Activation Functions
There is still one thing missing for building neural networks: activation function. Let's have a look what they are and why we need them.
An activation function is a crucial component in artificial neural networks, a class of machine learning models inspired by the structure and function of biological neurons in the human brain. The activation function introduces non-linear transformations to the weighted sum of inputs in a neural network, allowing it to model complex relationships between inputs and outputs.
In a neural network, each neuron (or node) receives inputs from the previous layer or directly from the input data. These inputs are multiplied by corresponding weights and summed up. The activation function is then applied to the sum, producing the output of the neuron, which is passed on to the next layer or used as the final output of the network.
The activation function serves two main purposes:
Introducing Non-Linearity: Without an activation function, the neural network would be limited to representing only linear relationships between inputs and outputs. However, real-world data often exhibits complex, non-linear relationships. The activation function allows the network to capture and model such non-linearities, enabling it to learn and approximate intricate patterns in the data.
Enabling Computation of Node Outputs: The activation function defines the output range of a neuron. It transforms the weighted sum of inputs into a specific output value or a range of values. This output is then used as the input for the next layer or as the final output of the network. The activation function introduces non-linear mappings that can squash the input range to a specific output range, ensuring that the neural network can learn and represent a wide variety of functions.
There are several common types of activation functions used in neural networks:
Sigmoid Activation Function
The sigmoid function squashes the input range between 0 and 1, producing a smooth S-shaped curve. It is commonly used in binary classification problems or when the output needs to be interpreted as a probability.
Hyperbolic Tangent Activation Function
The tanh function is similar to the sigmoid function but squashes the input range between -1 and 1. It is symmetric around the origin and often used in classification or regression tasks.
Rectified Linear Unit Activation Function
ReLU is a popular activation function that returns 0 for negative inputs and the input value itself for positive inputs. It is computationally efficient and has been successful in training deep neural networks.
Leaky ReLU
The Leaky ReLU function is a variant of the Rectified Linear Unit (ReLU) activation function commonly used in artificial neural networks. While the traditional ReLU function sets all negative input values to zero, the Leaky ReLU function allows a small, non-zero output for negative inputs, introducing a small slope or leak for negative values.
Mathematically, the Leaky ReLU function is defined as follows:
f(x) = max(ax, x)
Softmax Activation Function
The softmax function is a mathematical function that is commonly used in machine learning, especially in multi-class classification tasks. It takes a vector of real-valued numbers as input and transforms them into a probability distribution over multiple classes.
The softmax function operates on a vector of logits (also known as log-odds or unnormalized scores). A logit is an unnormalized value that represents the model's confidence or evidence for each class. The softmax function normalizes these logits and maps them to a valid probability distribution.
Mathematically, the softmax function is defined as follows:
softmax(x_i) = exp(x_i) / Σ(exp(x_j))
x_i is the logit for the i-th class. exp(x_i) represents the exponential of the logit, which ensures a non-negative value. Σ denotes the summation symbol, and the sum is taken over all classes. The softmax function exponentiates each logit and divides it by the sum of exponentiated logits across all classes. This normalization ensures that the resulting values lie between 0 and 1 and sum up to 1, representing valid probabilities.
The softmax function has a few important properties:
Output Interpretation: The output of the softmax function can be interpreted as the estimated probability of each class. Each value represents the model's confidence or belief that the input belongs to the corresponding class.
Class Competition: The softmax function introduces competition among classes. As the confidence of one class increases, the probabilities assigned to other classes decrease. The softmax function emphasizes the most probable class while suppressing the probabilities of less likely classes.
Sensitivity to Magnitude: The softmax function is sensitive to the magnitude or scale of the logits. Large differences between logits can lead to more pronounced differences in the resulting probabilities.
The softmax function is commonly used as the final activation function in the output layer of neural networks for multi-class classification tasks. It provides a differentiable and probabilistic representation of the model's predictions, enabling efficient training using techniques such as backpropagation and gradient descent.
By applying the softmax function to the logits, the model's outputs can be interpreted as class probabilities, facilitating decision-making and allowing for selecting the most likely class prediction based on the highest probability value.