Machine Learning Overview

Activation function progress in deep learning, Relu, Elu, Selu, Geli , mish, etc – include table and graphs – day 24

Activation Function Formula Comparison Why (Problem and Solution) Mathematical Explanation and Proof
Sigmoid σ(z) = 1 / (1 + e-z) – Non-zero-centered output
– Saturates for large values, leading to vanishing gradients
Problem: Vanishing gradients for large positive or negative inputs, slowing down learning in deep networks.
Solution: ReLU was introduced to avoid the saturation issue by having a linear response for positive values.
The gradient of the sigmoid function is σ'(z) = σ(z)(1 – σ(z)). As z moves far from zero (either positive or negative), σ(z) approaches 1 or 0, causing σ'(z) to approach 0, leading to very small gradients and hence slow learning.
ReLU (Rectified Linear Unit) f(z) = max(0, z) – Simple and computationally efficient
– Doesn’t saturate for positive values
– Suffers from “dying ReLU” problem
Problem: “Dying ReLU,” where neurons stop learning when their inputs are negative, leading to dead neurons.
Solution: Leaky ReLU was introduced to allow a small, non-zero gradient when z < 0, preventing neurons from dying.
For z < 0, the gradient of ReLU is 0, meaning that neurons receiving negative inputs will not update during backpropagation. If this persists, the neuron is effectively “dead.”
Leaky ReLU Leaky ReLUα(z) = max(αz, z) – Solves “dying ReLU” issue by allowing small gradient for negative values
– Hyperparameter α determines the slope for z < 0
Problem: Choosing α is crucial; a bad choice can lead to suboptimal learning.
Solution: Parametric ReLU (PReLU) allows α to be learned during training, optimizing the activation function for the specific dataset.
By allowing αz (where α is a small positive number) for z < 0, Leaky ReLU ensures that gradients never fully vanish, maintaining some gradient flow even for negative inputs.
PReLU (Parametric ReLU) f(z) = max(αz, z), where α is learned – Adaptively adjusts the slope of negative inputs based on data
– Can lead to overfitting on small datasets
Problem: Potential overfitting on smaller datasets due to the added flexibility of learning α.
Solution: Use regularization techniques or simpler activation functions like ELU that don’t require learning parameters.
The parameter α is learned during training, allowing the network to adaptively find the best slope for negative inputs. However, this flexibility can lead to overfitting, particularly on small datasets.
ELU (Exponential Linear Unit) ELUα(z) =

{

  α(ez – 1),     if z < 0

  z,                  if z ≥ 0

}
– Allows for negative outputs, reducing bias shift
– Smooths the gradient across the origin
Problem: ReLU, Leaky ReLU, and PReLU can result in a bias shift and their derivatives aren’t smooth at z=0.
Solution: ELU allows negative outputs and provides a smooth gradient transition, helping in faster convergence and better learning stability.
ELU ensures that the gradient remains smooth and non-zero even when z < 0, avoiding the abrupt changes seen in ReLU and its variants. This smoothness aids in better and faster learning.
SELU (Scaled Exponential Linear Unit) SELU(z) = λ ×

{

  α(ez – 1),     if z < 0

  z,                  if z ≥ 0

}
– Scaled version of ELU
– Self-normalizing properties ensure stable outputs
Problem: SELU requires specific conditions for self-normalization, limiting its flexibility to only certain architectures.
Solution: Use in plain MLPs with standardized inputs and LeCun normal initialization. For other architectures, alternative activation functions like GELU might be better.
The scaling factor λ ensures that the output of each layer is normalized, helping to maintain a consistent output distribution. However, the required conditions (e.g., standardized inputs, specific initialization) limit its general applicability.
GELU (Gaussian Error Linear Unit) f(z) = z × Φ(z), where Φ(z) is the Gaussian CDF – Smooth variant of ReLU
– Non-convex, non-monotonic, which can capture complex patterns better
Problem: ReLU and its variants may struggle with complex tasks due to their linearity or fixed non-linear behavior.
Solution: GELU’s combination of linear and Gaussian distribution provides a more flexible activation, making it better suited for capturing complex patterns.
GELU leverages the Gaussian CDF to introduce smooth, probabilistic behavior in the activation function. This smoothness helps to handle complex patterns, but at the cost of higher computational complexity.
Swish f(z) = z × σ(βz), where σ is the sigmoid function and β is a trainable parameter – Similar to GELU, allows smoother transitions and non-linear behavior
– Slightly more computationally expensive than ReLU
Problem: Even with GELU, the performance may not be optimal for all tasks.
Solution: Swish introduces a parameterized smooth curve, which allows for a more adaptive and potentially more effective activation, depending on the task and dataset.
The smooth transition offered by Swish, combined with the trainable parameter β, allows for dynamic adjustments of the activation behavior, optimizing performance across different tasks.
Mish f(z) = z × tanh(ln(1 + ez)) – Non-monotonic, smooth, and self-regularizing
– Encourages better gradient flow and faster convergence
Problem: Swish and GELU, while powerful, may not always capture the most effective gradient dynamics or can be computationally intensive.
Solution: Mish enhances gradient flow by ensuring smoother transitions and self-regularization, potentially leading to better convergence and model performance.
Mish combines the properties of both Swish and GELU, providing a smooth activation function that enhances gradient flow across layers. The tanh(ln(1 + ez)) term ensures that the activation is bounded and non-monotonic, which can lead to better performance in some deep learning tasks.

Let’s see their graph:

Now let’s continue even deeper:


Understanding the Impact of Activation Functions on Gradient Descent

Activation functions are integral to the learning mechanisms within neural networks, primarily influencing the efficiency and direction of gradient descent during training. This article delves into how activation functions like ReLU and Sigmoid shape the training process through a detailed comparative example.

Example Neural Network Setup

A simple neural network configuration includes:

  • Input: \(x = 0.5\)
  • Weights: \(w_1 = 0.6\), \(w_2 = -0.4\), \(w_{out} = 0.3\)
  • Biases: \(b_1 = 0.1\), \(b_2 = 0.2\), \(b_{out} = 0.0\)
  • Target Output: \(y_{target} = 1.0\)

Forward Pass Calculation

First, we calculate the output of the hidden neurons and the final output of the network.

Activation Function Neuron 1 Output (h1) Neuron 2 Output (h2) Network Output (y)
ReLU 0.4 0.0 0.12
Sigmoid 0.5987 0.5 0.3296

Backward Pass: Gradient Calculation

Next, we calculate the gradients to understand how each activation function influences the learning process:

\[
\text{Loss} = \frac{1}{2} (y_{target} – y)^2
\]\[
\text{Gradient} = \frac{\partial \text{Loss}}{\partial y} = y – y_{target}
\]

Activation Function Output Gradient h1 Gradient h2 Gradient
ReLU -0.88 -0.264 0 (ReLU derivative is 0 for h2=0)
Sigmoid -0.6704 \(-0.6704 \cdot 0.3 \cdot 0.240 = -0.0483\) \(-0.6704 \cdot 0.3 \cdot 0.25 = -0.0503\)

Analysis

ReLU: Neuron 1 (h1) contributes significantly to the gradient due to the derivative being 1 when positive; Neuron 2 (h2) contributes nothing, leading to “dying” neurons.

Sigmoid: Both neurons contribute, but with smaller gradients, potentially leading to vanishing gradients in deeper networks.

Conclusion

This discussion underscores the importance of choosing the right activation function to optimize neural network training, balancing between avoiding vanishing gradients and preventing neuron death.