Activation function progress in deep learning, Relu, Elu, Selu, Geli , mish, etc – include table and graphs – day 24

Activation Function Formula Comparison Why (Problem and Solution) Mathematical Explanation and Proof Sigmoid σ(z) = 1 / (1 + e-z) – Non-zero-centered output – Saturates for large values, leading to vanishing gradients Problem: Vanishing gradients for large positive or negative inputs, slowing down learning in deep networks. Solution: ReLU was introduced to avoid the saturation issue by having a linear response for positive values. The gradient of the sigmoid function is σ'(z) = σ(z)(1 – σ(z)). As z moves far from zero (either positive or negative), σ(z) approaches 1 or 0, causing σ'(z) to approach 0, leading to very small gradients and hence slow learning. ReLU (Rectified Linear Unit) f(z) = max(0, z) – Simple and computationally efficient – Doesn’t saturate for positive values – Suffers from “dying ReLU” problem Problem: “Dying ReLU,” where neurons stop learning when their inputs are negative, leading to dead neurons. Solution: Leaky ReLU was introduced to allow a small, non-zero gradient when z < 0, preventing neurons from dying. For z < 0, the gradient of ReLU is 0, meaning that neurons receiving negative inputs will not update during backpropagation. If this persists, the neuron is effectively “dead.” Leaky ReLU Leaky ReLUα(z) = max(αz, z) – Solves “dying ReLU” issue by allowing small gradient for negative values – Hyperparameter α determines the slope for z < 0 Problem: Choosing α is crucial; a bad choice can lead to suboptimal learning. Solution: Parametric ReLU (PReLU) allows α to be learned during training, optimizing the activation function for the specific dataset. By allowing αz (where α is a small positive number) for z < 0, Leaky ReLU ensures that gradients never fully vanish, maintaining some gradient flow even for negative inputs. PReLU (Parametric ReLU) f(z) = max(αz, z), where α is learned – Adaptively adjusts the slope of negative inputs based on data – Can lead to overfitting on small datasets Problem: Potential overfitting on smaller datasets due to the added flexibility of learning α. Solution: Use regularization techniques or simpler activation functions like ELU that don’t require learning parameters. The parameter α is learned during training, allowing the network to adaptively find the best slope for negative inputs. However, this flexibility can lead to overfitting, particularly on small datasets. ELU (Exponential Linear Unit) ELUα(z) = {   α(ez – 1),     if z < 0   z,                  if z ≥ 0 } – Allows for negative outputs, reducing bias shift – Smooths the gradient across the origin Problem: ReLU, Leaky ReLU, and PReLU can result in a bias shift and their derivatives aren’t smooth at z=0. Solution: ELU allows negative outputs and provides a smooth gradient transition, helping in faster convergence and better learning stability. ELU ensures that the gradient remains smooth and non-zero even when z < 0, avoiding the abrupt changes seen in ReLU and its variants. This smoothness aids in better and faster learning. SELU (Scaled Exponential Linear Unit) SELU(z) = λ × {   α(ez – 1),     if z < 0   z,                  if z ≥ 0 } – Scaled version of ELU – Self-normalizing properties ensure stable outputs Problem: SELU requires specific conditions for self-normalization, limiting its flexibility to only certain architectures. Solution: Use in plain MLPs with standardized inputs and LeCun normal initialization. For other architectures, alternative activation functions like GELU might be better. The scaling factor λ ensures that the output of each layer is normalized, helping to maintain a consistent output distribution. However, the required conditions (e.g., standardized inputs, specific initialization) limit its general applicability. GELU (Gaussian Error Linear Unit) f(z) = z × Φ(z), where Φ(z) is the Gaussian CDF – Smooth variant of ReLU – Non-convex, non-monotonic, which can capture complex patterns better Problem: ReLU and its variants may struggle with complex tasks due to their linearity or fixed non-linear behavior. Solution: GELU’s combination of linear and Gaussian distribution provides a…

Thank you for reading this post, don't forget to subscribe!

Membership Required

You must be a member to access this content.

View Membership Levels

Already a member? Log in here
FAQ Chatbot

Select a Question

Or type your own question

For best results, phrase your question similar to our FAQ examples.