Activation function progress in deep learning, Relu, Elu, Selu, Geli , mish, etc – include table and graphs – day 24
Activation Function Formula Comparison Why (Problem and Solution) Mathematical Explanation and Proof Sigmoid σ(z) = 1 / (1 + e-z) – Non-zero-centered output – Saturates for large values, leading to vanishing gradients Problem: Vanishing gradients for large positive or negative inputs, slowing down learning in deep networks. Solution: ReLU was introduced to avoid the saturation issue by having a linear response for positive values. The gradient of the sigmoid function is σ'(z) = σ(z)(1 – σ(z)). As z moves far from zero (either positive or negative), σ(z) approaches 1 or 0, causing σ'(z) to approach 0, leading to very small gradients and hence slow learning. ReLU (Rectified Linear Unit) f(z) = max(0, z) – Simple and computationally efficient – Doesn’t saturate for positive values – Suffers from “dying ReLU” problem Problem: “Dying ReLU,” where neurons stop learning when their inputs are negative, leading to dead neurons. Solution: Leaky ReLU was introduced to allow a small, non-zero gradient when z < 0, preventing neurons from dying. For z < 0, the gradient of ReLU is 0, meaning that neurons receiving negative inputs will not update during backpropagation. If this persists, the neuron is effectively “dead.” Leaky ReLU Leaky ReLUα(z) = max(αz,...