Demystifying Trainable and Non-Trainable Parameters in Batch Normalization Batch normalization (BN) is a powerful technique used in deep learning to stabilize and accelerate training. The core idea behind BN is to normalize the output of a previous layer by subtracting the batch mean and dividing by the batch standard deviation. This is expressed by the following general formula: \[\hat{x} = \frac{x – \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\]\[y = \gamma \hat{x} + \beta\] Where: \( x \) is the input to the batch normalization layer. \( \mu_B \) and \( \sigma_B^2 \) are the mean and variance of the current mini-batch, respectively. \( \epsilon \) is a small constant added to avoid division by zero. \( \hat{x} \) is the normalized output. \( \gamma \) and \( \beta \) are learnable parameters that scale and shift the normalized output. Why This Formula is Helpful The normalization step ensures that the input to each layer has a consistent distribution, which addresses the problem of “internal covariate shift”—where the distribution of inputs to a layer changes during training. By maintaining a stable distribution, the training process becomes more efficient, requiring less careful initialization of parameters and allowing for higher learning rates. The addition of \( \gamma \) and \( \beta \) parameters allows the model to restore the capacity of the network to represent the original data distribution. This means that the model can learn any representation it could without normalization, but with the added benefits of stabilized and accelerated training. The use of batch normalization has been shown empirically to result in faster convergence and improved model performance, particularly in deeper networks. Understanding Trainable and Non-Trainable Parameters in Batch Normalization In any batch normalization layer, there are typically four parameters associated with the normalization process: γ (Scale) – Trainable β (Shift) – Trainable μ (Moving Mean) – Non-Trainable σ (Moving Variance) – Non-Trainable Trainable Parameters: γ and β γ (Scale): This parameter scales the normalized output, allowing the model to control the variance after normalization. Without this parameter, the network might lose the ability to represent inputs with varying magnitudes. β (Shift): This parameter shifts the normalized output, essentially adjusting the mean of the output distribution. It prevents the network from losing information about the original distribution after normalization. Both γ and β are critical because they give the model the flexibility to learn the optimal scale and shift for the normalized activations during training. These parameters are updated via backpropagation, just like weights in a Dense layer. Non-Trainable Parameters: μ and σ μ (Moving Mean): This is the running average of the means computed over each mini-batch during training. It provides a stable mean during inference when the batch statistics might differ from the training phase. σ (Moving Variance): Similar to the moving mean, this is the running average of the variance computed over each mini-batch. It stabilizes the variance during inference. These parameters are crucial for ensuring that the batch normalization behaves consistently during inference. Unlike γ and β, μ and σ are not updated by backpropagation; instead, they are updated by tracking the statistics during training. When to Use Trainable and Non-Trainable Parameters The decision…
Thank you for reading this post, don't forget to subscribe!