Introduction to Batch Normalization Batch normalization is a widely used technique in deep learning that significantly improves the performance and stability of neural networks. Introduced by Sergey Ioffe and Christian Szegedy in 2015, this technique addresses the issues of vanishing and exploding gradients that can occur during training, particularly in deep networks. Why Batch Normalization? In deep learning, as data propagates through the layers of a neural network, it can lead to shifts in the distribution of inputs to layers deeper in the network—a phenomenon known as internal covariate shift. This shift can cause issues such as vanishing gradients, where gradients become too small, slowing down the training process, or exploding gradients, where they become too large, leading to unstable training. Traditional solutions like careful initialization and lower learning rates help, but they don’t entirely solve these problems. What is Batch Normalization? Batch normalization (BN) mitigates these issues by normalizing the inputs of each layer within a mini-batch, ensuring that the inputs to a given layer have a consistent distribution. This normalization happens just before or after the activation function of each hidden layer. Here’s a step-by-step breakdown of how batch normalization works: Zero-Centering and Normalization: \[ \mu_B = \frac{1}{m_B} \sum_{i=1}^{m_B} x^{(i)} \] For each mini-batch, compute the mean (\(\mu_B\)) and variance (\(\sigma_B^2\)) of the inputs. Normalize the inputs by subtracting the mean and dividing by the standard deviation: \[ \hat{x}^{(i)} = \frac{x^{(i)} – \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \] Here, \(\epsilon\) is a small constant added for numerical stability. Scaling and Shifting: After normalization, the data is scaled and shifted using two learned parameters, \(\gamma\) (scale) and \(\beta\) (shift): \[ z^{(i)} = \gamma \hat{x}^{(i)} + \beta \] This allows the network to undo the normalization if it proves beneficial. Benefits of Batch Normalization Reduces Internal Covariate Shift: By normalizing each layer’s inputs, batch normalization stabilizes the learning process, allowing the network to use higher learning rates. Acts as a Regularizer: BN adds noise to each hidden layer’s input in every mini-batch, providing a regularizing effect similar to dropout. Eliminates the Need for Careful Initialization: The network becomes less sensitive to the scale of initial weights, allowing for simpler initialization schemes. Implementing Batch Normalization in Keras Here’s how you can implement batch normalization in a Keras model: This model demonstrates the use of batch normalization layers both before and after dense layers, which is a common practice. Understanding the Parameters Gamma (\(\gamma\)) and Beta (\(\beta\)): These are trainable parameters that allow the network to learn the optimal scale and shift for each normalized input. Moving Mean and Moving Variance: During training, the network tracks the mean and variance for each batch to use them during testing, where the input might not be as well-behaved as during training. Understanding the Math Behind Batch Normalization In Part 1, we covered the basic concept and implementation of Batch Normalization (BN). Now, let’s delve deeper into the mathematical foundations and explore more advanced aspects. Batch normalization operates by normalizing the inputs of each layer across the mini-batch. Here’s a more detailed breakdown of the algorithm: Calculate the Mean: \[ \mu_B = \frac{1}{m_B} \sum_{i=1}^{m_B} x^{(i)} \] Calculate the Variance: \[ \sigma_B^2 = \frac{1}{m_B} \sum_{i=1}^{m_B} \left(x^{(i)} – \mu_B\right)^2 \] Normalize the Input: \[ \hat{x}^{(i)} = \frac{x^{(i)} – \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \] The small constant \(\epsilon\) ensures numerical stability by preventing division by zero. Scale and Shift: \[ z^{(i)} = \gamma \hat{x}^{(i)} + \beta \] Here, \(\gamma\) and \(\beta\) are trainable parameters that allow the model to scale and shift the normalized output. Training and Test Time Differences One important consideration when using batch normalization is the difference between how the network behaves during training versus inference (test time). Training: During training, the mean and variance are calculated on each mini-batch, allowing the network to adapt to the distribution of the input data dynamically. Inference: During inference, the network uses a running average of the mean and variance from training to normalize the inputs. This ensures consistency and avoids issues that might arise from having a smaller or differently distributed batch. Most deep learning frameworks, including Keras, handle this switch between training and inference automatically when using the BatchNormalization layer. Implementing Batch Normalization in Keras: Code Examples Here are some practical examples to demonstrate different ways to implement Batch Normalization in Keras. 1. Batch Normalization Before the Activation Function: In this configuration, Batch Normalization is applied before the activation functions. This approach is recommended because it often leads to faster convergence and better performance. 2. Batch Normalization After the Activation Function: Here, Batch Normalization is applied after the activation functions. While this approach can still be effective, it may not be as beneficial as applying BN before the activation, especially when using non-linear activations like ReLU. 3. Using Batch Normalization Without Bias: In certain cases, you can disable the bias term in the dense layers when using…
Thank you for reading this post, don't forget to subscribe!