Machine Learning Overview

Batch normalisation part 2 – day 26

Introduction to Batch Normalization

Batch normalization is a widely used technique in deep learning that significantly improves the performance and stability of neural networks. Introduced by Sergey Ioffe and Christian Szegedy in 2015, this technique addresses the issues of vanishing and exploding gradients that can occur during training, particularly in deep networks.

Why Batch Normalization?

In deep learning, as data propagates through the layers of a neural network, it can lead to shifts in the distribution of inputs to layers deeper in the network—a phenomenon known as internal covariate shift. This shift can cause issues such as vanishing gradients, where gradients become too small, slowing down the training process, or exploding gradients, where they become too large, leading to unstable training. Traditional solutions like careful initialization and lower learning rates help, but they don’t entirely solve these problems.

What is Batch Normalization?

Batch normalization (BN) mitigates these issues by normalizing the inputs of each layer within a mini-batch, ensuring that the inputs to a given layer have a consistent distribution. This normalization happens just before or after the activation function of each hidden layer.

Here’s a step-by-step breakdown of how batch normalization works:

  1. Zero-Centering and Normalization:
    \[
    \mu_B = \frac{1}{m_B} \sum_{i=1}^{m_B} x^{(i)}
    \]
    For each mini-batch, compute the mean (\(\mu_B\)) and variance (\(\sigma_B^2\)) of the inputs.
    Normalize the inputs by subtracting the mean and dividing by the standard deviation:
    \[
    \hat{x}^{(i)} = \frac{x^{(i)} – \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
    \]
    Here, \(\epsilon\) is a small constant added for numerical stability.
  2. Scaling and Shifting:
    After normalization, the data is scaled and shifted using two learned parameters, \(\gamma\) (scale) and \(\beta\) (shift):
    \[
    z^{(i)} = \gamma \hat{x}^{(i)} + \beta
    \]
    This allows the network to undo the normalization if it proves beneficial.

Benefits of Batch Normalization

  • Reduces Internal Covariate Shift: By normalizing each layer’s inputs, batch normalization stabilizes the learning process, allowing the network to use higher learning rates.
  • Acts as a Regularizer: BN adds noise to each hidden layer’s input in every mini-batch, providing a regularizing effect similar to dropout.
  • Eliminates the Need for Careful Initialization: The network becomes less sensitive to the scale of initial weights, allowing for simpler initialization schemes.

Implementing Batch Normalization in Keras

Here’s how you can implement batch normalization in a Keras model:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation='relu', kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation='relu', kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation='softmax')
])

This model demonstrates the use of batch normalization layers both before and after dense layers, which is a common practice.

Understanding the Parameters

  • Gamma (\(\gamma\)) and Beta (\(\beta\)): These are trainable parameters that allow the network to learn the optimal scale and shift for each normalized input.
  • Moving Mean and Moving Variance: During training, the network tracks the mean and variance for each batch to use them during testing, where the input might not be as well-behaved as during training.

Understanding the Math Behind Batch Normalization

In Part 1, we covered the basic concept and implementation of Batch Normalization (BN). Now, let’s delve deeper into the mathematical foundations and explore more advanced aspects.

Batch normalization operates by normalizing the inputs of each layer across the mini-batch. Here’s a more detailed breakdown of the algorithm:

  1. Calculate the Mean:
    \[
    \mu_B = \frac{1}{m_B} \sum_{i=1}^{m_B} x^{(i)}
    \]
  2. Calculate the Variance:
    \[
    \sigma_B^2 = \frac{1}{m_B} \sum_{i=1}^{m_B} \left(x^{(i)} – \mu_B\right)^2
    \]
  3. Normalize the Input:
    \[
    \hat{x}^{(i)} = \frac{x^{(i)} – \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
    \]
    The small constant \(\epsilon\) ensures numerical stability by preventing division by zero.
  4. Scale and Shift:
    \[
    z^{(i)} = \gamma \hat{x}^{(i)} + \beta
    \]
    Here, \(\gamma\) and \(\beta\) are trainable parameters that allow the model to scale and shift the normalized output.

Training and Test Time Differences

One important consideration when using batch normalization is the difference between how the network behaves during training versus inference (test time).

  • Training: During training, the mean and variance are calculated on each mini-batch, allowing the network to adapt to the distribution of the input data dynamically.
  • Inference: During inference, the network uses a running average of the mean and variance from training to normalize the inputs. This ensures consistency and avoids issues that might arise from having a smaller or differently distributed batch.

Most deep learning frameworks, including Keras, handle this switch between training and inference automatically when using the BatchNormalization layer.

Implementing Batch Normalization in Keras: Code Examples

Here are some practical examples to demonstrate different ways to implement Batch Normalization in Keras.

1. Batch Normalization Before the Activation Function:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),  # BN before activation
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation='softmax')
])

In this configuration, Batch Normalization is applied before the activation functions. This approach is recommended because it often leads to faster convergence and better performance.

2. Batch Normalization After the Activation Function:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal"),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.BatchNormalization(),  # BN after activation
    tf.keras.layers.Dense(100, kernel_initializer="he_normal"),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation='softmax')
])

Here, Batch Normalization is applied after the activation functions. While this approach can still be effective, it may not be as beneficial as applying BN before the activation, especially when using non-linear activations like ReLU.

3. Using Batch Normalization Without Bias:

In certain cases, you can disable the bias term in the dense layers when using Batch Normalization. The bias term is typically unnecessary because Batch Normalization itself includes a trainable shift parameter (\(\beta\)).

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, use_bias=False, kernel_initializer="he_normal"),  # No bias
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(100, use_bias=False, kernel_initializer="he_normal"),  # No bias
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(10, activation='softmax')
])

By setting use_bias=False, we eliminate the redundant bias terms from the dense layers. This setup can lead to a more efficient model since the bias term would otherwise be corrected by the Batch Normalization’s shift parameter.

Hyperparameters in Batch Normalization

While the defaults often work well, Batch Normalization has several hyperparameters that you might need to tweak:

  • Momentum:
    This controls how quickly the running averages of mean and variance decay. A value close to 1 (e.g., 0.9, 0.99, 0.999) means that the model gives more weight to recent mini-batches.
    Larger datasets with smaller mini-batches may require higher momentum values to maintain stability.
  • Axis:
    Determines which axis should be normalized. By default, it normalizes the last axis (e.g., features for fully connected layers or channels for convolutional layers).
    Adjusting the axis parameter allows BN to be applied in various configurations depending on the data structure (e.g., 2D convolutional layers).

Challenges and Considerations

While Batch Normalization is a powerful tool, it’s not without its challenges:

  • Increased Complexity: Adding BN layers increases the complexity of your model. Each BN layer introduces additional parameters (\(\gamma\), \(\beta\)) and operations that increase computational overhead.
  • Training Time: The addition of BN layers can slow down the training time per epoch due to the extra computations required. However, BN typically reduces the total number of epochs needed to converge, often resulting in a faster overall training process.
  • Small Batch Sizes: BN may become less effective with very small batch sizes since the mean and variance estimates become less reliable. In such cases, alternatives like Layer Normalization (LN) or Group Normalization (GN) might be better suited.

Practical Tips for Using Batch Normalization

  • Positioning BN Layers:
    Place BN layers before the activation function for ReLU and other non-saturating activations. However, you can experiment with positioning to see what works best for your specific model and dataset.
  • Use with Dropout:
    While BN and Dropout are both regularization techniques, using them together can sometimes result in diminished returns. Often, it’s advisable to use one or the other, though some models benefit from both.
  • Tuning Momentum:
    Experiment with different momentum values, especially if your dataset or batch size changes. Higher momentum can help stabilize training on large datasets with small batch sizes.
  • Handling Small Batches:
    If your model’s batch size is small (e.g., due to memory constraints), consider using other normalization techniques like Layer Normalization, which normalizes across the feature axis rather than across the batch axis.
  • Monitoring Performance:
    Keep an eye on training and validation loss curves when using Batch Normalization. If the validation loss starts to increase while the training loss decreases, you may need to adjust your BN layers or other hyperparameters.

Conclusion

Batch Normalization has revolutionized the training of deep neural networks by making them more stable and allowing for the use of higher learning rates. While it introduces additional complexity and computational overhead, its benefits in terms of faster convergence, reduced internal covariate shift, and regularization often outweigh the downsides.

By carefully integrating and tuning Batch Normalization in your models, you can achieve state-of-the-art performance across a wide range of deep learning tasks.