Comparing Momentum and Normalization in Deep Learning: A Mathematical Perspective

Momentum and normalization are two pivotal techniques in deep learning that enhance the efficiency and stability of training. This article explores the mathematics behind these methods, provides examples with and without these techniques, and demonstrates why they are beneficial for deep learning models.

Part 1: Comparing Momentum and Normalization

Momentum: Smoothing and Accelerating Convergence

Momentum is an optimization technique that modifies the standard gradient descent by adding a velocity term to the update rule. This velocity term is a running average of past gradients, which helps the optimizer to continue moving in directions where gradients are consistently pointing, thereby accelerating convergence and reducing oscillations.

Mathematical Formulation:

Without Momentum (Standard Gradient Descent):

$\theta_t = \theta_{t-1} - \alpha \nabla L(\theta_{t-1})$

With Momentum:

$v_t = \beta v_{t-1} + (1 - \beta) \nabla L(\theta_{t-1})$

$\theta_t = \theta_{t-1} - \alpha v_t$

Here, $\beta$ is the momentum coefficient (typically around 0.9), and $v_t$ accumulates the gradients to provide smoother and more directed updates.

Example with and Without Momentum:

Consider a simple quadratic loss function $L(\theta) = \theta^2$ , starting with $\theta_0 = 2.0$ , a learning rate $\alpha = 0.1$ , and $\beta = 0.9$ for momentum.

Without Momentum:

Iteration 1:
- Gradient at $\theta_0$ : $\nabla L(\theta_0) = 2 \times 2.0 = 4.0$
- Update: $\theta_1 = 2.0 - 0.1 \times 4.0 = 1.6$
Iteration 2:
- Gradient at $\theta_1$ : $\nabla L(\theta_1) = 2 \times 1.6 = 3.2$
- Update: $\theta_2 = 1.6 - 0.1 \times 3.2 = 1.28$

With Momentum:

Iteration 1:
- Gradient at $\theta_0$ : $\nabla L(\theta_0) = 4.0$
- Velocity update: $v_1 = 0.9 \times 0 + 0.1 \times 4.0 = 0.4$
- Update: $\theta_1 = 2.0 - 0.1 \times 0.4 = 1.96$
Iteration 2:
- Gradient at $\theta_1$ : $\nabla L(\theta_1) = 3.92$
- Velocity update: $v_2 = 0.9 \times 0.4 + 0.1 \times 3.92 = 0.752$
- Update: $\theta_2 = 1.96 - 0.1 \times 0.752 = 1.8848$

Why Momentum is Better:

Faster Convergence: With momentum, the updates are more directed, allowing the optimizer to move more quickly toward the minimum.
Reduced Oscillations: The momentum term smooths the path to the minimum, preventing the oscillations that occur in standard gradient descent, especially in areas with steep gradients.

Normalization: Stabilizing the Learning Process

Normalization techniques, such as Batch Normalization, help to maintain consistent distributions of activations across layers during training, reducing issues like internal covariate shift and vanishing/exploding gradients.

Mathematical Formulation:

Batch Normalization:

$\hat{x}^{(i)} = \frac{x^{(i)} - \mu_{\text{batch}}}{\sigma_{\text{batch}} + \epsilon}$

$y^{(i)} = \gamma \hat{x}^{(i)} + \beta$

Here, $\mu_{\text{batch}}$ and $\sigma_{\text{batch}}$ are the batch mean and standard deviation, and $\gamma$ and $\beta$ are learnable parameters.

Example with and Without Normalization:

Consider a batch of inputs $x = [1, 2, 3, 4, 5]$ in a neural network.

Without Normalization:

The raw inputs are fed directly into the network, potentially leading to unstable activations in deeper layers, especially as earlier layers change during training.

With Batch Normalization:

Calculate batch mean $\mu_{\text{batch}} = 3$ and standard deviation $\sigma_{\text{batch}} = 1.41$ .
Normalize: $\hat{x} = \frac{[1, 2, 3, 4, 5] - 3}{1.41} = [-1.41, -0.71, 0, 0.71, 1.41]$
Apply scaling (assume $\gamma = 2$ and $\beta = 0$ ): $y = 2 \times \hat{x} = [-2.82, -1.42, 0, 1.42, 2.82]$

Why Normalization is Better:

Stable Activations: By normalizing inputs, the network receives inputs with consistent distributions, which leads to more stable training and often faster convergence.
Improved Gradient Flow: Normalization reduces the likelihood of vanishing/exploding gradients, especially in deep networks.

Part 2: Why Going Fast to the Minimum Can Be Problematic

The Problem with Large Steps in Gradient Descent

While it might seem advantageous to use large steps to quickly reach the minimum, this approach can lead to several issues:

Overshooting the Minimum: Large steps can cause the optimizer to jump past the minimum, leading to oscillations or divergence.
Getting Stuck in Local Minima: In complex loss landscapes, large steps might cause the optimizer to get stuck in suboptimal local minima or saddle points.
Sensitivity to Noisy Gradients: Large steps can amplify the effect of noise in gradient estimates, causing erratic parameter updates.

How Momentum Helps

Momentum provides a solution to these problems by smoothing the gradient updates:

Controlled Progress: Momentum accumulates gradients over time, allowing for smoother and more controlled progress toward the minimum. This prevents overshooting and helps the optimizer escape from local minima or saddle points.
Efficiency: Even with smaller individual steps, momentum ensures steady and efficient progress toward the global minimum, combining the advantages of stability and speed.

Why Smaller Steps Without Momentum Are Also Not Ideal:

Slow Convergence: While smaller steps reduce the risk of overshooting, they can lead to very slow convergence, especially in flat regions of the loss surface.
Inefficiency: The need for many iterations with small steps increases computational costs and delays the learning process.

Momentum strikes a balance, providing a means to move steadily and efficiently toward the minimum without the pitfalls of large or small steps alone.

Conclusion

Momentum and normalization are both crucial in deep learning, but they address different challenges:

Momentum: Smooths and accelerates the optimization process by using accumulated gradients, enabling efficient convergence even with smaller steps.
Normalization: Stabilizes the learning process by maintaining consistent activations across layers, improving gradient flow and training stability.

Together, these techniques enhance the robustness and efficiency of deep learning models, making them indispensable tools in modern neural network training.

Applying Momentum and Normalization in Deep Learning

This section demonstrates how to apply momentum and normalization separately in a simple deep learning model using TensorFlow and Keras. We will train a Convolutional Neural Network (CNN) on the MNIST dataset, a popular dataset of handwritten digits, and compare the results with and without momentum and normalization.

Code Implementation

The following code snippet shows how to apply these techniques in a practical example:


    # Google Colab code for applying Momentum and Normalization in a simple CNN

    # Import necessary libraries
    import tensorflow as tf
    from tensorflow.keras import layers, models
    import matplotlib.pyplot as plt

    # Load and preprocess the MNIST dataset
    mnist = tf.keras.datasets.mnist
    (x_train, y_train), (x_test, y_test) = mnist.load_data()

    # Normalize the images to [0, 1] range
    x_train = x_train / 255.0
    x_test = x_test / 255.0

    # Reshape the data to add a channel dimension
    x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
    x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

    # Define a function to create a simple CNN model
    def create_model(use_batch_norm=False):
        model = models.Sequential()
        model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
        if use_batch_norm:
            model.add(layers.BatchNormalization())
        model.add(layers.MaxPooling2D((2, 2)))

        model.add(layers.Conv2D(64, (3, 3), activation='relu'))
        if use_batch_norm:
            model.add(layers.BatchNormalization())
        model.add(layers.MaxPooling2D((2, 2)))

        model.add(layers.Flatten())
        model.add(layers.Dense(64, activation='relu'))
        if use_batch_norm:
            model.add(layers.BatchNormalization())
        model.add(layers.Dense(10, activation='softmax'))

        return model

    # Train and evaluate a model without momentum and batch normalization
    def train_model(use_momentum=False, use_batch_norm=False):
        model = create_model(use_batch_norm)

        if use_momentum:
            optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
        else:
            optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

        model.compile(optimizer=optimizer,
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])

        history = model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))

        # Plot training &amp; validation accuracy values
        plt.plot(history.history['accuracy'], label='Training Accuracy')
        plt.plot(history.history['val_accuracy'], label = 'Validation Accuracy')
        plt.title('Model accuracy')
        plt.xlabel('Epoch')
        plt.ylabel('Accuracy')
        plt.legend(loc='lower right')
        plt.show()

        return model

    # Train and compare models
    print("Training without Momentum and without Batch Normalization:")
    model_no_momentum_no_bn = train_model(use_momentum=False, use_batch_norm=False)

    print("\nTraining with Momentum but without Batch Normalization:")
    model_with_momentum_no_bn = train_model(use_momentum=True, use_batch_norm=False)

    print("\nTraining with Batch Normalization but without Momentum:")
    model_no_momentum_with_bn = train_model(use_momentum=False, use_batch_norm=True)

    print("\nTraining with both Momentum and Batch Normalization:")
    model_with_momentum_with_bn = train_model(use_momentum=True, use_batch_norm=True)

    
Explanation
Model Creation: The create_model function constructs a simple CNN. It has an option (use_batch_norm) to include Batch Normalization layers after each convolutional and dense layer.
Training Function: The train_model function trains the model with or without momentum (by setting use_momentum) and with or without Batch Normalization (by setting use_batch_norm).
Comparison: The code trains four different models:

Without Momentum and without Batch Normalization: Baseline model with standard SGD optimization.
With Momentum but without Batch Normalization: Model optimized using momentum to accelerate convergence.
With Batch Normalization but without Momentum: Model stabilized using Batch Normalization for improved training.
With both Momentum and Batch Normalization: Combining both techniques to enhance the training process.

The training and validation accuracies are plotted for each configuration, allowing you to visually compare the impact of momentum and normalization on model performance.
Instructions
Copy and paste this code into a Google Colab notebook and run it to observe how momentum and normalization affect the training and validation accuracy of the CNN. This code will help you see the benefits of using momentum (faster and more stable convergence) and normalization (improved training stability and speed) in training a deep learning model.
Note to Understand the Code
This section breaks down the code into key parts, explaining how momentum and normalization are implemented and compared.
1. Creating the Model
The first key part of the code is the create_model function, which builds a Convolutional Neural Network (CNN). This function allows the inclusion of Batch Normalization layers after each convolutional and dense layer by setting the use_batch_norm parameter to True.

    def create_model(use_batch_norm=False):
        model = models.Sequential()
        model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
        if use_batch_norm:
            model.add(layers.BatchNormalization())
        model.add(layers.MaxPooling2D((2, 2)))

        model.add(layers.Conv2D(64, (3, 3), activation='relu'))
        if use_batch_norm:
            model.add(layers.BatchNormalization())
        model.add(layers.MaxPooling2D((2, 2)))

        model.add(layers.Flatten())
        model.add(layers.Dense(64, activation='relu'))
        if use_batch_norm:
            model.add(layers.BatchNormalization())
        model.add(layers.Dense(10, activation='softmax'))

        return model

    
Explanation:

Conditional Batch Normalization: The use_batch_norm parameter determines whether Batch Normalization is added to the model. If True, a BatchNormalization layer is added after each convolutional and dense layer.
Customization: This design allows you to easily toggle normalization on or off when creating the model, enabling a straightforward comparison between models with and without normalization.

2. Training the Model
The next important part of the code is the train_model function, which handles the training of the CNN. This function takes two parameters: use_momentum and use_batch_norm, allowing you to control whether momentum or Batch Normalization is used during training.

    def train_model(use_momentum=False, use_batch_norm=False):
        model = create_model(use_batch_norm)

        if use_momentum:
            optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
        else:
            optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

        model.compile(optimizer=optimizer,
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])

        history = model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))

        # Plot training &amp; validation accuracy values
        plt.plot(history.history['accuracy'], label='Training Accuracy')
        plt.plot(history.history['val_accuracy'], label = 'Validation Accuracy')
        plt.title('Model accuracy')
        plt.xlabel('Epoch')
        plt.ylabel('Accuracy')
        plt.legend(loc='lower right')
        plt.show()

        return model

    
Explanation:

Conditional Momentum: The use_momentum parameter determines whether the model uses an optimizer with momentum. If True, an SGD optimizer with a momentum of 0.9 is used; otherwise, a standard SGD optimizer without momentum is applied.
Comparison Setup: By passing different combinations of use_momentum and use_batch_norm to this function, you can train models under different configurations (e.g., with momentum only, with normalization only, with both, or with neither) and compare their performance.

3. Comparison of Models
The code compares four models:

Without momentum and without Batch Normalization.
With momentum but without Batch Normalization.
With Batch Normalization but without momentum.
With both momentum and Batch Normalization.


    # Train and compare models
    print("Training without Momentum and without Batch Normalization:")
    model_no_momentum_no_bn = train_model(use_momentum=False, use_batch_norm=False)

    print("\nTraining with Momentum but without Batch Normalization:")
    model_with_momentum_no_bn = train_model(use_momentum=True, use_batch_norm=False)

    print("\nTraining with Batch Normalization but without Momentum:")
    model_no_momentum_with_bn = train_model(use_momentum=False, use_batch_norm=True)

    print("\nTraining with both Momentum and Batch Normalization:")
    model_with_momentum_with_bn = train_model(use_momentum=True, use_batch_norm=True)

    
Explanation:

Comprehensive Testing: The code systematically trains the CNN under four different conditions to directly compare the impact of momentum and Batch Normalization on the training process.
Visual Comparison: After training, the code plots the training and validation accuracy for each model, allowing you to visually assess which configuration performs best.

Summary

Model Creation: The create_model function allows you to easily include or exclude Batch Normalization layers.
Training with Momentum: The train_model function lets you choose whether to use an optimizer with momentum, enabling a direct comparison of momentum’s effect.
Comparison of Models: The final part of the code systematically compares four different model configurations, allowing you to observe the impact of momentum and Batch Normalization on model performance.

  






    

    

    Note: Overlap vs. Convergence of Training and Validation Accuracy






Note: Overlap vs. Convergence of Training and Validation Accuracy
When evaluating the performance of a deep learning model, the relationship between training accuracy and validation accuracy is crucial. Let’s break down what it means for these two metrics to either overlap or converge closely, and why one might be preferred over the other.
1. Convergence (Close but Not Perfect Overlap)
Convergence occurs when the training and validation accuracies follow a similar trend and get close to each other but do not perfectly overlap. This situation is often considered ideal for the following reasons:

Indication of Good Generalization: A small gap between training and validation accuracy indicates that the model is learning the patterns in the data effectively without overfitting. Overfitting happens when the model performs well on the training data but poorly on unseen (validation) data.
Healthy Model Performance: If training accuracy is slightly higher than validation accuracy, it suggests that the model is not simply memorizing the training data (which would lead to overfitting) but is instead learning patterns that can generalize to new data.
Regularization in Action: Techniques like Batch Normalization and momentum help in smoothing the learning process, leading to better generalization. This is why even with a small gap, the model can be performing optimally.

2. Perfect Overlap
Perfect overlap occurs when the training accuracy and validation accuracy lines are almost identical.

Risk of Underfitting: If the model’s training and validation accuracies are perfectly overlapping and both are relatively low, this can indicate underfitting—the model is not complex enough to capture the underlying patterns in the data.
Too Good to Be True? If both accuracies are very high and perfectly overlapping, it might seem like an ideal scenario. However, in many real-world scenarios, this is rare and could suggest that the model might not be as robust to noise or unseen data as it appears. It may also indicate that the validation set is too similar to the training set, not providing a good challenge for the model to prove its generalization ability.

Why Convergence (Close, Not Overlapping) is Generally Better

Realistic Assessment of Generalization: A slight gap between the accuracies provides a more realistic view of how well the model will perform on truly unseen data.
Prevention of Overfitting: A small gap ensures that the model is not overfitting to the training data, which is a common risk in deep learning.
Better Robustness: Models that show good convergence (with a small gap) are often more robust and perform better when applied to new, diverse datasets.

Conclusion
Convergence with a Small Gap between training and validation accuracy is generally better because it suggests that the model is learning well and generalizing to new data without overfitting.
Perfect Overlap can sometimes indicate underfitting or an overly optimistic view of model performance, especially if the validation set is not sufficiently challenging.
In practice, you should aim for a model where training and validation accuracies converge closely, indicating a balance between learning the data well and generalizing to unseen data. In our model comparing the loss number and checking the convergence it show the model with applying both Momentum and Normalisation showed the best Results. Do not you enjoy our Article here ? check our ios apps and shops to enjoy even more 🙂 , do not forget to support more INGOAMPT therefore we can provide you with even more valuable services soon too 🙂

Momentum vs Normalisation in Deep learning -part 2 – day 34

Comparing Momentum and Normalization in Deep Learning: A Mathematical Perspective

Part 1: Comparing Momentum and Normalization