Machine Learning Overview

Weight initialisation in Deep Learning well explained: day 21






Weight Initialization in Deep Learning: Classic and Emerging Techniques

Weight Initialization in Deep Learning: Classic and Emerging Techniques

Understanding the correct initialization of weights in deep learning models is crucial for effective training and convergence. This post explores both classic and advanced weight initialization strategies, providing mathematical insights and practical code examples.


Part 1: Classic Weight Initialization Techniques

1. Glorot (Xavier) Initialization

Glorot Initialization is designed to maintain the variance of activations across layers, particularly effective for activation functions like tanh and sigmoid.

Mathematical Formula:

Uniform Distribution:

\( W \sim U\left(-\frac{\sqrt{6}}{\sqrt{\text{fan\_in} + \text{fan\_out}}}, \frac{\sqrt{6}}{\sqrt{\text{fan\_in} + \text{fan\_out}}}\right) \)

Normal Distribution:

\( W \sim \mathcal{N}\left(0, \frac{2}{\text{fan\_in} + \text{fan\_out}}\right) \)

Code Example in Keras:


from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import GlorotUniform, GlorotNormal

# Using Glorot Uniform
model.add(Dense(64, kernel_initializer=GlorotUniform(), activation='tanh'))

# Using Glorot Normal
model.add(Dense(64, kernel_initializer=GlorotNormal(), activation='tanh'))


2. He Initialization

He Initialization is optimized for ReLU and its variants, ensuring that the gradients remain within a good range across layers.

Mathematical Formula:

Uniform Distribution:

\( W \sim U\left(-\sqrt{\frac{6}{\text{fan\_in}}}, \sqrt{\frac{6}}{\text{fan\_in}}\right) \)

Normal Distribution:

\( W \sim \mathcal{N}\left(0, \frac{2}{\text{fan\_in}}\right) \)

Code Example in Keras:

from tensorflow.keras.initializers import HeUniform, HeNormal

# Using He Uniform
model.add(Dense(64, kernel_initializer=HeUniform(), activation='relu'))

# Using He Normal
model.add(Dense(64, kernel_initializer=HeNormal(), activation='relu'))

3. LeCun Initialization

LeCun Initialization is used for the SELU activation function, maintaining the self-normalizing property of the network.

Mathematical Formula:

Normal Distribution:

\( W \sim \mathcal{N}\left(0, \frac{1}{\text{fan\_in}}\right) \)

Code Example in Keras:


from tensorflow.keras.initializers import LecunNormal

# Using LeCun Normal
model.add(Dense(64, kernel_initializer=LecunNormal(), activation='selu'))

Summary Table: Classic Initialization Methods

Initialization Activation Functions Variance (Normal Distribution) Range (Uniform Distribution)
Glorot (Xavier) tanh, sigmoid, softmax, linear \(\frac{2}{\text{fan\_in} + \text{fan\_out}}\) \(\left[-\sqrt{\frac{6}}{\sqrt{\text{fan\_in} + \text{fan\_out}}}, \sqrt{\frac{6}}{\text{fan\_in} + \text{fan\_out}}\right]\)
He ReLU, Leaky ReLU, ELU \(\frac{2}{\text{fan\_in}}\) \(\left[-\sqrt{\frac{6}{\text{fan\_in}}}, \sqrt{\frac{6}}{\text{fan\_in}}\right]\)
LeCun SELU \(\frac{1}{\text{fan\_in}}\) Not typically used with uniform distribution

Part 2: Emerging Weight Initialization Techniques in 2024

1. Orthogonal Initialization

Orthogonal Initialization involves initializing the weight matrix as orthogonal, which helps maintain variance and prevents the collapse of the magnitude of activations.

Mathematical Explanation:

Ensures that \( W W^T = I \), preserving the variance of inputs across layers.

Code Example in Keras:


from tensorflow.keras.initializers import Orthogonal

# Using Orthogonal Initialization
model.add(Dense(64, kernel_initializer=Orthogonal(), activation='relu'))


2. Scaled Initialization

Scaled Initialization adjusts the scaling factor based on the depth of the network, ensuring that the variance remains stable across very deep networks.

Code Example in Keras:


from tensorflow.keras.initializers import VarianceScaling

# Example of Scaled Initialization
model.add(Dense(64, kernel_initializer=VarianceScaling(scale=1.0, mode='fan_avg', distribution='uniform'), activation='relu'))


3. Dynamic Weight Initialization

Dynamic Weight Initialization involves adjusting the initialization strategy during training, based on the changing distribution of gradients and activations.

Code Example in PyTorch:


import torch
import torch.nn as nn

class DynamicWeightInitLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super(DynamicWeightInitLayer, self).__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.reset_parameters()

    def reset_parameters(self):
        nn.init.kaiming_uniform_(self.linear.weight, a=0, mode='fan_in', nonlinearity='relu')
        if self.training:
            with torch.no_grad():
                self.linear.weight *= 1.5  # Example of dynamic adjustment

    def forward(self, x):
        return self.linear(x)

# Example usage
layer = DynamicWeightInitLayer(128, 64)


Conclusion

Incorporating the right weight initialization technique is crucial for the stability and performance of deep learning models. While classic methods like Glorot, He, and LeCun initialization are still highly effective and widely used, newer techniques such as Orthogonal Initialization, Scaled Initialization, and Dynamic Weight Initialization are essential for handling the complexities of modern deep learning architectures.

By understanding the mathematical foundations and having the right code snippets ready, you can apply these techniques to ensure that your neural networks start training on the right foot, leading to better and faster convergence.


Now let’s explain math behind it a little more :






Understanding the Mathematics Behind Weight Initialization

Understanding the Mathematics Behind Weight Initialization

What is Weight Initialization and Why is it Important?

Weight initialization is the process of setting the initial values for the weights of a neural network before training begins. The choice of these initial values is crucial because it significantly affects how the network learns, how quickly it converges during training, and how effective the model is in generalizing from the training data to unseen data.

Purpose of Weight Initialization:

  1. Avoiding Vanishing/Exploding Gradients: During backpropagation, gradients are calculated to update the weights. If weights are initialized improperly, gradients can either shrink (vanish) or grow exponentially (explode) as they propagate through layers. This can lead to very slow training (vanishing gradients) or cause the network to become unstable and fail to converge (exploding gradients). Proper initialization mitigates these issues, allowing gradients to remain within a reasonable range, which is crucial for effective learning, especially in deep networks.
  2. Ensuring Efficient Learning: Proper initialization allows the network to start with a reasonable range of activations and gradients. This ensures that the learning process is efficient, enabling the network to converge faster and reducing the likelihood of getting stuck in suboptimal regions of the loss landscape.
  3. Preventing Symmetry: If all weights are initialized to the same value (e.g., all zeros), each neuron in a layer will compute the same output and, therefore, have the same gradient during backpropagation. This symmetry prevents the network from learning different features and reduces it to essentially a single neuron per layer. Random initialization helps break this symmetry, ensuring that different neurons can learn different features.

The Mathematics Behind Weight Initialization

Weight initialization strategies are rooted in the need to manage the variance of activations and gradients across layers in a neural network. Here’s a deeper look at the mathematics that underpins these strategies:

1. Variance and Activation Flow:

Consider a neural network layer where the input to layer \(l\) is \(a^{(l)}\), the weights are \(W^{(l)}\), and the biases are \(b^{(l)}\). The output of this layer is given by:

\( z^{(l+1)} = W^{(l)} a^{(l)} + b^{(l)} \)

and the activation \(a^{(l+1)}\) is:

\( a^{(l+1)} = \phi(z^{(l+1)}) \)

where \(\phi\) is the activation function (like ReLU, sigmoid, or tanh).

Key Mathematical Goal: The variance of the outputs \(a^{(l+1)}\) should be roughly the same as the variance of the inputs \(a^{(l)}\) to ensure stable gradients and activations throughout the network. This goal is achieved by appropriately scaling the initial weights.

2. Different Initialization Techniques:

  • Glorot (Xavier) Initialization: Glorot Initialization aims to keep the variance of the outputs \(a^{(l+1)}\) equal to the variance of the inputs \(a^{(l)}\). This method is particularly effective for symmetric activation functions like tanh and sigmoid.

Mathematics:

The variance of the output \(z^{(l+1)}\) depends on both the number of inputs (fan-in) and outputs (fan-out) of the layer:

\( \text{Var}(W^{(l)}) = \frac{2}{\text{fan\_in} + \text{fan\_out}} \)

This formula ensures that the variance of activations remains stable across layers, helping to prevent vanishing/exploding gradients.

  • He Initialization: He Initialization is designed specifically for ReLU and its variants, which are widely used due to their effectiveness in avoiding saturation issues found in sigmoid/tanh. ReLU outputs are zero for half of the input values, so the initialization must account for this reduction in variance.

Mathematics:

The variance of the output \(z^{(l+1)}\) with ReLU activation is adjusted to:

\( \text{Var}(W^{(l)}) = \frac{2}{\text{fan\_in}} \)

This method is more aggressive than Glorot initialization, providing larger initial weights to compensate for the ReLU’s tendency to “kill” half of the activations.

  • LeCun Initialization: LeCun Initialization is tailored for SELU (Scaled Exponential Linear Units) activation functions, which have self-normalizing properties, meaning they automatically normalize the output variance during training.

Mathematics:

For SELU activations, it’s important that the variance of the weights is set so that the output variance remains close to 1 across layers:

\( \text{Var}(W^{(l)}) = \frac{1}{\text{fan\_in}} \)

This initialization helps maintain the network’s self-normalizing property, making it highly effective for deep networks with SELU activations.

Application to Algorithms

Weight initialization strategies are applicable across a wide range of deep learning models, including feedforward neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). The choice of initialization depends on several factors:

1. Activation Function:

  • ReLU or Leaky ReLU: Use He Initialization. This is optimal because ReLU units tend to “kill” half of their input values, requiring a more aggressive initialization to maintain gradient flow.
  • Tanh or Sigmoid: Use Glorot (Xavier) Initialization. These symmetric functions benefit from a balanced variance approach that Glorot initialization provides.
  • SELU: Use LeCun Initialization. SELU activations are designed for self-normalizing neural networks, and LeCun Initialization preserves this property.

2. Network Depth:

  • Shallow Networks: For networks with fewer layers, most initialization methods work reasonably well.
  • Deep Networks: As networks become deeper, the risk of vanishing/exploding gradients increases. He Initialization or Orthogonal Initialization (especially in RNNs) are more suitable to maintain stable gradient flow.

3. Type of Task:

  • Image Classification with CNNs: He Initialization is often preferred due to the prevalent use of ReLU in convolutional layers.
  • Sequence Modeling with RNNs: Orthogonal Initialization is commonly used to maintain the stability of the activations over many time steps.

How to Choose the Right Initialization:

  • Use Glorot Initialization for networks with tanh, sigmoid, or softmax activation functions.
  • Use He Initialization for networks with ReLU or its variants, especially in deep networks where the risk of vanishing gradients is higher.
  • Use LeCun Initialization for networks with SELU activation, particularly in self-normalizing neural networks (SNNs).
  • Consider Orthogonal Initialization for very deep networks or recurrent networks where preserving the magnitude of the activations is crucial.
  • Dynamic or Scaled Initialization might be necessary for extremely deep networks or when standard techniques fail to maintain gradient stability.

Summary

Weight initialization is a crucial step in designing and training neural networks. The choice of initialization strategy depends on the network architecture, the activation functions used, and the specific challenges posed by the model’s depth and complexity. Proper weight initialization helps start the training process on the right foot, leading to better and faster convergence and ensuring that the neural network can learn effectively.

In summary:

  • Glorot Initialization is ideal for symmetric activation functions like tanh and sigmoid.
  • He Initialization is suited for ReLU and its variants, particularly in deep networks.
  • LeCun Initialization is optimal for SELU activations in self-normalizing networks.
  • Orthogonal Initialization is effective in deep and recurrent networks to preserve variance.
  • Dynamic or Scaled Initialization might be necessary for very deep or complex networks.

Understanding these mathematical foundations and their application ensures that your neural networks can be optimized for performance and stability from the very start of training.