Weight initialisation in Deep Learning well explained _ Day 21

Weight Initialization in Deep Learning: Classic and Emerging Techniques Understanding the correct initialization of weights in deep learning models is crucial for effective training and convergence. This post explores both classic and advanced weight initialization strategies, providing mathematical insights and practical code examples. Part 1: Classic Weight Initialization Techniques 1. Glorot (Xavier) Initialization Glorot Initialization is designed to maintain the variance of activations across layers, particularly effective for activation functions like tanh and sigmoid. Mathematical Formula: Uniform Distribution: Normal Distribution: Code Example in Keras: from tensorflow.keras.layers import Dense from tensorflow.keras.initializers import GlorotUniform, GlorotNormal # Using Glorot Uniform model.add(Dense(64, kernel_initializer=GlorotUniform(), activation='tanh')) # Using Glorot Normal model.add(Dense(64, kernel_initializer=GlorotNormal(), activation='tanh')) 2. He Initialization He Initialization is optimized for ReLU and its variants, ensuring that the gradients remain within a good range across layers. Mathematical Formula: Uniform Distribution: Normal Distribution: Code Example in Keras: from tensorflow.keras.initializers import HeUniform, HeNormal # Using He Uniform model.add(Dense(64, kernel_initializer=HeUniform(), activation='relu')) # Using He Normal model.add(Dense(64, kernel_initializer=HeNormal(), activation='relu')) 3. LeCun Initialization LeCun Initialization is used for the SELU activation function, maintaining the self-normalizing property of the network. Mathematical Formula: Normal Distribution: Code Example in Keras: from tensorflow.keras.initializers import LecunNormal # Using LeCun Normal model.add(Dense(64, kernel_initializer=LecunNormal(), activation='selu')) Summary Table: Classic Initialization Methods Initialization Activation Functions Variance (Normal Distribution) Range (Uniform Distribution) Glorot (Xavier) tanh, sigmoid, softmax, linear He ReLU, Leaky ReLU, ELU LeCun SELU Not typically used with uniform distribution Part 2: Emerging Weight Initialization Techniques in 2024 1. Orthogonal Initialization Orthogonal Initialization involves initializing the weight matrix as orthogonal, which helps maintain variance and prevents the collapse of the magnitude of activations. Mathematical Explanation: Code Example in Keras: from tensorflow.keras.initializers import Orthogonal # Using Orthogonal Initialization model.add(Dense(64, kernel_initializer=Orthogonal(), activation='relu')) 2. Scaled Initialization Scaled Initialization adjusts the scaling factor based on the depth of the network, ensuring that the variance remains stable across very deep networks. Code Example in Keras: from tensorflow.keras.initializers import VarianceScaling # Example of Scaled Initialization model.add(Dense(64, kernel_initializer=VarianceScaling(scale=1.0, mode='fan_avg', distribution='uniform'), activation='relu')) 3. Dynamic Weight Initialization Dynamic Weight Initialization involves adjusting the initialization strategy during training, based on the changing distribution of gradients and activations. Code Example in PyTorch: import torch import torch.nn as nn class DynamicWeightInitLayer(nn.Module): def __init__(self, in_features, out_features): super(DynamicWeightInitLayer, self).__init__() self.linear = nn.Linear(in_features, out_features) self.reset_parameters() def reset_parameters(self): nn.init.kaiming_uniform_(self.linear.weight, a=0, mode='fan_in', nonlinearity='relu') if self.training: with torch.no_grad(): self.linear.weight *= 1.5 # Example of dynamic adjustment def forward(self, x): return self.linear(x) # Example usage layer = DynamicWeightInitLayer(128, 64) Key Notes: Incorporating the right weight initialization technique is crucial for the stability and performance of deep learning models. While classic methods like Glorot, He, and LeCun initialization are still highly effective and widely used, newer techniques such as Orthogonal Initialization, Scaled Initialization, and Dynamic Weight Initialization are essential for handling the complexities of modern deep learning architectures. By understanding the mathematical foundations and having the right code snippets ready, you can apply these techniques to ensure that your neural networks start training on the right foot, leading to better and faster convergence. Now let’s explain math behind it a little more : Understanding the Mathematics Behind Weight Initialization Understanding the Mathematics Behind Weight Initialization What is Weight Initialization and Why is it Important? Weight initialization is the process of setting the initial values for the weights of a neural network before training begins. The choice of these initial values is crucial because it significantly affects how the network learns, how quickly it converges during training, and how effective the model is in generalizing from the training data to unseen data. Purpose of Weight Initialization: Avoiding Vanishing/Exploding Gradients: During backpropagation, gradients are calculated to update the weights. If weights are initialized improperly, gradients can either shrink (vanish) or grow exponentially (explode) as they propagate through layers. This can lead to very slow training (vanishing gradients) or cause the network to become unstable and fail to converge (exploding gradients). Proper initialization mitigates these issues, allowing gradients to remain within a reasonable range, which is crucial for effective learning, especially in deep networks. Ensuring Efficient Learning: Proper initialization allows the network to start with a reasonable range of activations and gradients. This ensures that the learning process is efficient, enabling the network to converge faster and reducing the likelihood of getting stuck in suboptimal regions of the loss landscape. Preventing Symmetry: If all weights are initialized to the same value (e.g., all zeros), each neuron in a layer will compute the same output and, therefore, have the same gradient during backpropagation. This symmetry prevents the network from learning different features and reduces it to essentially a single neuron per layer. Random initialization helps break this symmetry, ensuring that different neurons can learn different features. The Mathematics Behind Weight Initialization Weight initialization strategies are rooted in the need to manage the variance of activations and gradients across layers in a neural network. Here’s a deeper look at the mathematics that underpins these strategies: 1. Variance and Activation Flow: Consider a neural network layer where the input to layer \(l\) is \(a^{(l)}\), the weights are \(W^{(l)}\), and the biases are \(b^{(l)}\). The output of this layer is given by: and the activation \(a^{(l+1)}\) is: where \(\phi\) is the activation function (like ReLU, sigmoid, or tanh). Key Mathematical Goal: The variance of the outputs \(a^{(l+1)}\) should be roughly the same as the variance of the inputs \(a^{(l)}\) to ensure stable gradients and activations throughout the network. This goal is achieved by appropriately scaling the initial weights. 2. Different Initialization Techniques: Glorot (Xavier) Initialization: Glorot Initialization aims to keep the variance of the outputs \(a^{(l+1)}\) equal to the variance of the inputs \(a^{(l)}\). This method is particularly effective for symmetric activation functions like tanh and sigmoid. Mathematics: The variance of the output \(z^{(l+1)}\) depends on both the number of inputs (fan-in) and outputs (fan-out) of the layer: This formula ensures…

Thank you for reading this post, don't forget to subscribe!

Membership Required

You must be a member to access this content.

View Membership Levels

Already a member? Log in here

Membership Required

Can we make prediction without need of going through iteration ? yes with the Normal Equation _ Day 6

Hyperparameter Tuning with Keras Tuner _ Day 17

Learning Rate – 1-Cycle Scheduling, exponential decay and Cyclic Exponential Decay (CED) – Part 4 – Day 45

Social Link

Categories

Privacy Policies

Select a Question

Or type your own question

Membership Required

Widgets

Can we make prediction without need of going through iteration ? yes with the Normal Equation _ Day 6

Hyperparameter Tuning with Keras Tuner _ Day 17

Learning Rate – 1-Cycle Scheduling, exponential decay and Cyclic Exponential Decay (CED) – Part 4 – Day 45

Social Link

Categories

Privacy Policies

Select a Question

Or type your own question