Understanding RNNs & Transformers in Detail: Predicting the Next Letter in a Sequence We have been focusing on NLP on today article and our other two articles of Natural Language Processing (NLP) -RNN – Day 63 & The Revolution of Transformer Models – day 65. In this article explanation, we’ll delve deeply into how Recurrent Neural Networks (RNNs) and Transformers work, especially in the context of predicting the next letter “D” in the sequence “A B C”. We’ll walk through every step, including actual numerical calculations for a simple example, to make the concepts clear. We’ll also explain why Transformers are considered as neural networks and how they fit into the broader context of deep learning. Recurrent Neural Networks (RNNs) Introduction to RNNs RNNs are a type of neural network designed to process sequential data by maintaining a hidden state that captures information about previous inputs. This makes them suitable for tasks like language modeling, where the context provided by earlier letters influences the prediction of the next letter. Problem Statement Lets say, Given the sequence of “A B C”, we want the RNN to predict the next letter, which is “D”. Input Representation We need to represent each letter numerically. We’ll use one-hot encoding for simplicity. Let’s define our vocabulary as letters A, B, C, D. A: B: C: D: Network Architecture and Parameters Lets define the architecture and set the parameters for simplicity. The input size (since we have four letters), the hidden size , and the output size . **Weights and Biases**: Input-to-Hidden Weights : Hidden-to-Hidden Weights : Hidden-to-Output Weights : Biases: Hidden bias , Output bias Forward Pass We’ll process the input sequence one time step at a time. Time Step 1: Processing ‘A’ The input for ‘A’ is represented as . To compute the hidden state: The hidden state update at time step 1 is: Time Step 2: Processing ‘B’ The input for ‘B’ is . The hidden state update at time step 2 is: Time Step 3: Processing ‘C’ The input for ‘C’ is . The hidden state update at time step 3 is: Output Prediction We now predict the output based on the final hidden state . The output is computed as: Loss Calculation We compute the cross-entropy loss by comparing the predicted probabilities with the true label (which is “D”). The cross-entropy loss is calculated as: Backpropagation Through Time (BPTT) Continued During backpropagation, we update the weights by computing gradients of the loss with respect to the parameters. This process is performed across time steps to adjust the weights in the hidden layers, input-to-hidden weights, and hidden-to-output weights. Step 1: Compute Gradients at Output Layer The gradient of the loss with respect to the output scores is calculated as: Step 2: Compute Gradients w.r.t Weights The gradient with respect to the hidden-to-output weights is calculated as: The bias gradients are: Step 3: Compute Gradient w.r.t Hidden State The gradient with respect to the hidden state is: Step 4: Backpropagate to Previous Time Steps The error is propagated back to previous time steps by computing the gradients with respect to the previous hidden states. This process continues back in time, updating the weights for the hidden-to-hidden and input-to-hidden matrices. Step 5: Update Weights After computing the gradients, we update the weights using gradient descent: where is the learning rate. & This Process goes on …, ( Please note, this was just a very brief overview of how the NLP would look like with RNN in very simple approach to just have a overlook of the math in mind) Now Lets See How The Same Example Look In Transformers How the Same Example Looks in Transformers 1. Introduction to Transformers Transformers model relationships between sequence elements using self-attention and feed-forward layers, making them highly effective for tasks like sequence prediction. Unlike RNNs, Transformers do not rely on sequential data processing, allowing parallel computations and better handling of long-range dependencies. 2. Problem Statement We aim to predict the next letter, “D,” given the input sequence “A B C” using a Transformer model. 3. Input Representation Instead of one-hot encoding, we represent the input letters as embeddings. Let the embedding matrix (of size ) map each letter to a 2-dimensional vector: Mapping: 4. Self-Attention Mechanism Step 1: Compute Queries, Keys, and Values Transformers use Queries (Q), Keys (K), and Values (V) to compute self-attention. The weight matrices are used to linearly transform the embeddings. For simplicity: For each input , compute: For example, for : Step 2: Compute Attention Scores Attention scores between and all are computed using the scaled dot product: For simplicity, compute scores for (for ): With : With : With : Scores: Step 3: Compute Attention Weights Apply softmax to normalize the scores: Softmax for : Exponential values: Normalized weights: Step 4: Compute Attention Output The attention output is a weighted sum of the values: Values: 5. Feed-Forward Neural Network The attention output is passed through a feed-forward layer: Assuming identity weights and zero bias: 6. Output Layer and Prediction The FFN output is mapped to the vocabulary size using output weights and bias : Assuming: 7. Apply Softmax for Probabilities Convert logits to probabilities using softmax: Probabilities: Prediction: (highest probability = 0.569). 8. Loss Calculation Cross-entropy loss for the true label : 9. Backpropagation Compute gradients for all weights () and update them using gradient descent. This process continues iteratively, with further training and optimization improving the efficiency and accuracy of the model. However, this was very simple far overview of Transformer approach for the same example can help you understand the Math Behind RNN and transformers better Why Transformers Are considered as Neural Networks Transformers are considered neural networks because they consist of layers of neurons (units) that perform linear transformations (e.g., matrix multiplications) followed by non-linear activations (e.g., ReLU). They also have learnable parameters (weights and biases), and the model is trained using backpropagation to minimize a loss function. The attention mechanism, though different from traditional convolutional or recurrent layers, operates within the framework of a neural network. KEY Notes: In this explanation, we covered how RNNs and Transformers work to predict the next letter in a sequence. While RNNs process inputs sequentially and rely on maintaining a hidden state over time, Transformers leverage self-attention to process all inputs in parallel, making them more efficient and effective for capturing long-range dependencies in data. This makes Transformers particularly well-suited for tasks like language modeling, where understanding the relationships between words (or letters) over long distances is essential. Key Differences Between RNNs and Transformers Sequential vs. Parallel Processing: RNNs process inputs one at a time, maintaining a hidden state for each step, while Transformers process the entire sequence in parallel using self-attention mechanisms. Handling Long-Range Dependencies: RNNs can struggle with capturing long-term dependencies due to vanishing gradients, whereas Transformers handle long-range dependencies efficiently through self-attention. Training Efficiency: Transformers, due to their parallel nature, are faster to train on large datasets compared to RNNs, which must process the data sequentially. Architectural Complexity: While RNNs are simpler in their design, Transformers are more complex due to their multi-head self-attention layers and feed-forward networks. However, the complexity allows Transformers to capture more nuanced relationships in the data. Back To Answer Our Initial Main Question , Why Transformers Are Considered Superior For Nlp Compared To Rnn Let’s revisit why Transformers are considered superior for NLP tasks compared to RNNs, and directly relate this to the detailed examples of RNN and Transformer architectures we previously explored. 1. Parallel Processing and Speed In the earlier RNN explanation, we saw that RNNs process sequences one step at a time. For each letter in the sequence “A B C”, the model updates the hidden state sequentially, meaning that at each step, the model can only “see” one input at a time. This limits the model’s ability to fully utilize parallel computing, resulting in slower training times. In contrast, the Transformer processes the entire sequence at once, thanks to its self-attention mechanism. This allows it to process all letters (“A B C”) simultaneously and compute relationships between every pair of tokens at the same time, drastically improving training speed. The self-attention mechanism also ensures that Transformers can…
Why transformers are better for NLP ? Let’s see the math behind it – Day 64
