Machine Learning Overview

why transformers are better for NLP ? Let’s see the math behind it – Day 64




RNNs and Transformers Detailed Explanation

Understanding RNNs and Transformers in Detail: Predicting the Next Letter in a Sequence

In this comprehensive explanation, we’ll delve deeply into how Recurrent Neural Networks (RNNs) and Transformers work, especially in the context of predicting the next letter “D” in the sequence “A B C”. We’ll walk through every step, including actual numerical calculations for a simple example, to make the concepts clear. We’ll also explain why Transformers are considered neural networks and how they fit into the broader context of deep learning.

Part 1: Recurrent Neural Networks (RNNs)

1. Introduction to RNNs

RNNs are a type of neural network designed to process sequential data by maintaining a hidden state that captures information about previous inputs. This makes them suitable for tasks like language modeling, where the context provided by earlier letters influences the prediction of the next letter.

2. Problem Statement

Given the sequence “A B C”, we want the RNN to predict the next letter, which is “D”.

3. Input Representation

We need to represent each letter numerically. We’ll use one-hot encoding for simplicity. Let’s define our vocabulary as letters A, B, C, D.

A: \begin{bmatrix} 1, 0, 0, 0 \end{bmatrix}

B: \begin{bmatrix} 0, 1, 0, 0 \end{bmatrix}

C: \begin{bmatrix} 0, 0, 1, 0 \end{bmatrix}

D: \begin{bmatrix} 0, 0, 0, 1 \end{bmatrix}

4. Network Architecture and Parameters

We define the architecture and set the parameters for simplicity. The input size n_x = 4 (since we have four letters), the hidden size n_h = 2, and the output size n_y = 4.

**Weights and Biases**:

Input-to-Hidden Weights W_{xh}:

W_{xh} = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \end{bmatrix}

Hidden-to-Hidden Weights W_{hh}:

W_{hh} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}

Hidden-to-Output Weights W_{hy}:

W_{hy} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \\ 0.7 & 0.8 \end{bmatrix}

Biases:

Hidden bias b_h = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, Output bias b_y = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}

5. Forward Pass

We’ll process the input sequence one time step at a time.

Time Step 1: Processing ‘A’

The input for ‘A’ is represented as x_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}.

To compute the hidden state:

W_{xh} \cdot x_1 = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ 0.5 & 0.6 & 0.7 & 0.8 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.1 \\ 0.5 \end{bmatrix}

The hidden state update at time step 1 is:

h_1 = \tanh(W_{xh} \cdot x_1 + W_{hh} \cdot h_0 + b_h) = \tanh\left(\begin{bmatrix} 0.1 \\ 0.5 \end{bmatrix}\right) = \begin{bmatrix} 0.0997 \\ 0.4621 \end{bmatrix}

Time Step 2: Processing ‘B’

The input for ‘B’ is x_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}.

The hidden state update at time step 2 is:

h_2 = \tanh(W_{xh} \cdot x_2 + W_{hh} \cdot h_1 + b_h) = \tanh\left(\begin{bmatrix} 0.30239 \\ 0.81476 \end{bmatrix}\right) = \begin{bmatrix} 0.2935 \\ 0.6729 \end{bmatrix}

Time Step 3: Processing ‘C’

The input for ‘C’ is x_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix}.

The hidden state update at time step 3 is:

h_3 = \tanh(W_{xh} \cdot x_3 + W_{hh} \cdot h_2 + b_h) = \tanh\left(\begin{bmatrix} 0.46393 \\ 1.05721 \end{bmatrix}\right) = \begin{bmatrix} 0.4339 \\ 0.7843 \end{bmatrix}

6. Output Prediction

We now predict the output based on the final hidden state h_3. The output is computed as:

o_3 = W_{hy} \cdot h_3 + b_y = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \\ 0.7 & 0.8 \end{bmatrix} \cdot \begin{bmatrix} 0.4339 \\ 0.7843 \end{bmatrix} = \begin{bmatrix} 0.20025 \\ 0.44389 \\ 0.68753 \\ 0.93117 \end{bmatrix}

7. Loss Calculation

We compute the cross-entropy loss by comparing the predicted probabilities with the true label (which is “D”). The cross-entropy loss is calculated as:

\text{Loss} = -\log(0.3473) = 1.057

8. Backpropagation Through Time (BPTT) Continued

During backpropagation, we update the weights by computing gradients of the loss with respect to the parameters. This process is performed across time steps to adjust the weights in the hidden layers, input-to-hidden weights, and hidden-to-output weights.

Step 1: Compute Gradients at Output Layer

The gradient of the loss with respect to the output scores is calculated as:

\delta o_3 = y_{\text{pred}} - y_{\text{true}} = \begin{bmatrix} 0.1672 \\ 0.2133 \\ 0.2722 \\ -0.6527 \end{bmatrix}

Step 2: Compute Gradients w.r.t Weights

The gradient with respect to the hidden-to-output weights W_{hy} is calculated as:

\delta W_{hy} = \delta o_3 \cdot h_3^T

The bias gradients are:

\delta b_y = \delta o_3

Step 3: Compute Gradient w.r.t Hidden State

The gradient with respect to the hidden state h_3 is:

\delta h_3 = W_{hy}^T \cdot \delta o_3

Step 4: Backpropagate to Previous Time Steps

The error is propagated back to previous time steps by computing the gradients with respect to the previous hidden states. This process continues back in time, updating the weights for the hidden-to-hidden W_{hh} and input-to-hidden W_{xh} matrices.

Step 5: Update Weights

After computing the gradients, we update the weights using gradient descent:

W \leftarrow W - \eta \cdot \delta W

where \eta is the learning rate.

Part 2: Transformers

1. Introduction to Transformers

Transformers are a type of neural network architecture that relies entirely on self-attention mechanisms to model relationships in sequential data, without using recurrent connections. They are considered neural networks because they consist of layers of neurons that perform linear transformations and non-linear activations, similar to traditional neural networks.

2. Problem Statement

We aim to predict the next letter “D” given the sequence “A B C” using a Transformer model.

3. Input Representation

The input is represented using embeddings. We use an embedding dimension d_{\text{model}} = 2 for simplicity. The embedding matrix is:

E = \begin{bmatrix} [1.0, 0.0] \\ [0.0, 1.0] \\ [1.0, 1.0] \\ [0.0, 0.0] \end{bmatrix}

Positional encoding is typically added to the embeddings to account for the position of the tokens in the sequence, but for simplicity, we will assume positional encodings are zero.

4. Self-Attention Mechanism

In the self-attention mechanism, we compute Queries (Q), Keys (K), and Values (V) for each position in the sequence.

The weights W_Q, W_K, W_V are matrices of size d_{\text{model}} \times d_k, where d_k = d_{\text{model}} = 2. We initialize the weights as:

W_Q = W_K = W_V = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 1.0 \end{bmatrix}

Compute Q, K, V

For each position i:

Q_i = x_i \cdot W_Q

K_i = x_i \cdot W_K

V_i = x_i \cdot W_V

5. Compute Attention Scores

For each query Q_i, we compute attention scores with all keys K_j:

\text{Attention Score}_{i,j} = \frac{Q_i \cdot K_j^T}{\sqrt{d_k}}

For simplicity, we compute the attention for the last input “C” (i.e., i = 3).

6. Compute Attention Weights

We apply the softmax function to the attention scores to obtain the attention weights:

\alpha_{i,j} = \frac{e^{\text{Score}_{i,j}}}{\sum_k e^{\text{Score}_{i,k}}}

7. Compute Attention Output

The attention output is the weighted sum of the values V_j:

\text{Attention Output}_i = \sum_j \alpha_{i,j} \cdot V_j

8. Feed-Forward Neural Network

After computing the attention output, it is passed through a feed-forward neural network. We assume the weights of the feed-forward network are identity matrices for simplicity.

\text{FFN}_i = \text{ReLU}(\text{Attention Output}_i)

9. Output Layer and Prediction

The output of the feed-forward network is then passed through an output layer to predict the next letter. We assume the output weights map the final attention output to the vocabulary size.

The predicted output is computed as:

o = W_{\text{out}} \cdot \text{FFN}_3 + b_{\text{out}}

10. Apply Softmax to Get Probabilities

We apply the softmax function to the output logits to obtain the predicted probabilities for each letter in the vocabulary.

11. Loss Calculation

The cross-entropy loss is calculated as:

\text{Loss} = -\log(P_{\text{true}})

12. Backpropagation

Backpropagation in the transformer model follows the same principles as in other neural networks. Gradients are computed for the weights in the attention mechanism and the feed-forward network, and the weights are updated using gradient descent.

Why Transformers Are Neural Networks

Transformers are considered neural networks because they consist of layers of neurons (units) that perform linear transformations (e.g., matrix multiplications) followed by non-linear activations (e.g., ReLU). They also have learnable parameters (weights and biases), and the model is trained using backpropagation to minimize a loss function. The attention mechanism, though different from traditional convolutional or recurrent layers, operates within the framework of a neural network.

Conclusion

In this explanation, we covered how RNNs and Transformers work to predict the next letter in a sequence. While RNNs process inputs sequentially and rely on maintaining a hidden state over time, Transformers leverage self-attention to process all inputs in parallel, making them more efficient and effective for capturing long-range dependencies in data.

. This makes Transformers particularly well-suited for tasks like language modeling, where understanding the relationships between words (or letters) over long distances is essential.

Key Differences Between RNNs and Transformers

  • Sequential vs. Parallel Processing: RNNs process inputs one at a time, maintaining a hidden state for each step, while Transformers process the entire sequence in parallel using self-attention mechanisms.
  • Handling Long-Range Dependencies: RNNs can struggle with capturing long-term dependencies due to vanishing gradients, whereas Transformers handle long-range dependencies efficiently through self-attention.
  • Training Efficiency: Transformers, due to their parallel nature, are faster to train on large datasets compared to RNNs, which must process the data sequentially.
  • Architectural Complexity: While RNNs are simpler in their design, Transformers are more complex due to their multi-head self-attention layers and feed-forward networks. However, the complexity allows Transformers to capture more nuanced relationships in the data.

References




Why Transformers are Superior for NLP Tasks

Why Transformers Are Considered Superior for NLP

Let’s revisit why Transformers are considered superior for NLP tasks compared to RNNs, and directly relate this to the detailed examples of RNN and Transformer architectures we previously explored.

1. Parallel Processing and Speed

In the earlier RNN explanation, we saw that RNNs process sequences one step at a time. For each letter in the sequence “A B C”, the model updates the hidden state sequentially, meaning that at each step, the model can only “see” one input at a time. This limits the model’s ability to fully utilize parallel computing, resulting in slower training times.

In contrast, the Transformer processes the entire sequence at once, thanks to its self-attention mechanism. This allows it to process all letters (“A B C”) simultaneously and compute relationships between every pair of tokens at the same time, drastically improving training speed. The self-attention mechanism also ensures that Transformers can leverage parallel computing on GPUs more efficiently. In the example, we saw how self-attention enabled the model to focus on relevant parts of the input sequence all at once, making it more scalable for large datasets compared to the sequential processing of RNNs.

2. Handling Long-Range Dependencies

When discussing RNNs, we highlighted that they can struggle with long-range dependencies due to the vanishing gradient problem. This issue arises because the gradients diminish as they are backpropagated through many time steps, making it hard for the model to update weights and learn relationships between distant words (like in long sentences or documents). In our RNN example, predicting the next letter “D” from “A B C” would become increasingly difficult if the sequence were much longer due to this vanishing gradient problem.

Transformers, on the other hand, excel at capturing long-range dependencies. Using the self-attention mechanism, Transformers can compute relationships between any two tokens, regardless of their distance in the sequence. In the Transformer example, we calculated how each word’s relevance to the others was determined through attention scores. This mechanism is not impacted by token position, which enables Transformers to model long sequences effectively, solving the core problem faced by RNNs.

3. Efficiency in Gradient Flow

In our RNN example, backpropagation through time (BPTT) was computationally expensive and prone to vanishing gradients because the gradients had to pass through many sequential hidden states. The sequential nature of RNNs makes gradient flow less efficient, especially over long sequences, which limits their ability to learn complex dependencies.

In comparison, Transformers avoid the vanishing gradient problem entirely. With their non-sequential nature, gradients can flow more easily through the network during backpropagation, allowing for smoother training over large sequences. This makes Transformers more robust for learning from complex data, as seen in the transformer’s ability to handle the entire sequence “A B C” in parallel without encountering these gradient issues.

4. Scalability and Model Size

In the RNN architecture we explored, adding more layers or neurons to handle longer sequences becomes computationally challenging. RNNs face limitations in scalability because their sequential structure prevents efficient scaling with modern GPU resources.

Transformers, however, scale much more effectively. In the Transformer example, we demonstrated how adding more layers of self-attention and feed-forward networks (as done in models like GPT-3 and BERT) allows the architecture to handle massive datasets and complex tasks without suffering from computational inefficiencies. This scalability is a key reason why Transformers have become the foundation for large language models that can be trained on vast corpora of text.

5. Self-Attention vs. Sequential Memory

In RNNs, the hidden state serves as a “memory” of previous tokens, but as we saw in the example, this memory decays over time, especially with long sequences. RNNs rely on this hidden state to retain information, which limits their ability to maintain accurate representations of earlier tokens as the sequence grows longer.

In the Transformer example, we saw that the self-attention mechanism allows the model to compare all tokens in the sequence directly, providing a more precise and dynamic representation of how each token relates to every other token. This makes Transformers far better at capturing the relationships between distant tokens, which is crucial in tasks like machine translation, text summarization, and long-form text generation.

Conclusion: Why Transformers are Superior for NLP

To summarize, based on both the theoretical explanations and the practical examples, Transformers are better suited for NLP tasks because:

  • Parallel processing allows for faster training and scalability.
  • Self-attention helps capture long-range dependencies more effectively.
  • Transformers avoid the vanishing gradient problem, making gradient flow smoother.
  • The architecture is inherently more scalable, which is why large models like GPT-3 and BERT are based on Transformers.

In essence, the efficiency, ability to handle long dependencies, and scalability make Transformers the go-to architecture for many NLP tasks today.