Machine Learning Overview

Understanding RNNs: Why Not compare it with FNN to Understand the Math Behind it Better? – DAY 58

In this article we try to show an example of FNN and for RNN TO understand the math behind it better by comparing to each other:

 

Neural Networks Example

Example Setup

  • Input for FNN: x = 0.5
  • Target Output for FNN: y = 0.8
  • RNNs are tailored for sequential data because they are designed to remember and utilize information from previous inputs in a sequence, allowing them to capture temporal relationships and context effectively. This characteristic differentiates RNNs from other neural network types that are not inherently sequence-aware., Input for RNN (Sequence): X = [0.5, 0.7]
  • Target Output for RNN (Sequence): Y = [0.8, 0.9]
  • Learning Rate: \eta = 0.1

1. Feedforward Neural Network (FNN)

Structure

  • Input Layer: 1 neuron
  • Hidden Layer: 1 neuron
  • Output Layer: 1 neuron

Weights and Biases

  • Initial Weights:
    • W_{ih} = 0.4 (Input to Hidden weight)
    • W_{ho} = 0.6 (Hidden to Output weight)
  • Biases:
    • b_h = 0.1 (Hidden layer bias)
    • b_o = 0.2 (Output layer bias)

Step-by-Step Calculation for FNN

Step 1: Forward Pass

  1. Hidden Layer Output:

    h = \text{ReLU}(W_{ih} \cdot x + b_h) = \text{ReLU}(0.4 \cdot 0.5 + 0.1) = \text{ReLU}(0.2 + 0.1) = \text{ReLU}(0.3) = 0.3

  2. Output:

    y_{\text{pred}} = W_{ho} \cdot h + b_o = 0.6 \cdot 0.3 + 0.2 = 0.18 + 0.2 = 0.38

Step 2: Loss Calculation

Using Mean Squared Error (MSE):

L = \frac{1}{2} (y_{\text{pred}} - y)^2 = \frac{1}{2} (0.38 - 0.8)^2 = \frac{1}{2} (-0.42)^2 = \frac{1}{2} \cdot 0.1764 = 0.0882

Step 3: Backward Pass

  1. Gradient of Loss with respect to Output:

    \frac{\partial L}{\partial y_{\text{pred}}} = y_{\text{pred}} - y = 0.38 - 0.8 = -0.42

  2. Gradient of Output with respect to Hidden Layer:

    \frac{\partial y_{\text{pred}}}{\partial h} = W_{ho}

  3. Gradient of Hidden Layer Output with respect to Weights:

    \frac{\partial L}{\partial W_{ho}} = \frac{\partial L}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial W_{ho}} = -0.42 \cdot 0.3 = -0.126

    \frac{\partial L}{\partial W_{ih}} = \frac{\partial L}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial h} \cdot \frac{\partial h}{\partial W_{ih}} = -0.42 \cdot W_{ho} \cdot \frac{\partial h}{\partial W_{ih}}

    Assuming \frac{\partial h}{\partial W_{ih}} = 0.5:

    \frac{\partial L}{\partial W_{ih}} = -0.42 \cdot 0.6 \cdot 0.5 = -0.126

Step 4: Weight Update

  1. Update Output Weight:

    W_{ho} = W_{ho} - \eta \cdot \frac{\partial L}{\partial W_{ho}} = 0.6 - 0.1 \cdot (-0.126) = 0.6 + 0.0126 = 0.6126

  2. Update Input Weight:

    W_{ih} = W_{ih} - \eta \cdot \frac{\partial L}{\partial W_{ih}} = 0.4 - 0.1 \cdot (-0.126) = 0.4 + 0.0126 = 0.4126

2. Recurrent Neural Network (RNN)

Structure

  • Input Layer: 1 neuron
  • Hidden Layer: 1 neuron
  • Output Layer: 1 neuron

Weights and Biases

  • Initial Weights:
    • W_{xh} = 0.5 (Input to Hidden weight)
    • W_{hh} = 0.3 (Hidden to Hidden weight)
    • W_{hy} = 0.7 (Hidden to Output weight)
  • Biases:
    • b_h = 0.1 (Hidden layer bias)
    • b_o = 0.2 (Output layer bias)

NOW Lets check Step-by-Step example Calculation for RNN

Step 1: Forward Pass

Assuming initial hidden state h_0 = 0. This is where the memory concept starts; the hidden state retains information from previous time steps.

  1. For t = 1 (Input x_1 = 0.5):
    • Hidden State:

      h_1 = \text{tanh}(W_{xh} \cdot x_1 + W_{hh} \cdot h_0 + b_h) = \text{tanh}(0.5 \cdot 0.5 + 0.3 \cdot 0 + 0.1) = \text{tanh}(0.25 + 0.1) = \text{tanh}(0.35) \approx 0.337

      Here, h_1 is influenced by the previous hidden state h_0 (which is 0). This demonstrates how the RNN maintains memory; the hidden state captures the relevant information to influence future computations.

    • Output:

      y_{1, \text{pred}} = W_{hy} \cdot h_1 + b_o = 0.7 \cdot 0.337 + 0.2 \approx 0.2359 + 0.2 = 0.4359

  2. For t = 2 (Input x_2 = 0.7):
    • Hidden State:

      h_2 = \text{tanh}(W_{xh} \cdot x_2 + W_{hh} \cdot h_1 + b_h) = \text{tanh}(0.5 \cdot 0.7 + 0.3 \cdot 0.337 + 0.1)

       = \text{tanh}(0.35 + 0.1011 + 0.1) = \text{tanh}(0.5511) \approx 0.500

      In this step, h_2 is influenced by both the current input x_2 and the previous hidden state h_1. This reflects the memory of the previous input and its influence on the current state.

    • Output:

      y_{2, \text{pred}} = W_{hy} \cdot h_2 + b_o = 0.7 \cdot 0.500 + 0.2 = 0.35 + 0.2 = 0.55

Step 2: Loss Calculation

Using Mean Squared Error (MSE) for the sequence:

  1. For t = 1:

    L_1 = \frac{1}{2} (y_{1, \text{pred}} - 0.8)^2 = \frac{1}{2} (0.4359 - 0.8)^2 \approx 0.0613

  2. For t = 2:

    L_2 = \frac{1}{2} (y_{2, \text{pred}} - 0.9)^2 = \frac{1}{2} (0.55 - 0.9)^2 \approx 0.0600

Total Loss:

L_{\text{total}} = L_1 + L_2 \approx 0.0613 + 0.0600 = 0.1213

Step 3: Backward Pass (BPTT)

This is where backpropagation through time takes place. The gradients are computed considering how each hidden state affects the output across the entire sequence.

  1. Gradient of Loss w.r.t Output:
    • For t = 1:

      \frac{\partial L_1}{\partial y_{1, \text{pred}}} = y_{1, \text{pred}} - 0.8 = 0.4359 - 0.8 = -0.3641

    • For t = 2:

      \frac{\partial L_2}{\partial y_{2, \text{pred}}} = y_{2, \text{pred}} - 0.9 = 0.55 - 0.9 = -0.35

  2. Gradient of Output w.r.t Hidden:

    \frac{\partial y_{1, \text{pred}}}{\partial h_1} = W_{hy} = 0.7

    \frac{\partial y_{2, \text{pred}}}{\partial h_2} = W_{hy} = 0.7

  3. Gradient of Hidden States:
    • For t = 1:

      \frac{\partial L_1}{\partial h_1} = \frac{\partial L_1}{\partial y_{1, \text{pred}}} \cdot \frac{\partial y_{1, \text{pred}}}{\partial h_1} = -0.3641 \cdot 0.7 = -0.2549

    • For t = 2:

      \frac{\partial L_2}{\partial h_2} = \frac{\partial L_2}{\partial y_{2, \text{pred}}} \cdot \frac{\partial y_{2, \text{pred}}}{\partial h_2} = -0.35 \cdot 0.7 = -0.245

    • Memory Influence: The hidden state h_2 depends on h_1 and the current input x_2. Thus, the gradients also account for the memory stored in previous hidden states.
  4. Gradient for Weights:
    • For W_{hy}:

      \frac{\partial L_1}{\partial W_{hy}} = \frac{\partial L_1}{\partial y_{1, \text{pred}}} \cdot h_1 = -0.3641 \cdot 0.337 = -0.122

    • For W_{hh}:

      \frac{\partial L_2}{\partial W_{hh}} = \frac{\partial L_2}{\partial y_{2, \text{pred}}} \cdot h_1 \cdot W_{hh} = -0.35 \cdot 0.500 \cdot 0.3 = -0.0525

Step 4: Weight Update

  1. Update Weights:
    • For W_{hy}:

      W_{hy} = W_{hy} - \eta \cdot \frac{\partial L_1}{\partial W_{hy}} = 0.7 - 0.1 \cdot (-0.122) = 0.7 + 0.0122 = 0.7122

    • For W_{hh}:

      W_{hh} = W_{hh} - \eta \cdot \frac{\partial L_2}{\partial W_{hh}} = 0.3 - 0.1 \cdot (-0.0525) = 0.3 + 0.00525 = 0.30525

Summary Table

Feedforward Neural Network (FNN)

Step Calculation
Forward Pass h = 0.3, \; y_{\text{pred}} \approx 0.38
Loss L \approx 0.0882
Gradient (Output) \frac{\partial L}{\partial y_{\text{pred}}} \approx -0.42
Weight Update W_{ho} = 0.6126, \; W_{ih} = 0.4126

Recurrent Neural Network (RNN)

Step Calculation Explanation
Forward Pass h_1 \approx 0.337, \; y_{1, \text{pred}} \approx 0.4359 Hidden state h_1 remembers h_0 (initial state).
  h_2 \approx 0.500, \; y_{2, \text{pred}} \approx 0.55 Hidden state h_2 remembers h_1, capturing memory.
Loss L_{\text{total}} \approx 0.1213 Total loss calculated across the sequence.
Gradient (Output) For t=1: -0.3641, \; t=2: -0.35 Gradients computed for each time step output.
Weight Update W_{hy} = 0.7122, \; W_{hh} = 0.30525 Weights updated based on contributions from all previous states.

Key Takeaways

  • FNN: Each input is treated independently, and the backpropagation process is straightforward.
  • RNN: The model retains memory of previous states through the hidden state, making the calculations for gradients more complex, especially during backpropagation through time (BPTT). Each hidden state influences subsequent outputs and reflects the model’s ability to remember past inputs.

Another concept to understand here is, memorization in RNNs happens through the hidden states. Each hidden state (h_t) carries information from previous inputs:

  • At time step t = 1, the hidden state h_1 is influenced by the initial hidden state h_0 (which is 0).
  • At time step t = 2, the hidden state h_2 is influenced by both the current input x_2 and the previous hidden state h_1.

Our brief example  here was to explain the math behind Feedforward Neural Networks (FNNs) and Recurrent Neural Networks (RNNs) in a simple way, by highlighting their mathematical differences you can understand each model better.

Your image description here

Don’t forget to check our apps! Visit here.

don't miss our new posts. Subscribe for updates

We don’t spam! Read our privacy policy for more info.