Machine Learning Overview

Iterative Forecasting which is Predicting One Step at a Time 2- Direct Multi-Step Forecasting with RNN 3- Seq2Seq Models for Time Series Forecasting – day 61




Mastering Time Series Forecasting with RNNs and Seq2Seq Models


Mastering Time Series Forecasting with RNNs and Seq2Seq Models: Detailed Iterations with Calculations, Tables, and Method-Specific Features

Time series forecasting is a crucial task in various domains such as finance, weather prediction, and energy management. Recurrent Neural Networks (RNNs) and Sequence-to-Sequence (Seq2Seq) models are powerful tools for handling sequential data. In this guide, we will provide step-by-step calculations, including forward passes, loss computations, and backpropagation for two iterations across three forecasting methods:

  1. Iterative Forecasting: Predicting One Step at a Time
  2. Direct Multi-Step Forecasting with RNN
  3. Seq2Seq Models for Time Series Forecasting

Assumptions and Initial Parameters

For consistency across all methods, we’ll use the following initial parameters:

  • Input Sequence: X = [1, 2, 3]
  • Desired Outputs:
    • For Iterative Forecasting and Seq2Seq: Y_{\text{true}} = [2, 3, 4]
    • For Direct Multi-Step Forecasting: Y_{\text{true}} = [4, 5]
  • Initial Weights and Biases:
    • Weights:
      • W_h = 0.5 (hidden-to-hidden weight)
      • W_x = 0.2 (input-to-hidden weight)
      • W_y will vary per method to accommodate output dimensions.
    • Biases:
      • b_h = 0.1
      • b_y = 0.0
  • Activation Function: Hyperbolic tangent (\tanh(z))
  • Learning Rate: \eta = 0.01
  • Initial Hidden State: h_0 = 0

1. Iterative Forecasting: Predicting One Step at a Time

In iterative forecasting, the model predicts one time step ahead and uses that prediction as an input to predict the next step during inference.

Key Feature: During training, we use actual data to prevent error accumulation, but during inference, predictions are fed back into the model.

Iteration 1

Forward Pass

We compute the hidden states and outputs for each time step.

Time Step t x_t h_{t-1} z_t = W_h h_{t-1} + W_x x_t + b_h h_t = \tanh(z_t) y_t = W_y h_t + b_y
1 1 0 0.5 \times 0 + 0.2 \times 1 + 0.1 = 0.3 \tanh(0.3) = 0.291 1.0 \times 0.291 + 0 = 0.291
2 2 0.291 0.5 \times 0.291 + 0.2 \times 2 + 0.1 = 0.6455 \tanh(0.6455) = 0.569 1.0 \times 0.569 + 0 = 0.569
3 3 0.569 0.5 \times 0.569 + 0.2 \times 3 + 0.1 = 0.9845 \tanh(0.9845) = 0.755 1.0 \times 0.755 + 0 = 0.755
Feature Highlight: Each hidden state h_t depends on the previous hidden state h_{t-1} and the current input x_t, showcasing the sequential processing of RNNs.

Loss Calculation

Compute the Mean Squared Error (MSE):

\text{MSE} = \frac{1}{3} \sum_{t=1}^{3} (y_t - Y_{\text{true}, t})^2

Time Step t y_t Y_{\text{true}, t} (y_t - Y_{\text{true}, t})^2
1 0.291 2 (0.291 - 2)^2 = 2.921
2 0.569 3 (0.569 - 3)^2 = 5.902
3 0.755 4 (0.755 - 4)^2 = 10.514
Total Sum = 19.337
MSE \text{MSE} = \frac{19.337}{3} = 6.446

Backpropagation

Feature Highlight: We use Backpropagation Through Time (BPTT) to compute gradients over the entire sequence, considering temporal dependencies.
Step 1: Gradients w.r.t Outputs

\delta_{y_t} = y_t - Y_{\text{true}, t}

Time Step t y_t Y_{\text{true}, t} \delta_{y_t} = y_t - Y_{\text{true}, t}
1 0.291 2 -1.709
2 0.569 3 -2.431
3 0.755 4 -3.245
Step 2: Gradients w.r.t W_y and b_y

\frac{\partial L}{\partial W_y} = \sum_{t=1}^{3} \delta_{y_t} h_t

\frac{\partial L}{\partial b_y} = \sum_{t=1}^{3} \delta_{y_t}

Compute:

\frac{\partial L}{\partial W_y} = (-1.709)(0.291) + (-2.431)(0.569) + (-3.245)(0.755) = -4.326

\frac{\partial L}{\partial b_y} = -1.709 - 2.431 - 3.245 = -7.385

Feature Highlight: Gradients for the output layer accumulate over time steps, reflecting the influence of each time step on the loss.
Step 3: Gradients w.r.t Hidden States

Starting from the last time step and moving backward.

At t = 3:

\delta_{h_3} = \delta_{y_3} W_y (1 - h_3^2)

Compute:

\delta_{h_3} = (-3.245)(1.0)(1 - (0.755)^2) = (-3.245)(0.430) = -1.394

Feature Highlight: The derivative of the activation function (1 - h_t^2) is used because of the tanh activation.

At t = 2:

\delta_{h_2} = (\delta_{h_3} W_h + \delta_{y_2} W_y)(1 - h_2^2)

Compute:

\delta_{h_2} = (-1.394)(0.5) + (-2.431)(1.0) = -0.697 - 2.431 = -3.128

(1 - h_2^2) = 1 - (0.569)^2 = 0.676

\delta_{h_2} = -3.128 \times 0.676 = -2.116

At t = 1:

\delta_{h_1} = (\delta_{h_2} W_h + \delta_{y_1} W_y)(1 - h_1^2)

Compute:

\delta_{h_1} = (-2.116)(0.5) + (-1.709)(1.0) = -1.058 - 1.709 = -2.767

(1 - h_1^2) = 0.915

\delta_{h_1} = -2.767 \times 0.915 = -2.531

Feature Highlight: Gradients w.r.t hidden states consider both the direct error at the current time step and the propagated error from future time steps.
Step 4: Gradients w.r.t W_h, W_x, and b_h

Compute:

\frac{\partial L}{\partial W_h} = \sum_{t=1}^{3} \delta_{h_t} h_{t-1}

\frac{\partial L}{\partial W_x} = \sum_{t=1}^{3} \delta_{h_t} x_t

\frac{\partial L}{\partial b_h} = \sum_{t=1}^{3} \delta_{h_t}

Calculations:

  • For W_h:

    \frac{\partial L}{\partial W_h} = (-2.531)(0) + (-2.116)(0.291) + (-1.394)(0.569) = -0.616 - 0.794 = -1.410

  • For W_x:

    \frac{\partial L}{\partial W_x} = (-2.531)(1) + (-2.116)(2) + (-1.394)(3) = -2.531 - 4.232 - 4.182 = -10.945

  • For b_h:

    \frac{\partial L}{\partial b_h} = -2.531 - 2.116 - 1.394 = -6.041

Feature Highlight: Gradients for weights are influenced by both inputs and hidden states, showing how the network learns from the sequence data.
Step 5: Update Weights and Biases

Update parameters using gradient descent:

W_y_{\text{new}} = W_y - \eta \frac{\partial L}{\partial W_y} = 1.0 - 0.01 \times (-4.326) = 1.043

b_y_{\text{new}} = b_y - \eta \frac{\partial L}{\partial b_y} = 0.0 - 0.01 \times (-7.385) = 0.074

W_h_{\text{new}} = W_h - \eta \frac{\partial L}{\partial W_h} = 0.5 - 0.01 \times (-1.410) = 0.514

W_x_{\text{new}} = W_x - \eta \frac{\partial L}{\partial W_x} = 0.2 - 0.01 \times (-10.945) = 0.309

b_h_{\text{new}} = b_h - \eta \frac{\partial L}{\partial b_h} = 0.1 - 0.01 \times (-6.041) = 0.161

Feature Highlight: Parameters are updated based on accumulated gradients, improving the model’s predictions in subsequent iterations.

Iteration 2

Using updated parameters.

Forward Pass

Time Step t x_t h_{t-1} z_t h_t y_t
1 1 0 0.514 \times 0 + 0.309 \times 1 + 0.161 = 0.470 \tanh(0.470) = 0.438 1.043 \times 0.438 + 0.074 = 0.530
2 2 0.438 0.514 \times 0.438 + 0.309 \times 2 + 0.161 = 0.992 \tanh(0.992) = 0.759 1.043 \times 0.759 + 0.074 = 0.866
3 3 0.759 0.514 \times 0.759 + 0.309 \times 3 + 0.161 = 1.504 \tanh(1.504) = 0.906 1.043 \times 0.906 + 0.074 = 1.019
Feature Highlight: Updated parameters result in different activations and outputs, reflecting the learning process.

Loss Calculation

Time Step t y_t Y_{\text{true}, t} (y_t - Y_{\text{true}, t})^2
1 0.530 2 (0.530 - 2)^2 = 2.159
2 0.866 3 (0.866 - 3)^2 = 4.551
3 1.019 4 (1.019 - 4)^2 = 8.876
Total Sum = 15.586
MSE \text{MSE} = \frac{15.586}{3} = 5.195
Feature Highlight: The decrease in loss indicates that the model is learning and improving its predictions.

Backpropagation

Repeat the same backpropagation steps as in Iteration 1, using the updated parameters..








Mastering Time Series Forecasting with RNNs and Seq2Seq Models



2. Direct Multi-Step Forecasting with RNN

In direct multi-step forecasting, the model predicts multiple future time steps simultaneously using the final hidden state of the RNN, without feeding predictions back into the model.

Key Feature: The model outputs multiple predictions at once, capturing dependencies between future steps directly, and backpropagation occurs through the final hidden state to the input time steps.

Iteration 1

Forward Pass

We process the input sequence to obtain the final hidden state and then predict multiple future outputs.

Time Step  t  x_t  h_{t-1}  z_t = W_h h_{t-1} + W_x x_t + b_h  h_t = \tanh(z_t)
1 1 0  0.5 \times 0 + 0.2 \times 1 + 0.1 = 0.3  \tanh(0.3) = 0.291
2 2 0.291  0.5 \times 0.291 + 0.2 \times 2 + 0.1 = 0.6455  \tanh(0.6455) = 0.569
3 3 0.569  0.5 \times 0.569 + 0.2 \times 3 + 0.1 = 0.9845  \tanh(0.9845) = 0.755
Feature Highlight: Unlike iterative forecasting, we only compute outputs after processing the entire input sequence, using the final hidden state  h_3 .

Predict Outputs:

Assuming  W_y is adjusted to output two values:

  • Let  W_y be a  2 \times 1 matrix:

             W_y = \begin{bmatrix} 1.0 \\ 0.8 \end{bmatrix}, \quad b_y = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

Compute the predictions:

 \hat{Y} = W_y h_3 + b_y = \begin{bmatrix} 1.0 \\ 0.8 \end{bmatrix} \times 0.755 + \begin{bmatrix} 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.755 \\ 0.604 \end{bmatrix}

Feature Highlight: The model predicts multiple future steps simultaneously from the final hidden state.

Loss Calculation

Compute the Mean Squared Error (MSE):

 \text{MSE} = \frac{1}{2} \sum_{i=1}^{2} (\hat{Y}_i - Y_{\text{true}, i})^2

Output Index  i  \hat{Y}_i  Y_{\text{true}, i}  (\hat{Y}_i - Y_{\text{true}, i})^2
1 0.755 4  (0.755 - 4)^2 = 10.517
2 0.604 5  (0.604 - 5)^2 = 19.322
Total Sum = 29.839
MSE  \text{MSE} = \frac{29.839}{2} = 14.919

Backpropagation

Feature Highlight: Backpropagation flows from the outputs back through the final hidden state and then through the time steps, without iterative feedback from predictions.
Step 1: Gradients w.r.t Outputs

 \delta_{\hat{Y}_i} = \hat{Y}_i - Y_{\text{true}, i}

Compute:

  •  \delta_{\hat{Y}_1} = 0.755 - 4 = -3.245
  •  \delta_{\hat{Y}_2} = 0.604 - 5 = -4.396
Step 2: Gradients w.r.t  W_y and  b_y

 \frac{\partial L}{\partial W_y} = \delta_{\hat{Y}} h_3^\top = \begin{bmatrix} -3.245 \\ -4.396 \end{bmatrix} \times 0.755 = \begin{bmatrix} -2.450 \\ -3.321 \end{bmatrix}

 \frac{\partial L}{\partial b_y} = \delta_{\hat{Y}} = \begin{bmatrix} -3.245 \\ -4.396 \end{bmatrix}

Feature Highlight: The gradient w.r.t  W_y is calculated using the final hidden state and the errors from both outputs.
Step 3: Gradient w.r.t Final Hidden State  h_3

 \delta_{h_3} = W_y^\top \delta_{\hat{Y}} = (1.0)(-3.245) + (0.8)(-4.396) = -3.245 - 3.517 = -6.762

Compute derivative through the activation function:

 \delta_{h_3} = \delta_{h_3} \cdot (1 - h_3^2) = -6.762 \times (1 - 0.755^2) = -6.762 \times 0.430 = -2.906

Feature Highlight: The error is backpropagated from both outputs to the final hidden state, reflecting the simultaneous prediction of multiple steps.
Step 4: Backpropagate to Previous Time Steps

Compute  \delta_{h_t} for  t = 2 and  t = 1 :

  • At  t = 2 :

     \delta_{h_2} = \delta_{h_3} W_h (1 - h_2^2) = (-2.906)(0.5)(1 - 0.569^2) = -2.906 \times 0.5 \times 0.676 = -0.982

  • At  t = 1 :

     \delta_{h_1} = \delta_{h_2} W_h (1 - h_1^2) = (-0.982)(0.5)(1 - 0.291^2) = -0.982 \times 0.5 \times 0.915 = -0.449

Feature Highlight: Gradients are propagated back through time steps from the final hidden state without considering future predicted outputs.
Step 5: Gradients w.r.t  W_h ,  W_x , and  b_h

Compute:

  • For  W_h :

     \frac{\partial L}{\partial W_h} = \delta_{h_3} h_2 + \delta_{h_2} h_1 + \delta_{h_1} h_0 = (-2.906)(0.569) + (-0.982)(0.291) + (-0.449)(0) = -1.654 - 0.286 + 0 = -1.940

  • For  W_x :

     \frac{\partial L}{\partial W_x} = \delta_{h_3} x_3 + \delta_{h_2} x_2 + \delta_{h_1} x_1 = (-2.906)(3) + (-0.982)(2) + (-0.449)(1) = -8.719 - 1.964 - 0.449 = -11.132

  • For  b_h :

     \frac{\partial L}{\partial b_h} = \delta_{h_3} + \delta_{h_2} + \delta_{h_1} = -2.906 - 0.982 - 0.449 = -4.337

Feature Highlight: Weight gradients are computed based on the backpropagated errors from the final hidden state through the RNN layers.
Step 6: Update Weights and Biases

Update parameters:

  •  W_y^{\text{new}} = W_y - \eta \frac{\partial L}{\partial W_y} = \begin{bmatrix} 1.0 \\ 0.8 \end{bmatrix} - 0.01 \times \begin{bmatrix} -2.450 \\ -3.321 \end{bmatrix} = \begin{bmatrix} 1.0245 \\ 0.8332 \end{bmatrix}
  •  b_y^{\text{new}} = b_y - \eta \frac{\partial L}{\partial b_y} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} - 0.01 \times \begin{bmatrix} -3.245 \\ -4.396 \end{bmatrix} = \begin{bmatrix} 0.032 \\ 0.044 \end{bmatrix}
  •  W_h^{\text{new}} = 0.5 - 0.01 \times (-1.940) = 0.519
  •  W_x^{\text{new}} = 0.2 - 0.01 \times (-11.132) = 0.311
  •  b_h^{\text{new}} = 0.1 - 0.01 \times (-4.337) = 0.143
Feature Highlight: Parameters are updated based on gradients that reflect the simultaneous prediction of future steps.

Iteration 2

Forward Pass

Time Step  t  x_t  h_{t-1}  z_t  h_t
1 1 0  0.519 \times 0 + 0.311 \times 1 + 0.143 = 0.454  \tanh(0.454) = 0.425
2 2 0.425  0.519 \times 0.425 + 0.311 \times 2 + 0.143 = 1.007  \tanh(1.007) = 0.765
3 3 0.765  0.519 \times 0.765 + 0.311 \times 3 + 0.143 = 1.569  \tanh(1.569) = 0.916

Predict Outputs

 \hat{Y} = W_y^{\text{new}} h_3 + b_y^{\text{new}} = \begin{bmatrix} 1.0245 \\ 0.8332 \end{bmatrix} \times 0.916 + \begin{bmatrix} 0.032 \\ 0.044 \end{bmatrix} = \begin{bmatrix} 0.968 + 0.032 \\ 0.763 + 0.044 \end{bmatrix} = \begin{bmatrix} 1.000 \\ 0.807 \end{bmatrix}

Loss Calculation

Output Index  i  \hat{Y}_i  Y_{\text{true}, i}  (\hat{Y}_i - Y_{\text{true}, i})^2
1 1.000 4  (1.000 - 4)^2 = 9.000
2 0.807 5  (0.807 - 5)^2 = 17.550
Total Sum = 26.550
MSE  \text{MSE} = \frac{26.550}{2} = 13.275
Feature Highlight: The loss has decreased compared to the first iteration, indicating learning progress.

Backpropagation

Repeat the backpropagation steps as in Iteration 1, using updated parameters and calculations.


3. Seq2Seq Models for Time Series Forecasting

Seq2Seq models use an encoder-decoder architecture to handle input and output sequences of different lengths.

Key Feature: The model consists of two separate RNNs: an encoder that processes the input sequence and a decoder that generates the output sequence, allowing flexible sequence lengths and capturing complex temporal dependencies.

Iteration 1

Forward Pass

Encoder

Process input sequence  X = [1, 2, 3] to obtain the final hidden state  h_{\text{enc}} .

Encoder Time Step  t  x_t  h_{t-1}  z_t  h_t
1 1 0  0.5 \times 0 + 0.2 \times 1 + 0.1 = 0.3  \tanh(0.3) = 0.291
2 2 0.291  0.5 \times 0.291 + 0.2 \times 2 + 0.1 = 0.6455  \tanh(0.6455) = 0.569
3 3 0.569  0.5 \times 0.569 + 0.2 \times 3 + 0.1 = 0.9845  \tanh(0.9845) = 0.755
Decoder

Initialize decoder hidden state  h_{\text{dec}, 0} = h_{\text{enc}} = 0.755 . Assuming teacher forcing, we use the actual previous output as input.

  • Decoder Weights: We’ll use separate weights for the decoder:
    •  W_{h_{\text{dec}}} = 0.6
    •  W_{x_{\text{dec}}} = 0.3
    •  W_{y_{\text{dec}}} = 1.0
    •  b_{h_{\text{dec}}} = 0.1
    •  b_{y_{\text{dec}}} = 0.0
Feature Highlight: The encoder and decoder have separate parameters, allowing for different dynamics in processing inputs and generating outputs.

Decoder Time Step  t  y_{\text{input}, t}  h_{t-1}  z_t  h_t  y_t
1 0 (start token) 0.755  0.6 \times 0.755 + 0.3 \times 0 + 0.1 = 0.553  \tanh(0.553) = 0.503  1.0 \times 0.503 + 0 = 0.503
2 2 0.503  0.6 \times 0.503 + 0.3 \times 2 + 0.1 = 0.902  \tanh(0.902) = 0.717  1.0 \times 0.717 + 0 = 0.717
3 3 0.717  0.6 \times 0.717 + 0.3 \times 3 + 0.1 = 1.210  \tanh(1.210) = 0.837  1.0 \times 0.837 + 0 = 0.837
Feature Highlight: The decoder uses the encoder’s final hidden state as its initial hidden state and processes the target sequence (using teacher forcing).

Loss Calculation

Compute the MSE over the decoder outputs:

 \text{MSE} = \frac{1}{3} \sum_{t=1}^{3} (y_t - Y_{\text{true}, t})^2

Time Step  t  y_t  Y_{\text{true}, t}  (y_t - Y_{\text{true}, t})^2
1 0.503 2  (0.503 - 2)^2 = 2.235
2 0.717 3  (0.717 - 3)^2 = 5.210
3 0.837 4  (0.837 - 4)^2 = 10.016
Total Sum = 17.461
MSE  \text{MSE} = \frac{17.461}{3} = 5.820

Backpropagation

Feature Highlight: Backpropagation involves computing gradients through both the decoder and encoder, considering their separate parameters and the dependency of the decoder’s initial state on the encoder’s final state.
Step 1: Gradients w.r.t Decoder Outputs

 \delta_{y_t} = y_t - Y_{\text{true}, t}

Compute:

  •  \delta_{y_1} = 0.503 - 2 = -1.497
  •  \delta_{y_2} = 0.717 - 3 = -2.283
  •  \delta_{y_3} = 0.837 - 4 = -3.163
Step 2: Backpropagation Through Decoder Time Steps

Compute gradients w.r.t decoder weights and biases.

At  t = 3 :

 \delta_{h_{\text{dec},3}} = \delta_{y_3} W_{y_{\text{dec}}} (1 - h_{\text{dec},3}^2) = (-3.163)(1.0)(1 - 0.837^2) = -3.163 \times 0.299 = -0.946

At  t = 2 :

 \delta_{h_{\text{dec},2}} = (\delta_{h_{\text{dec},3}} W_{h_{\text{dec}}} + \delta_{y_2} W_{y_{\text{dec}}})(1 - h_{\text{dec},2}^2)

 \delta_{h_{\text{dec},2}} = (-0.946)(0.6) + (-2.283)(1.0) = -0.568 - 2.283 = -2.851

 (1 - h_{\text{dec},2}^2) = 1 - 0.717^2 = 0.485

 \delta_{h_{\text{dec},2}} = -2.851 \times 0.485 = -1.383

At  t = 1 :

 \delta_{h_{\text{dec},1}} = (\delta_{h_{\text{dec},2}} W_{h_{\text{dec}}} + \delta_{y_1} W_{y_{\text{dec}}})(1 - h_{\text{dec},1}^2)

 \delta_{h_{\text{dec},1}} = (-1.383)(0.6) + (-1.497)(1.0) = -0.830 - 1.497 = -2.327

 (1 - h_{\text{dec},1}^2) = 1 - 0.503^2 = 0.747

 \delta_{h_{\text{dec},1}} = -2.327 \times 0.747 = -1.738

Feature Highlight: Errors are backpropagated through the decoder time steps, and gradients are computed for decoder-specific weights and biases.
Step 3: Gradients w.r.t Decoder Weights and Biases

Compute:

  • For  W_{y_{\text{dec}}} :

     \frac{\partial L}{\partial W_{y_{\text{dec}}}} = \sum_{t=1}^{3} \delta_{y_t} h_{\text{dec}, t}

    •  \delta_{y_1} h_{\text{dec},1} = (-1.497)(0.503) = -0.753
    •  \delta_{y_2} h_{\text{dec},2} = (-2.283)(0.717) = -1.637
    •  \delta_{y_3} h_{\text{dec},3} = (-3.163)(0.837) = -2.648
    • Total:  -0.753 - 1.637 - 2.648 = -5.038
  • For  W_{h_{\text{dec}}} :

     \frac{\partial L}{\partial W_{h_{\text{dec}}}} = \sum_{t=1}^{3} \delta_{h_{\text{dec}, t}} h_{\text{dec}, t-1}

    •  \delta_{h_{\text{dec},1}} h_{\text{dec},0} = (-1.738)(0.755) = -1.312
    •  \delta_{h_{\text{dec},2}} h_{\text{dec},1} = (-1.383)(0.503) = -0.696
    •  \delta_{h_{\text{dec},3}} h_{\text{dec},2} = (-0.946)(0.717) = -0.679
    • Total:  -1.312 - 0.696 - 0.679 = -2.687
  • For  W_{x_{\text{dec}}} :

     \frac{\partial L}{\partial W_{x_{\text{dec}}}} = \sum_{t=1}^{3} \delta_{h_{\text{dec}, t}} y_{\text{input}, t}

    •  \delta_{h_{\text{dec},1}} y_{\text{input},1} = (-1.738)(0) = 0
    •  \delta_{h_{\text{dec},2}} y_{\text{input},2} = (-1.383)(2) = -2.766
    •  \delta_{h_{\text{dec},3}} y_{\text{input},3} = (-0.946)(3) = -2.838
    • Total:  0 - 2.766 - 2.838 = -5.604
  • For  b_{h_{\text{dec}}} :

     \frac{\partial L}{\partial b_{h_{\text{dec}}}} = \sum_{t=1}^{3} \delta_{h_{\text{dec}, t}} = -1.738 - 1.383 - 0.946 = -4.067

Feature Highlight: Separate gradients are computed for the decoder’s weights and biases, independent of the encoder.
Step 4: Gradient w.r.t Encoder’s Final Hidden State

Since  h_{\text{dec},0} = h_{\text{enc}} , we need to compute:

 \delta_{h_{\text{enc}}} = \delta_{h_{\text{dec},1}} W_{h_{\text{dec}}} (1 - h_{\text{enc}}^2) = (-1.738)(0.6)(1 - 0.755^2) = -1.738 \times 0.6 \times 0.430 = -0.449

Step 5: Backpropagation Through Encoder Time Steps

Compute  \delta_{h_t} for encoder time steps  t = 3, 2, 1 :

  • At  t = 3 :

     \delta_{h_3} = \delta_{h_{\text{enc}}} (1 - h_3^2) = -0.449 \times (1 - 0.755^2) = -0.449 \times 0.430 = -0.193

  • At  t = 2 :

     \delta_{h_2} = \delta_{h_3} W_h (1 - h_2^2) = (-0.193)(0.5)(0.676) = -0.065

  • At  t = 1 :

     \delta_{h_1} = \delta_{h_2} W_h (1 - h_1^2) = (-0.065)(0.5)(0.915) = -0.030

Feature Highlight: The encoder receives gradients backpropagated from the decoder, allowing the encoder to adjust its parameters based on the decoder’s performance.
Step 6: Gradients w.r.t Encoder Weights and Biases

Compute:

  • For  W_h :

     \frac{\partial L}{\partial W_h} = \delta_{h_3} h_2 + \delta_{h_2} h_1 + \delta_{h_1} h_0 = (-0.193)(0.569) + (-0.065)(0.291) + (-0.030)(0) = -0.110 - 0.019 + 0 = -0.129

  • For  W_x :

     \frac{\partial L}{\partial W_x} = \delta_{h_3} x_3 + \delta_{h_2} x_2 + \delta_{h_1} x_1 = (-0.193)(3) + (-0.065)(2) + (-0.030)(1) = -0.579 - 0.129 - 0.030 = -0.738

  • For  b_h :

     \frac{\partial L}{\partial b_h} = \delta_{h_3} + \delta_{h_2} + \delta_{h_1} = -0.193 - 0.065 - 0.030 = -0.288

Feature Highlight: The encoder’s parameters are updated based on how well the decoder performs, enabling the entire Seq2Seq model to learn jointly.

Iteration 2

Using updated parameters, repeat the forward pass and backpropagation steps for both the encoder and decoder.


Overall Feature Highlight: Seq2Seq models allow for flexible input and output sequence lengths, with separate encoder and decoder networks learning jointly. The backpropagation process involves gradients flowing from the decoder outputs back through the decoder and then to the encoder, enabling complex temporal dependencies to be captured.

By providing detailed calculations, tables, and highlighting method-specific features during the calculations, we’ve covered the 3 methods: Iterative Forecasting which is Predicting One Step at a Time and Direct Multi-Step Forecasting with RNN and Seq2Seq Models for Time Series Forecasting, demonstrating how each method processes data differently and how backpropagation is performed uniquely in each case.




Conclusion: Time Series Forecasting Methods Comparison (Updated for 2024)


Conclusion: Time Series Forecasting Methods Comparison (Updated for 2024)

Here is a more detailed comparison of the three time series forecasting methods, including which is likely to have less error and insights into their current popularity based on 2024 trends.


Complete Table Comparing the Forecasting Methods

Method Prediction Style When to Use Drawbacks Example Use Case Error Propensity Popularity in 2024
Iterative Forecasting Predicts one step at a time When future depends on the immediate past (short-term) Error accumulation due to feedback Stock prices or energy consumption prediction High error potential (due to error feedback) Commonly used for simple tasks
Direct Multi-Step Forecasting Predicts multiple steps at once When future steps are loosely connected (medium-term) May miss time step dependencies Sales forecasting for the next few months Moderate error, but no feedback loop errors Moderate use, effective for medium-term
Seq2Seq Models Encoder-decoder for full sequence For long-term predictions or variable-length sequences More complex and harder to train Long-term financial forecasts or weather predictions Lower error for complex or long-term tasks Increasingly popular for complex forecasting, especially in deep learning

Text-Based Graph Representations for Each Method

1. Iterative Forecasting (Predicting One Step at a Time)

This method predicts one step at a time and uses the predicted output as input for the next step, introducing a feedback loop.

Input:    [X₁, X₂, X₃]    --->   Y₁
                     |             |
                     v             v
Next Input: [X₂, X₃, Y₁] --->   Y₂
                     |             |
                     v             v
Next Input: [X₃, Y₁, Y₂] --->   Y₃

Error Propensity: As each prediction is used in the next step, errors from one prediction propagate through subsequent steps, leading to higher cumulative error.


2. Direct Multi-Step Forecasting with RNN

This method predicts multiple future steps at once, based on the input sequence, without any feedback loop.

Input:    [X₁, X₂, X₃]    --->   [Y₁, Y₂, Y₃]

Error Propensity: The model outputs multiple predictions at once, which can lead to moderate error, but avoids feedback loop problems.


3. Seq2Seq Models for Time Series Forecasting

Seq2Seq models use an encoder-decoder architecture, where the encoder processes the input sequence into a context vector, and the decoder generates the future sequence.

Encoder: [X₁, X₂, X₃] ---> Context Vector ---> Decoder: [Y₁, Y₂, Y₃]

Error Propensity: This model has lower error when applied to complex and long-term time series forecasting problems, as it captures dependencies across the entire sequence.


Most Popular Method in 2024

In 2024, deep learning-based models, particularly Seq2Seq models and more advanced methods like Temporal Fusion Transformer (TFT) and N-BEATS, are increasingly popular for time series forecasting, particularly for complex, multivariate tasks and long-term forecasts​:contentReference[oaicite:2]{index=2}​:contentReference[oaicite:3]{index=3}. These methods are favored for their ability to capture complex patterns and dependencies, but classical models like ARIMA and iterative methods remain widely used for simpler or shorter-term tasks​:contentReference[oaicite:4]{index=4}​:contentReference[oaicite:5]{index=5}.

Seq2Seq models, despite their complexity, are seen as highly effective for scenarios requiring more accurate long-term predictions.







Final Summary: Time Series Forecasting Methods


Final Summary: Time Series Forecasting Methods

Part 1: Iterative Forecasting (One Step at a Time)

Training Process

The model predicts one time step ahead and feeds that prediction as input for the next step.

Example Input: [1, 2, 3] → Predicted Output (during training): [2, 3, 4]

Inference Process

During inference, the model predicts unseen future values based on new inputs.

Example Input: [3, 4, 5] → Predicted Output (during inference): [6, 7]

Loss Calculation

The model calculates the error between the predicted value and the actual target:

\text{Loss} = \sum_{t=1}^{T} (y_t - Y_{\text{true}, t})^2

The total loss is summed over all predictions.

Backpropagation & Weight Updates

After each time step, the model uses Backpropagation Through Time (BPTT) to calculate the gradients and update the weights:

W \rightarrow W - \eta \cdot \nabla_W \text{Loss}

Where \eta is the learning rate, and \nabla_W \text{Loss} is the gradient of the loss.

Graph Representation

Training:

Input: [1, 2, 3]    --->   Y₁ = 2
  Calculate Loss    --->   Update weights for Y₁
                     |
                     v
Next Input: [2, 3, Y₁] --->   Y₂ = 3
  Calculate Loss    --->   Update weights for Y₂
                     |
                     v
Next Input: [3, Y₁, Y₂] --->   Y₃ = 4
  Calculate Loss    --->   Update weights for Y₃

Inference:

Input: [3, 4, 5] ---> Model predicts Y₁ = 6, Y₂ = 7

Error Accumulation

Since each prediction depends on the previous step’s prediction, any errors from one prediction will propagate forward, leading to higher loss over time.

Pros

  • Useful for short-term predictions where immediate dependencies exist between time steps.

Cons

  • Error accumulation: If the early predictions are wrong, subsequent steps are affected, which increases the overall error.

Part 2: Direct Multi-Step Forecasting

Training Process

The model predicts multiple future values at once without feeding the predictions back into the input.

Example Input: [1, 2, 3] → Predicted Output: [4, 5]

Inference Process

In inference, the model predicts multiple future values based on the new unseen input.

Example Input: [3, 4, 5] → Predicted Output: [6, 7]

Loss Calculation

The loss is calculated simultaneously for all predictions:

\text{Loss} = \sum_{t=1}^{T} (y_t - Y_{\text{true}, t})^2

Backpropagation & Weight Updates

The model adjusts its weights after all predictions are made:

W \rightarrow W - \eta \cdot \nabla_W \text{Loss}

Graph Representation

Training:

Input: [1, 2, 3]    --->   Y₁ = 4, Y₂ = 5
  Calculate Total Loss (for Y₁ and Y₂) --->   Update weights

Inference:

Input: [3, 4, 5] ---> Model predicts Y₁ = 6, Y₂ = 7

No Error Accumulation

Since all future steps are predicted at once, there is no error propagation between predictions, which can result in more accurate predictions.

Pros

  • Suitable for medium-term forecasting tasks where multiple predictions are needed at once.
  • Avoids the issue of feedback loops.

Cons

  • It may miss some time dependencies between individual time steps, especially if they are strongly connected.

Part 3: Seq2Seq Models (Encoder-Decoder Architecture)

Training Process

The model uses an encoder-decoder structure, where the encoder processes the input sequence into a context vector, and the decoder generates the output sequence based on that vector.

Example Input: [1, 2, 3] → Predicted Output: [4, 5]

Inference Process

The model predicts unseen future values based on new inputs and the context vector generated by the encoder.

Example Input: [3, 4, 5] → Predicted Output: [6, 7]

Loss Calculation

The total loss is calculated across all predicted outputs:

\text{Loss} = \frac{1}{2} \left( (y_1 - 4)^2 + (y_2 - 5)^2 \right)

Backpropagation & Weight Updates

The model uses Backpropagation Through Time (BPTT) to update the weights in both the encoder and decoder:

W_{\text{encoder}} \rightarrow W_{\text{encoder}} - \eta \cdot \nabla_W \text{Loss}

W_{\text{decoder}} \rightarrow W_{\text{decoder}} - \eta \cdot \nabla_W \text{Loss}

Graph Representation

Training:

Encoder: [1, 2, 3] ---> Context Vector
                                      |
                                      v
Decoder: Context Vector ---> Y₁ = 4, Y₂ = 5
  Calculate Total Loss ---> Update weights for encoder and decoder

Inference:

Encoder: [3, 4, 5] ---> Model predicts Y₁ = 6, Y₂ = 7

Long-Term Dependencies

The Seq2Seq architecture is highly effective at capturing long-term dependencies between input and output sequences. The context vector provides a summary of the entire input, allowing for more accurate predictions over longer time horizons.

Pros

  • Best for long-term forecasting, especially when there are complex relationships between different time steps.
  • The encoder-decoder structure is powerful in handling variable-length sequences.

Cons

  • Requires more computational resources due to its complexity.
  • May need more training data to perform well.

Final Thoughts: Choosing the Best Method

Each forecasting method has its strengths and weaknesses:

  • Iterative Forecasting: Ideal for short-term predictions where there are immediate dependencies between the time steps, but it suffers from error accumulation, which can lead to higher overall loss over time.
  • Direct Multi-Step Forecasting: This method avoids error feedback by predicting all future steps simultaneously, making it more suitable for medium-term forecasting. However, it might miss dependencies between individual time steps if the time steps are strongly connected.
  • Seq2Seq Models: Best for long-term predictions where each future step depends on the entire input sequence. The encoder-decoder architecture is powerful for complex and variable-length sequences but requires more computational resources and data to achieve good performance.

Which Method Decreases Loss Better?

In summary, the method that decreases loss better depends on the forecasting task:

  • Iterative Forecasting: Works well for short-term forecasting but suffers from error propagation.
  • Direct Multi-Step Forecasting: Effective when you need to predict multiple steps at once without feedback loops. However, it may miss subtle time dependencies.
  • Seq2Seq Models: Generally offer better performance in long-term forecasting because they capture complex dependencies across the entire input sequence. This helps reduce loss for complex tasks, but they require more training data and computational power.

Final Recommendation

Choosing the right forecasting method depends on the task at hand:

  • If you are predicting short-term sequences with immediate dependencies, use Iterative Forecasting.
  • If you need multiple steps predicted at once without worrying about feedback loops, opt for Direct Multi-Step Forecasting.
  • If you’re working with long-term forecasts and complex relationships between inputs and outputs, the Seq2Seq Model will likely be the best choice.

Now is Time for Coding it

Part 1: Iterative Forecasting (One Step at a Time)

Example:

  • Input Sequence: [1, 2, 3]
  • Target Sequence (during training): [4] (the next time step)

Explanation:

  • The model predicts one time step ahead.
  • During inference, the model uses its own predictions to predict subsequent steps.
  • Key Difference: The model outputs a single value and uses iterative feedback during inference.

PyTorch Code Example


import torch
import torch.nn as nn
import torch.optim as optim

# Define the RNN model for Iterative Forecasting
class IterativeRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(IterativeRNN, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)  # Output layer predicts one step ahead

    def forward(self, x, hidden):
        out, hidden = self.rnn(x, hidden)
        # Only take the output from the last time step
        out = self.fc(out[:, -1, :])  # Line differs: Predicts one time step
        return out, hidden

# Hyperparameters
input_size = 1
hidden_size = 10
output_size = 1  # Predicting one value
learning_rate = 0.01
num_epochs = 100

# Training data
# Input sequence: [1, 2, 3]
x_train = torch.tensor([[[1.0], [2.0], [3.0]]])  # Shape: (batch_size, seq_length, input_size)
# Target value: [4] (the next time step)
y_train = torch.tensor([[4.0]])  # Shape: (batch_size, output_size)

# Initialize the model, loss function, and optimizer
model = IterativeRNN(input_size, hidden_size, output_size)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    hidden = torch.zeros(1, x_train.size(0), hidden_size)
    output, hidden = model(x_train, hidden)
    loss = criterion(output, y_train)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Inference: Predicting future values iteratively
model.eval()
with torch.no_grad():
    input_seq = x_train.clone()
    hidden = torch.zeros(1, input_seq.size(0), hidden_size)
    num_predictions = 2  # Number of future steps to predict
    predictions = []
    for _ in range(num_predictions):
        output, hidden = model(input_seq, hidden)
        predictions.append(output.item())
        # Update input sequence by appending the prediction
        input_seq = torch.cat((input_seq[:, 1:, :], output.unsqueeze(0).unsqueeze(2)), dim=1)  # Line differs: Iterative feedback
    print('Predicted values:', predictions)

Explanation

  • Model Definition:
    • Line 9: self.fc = nn.Linear(hidden_size, output_size)
      Explanation: The output layer predicts a single time step ahead.
  • Forward Method:
    • Line 13: out = self.fc(out[:, -1, :])
      Explanation: Only the last output is used for prediction. This is the key difference as the model predicts one time step at a time.
  • Inference Loop:
    • Line 45: input_seq = torch.cat((input_seq[:, 1:, :], output.unsqueeze(0).unsqueeze(2)), dim=1)
      Explanation: This line appends the model’s prediction as the next input, showcasing iterative feedback during inference.

Part 2: Direct Multi-Step Forecasting

Example:

  • Input Sequence: [1, 2, 3]
  • Target Sequence: [4, 5] (future values)

Explanation:

  • The model predicts multiple future steps at once.
  • It uses the final hidden state to predict all future values simultaneously.
  • Key Difference: The output layer predicts multiple steps without iterative feedback.

PyTorch Code Example


import torch
import torch.nn as nn
import torch.optim as optim

# Define the RNN model for Direct Multi-Step Forecasting
class DirectMultiStepRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_future_steps):
        super(DirectMultiStepRNN, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        # Output layer predicts multiple future steps
        self.fc = nn.Linear(hidden_size, output_size * num_future_steps)  # Line differs: Predicts multiple steps
        self.num_future_steps = num_future_steps
        self.output_size = output_size

    def forward(self, x, hidden):
        out, hidden = self.rnn(x, hidden)
        # Use the final hidden state to predict future steps
        out = self.fc(out[:, -1, :])  # Final hidden state used
        # Reshape output to (batch_size, num_future_steps, output_size)
        out = out.view(-1, self.num_future_steps, self.output_size)  # Line differs: Reshaping for multiple steps
        return out, hidden

# Hyperparameters
input_size = 1
hidden_size = 10
output_size = 1
num_future_steps = 2  # Predicting two future steps
learning_rate = 0.01
num_epochs = 100

# Training data
# Input sequence: [1, 2, 3]
x_train = torch.tensor([[[1.0], [2.0], [3.0]]])
# Target sequence: [4, 5]
y_train = torch.tensor([[[4.0], [5.0]]])  # Shape: (batch_size, num_future_steps, output_size)

# Initialize the model, loss function, and optimizer
model = DirectMultiStepRNN(input_size, hidden_size, output_size, num_future_steps)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    hidden = torch.zeros(1, x_train.size(0), hidden_size)
    output, hidden = model(x_train, hidden)
    loss = criterion(output, y_train)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Inference: Predicting future values
model.eval()
with torch.no_grad():
    input_seq = x_train
    hidden = torch.zeros(1, input_seq.size(0), hidden_size)
    output, hidden = model(input_seq, hidden)
    predictions = output.squeeze().numpy()
    print('Predicted values:', predictions)

Explanation

  • Model Definition:
    • Line 10: self.fc = nn.Linear(hidden_size, output_size * num_future_steps)
      Explanation: The output layer predicts multiple future steps at once. This is the key difference.
  • Forward Method:
    • Line 16: out = out.view(-1, self.num_future_steps, self.output_size)
      Explanation: This line reshapes the output to match the target shape for multiple steps.

Part 3: Seq2Seq Models (Encoder-Decoder Architecture)

Example:

  • Input Sequence: [1, 2, 3]
  • Target Sequence: [4, 5] (predicted output sequence)

Explanation:

  • The model uses an encoder to process the input sequence into a context vector.
  • A decoder generates the output sequence based on the context vector.
  • Key Differences:
    • Separate encoder and decoder networks.
    • Use of teacher forcing during training.

PyTorch Code Example


import torch
import torch.nn as nn
import torch.optim as optim

# Define the Encoder
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)

    def forward(self, x, hidden):
        out, hidden = self.rnn(x, hidden)
        return hidden  # Return final hidden state

# Define the Decoder
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.rnn = nn.RNN(output_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        out, hidden = self.rnn(x, hidden)
        out = self.fc(out)  # Apply output layer to each time step
        return out, hidden

# Hyperparameters
input_size = 1
hidden_size = 10
output_size = 1
learning_rate = 0.01
num_epochs = 100

# Training data
# Input sequence: [1, 2, 3]
x_train = torch.tensor([[[1.0], [2.0], [3.0]]])
# Target sequence: [4, 5]
y_train = torch.tensor([[[4.0], [5.0]]])  # Shape: (batch_size, target_length, output_size)

# Initialize models, loss function, and optimizers
encoder = EncoderRNN(input_size, hidden_size)
decoder = DecoderRNN(hidden_size, output_size)
criterion = nn.MSELoss()
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    encoder.train()
    decoder.train()
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Initialize hidden states
    encoder_hidden = torch.zeros(1, x_train.size(0), hidden_size)

    # Encoder forward pass
    encoder_hidden = encoder(x_train, encoder_hidden)  # Line differs: Separate encoder

    # Decoder forward pass
    decoder_input = torch.zeros(x_train.size(0), 1, output_size)  # Start token
    decoder_hidden = encoder_hidden  # Line differs: Decoder hidden state initialized with encoder hidden state
    target_length = y_train.size(1)
    outputs = torch.zeros(x_train.size(0), target_length, output_size)

    # Teacher forcing: use actual target outputs as next inputs
    for t in range(target_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        outputs[:, t:t+1, :] = decoder_output
        decoder_input = y_train[:, t:t+1, :]  # Line differs: Teacher forcing applied

    # Compute loss
    loss = criterion(outputs, y_train)
    loss.backward()
    encoder_optimizer.step()
    decoder_optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Inference: Predicting future values
encoder.eval()
decoder.eval()
with torch.no_grad():
    input_seq = x_train
    encoder_hidden = torch.zeros(1, input_seq.size(0), hidden_size)
    encoder_hidden = encoder(input_seq, encoder_hidden)

    decoder_input = torch.zeros(input_seq.size(0), 1, output_size)  # Start token
    decoder_hidden = encoder_hidden
    predictions = []
    target_length = 2  # Number of future steps to predict

    for _ in range(target_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        predictions.append(decoder_output.item())
        decoder_input = decoder_output  # Line differs: Decoder uses its own output as next input

    print('Predicted values:', predictions)

Explanation

  • Separate Encoder and Decoder:
    • Lines 7-17: Definition of EncoderRNN and DecoderRNN classes.
      Explanation: These separate classes handle the encoding and decoding of sequences.
  • Encoder Forward Pass:
    • Line 43: encoder_hidden = encoder(x_train, encoder_hidden)
      Explanation: The encoder processes the input sequence into a context vector.
  • Decoder Initialization:
    • Line 46: decoder_hidden = encoder_hidden
      Explanation: The decoder’s hidden state is initialized using the encoder’s final hidden state. This is the key difference in Seq2Seq models.
  • Teacher Forcing:
    • Line 53: decoder_input = y_train[:, t:t+1, :]
      Explanation: During training, the actual target output is used as the next input to the decoder. This is called teacher forcing and helps the model learn faster.
  • Inference Without Teacher Forcing:
    • Line 67: decoder_input = decoder_output
      Explanation: During inference, the decoder uses its own previous output as the next input.

Summary of Key Differences

Iterative Forecasting:

  • Output Layer Definition:
    self.fc = nn.Linear(hidden_size, output_size)  # Predicts one step ahead
  • Forward Method:
    out = self.fc(out[:, -1, :])  # Only last output used
  • Inference Loop:
    input_seq = torch.cat((input_seq[:, 1:, :], output.unsqueeze(0).unsqueeze(2)), dim=1)  # Iterative feedback

Direct Multi-Step Forecasting:

  • Output Layer Definition:
    self.fc = nn.Linear(hidden_size, output_size * num_future_steps)  # Predicts multiple steps
  • Forward Method:
    out = self.fc(out[:, -1, :])       # Final hidden state used
    out = out.view(-1, self.num_future_steps, self.output_size)  # Reshape for multiple steps

Seq2Seq Models:

  • Separate Encoder and Decoder Classes:
    class EncoderRNN(nn.Module):  # Encoder definition
    class DecoderRNN(nn.Module):  # Decoder definition
  • Decoder Initialization:
    decoder_hidden = encoder_hidden  # Decoder uses encoder's final hidden state
  • Teacher Forcing:
    decoder_input = y_train[:, t:t+1, :]  # Teacher forcing
  • Inference Without Teacher Forcing:
    decoder_input = decoder_output  # Decoder uses its own output

By comparing the lines of code across the three models, the specific features and techniques for each method become clear. Iterative forecasting uses feedback loops, direct multi-step forecasting predicts multiple steps simultaneously, and Seq2Seq models employ separate encoder and decoder networks with teacher forcing during training.






Don’t forget to check our iOS app!