Machine Learning Overview

To learn what is RNN (Recurrent Neural Networks ) why not understand ARIMA, SARIMA first ? – RNN Learning – Part 5 – day 59






ARIMA, SARIMA, and Their Relationship with Deep Learning for Time Series Forecasting

A Deep Dive into ARIMA, SARIMA, and Their Relationship with Deep Learning for Time Series Forecasting

In recent years, deep learning has become a dominant force in many areas of data analysis, and time series forecasting is no exception. Traditional models like ARIMA (Autoregressive Integrated Moving Average) and its seasonal extension SARIMA have long been the go-to solutions for forecasting time-dependent data. However, newer models based on Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have emerged as powerful alternatives. Both approaches have their strengths and applications, and understanding their relationship helps in choosing the right tool for the right problem.

In this blog post, we’ll explore ARIMA and SARIMA models in detail, discuss how they compare to deep learning-based models like RNNs, and demonstrate their practical implementation.

Deep Learning and Time Series Forecasting

Deep learning is a subset of machine learning where models learn hierarchical features from data using multiple layers of neural networks. When it comes to time series forecasting, one of the most common deep learning architectures used is Recurrent Neural Networks (RNNs).

RNNs are particularly well-suited for time series because they are designed to handle sequential data, where the output at each time step depends not only on the current input but also on the previous inputs. This is achieved by maintaining a hidden state that gets updated at each time step, allowing the model to “remember” past information.

Here are the key components of RNNs and their relevance to time series forecasting:

  • Sequential Memory: RNNs are built to retain information across time steps. This makes them suitable for forecasting problems where patterns are spread across time, such as stock prices or weather data.
  • Backpropagation Through Time (BPTT): Unlike traditional feedforward neural networks, RNNs are trained using a variant of backpropagation known as BPTT, where the network adjusts its weights by considering errors over multiple time steps.
  • Long Short-Term Memory (LSTM): A variant of RNNs, LSTMs are particularly useful in long-term forecasting because they are designed to overcome the vanishing gradient problem, allowing them to capture long-term dependencies in data.

While ARIMA and SARIMA focus on modeling the linear relationships in time series data, RNNs and LSTMs can capture complex non-linear dependencies. This makes RNNs more flexible, but they also require larger datasets and more computational power to train effectively.

How RNNs Relate to ARIMA and SARIMA Models

Although RNNs and ARIMA/SARIMA models operate differently, they share common ground in the context of time series forecasting:

  • Time Dependence: Both models are designed to forecast time-dependent data, meaning they consider historical information to predict future values.
  • Lagging Features: ARIMA uses lagging features (i.e., past values) directly, while RNNs learn patterns through sequential memory.
  • Complexity vs Simplicity: ARIMA/SARIMA models are simpler and more interpretable but may struggle with complex, non-linear patterns. RNNs, on the other hand, can model non-linearity but require more data and computational resources.

In this article, we will primarily focus on ARIMA and SARIMA models, their theoretical foundations, and how they are practically applied to time series forecasting. We’ll compare their strengths to RNNs and understand when to use which approach.

Understanding ARIMA and SARIMA Models

Time Series Fundamentals

At the heart of time series forecasting is the ability to recognize and model patterns in data that evolves over time. This includes:

  • Trend: A long-term upward or downward movement in the data.
  • Seasonality: Cyclical patterns that repeat at regular intervals, such as daily, weekly, or yearly fluctuations.
  • Stationarity: A stationary time series has a constant mean and variance over time. Many forecasting models, including ARIMA, require the data to be stationary to perform well.
  • Autocorrelation: The correlation between a time series and its lagged values. ARIMA models rely heavily on autocorrelation to predict future values.

ARIMA: Autoregressive Integrated Moving Average

The ARIMA model is a well-established statistical approach to time series forecasting. It works by combining three components:

  • Autoregressive (AR): The model regresses the target variable against its own previous values.
  • Integrated (I): This step involves differencing the data to remove trends and make the series stationary.
  • Moving Average (MA): The model includes a moving average component to account for the errors of past predictions.

The general ARIMA model is expressed as ARIMA(p, d, q), where:

  • p: The number of autoregressive terms.
  • d: The degree of differencing.
  • q: The number of lagged forecast errors used in the prediction.

SARIMA: Seasonal ARIMA

While ARIMA works well for non-seasonal data, time series data often contains seasonal patterns. SARIMA extends ARIMA by incorporating seasonal components:

  • P: Seasonal autoregressive terms.
  • D: Seasonal differencing.
  • Q: Seasonal moving average terms.
  • s: The length of the season (e.g., 7 for weekly seasonality).

A SARIMA model is expressed as ARIMA(p, d, q) x (P, D, Q, s), where both non-seasonal and seasonal components are considered.

Steps in Building ARIMA and SARIMA Models

  1. Data Preparation: Ensure the time series data is stationary. If not, apply differencing to make it stationary.
  2. Model Identification: Use tools like autocorrelation plots (ACF) and partial autocorrelation plots (PACF) to choose appropriate values for p, d, and q.
  3. Model Fitting: Train the ARIMA/SARIMA model on historical data.
  4. Forecasting: Use the fitted model to predict future data points.
  5. Model Evaluation: Measure the accuracy of the forecast using metrics like Mean Absolute Error (MAE).

Deep Learning vs Traditional Models

When deciding between RNNs and ARIMA/SARIMA models, it’s important to consider the complexity and nature of the data:

  • ARIMA/SARIMA: Best suited for small to medium-sized datasets with linear patterns and clear seasonality. They require minimal data preprocessing but struggle with non-linearity.
  • RNN/LSTM: Better suited for large datasets with complex, non-linear patterns. They excel at capturing long-term dependencies but need more data and computation to be effective. This is particularly useful for multi-step forecasts.






Coding Now to Understand ARIMA and SARIMA with RNN Comparison Better

Code Implementation of ARIMA and SARIMA with RNN Comparison

In this section, we will implement the two code and we will compare ARIMA and SARIMA models with Recurrent Neural Networks (RNNs) for time series forecasting.

1. ARIMA Basic Forecast Code

The ARIMA model is used to forecast rail ridership for the next day (June 1, 2019), assuming data ends on May 31, 2019.

    from statsmodels.tsa.arima.model import ARIMA

    # Define the origin and end date for the dataset
    origin, today = "2019-01-01", "2019-05-31"

    # Assume the time series is stored in a pandas dataframe
    rail_series = df.loc[origin:today]["rail"].asfreq("D")

    # Build the ARIMA model
    model = ARIMA(rail_series, order=(1, 0, 0), seasonal_order=(0, 1, 1, 7))

    # Fit the model to the data
    model = model.fit()

    # Forecast the rail ridership for June 1, 2019
    y_pred = model.forecast()  # returns 427,758.6
    

Explanation:

  • ARIMA Setup: The order=(1, 0, 0) sets up the ARIMA model with one autoregressive term (AR), no differencing (d=0), and no moving average term (MA).
  • Seasonal Component: The seasonal_order=(0, 1, 1, 7) adds seasonal differencing (D=1), a seasonal moving average (Q=1), and a seasonal period of 7 days (weekly seasonality).
  • Forecast: After fitting the model, the predicted ridership for June 1, 2019, is 427,758.6 passengers.

2. SARIMA with Daily Retraining and MAE Calculation

In this code, we extend the SARIMA model to retrain it daily for each day from March 1 to May 31, 2019. The forecasts are then compared to actual values, and the Mean Absolute Error (MAE) is calculated to evaluate the performance.

    # Define the time period and data range
    origin, start_date, end_date = "2019-01-01", "2019-03-01", "2019-05-31"
    time_period = pd.date_range(start_date, end_date)
    rail_series = df.loc[origin:end_date]["rail"].asfreq("D")

    # Empty list to store predictions
    y_preds = []

    # Loop through each day in the time period and retrain the model
    for today in time_period.shift(-1):
        model = ARIMA(rail_series[origin:today], 
                      order=(1, 0, 0), 
                      seasonal_order=(0, 1, 1, 7))

        # Fit the model
        model = model.fit()

        # Forecast for the next day and append to predictions list
        y_pred = model.forecast()[0]
        y_preds.append(y_pred)

    # Convert the predictions into a pandas Series
    y_preds = pd.Series(y_preds, index=time_period)

    # Calculate the Mean Absolute Error (MAE)
    mae = (y_preds - rail_series[time_period]).abs().mean()
    # MAE is 32,040.7
    

Explanation:

  • Daily Retraining: The SARIMA model is retrained every day based on data up to the current day. This allows it to adapt better to recent data trends.
  • Time Period: The forecasts are made for each day between March 1 and May 31, 2019. The predictions are stored in the list y_preds.
  • Evaluation (MAE Calculation): The Mean Absolute Error (MAE) measures the average error in the predictions. Here, the model produces an MAE of 32,040.7, which indicates the average error in the ridership prediction over the given time period.

Comparison with Recurrent Neural Networks (RNNs)

Now that we have implemented ARIMA and SARIMA models, let’s explore how they compare with Recurrent Neural Networks (RNNs) for time series forecasting.

Strengths of ARIMA and SARIMA:

  • Simplicity: ARIMA and SARIMA models are relatively straightforward to implement and interpret, particularly for linear, seasonal data.
  • Data Requirements: These models perform well on small to medium-sized datasets without requiring extensive computational resources.
  • Seasonality: SARIMA can handle seasonal patterns explicitly, which is useful for datasets with known seasonality (e.g., weekly, monthly patterns).

Limitations of ARIMA and SARIMA:

  • Linear Assumptions: Both ARIMA and SARIMA models assume linear relationships in the data. They may struggle with complex, non-linear patterns.
  • Long-term Dependencies: These models work well with short-term forecasts but may not capture long-term dependencies as effectively.

Why Use RNNs for Time Series Forecasting?

Recurrent Neural Networks (RNNs) are designed to handle sequential data like time series, where the future value depends on previous values. Unlike ARIMA and SARIMA, RNNs are capable of modeling both linear and non-linear relationships, making them powerful for complex time series forecasting.

Strengths of RNNs:

  • Sequential Memory: RNNs have a hidden state that retains information from previous time steps, allowing the model to “remember” past values and make better forecasts for long sequences.
  • Non-linearity: RNNs can model non-linear patterns in the data, which is critical for complex time series that have intricate patterns.
  • Handling Long-term Dependencies: With variants like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), RNNs can capture long-term dependencies that are difficult for ARIMA/SARIMA to handle.

Limitations of RNNs:

  • Data Requirement: RNNs typically require larger datasets to train effectively compared to ARIMA/SARIMA models.
  • Complexity: RNNs are computationally intensive, requiring more resources for training and tuning.
  • Interpretability: Unlike ARIMA/SARIMA models, RNNs can be treated as black-box models. It is harder to interpret the relationships learned by RNNs.

When to Use ARIMA/SARIMA vs. RNNs?

  • ARIMA/SARIMA: These models are better suited for small datasets with linear relationships and seasonal patterns. They are easier to interpret and require fewer computational resources.
  • RNNs (LSTM/GRU): If your time series data is large, has non-linear relationships, or involves long-term dependencies, RNNs or their variants (like LSTMs or GRUs) may provide better accuracy.






Recap to ARIMA with a real example

Recap to ARIMA with a small real example

Recap again : What is ARIMA?

ARIMA (AutoRegressive Integrated Moving Average) is a time series forecasting model. It uses three components:

  • Autoregressive (AR): Predicts future values based on past values.
  • Integrated (I): Applies differencing to make the series stationary.
  • Moving Average (MA): Uses past forecast errors to improve predictions.

ARIMA is typically represented as ARIMA(p, d, q), where:

  • p is the number of autoregressive terms.
  • d is the degree of differencing.
  • q is the number of moving average terms.

Example: ARIMA(1,1,1) Step-by-Step

Given this data:

Time (t) Value (y)
t=1 50
t=2 55
t=3 54
t=4 57

We aim to predict y_5 using ARIMA(1,1,1).

Step 1: Differencing (I = 1)

First, we apply first differencing to remove trends:

y'_t = y_t - y_{t-1}

For our data:

  • y'_2 = 55 - 50 = 5
  • y'_3 = 54 - 55 = -1
  • y'_4 = 57 - 54 = 3

The differenced series is y' = [5, -1, 3].

Step 2: Autoregression (AR = 1)

In AR(1), we predict the next differenced value \hat{y}'_5 using y'_4:

\hat{y}'_5 = c + \phi_1 y'_4

Where:

  • c = 1 (constant)
  • \phi_1 = 0.7 (autoregressive coefficient estimated from autocorrelation)

How \phi_1 = 0.7 is Estimated:

To estimate \phi_1, we calculate the lag-1 autocorrelation of the differenced series [5, -1, 3]. Here’s the calculation:

  1. Find the mean of the differenced series: \text{Mean} = \frac{5 + (-1) + 3}{3} = 2.33
  2. Covariance between y'_t and y'_{t-1}:
    \text{Cov}(y'_t, y'_{t-1}) = \frac{(5 - 2.33)(-1 - 2.33) + (-1 - 2.33)(3 - 2.33)}{2} = -6.33
  3. Variance of y'_t:
    \text{Var}(y'_t) = \frac{(5 - 2.33)^2 + (-1 - 2.33)^2 + (3 - 2.33)^2}{2} = 7.56
  4. Autocorrelation: \phi_1 = \frac{-6.33}{7.56} = -0.837

We assume \phi_1 is approximately 0.7 for this example.

Using \phi_1 = 0.7 and y'_4 = 3:

\hat{y}'_5 = 1 + 0.7 \times 3 = 1 + 2.1 = 3.1

The predicted differenced value \hat{y}'_5 is **3.1**.

Step 3: Moving Average (MA = 1)

The MA(1) component adjusts the prediction based on the previous error \epsilon_4:

\hat{y}_5 = \hat{y}'_5 + \theta_1 \epsilon_4

Where:

  • \epsilon_4 = -0.5 (previous error)
  • \theta_1 = 0.6 (moving average coefficient)

Now, calculate the adjusted prediction:

\hat{y}_5 = 3.1 + 0.6 \times (-0.5) = 3.1 - 0.3 = 2.8

The adjusted prediction is **2.8**.

Step 4: Reverse Differencing to Get Final Prediction

Finally, we reverse the differencing to bring the prediction back to the original scale:

\hat{y}_5 = y_4 + \hat{y}'_5
\hat{y}_5 = 57 + 2.8 = 59.8

Final Prediction

The predicted value for t=5 is **59.8**.

Summary

  • Differencing removes trends in the data.
  • Autoregression (AR) predicts the next value using the previous differenced value.
  • Moving Average (MA) adjusts the prediction using past forecast errors.
  • Reversing the differencing brings the prediction back to the original scale.





Overview

Objective:Now Lets Use a Recurrent Neural Network (RNN) to predict y_5 based on the previous values example exactly of y_1, y_2, y_3, and y_4, and demonstrate how the model improves over multiple training iterations.

Given Time Series Data:

 \begin{array}{|c|c|} \hline \text{Time (t)} & \text{Value} (y_t) \\ \hline t = 1 & 50 \\ t = 2 & 55 \\ t = 3 & 54 \\ t = 4 & 57 \\ \hline \end{array}

Step 1: Data Preparation

1.1 Organize Data into Sequences

We create input-output pairs for training:

Training Input Sequence: X_{\text{train}} = [y_1, y_2, y_3] = [50, 55, 54]
Training Target Output: Y_{\text{train}} = y_4 = 57
Prediction Input Sequence: X_{\text{predict}} = [y_2, y_3, y_4] = [55, 54, 57]

1.2 Reshape Data

Reshape X_{\text{train}} for RNN input:

 X_{\text{train}} = \begin{bmatrix} \begin{bmatrix} 50 \\ 55 \\ 54 \end{bmatrix} \end{bmatrix}

Step 2: Define the RNN Model

2.1 Model Architecture

  • Input Size: 1
  • Hidden Units: 1
  • Output Size: 1

2.2 Initialize Weights and Biases

 W_{xh} = 0.1, \quad W_{hh} = 0.2, \quad W_{hy} = 0.5, \quad b_h = 0, \quad b_y = 0

2.3 Activation Function

Use \tanh for the hidden state activation.

2.4 Initial Hidden State

Initial hidden state:

h_0 = 0

Step 3: Training Iterations

We will perform three training iterations to observe how the model improves.

Iteration 1

3.1 Forward Propagation

Time Step t = 1

  • Input: x_1 = 50

Compute Hidden State h_1:

h_1 = \tanh(W_{xh} \cdot x_1 + W_{hh} \cdot h_0 + b_h) = \tanh(0.1 \cdot 50 + 0.2 \cdot 0) = \tanh(5) \approx 0.9999092

Compute Output \hat{y}_1:

\hat{y}_1 = W_{hy} \cdot h_1 + b_y = 0.5 \cdot 0.9999092 = 0.4999546

Time Step t = 2

  • Input: x_2 = 55

Compute Hidden State h_2:

h_2 = \tanh(0.1 \cdot 55 + 0.2 \cdot h_1) = \tanh(5.5 + 0.2 \cdot 0.9999092) = \tanh(5.69998184) \approx 0.9999765

Compute Output \hat{y}_2:

\hat{y}_2 = 0.5 \cdot h_2 = 0.5 \cdot 0.9999765 = 0.49998825

Time Step t = 3

  • Input: x_3 = 54

Compute Hidden State h_3:

h_3 = \tanh(0.1 \cdot 54 + 0.2 \cdot h_2) = \tanh(5.5999953) \approx 0.9999723

Compute Output \hat{y}_3:

\hat{y}_3 = 0.5 \cdot h_3 = 0.5 \cdot 0.9999723 = 0.49998615

3.2 Loss Calculation

Target Output: Y_{\text{train}} = 57

Predicted Output: \hat{y}_3 \approx 0.49998615

Calculate Loss:

L = \frac{1}{2} (\hat{y}_3 - Y_{\text{train}})^2 = \frac{1}{2} (0.49998615 - 57)^2 = \frac{1}{2} (-56.50001385)^2 = 1596.1328

3.3 Adjust Output Scaling

To address the mismatch in output scale, set W_{hy} = 55.

Recalculate Output \hat{y}_3:

\hat{y}_3 = 55 \cdot h_3 = 55 \cdot 0.9999723 = 54.9985

Recalculate Loss:

L = \frac{1}{2} (54.9985 - 57)^2 = 2.003

3.4 Backpropagation Through Time (BPTT)

Compute Gradients

Gradient w.r.t W_{hy}:

\frac{\partial L}{\partial W_{hy}} = (\hat{y}_3 - Y_{\text{train}}) \cdot h_3 = (-2.0015) \cdot 0.9999723 = -2.00144

Gradient w.r.t W_{xh} at t = 3:

\delta_3 = (\hat{y}_3 - Y_{\text{train}}) \cdot W_{hy} \cdot (1 - h_3^2) = (-2.0015) \cdot 55 \cdot 0.0000554 = -0.006095

\frac{\partial L}{\partial W_{xh}}^{(t=3)} = \delta_3 \cdot x_3 = -0.006095 \cdot 54 = -0.32913

Gradient w.r.t W_{hh} at t = 3:

\frac{\partial L}{\partial W_{hh}}^{(t=3)} = \delta_3 \cdot h_2 = -0.006095 \cdot 0.9999765 = -0.00609486

Total Gradients

Total Gradients:

  • \frac{\partial L}{\partial W_{xh}} \approx -0.32913
  • \frac{\partial L}{\partial W_{hh}} \approx -0.00609486
  • \frac{\partial L}{\partial W_{hy}} \approx -2.00144

3.5 Update Weights

Using learning rate \eta = 0.01:

  • W_{xh}^{\text{new}} = 0.1 - 0.01 \cdot (-0.32913) = 0.1032913
  • W_{hh}^{\text{new}} = 0.2 - 0.01 \cdot (-0.00609486) = 0.20006095
  • W_{hy}^{\text{new}} = 55 - 0.01 \cdot (-2.00144) = 55.0200144

Iteration 2

4.1 Forward Propagation

Time Step t = 1

  • Input: x_1 = 50

Compute Hidden State h_1:

h_1 = \tanh(0.1032913 \cdot 50 + 0.20006095 \cdot 0) = \tanh(5.164565) \approx 0.999936

Compute Output \hat{y}_1:

\hat{y}_1 = 55.0200144 \cdot h_1 = 55.0200144 \cdot 0.999936 = 54.9965

Time Step t = 2

  • Input: x_2 = 55

Compute Hidden State h_2:

h_2 = \tanh(0.1032913 \cdot 55 + 0.20006095 \cdot h_1) = \tanh(5.681023 + 0.20006095 \cdot 0.999936) = \tanh(5.8810716) \approx 0.999982

Compute Output \hat{y}_2:

\hat{y}_2 = 55.0200144 \cdot h_2 = 55.0200144 \cdot 0.999982 = 55.019

Time Step t = 3

  • Input: x_3 = 54

Compute Hidden State h_3:

h_3 = \tanh(0.1032913 \cdot 54 + 0.20006095 \cdot h_2) = \tanh(5.777788) \approx 0.9999815

Compute Output \hat{y}_3:

\hat{y}_3 = 55.0200144 \cdot h_3 = 55.0200144 \cdot 0.9999815 = 55.019

4.2 Loss Calculation

Calculate Loss:

L = \frac{1}{2} (55.019 - 57)^2 = \frac{1}{2} (-1.981)^2 = 1.962

Note: The loss has decreased from 2.003 to 1.962.

4.3 Backpropagation Through Time (BPTT)

Compute gradients similarly to Iteration 1 but with updated values.

Gradient w.r.t W_{hy}:

\frac{\partial L}{\partial W_{hy}} = (55.019 - 57) \cdot h_3 = (-1.981) \cdot 0.9999815 = -1.98096

Gradient w.r.t W_{xh} at t = 3:

\delta_3 = (-1.981) \cdot 55.0200144 \cdot (1 - (0.9999815)^2) = -0.004048

\frac{\partial L}{\partial W_{xh}}^{(t=3)} = -0.004048 \cdot 54 = -0.2186

Gradient w.r.t W_{hh} at t = 3:

\frac{\partial L}{\partial W_{hh}}^{(t=3)} = -0.004048 \cdot h_2 = -0.004048

4.4 Update Weights

  • W_{xh}^{\text{new}} = 0.1032913 - 0.01 \cdot (-0.2186) = 0.1054773
  • W_{hh}^{\text{new}} = 0.20006095 - 0.01 \cdot (-0.004048) = 0.20010143
  • W_{hy}^{\text{new}} = 55.0200144 - 0.01 \cdot (-1.98096) = 55.0398240

Iteration 3

5.1 Forward Propagation

Time Step t = 1

  • Input: x_1 = 50

Compute Hidden State h_1:

h_1 = \tanh(0.1054773 \cdot 50 + 0.20010143 \cdot 0) = \tanh(5.273865) \approx 0.999946

Compute Output \hat{y}_1:

\hat{y}_1 = 55.0398240 \cdot h_1 = 55.0398240 \cdot 0.999946 = 55.0379

Time Step t = 2

  • Input: x_2 = 55

Compute Hidden State h_2:

h_2 = \tanh(0.1054773 \cdot 55 + 0.20010143 \cdot h_1) = \tanh(5.801252 + 0.20010143 \cdot 0.999946) = \tanh(6.001343) \approx 0.999990

Compute Output \hat{y}_2:

\hat{y}_2 = 55.0398240 \cdot h_2 = 55.0398240 \cdot 0.999990 = 55.0393

Time Step t = 3

  • Input: x_3 = 54

Compute Hidden State h_3:

h_3 = \tanh(0.1054773 \cdot 54 + 0.20010143 \cdot h_2) = \tanh(5.8961505) \approx 0.999985

Compute Output \hat{y}_3:

\hat{y}_3 = 55.0398240 \cdot h_3 = 55.0398240 \cdot 0.999985 = 55.0390

5.2 Loss Calculation

Calculate Loss:

L = \frac{1}{2} (55.0390 - 57)^2 = \frac{1}{2} (-1.9610)^2 = 1.9215

Note: The loss has decreased from 1.962 to 1.9215.

5.3 Backpropagation Through Time (BPTT)

Gradient w.r.t W_{hy}:

\frac{\partial L}{\partial W_{hy}} = (55.0390 - 57) \cdot h_3 = (-1.9610) \cdot 0.999985 = -1.96097

Gradient w.r.t W_{xh} at t = 3:

\delta_3 = (-1.9610) \cdot 55.0398240 \cdot (1 - (0.999985)^2) = (-1.9610) \cdot 55.0398240 \cdot 0.0000299 = -0.003230

\frac{\partial L}{\partial W_{xh}}^{(t=3)} = -0.003230 \cdot x_3 = -0.003230 \cdot 54 = -0.1744

Gradient w.r.t W_{hh} at t = 3:

\frac{\partial L}{\partial W_{hh}}^{(t=3)} = -0.003230 \cdot h_2 = -0.003230

5.4 Update Weights

  • W_{xh}^{\text{new}} = 0.1054773 - 0.01 \cdot (-0.1744) = 0.1072213
  • W_{hh}^{\text{new}} = 0.20010143 - 0.01 \cdot (-0.003230) = 0.20013373
  • W_{hy}^{\text{new}} = 55.0398240 - 0.01 \cdot (-1.96097) = 55.0594337

Step 4: Prediction

After three iterations, we use the updated weights to predict y_5.

4.1 Prediction Input Sequence

Input sequence:

X_{\text{predict}} = [55, 54, 57]

4.2 Forward Propagation

Time Step t = 1

  • Input: x_1 = 55

Compute Hidden State h_1:

h_1 = \tanh(0.1072213 \cdot 55 + 0.20013373 \cdot 0) = \tanh(5.8971715) \approx 0.999991

Compute Output \hat{y}_1:

\hat{y}_1 = 55.0594337 \cdot h_1 = 55.0594337 \cdot 0.999991 = 55.0589

Time Step t = 2

  • Input: x_2 = 54

Compute Hidden State h_2:

h_2 = \tanh(0.1072213 \cdot 54 + 0.20013373 \cdot h_1) = \tanh(5.7869482 + 0.2001315) = \tanh(5.9870797) \approx 0.999986

Compute Output \hat{y}_2:

\hat{y}_2 = 55.0594337 \cdot h_2 = 55.0594337 \cdot 0.999986 = 55.0586

Time Step t = 3

  • Input: x_3 = 57

Compute Hidden State h_3:

h_3 = \tanh(0.1072213 \cdot 57 + 0.20013373 \cdot h_2) = \tanh(6.3107376) \approx 0.999993

Compute Output \hat{y}_3:

\hat{y}_3 = 55.0594337 \cdot h_3 = 55.0594337 \cdot 0.999993 \approx 55.0590

4.3 Predicted y_5

\boxed{\hat{y}_5 = \hat{y}_3 \approx 55.0590}

Conclusion

  • Initial Loss: 2.003
  • Loss after Iteration 3: 1.9215
  • Predicted y_5: Improved from 54.9985 to 55.0590

Final Thoughts

By performing multiple training iterations, we observed the following:

  • Decrease in Loss: The loss gradually decreased with each iteration, indicating the model is learning.
  • Improved Predictions: The predicted value for y_5 became more accurate over the iterations.

Key Takeaways

  • Gradient Descent: Repeatedly updating weights using gradients reduces the loss.
  • Learning Rate: A small learning rate ensures stable convergence.
  • RNN Capability: Even with limited data, the RNN adjusts its weights to better fit the training data.

I hope this extended explanation, including multiple iterations, provides a clearer understanding of how an RNN learns and improves its predictions over time ; We have provided the ARIMA and SARIMA concepts to understand better how predictions can happen in other ways as well





Note to Understand RNN Steps in this Example

1. Sequence-to-One Prediction

Sequence Input: The RNN takes a sequence of inputs and processes them over time steps.

Single Output: It produces a single output after processing the entire sequence.

This is called a many-to-one or sequence-to-one prediction model.

2. Training Phase

Purpose: Teach the RNN to predict the next value in a sequence based on previous values.

Process:

  • Provide sequences where the next value is known.
  • The RNN learns patterns from these sequences.

What’s Being Calculated from the Sequence?

  • From the input sequence, we are calculating the hidden states h_t at each time step t.
  • Using the hidden states and the learned weights, we compute the output \hat{y}_t at each time step.
  • The weights W_{xh}, W_{hh}, and W_{hy} are learned during training and remain fixed during prediction.

Detailed Explanation

1. During Training

Goal: Learn the weights W that minimize the difference between the predicted output and the actual target.

  • Input Sequence: X_{\text{train}} = [y_1, y_2, y_3]
  • Target Output: Y_{\text{train}} = y_4

Process:

  1. Time Step t = 1:

    Input: x_1 = y_1

    Compute Hidden State: h_1 = \tanh(W_{xh} \cdot x_1 + W_{hh} \cdot h_0)

    Compute Output: \hat{y}_1 = W_{hy} \cdot h_1

  2. Time Step t = 2:

    Input: x_2 = y_2

    Compute Hidden State: h_2 = \tanh(W_{xh} \cdot x_2 + W_{hh} \cdot h_1)

    Compute Output: \hat{y}_2 = W_{hy} \cdot h_2

  3. Time Step t = 3:

    Input: x_3 = y_3

    Compute Hidden State: h_3 = \tanh(W_{xh} \cdot x_3 + W_{hh} \cdot h_2)

    Compute Output: \hat{y}_3 = W_{hy} \cdot h_3

    Prediction: \hat{y}_3 is compared to y_4 to compute the loss.

Weights Update

Using backpropagation through time (BPTT), we compute gradients of the loss with respect to the weights.

Weights W are updated to minimize the loss.


2. During Prediction

Goal: Use the learned weights to predict the next value y_5 based on a new input sequence.

  • Input Sequence: X_{\text{predict}} = [y_2, y_3, y_4]
  • We use the same learned weights W from training.

Process:

  1. Time Step t = 1:

    Input: x_1 = y_2

    Compute Hidden State: h_1 = \tanh(W_{xh} \cdot x_1 + W_{hh} \cdot h_0)

    Compute Output (intermediate): \hat{y}_1 = W_{hy} \cdot h_1

  2. Time Step t = 2:

    Input: x_2 = y_3

    Compute Hidden State: h_2 = \tanh(W_{xh} \cdot x_2 + W_{hh} \cdot h_1)

    Compute Output (intermediate): \hat{y}_2 = W_{hy} \cdot h_2

  3. Time Step t = 3:

    Input: x_3 = y_4

    Compute Hidden State: h_3 = \tanh(W_{xh} \cdot x_3 + W_{hh} \cdot h_2)

    Compute Output: \hat{y}_3 = W_{hy} \cdot h_3

    Prediction: \hat{y}_3 is our predicted y_5.

Key Points:

  • Hidden States h_t: Calculated from the input sequence and previous hidden states.
  • Weights W: Remain fixed during prediction; they are the result of training.
  • Prediction: The final output \hat{y}_3 is the RNN’s prediction for the next value in the sequence.

Summary

  • From the Sequence, we are calculating the Hidden States h_t at each time step.
  • Using the Hidden States and the Weights W, we compute the Outputs \hat{y}_t.
  • During Training:
    • We adjust the Weights W to minimize the loss between \hat{y}_3 and the actual y_4.
  • During Prediction:
    • We use the learned Weights W to compute \hat{y}_3, which is our predicted y_5.

Visualization

Training Phase

Here’s a visualization of the training phase:

Input Sequence: y1       y2       y3
                    ↓        ↓        ↓
Time Steps:        t=1     t=2     t=3
                    ↓        ↓        ↓
Hidden States:     h1       h2       h3
                                     ↓
                          Output: \hat{y}_3 (compared to y4)

Weights W are updated based on the loss between \hat{y}_3 and y_4.

Prediction Phase

Here’s a visualization of the prediction phase:

Input Sequence: y2       y3       y4
                    ↓        ↓        ↓
Time Steps:        t=1     t=2     t=3
                    ↓        ↓        ↓
Hidden States:     h1       h2       h3
                                     ↓
                          Output: \hat{y}_3 (prediction for y5)

Weights W remain fixed; we use them to compute the prediction.


Notes

  • We are calculating the Hidden States h_t from the input sequence.
  • The Weights W are learned during training and are used (not calculated) to compute the hidden states and outputs.
  • The Hidden States and Weights together allow us to predict y_4 during training and y_5 during prediction.

Key Takeaways

  • Sequence → Hidden States: The sequence of inputs is used to compute hidden states at each time step.
  • Hidden States + Weights → Outputs: The hidden states and learned weights are used to compute outputs.
  • Weights:
    • Adjusted during training to minimize loss.
    • Remain fixed during prediction.