Machine Learning Overview

3 Types of Gradient Decent Types : Batch, Stochastic & Mini-Batch _ Day 8





Understanding Gradient Descent: Batch, Stochastic, and Mini-Batch

Understanding Gradient Descent: Batch, Stochastic, and Mini-Batch

Learn the key differences between Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, and how to apply them in your machine learning models.

Batch Gradient Descent

Batch Gradient Descent uses the entire dataset to calculate the gradient of the cost function, leading to stable, consistent steps toward an optimal solution. It is computationally expensive, making it suitable for smaller datasets where high precision is crucial.

Formula:

\[\theta := \theta – \eta \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

  • \(\theta\) = parameters
  • \(\eta\) = learning rate
  • \(m\) = number of training examples
  • \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent updates parameters using each training example individually. This method can quickly adapt to new patterns, potentially escaping local minima more effectively than Batch Gradient Descent. It is particularly useful for large datasets and online learning environments.

Formula:

\[\theta := \theta – \eta \cdot \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

  • \(\theta\) = parameters
  • \(\eta\) = learning rate
  • \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function for a single training example

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a hybrid approach that divides the dataset into small batches and updates parameters for each batch. This method balances the robustness of Batch Gradient Descent with the flexibility and speed of Stochastic Gradient Descent, making it highly efficient, especially on modern hardware that can parallelize computations.

Formula:

\[\theta := \theta – \eta \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

  • \(\theta\) = parameters
  • \(\eta\) = learning rate
  • \(n\) = number of training examples in a mini-batch
  • \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function for a mini-batch

Key Takeaways

  • Batch Gradient Descent: Best for smaller datasets; provides stable but slow convergence.
  • Stochastic Gradient Descent (SGD): Suitable for large datasets and online learning; faster updates but can be noisy.
  • Mini-Batch Gradient Descent: Combines the benefits of both; optimal for deep learning with balanced performance and stability.
Choosing the right gradient descent method can significantly impact the efficiency and performance of your machine learning model. For most deep learning tasks, Mini-Batch Gradient Descent is the preferred choice due to its balanced approach. For more in-depth guides and practical implementations, check out resources from Machine Learning Mastery and Analytics Vidhya.




Gradient Descent Example for Small Business Revenue Prediction

Gradient Descent Example for Small Business Revenue Prediction

Introduction

This part of the blog will use a simple linear regression to predict the revenue based on marketing spend. We will use a small dataset to illustrate how each gradient descent method performs and impacts the efficiency and accuracy of the model.

Dataset and Initial Setup

Our dataset consists of three data points representing daily marketing spend and the revenue:

  • Day 1: \(x_1 = \$2000\), \(y_1 = \$3000\)
  • Day 2: \(x_2 = \$3000\), \(y_2 = \$5000\)
  • Day 3: \(x_3 = \$2500\), \(y_3 = \$4000\)

Model: \( y = \theta_0 + \theta_1 \cdot x \)

Initial Parameters: \( \theta_0 = 0 \), \( \theta_1 = 0 \)

Learning Rate: \( \eta = 0.000001 \)

Applying Gradient Descent Methods

Batch Gradient Descent

  1. Calculate the gradient for each parameter across all data points.
  2. Update parameters based on these gradients.
  3. Repeat until convergence or for a fixed number of iterations.

Gradient Calculations and Update (One iteration):

  • Gradient of \( \theta_0 \) = \(\frac{2}{3} [(-3000) + (-5000) + (-4000)] = -8000\)
  • Gradient of \( \theta_1 \) = \(\frac{2}{3} [(-3000 \times 2000) + (-5000 \times 3000) + (-4000 \times 2500)] = -36666666.67\)
  • Updated \( \theta_0 \) = \(0 – 0.000001 \times -8000 = 0.008\)
  • Updated \( \theta_1 \) = \(0 – 0.000001 \times -36666666.67 = 36.67\)

Stochastic Gradient Descent

Updates parameters using each data point individually:

  • Update after Day 1:
    • \(\theta_0\) updated to 0.006
    • \(\theta_1\) updated to 6
  • Update after Day 2:
    • \(\theta_0\) updated to 0.011
    • \(\theta_1\) updated to 21
  • Update after Day 3:
    • \(\theta_0\) updated to 0.019
    • \(\theta_1\) updated to 29.5

Mini-Batch Gradient Descent

Using mini-batches of size 2 (First two days as one batch and the third day as another batch):

  • First Batch (Day 1 & 2):
    • \(\theta_0\) updated to 0.007
    • \(\theta_1\) updated to 13.5
  • Second Batch (Day 3):
    • \(\theta_0\) updated to 0.015
    • \(\theta_1\) updated to 26.75

Comparison and Conclusion

Batch Gradient Descent provides the most stable and consistent updates but may be slow for larger datasets. Stochastic Gradient Descent offers faster updates but can result in significant parameter fluctuations, which may hinder convergence. Mini-Batch Gradient Descent balances both approaches, providing updates that are both timely and relatively stable, making it ideal for most practical applications.






Gradient Descent Detailed Comparison for Revenue Prediction Model


Gradient Descent Detailed Comparison for Revenue Prediction Model

Overview

This section provides a comprehensive view of how different gradient descent methods perform over multiple iterations using the revenue prediction model. Detailed calculations help determine the most effective method for this specific dataset.

Method Iteration \(\theta_0\) Update \(\theta_1\) Update Cost Function Value
Batch Gradient Descent 1 0.008 36.67 15083333.33
2 0.015 52.33 7531250.00
3 0.021 61.75 3760572.92
Stochastic Gradient Descent 1 0.019 29.5 2004625.00
2 0.027 41.20 1002568.75
3 0.034 47.85 501300.78
Mini-Batch Gradient Descent 1 0.015 26.75 5037083.33
2 0.023 38.67 2517786.46
3 0.029 45.20 1258893.23

Calculations Explained

Each method was iterated three times. The cost function used was Mean Squared Error (MSE), calculated as follows:

\[ MSE = \frac{1}{m} \sum (y^{(i)} – (\theta_0 + \theta_1 \times x^{(i)}))^2 \]

After three iterations, here is how each method performed:

  • Batch Gradient Descent showed the most significant reduction in the cost function due to stable, consistent updates but was slowest in terms of computational speed.
  • Stochastic Gradient Descent had more frequent updates with less stability, which sometimes led to faster convergence but at the cost of higher volatility in parameter changes.
  • Mini-Batch Gradient Descent provided a balance between stability and speed, resulting in efficient and moderately stable parameter updates.

Conclusion for this example

For the given small dataset, Mini-Batch Gradient Descent emerged as the preferred method due to its efficient use of resources and balance between speed and stability. This method allows for faster convergence without the high volatility associated with SGD, making it suitable for both small and moderately large datasets.

General Conclusion on Gradient Descent Methods

When evaluating gradient descent methods—Batch, Stochastic, and Mini-Batch—each serves specific needs depending on the dataset size, computational resources, and the desired balance between stability and speed of convergence:

  • Batch Gradient Descent: Ideal when computational resources are abundant and stability is crucial, as it computes precise gradients using the entire dataset. However, it is impractical for very large datasets due to high memory and time requirements.
  • Stochastic Gradient Descent (SGD): Suitable for extremely large or streaming datasets where quick updates are needed. Its noisy updates can help escape local minima but may lead to instability and slower convergence due to high variance in gradient estimation.
  • Mini-Batch Gradient Descent: Offers the best of both worlds by balancing stability and computational efficiency. It reduces the noise in gradient updates while processing data in smaller batches, making it a versatile choice for most applications.

In summary, the choice of gradient descent method should align with the problem requirements and available resources. Mini-Batch Gradient Descent often stands out as the most practical and adaptable approach, suitable for a wide range of dataset sizes and scenarios.







Example of How it look on Code ? Lets see lastly, theGradient Descent Code Example in PyTorch

Gradient Descent Methods in PyTorch

1. Batch Gradient Descent

Batch Gradient Descent processes the entire dataset in one go to compute the gradients and update the weights.


import torch
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset

# Example Dataset
X = torch.rand(100, 3)
y = torch.rand(100, 1)

# DataLoader for Batch Gradient Descent
batch_loader = DataLoader(TensorDataset(X, y), batch_size=len(X), shuffle=True)

# Simple Model
model = nn.Linear(3, 1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training Loop
for epoch in range(100):
    for batch_X, batch_y in batch_loader:
        optimizer.zero_grad()
        predictions = model(batch_X)
        loss = criterion(predictions, batch_y)
        loss.backward()
        optimizer.step()
    

2. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent processes one data point at a time for updates.


# DataLoader for Stochastic Gradient Descent
sgd_loader = DataLoader(TensorDataset(X, y), batch_size=1, shuffle=True)

# Training Loop
for epoch in range(100):
    for sample_X, sample_y in sgd_loader:
        optimizer.zero_grad()
        predictions = model(sample_X)
        loss = criterion(predictions, sample_y)
        loss.backward()
        optimizer.step()
    

3. Mini-Batch Gradient Descent

Mini-Batch Gradient Descent uses a small batch of data to compute the gradients and update weights.


# DataLoader for Mini-Batch Gradient Descent
mini_batch_loader = DataLoader(TensorDataset(X, y), batch_size=16, shuffle=True)

# Training Loop
for epoch in range(100):
    for mini_batch_X, mini_batch_y in mini_batch_loader:
        optimizer.zero_grad()
        predictions = model(mini_batch_X)
        loss = criterion(predictions, mini_batch_y)
        loss.backward()
        optimizer.step()
    


Key Differences in these 3 Codes

Each gradient descent method differs primarily in the choice of batch size, which determines the number of samples used to compute the gradient updates. Here’s how it is set in the code:

  • Batch Gradient Descent: batch_size=len(X) (entire dataset as one batch). This ensures the model processes the whole dataset before performing a single update.
  • Stochastic Gradient Descent (SGD): batch_size=1 (one sample at a time). Each gradient update is computed for a single data point, introducing high variability in updates.
  • Mini-Batch Gradient Descent: batch_size=16 (a subset of data, e.g., 16 samples). This balances the stability of Batch GD and the faster updates of SGD, making it a commonly used approach.

These differences in batch size directly impact the speed, stability, and computational efficiency of the training process, allowing you to choose the most suitable method based on the dataset size and computational resources available.

don't miss our new posts. Subscribe for updates

We don’t spam! Read our privacy policy for more info.