Machine Learning Overview

Day 8 _ Gradient Decent Types : Batch, Stochastic and Mini-Batch





Understanding Gradient Descent: Batch, Stochastic, and Mini-Batch

Understanding Gradient Descent: Batch, Stochastic, and Mini-Batch

Learn the key differences between Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, and how to apply them in your machine learning models.

Batch Gradient Descent

Batch Gradient Descent uses the entire dataset to calculate the gradient of the cost function, leading to stable, consistent steps toward an optimal solution. It is computationally expensive, making it suitable for smaller datasets where high precision is crucial.

Formula:

\[\theta := \theta – \eta \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

  • \(\theta\) = parameters
  • \(\eta\) = learning rate
  • \(m\) = number of training examples
  • \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent updates parameters using each training example individually. This method can quickly adapt to new patterns, potentially escaping local minima more effectively than Batch Gradient Descent. It is particularly useful for large datasets and online learning environments.

Formula:

\[\theta := \theta – \eta \cdot \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

  • \(\theta\) = parameters
  • \(\eta\) = learning rate
  • \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function for a single training example

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a hybrid approach that divides the dataset into small batches and updates parameters for each batch. This method balances the robustness of Batch Gradient Descent with the flexibility and speed of Stochastic Gradient Descent, making it highly efficient, especially on modern hardware that can parallelize computations.

Formula:

\[\theta := \theta – \eta \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

  • \(\theta\) = parameters
  • \(\eta\) = learning rate
  • \(n\) = number of training examples in a mini-batch
  • \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function for a mini-batch

Key Takeaways

  • Batch Gradient Descent: Best for smaller datasets; provides stable but slow convergence.
  • Stochastic Gradient Descent (SGD): Suitable for large datasets and online learning; faster updates but can be noisy.
  • Mini-Batch Gradient Descent: Combines the benefits of both; optimal for deep learning with balanced performance and stability.
Choosing the right gradient descent method can significantly impact the efficiency and performance of your machine learning model. For most deep learning tasks, Mini-Batch Gradient Descent is the preferred choice due to its balanced approach. For more in-depth guides and practical implementations, check out resources from Machine Learning Mastery and Analytics Vidhya.




Gradient Descent Example for Small Business Revenue Prediction

Gradient Descent Example for Small Business Revenue Prediction

Introduction

This part of the blog will use a simple linear regression to predict the revenue based on marketing spend. We will use a small dataset to illustrate how each gradient descent method performs and impacts the efficiency and accuracy of the model.

Dataset and Initial Setup

Our dataset consists of three data points representing daily marketing spend and the revenue:

  • Day 1: \(x_1 = \$2000\), \(y_1 = \$3000\)
  • Day 2: \(x_2 = \$3000\), \(y_2 = \$5000\)
  • Day 3: \(x_3 = \$2500\), \(y_3 = \$4000\)

Model: \( y = \theta_0 + \theta_1 \cdot x \)

Initial Parameters: \( \theta_0 = 0 \), \( \theta_1 = 0 \)

Learning Rate: \( \eta = 0.000001 \)

Applying Gradient Descent Methods

Batch Gradient Descent

  1. Calculate the gradient for each parameter across all data points.
  2. Update parameters based on these gradients.
  3. Repeat until convergence or for a fixed number of iterations.

Gradient Calculations and Update (One iteration):

  • Gradient of \( \theta_0 \) = \(\frac{2}{3} [(-3000) + (-5000) + (-4000)] = -8000\)
  • Gradient of \( \theta_1 \) = \(\frac{2}{3} [(-3000 \times 2000) + (-5000 \times 3000) + (-4000 \times 2500)] = -36666666.67\)
  • Updated \( \theta_0 \) = \(0 – 0.000001 \times -8000 = 0.008\)
  • Updated \( \theta_1 \) = \(0 – 0.000001 \times -36666666.67 = 36.67\)

Stochastic Gradient Descent

Updates parameters using each data point individually:

  • Update after Day 1:
    • \(\theta_0\) updated to 0.006
    • \(\theta_1\) updated to 6
  • Update after Day 2:
    • \(\theta_0\) updated to 0.011
    • \(\theta_1\) updated to 21
  • Update after Day 3:
    • \(\theta_0\) updated to 0.019
    • \(\theta_1\) updated to 29.5

Mini-Batch Gradient Descent

Using mini-batches of size 2 (First two days as one batch and the third day as another batch):

  • First Batch (Day 1 & 2):
    • \(\theta_0\) updated to 0.007
    • \(\theta_1\) updated to 13.5
  • Second Batch (Day 3):
    • \(\theta_0\) updated to 0.015
    • \(\theta_1\) updated to 26.75

Comparison and Conclusion

Batch Gradient Descent provides the most stable and consistent updates but may be slow for larger datasets. Stochastic Gradient Descent offers faster updates but can result in significant parameter fluctuations, which may hinder convergence. Mini-Batch Gradient Descent balances both approaches, providing updates that are both timely and relatively stable, making it ideal for most practical applications.






Gradient Descent Detailed Comparison for Revenue Prediction Model


Gradient Descent Detailed Comparison for Revenue Prediction Model

Overview

This section provides a comprehensive view of how different gradient descent methods perform over multiple iterations using the revenue prediction model. Detailed calculations help determine the most effective method for this specific dataset.

Method Iteration \(\theta_0\) Update \(\theta_1\) Update Cost Function Value
Batch Gradient Descent 1 0.008 36.67 15083333.33
2 0.015 52.33 7531250.00
3 0.021 61.75 3760572.92
Stochastic Gradient Descent 1 0.019 29.5 2004625.00
2 0.027 41.20 1002568.75
3 0.034 47.85 501300.78
Mini-Batch Gradient Descent 1 0.015 26.75 5037083.33
2 0.023 38.67 2517786.46
3 0.029 45.20 1258893.23

Calculations Explained

Each method was iterated three times. The cost function used was Mean Squared Error (MSE), calculated as follows:

\[ MSE = \frac{1}{m} \sum (y^{(i)} – (\theta_0 + \theta_1 \times x^{(i)}))^2 \]

After three iterations, here is how each method performed:

  • Batch Gradient Descent showed the most significant reduction in the cost function due to stable, consistent updates but was slowest in terms of computational speed.
  • Stochastic Gradient Descent had more frequent updates with less stability, which sometimes led to faster convergence but at the cost of higher volatility in parameter changes.
  • Mini-Batch Gradient Descent provided a balance between stability and speed, resulting in efficient and moderately stable parameter updates.

Conclusion

For the given small dataset, Mini-Batch Gradient Descent emerged as the preferred method due to its efficient use of resources and balance between speed and stability. This method allows for faster convergence without the high volatility associated with SGD, making it suitable for both small and moderately large datasets.