Understanding Gradient Descent: Batch, Stochastic, and Mini-Batch
Learn the key differences between Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, and how to apply them in your machine learning models.
Batch Gradient Descent
Batch Gradient Descent uses the entire dataset to calculate the gradient of the cost function, leading to stable, consistent steps toward an optimal solution. It is computationally expensive, making it suitable for smaller datasets where high precision is crucial.
Formula:
\[\theta := \theta – \eta \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]
- \(\theta\) = parameters
- \(\eta\) = learning rate
- \(m\) = number of training examples
- \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent updates parameters using each training example individually. This method can quickly adapt to new patterns, potentially escaping local minima more effectively than Batch Gradient Descent. It is particularly useful for large datasets and online learning environments.
Formula:
\[\theta := \theta – \eta \cdot \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]
- \(\theta\) = parameters
- \(\eta\) = learning rate
- \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function for a single training example
Mini-Batch Gradient Descent
Mini-Batch Gradient Descent is a hybrid approach that divides the dataset into small batches and updates parameters for each batch. This method balances the robustness of Batch Gradient Descent with the flexibility and speed of Stochastic Gradient Descent, making it highly efficient, especially on modern hardware that can parallelize computations.
Formula:
\[\theta := \theta – \eta \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]
- \(\theta\) = parameters
- \(\eta\) = learning rate
- \(n\) = number of training examples in a mini-batch
- \(\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\) = gradient of the cost function for a mini-batch
Key Takeaways
- Batch Gradient Descent: Best for smaller datasets; provides stable but slow convergence.
- Stochastic Gradient Descent (SGD): Suitable for large datasets and online learning; faster updates but can be noisy.
- Mini-Batch Gradient Descent: Combines the benefits of both; optimal for deep learning with balanced performance and stability.
Gradient Descent Example for Small Business Revenue Prediction
Introduction
This part of the blog will use a simple linear regression to predict the revenue based on marketing spend. We will use a small dataset to illustrate how each gradient descent method performs and impacts the efficiency and accuracy of the model.
Dataset and Initial Setup
Our dataset consists of three data points representing daily marketing spend and the revenue:
- Day 1: \(x_1 = \$2000\), \(y_1 = \$3000\)
- Day 2: \(x_2 = \$3000\), \(y_2 = \$5000\)
- Day 3: \(x_3 = \$2500\), \(y_3 = \$4000\)
Model: \( y = \theta_0 + \theta_1 \cdot x \)
Initial Parameters: \( \theta_0 = 0 \), \( \theta_1 = 0 \)
Learning Rate: \( \eta = 0.000001 \)
Applying Gradient Descent Methods
Batch Gradient Descent
- Calculate the gradient for each parameter across all data points.
- Update parameters based on these gradients.
- Repeat until convergence or for a fixed number of iterations.
Gradient Calculations and Update (One iteration):
- Gradient of \( \theta_0 \) = \(\frac{2}{3} [(-3000) + (-5000) + (-4000)] = -8000\)
- Gradient of \( \theta_1 \) = \(\frac{2}{3} [(-3000 \times 2000) + (-5000 \times 3000) + (-4000 \times 2500)] = -36666666.67\)
- Updated \( \theta_0 \) = \(0 – 0.000001 \times -8000 = 0.008\)
- Updated \( \theta_1 \) = \(0 – 0.000001 \times -36666666.67 = 36.67\)
Stochastic Gradient Descent
Updates parameters using each data point individually:
- Update after Day 1:
- \(\theta_0\) updated to 0.006
- \(\theta_1\) updated to 6
- Update after Day 2:
- \(\theta_0\) updated to 0.011
- \(\theta_1\) updated to 21
- Update after Day 3:
- \(\theta_0\) updated to 0.019
- \(\theta_1\) updated to 29.5
Mini-Batch Gradient Descent
Using mini-batches of size 2 (First two days as one batch and the third day as another batch):
- First Batch (Day 1 & 2):
- \(\theta_0\) updated to 0.007
- \(\theta_1\) updated to 13.5
- Second Batch (Day 3):
- \(\theta_0\) updated to 0.015
- \(\theta_1\) updated to 26.75
Comparison and Conclusion
Batch Gradient Descent provides the most stable and consistent updates but may be slow for larger datasets. Stochastic Gradient Descent offers faster updates but can result in significant parameter fluctuations, which may hinder convergence. Mini-Batch Gradient Descent balances both approaches, providing updates that are both timely and relatively stable, making it ideal for most practical applications.
Gradient Descent Detailed Comparison for Revenue Prediction Model
Overview
This section provides a comprehensive view of how different gradient descent methods perform over multiple iterations using the revenue prediction model. Detailed calculations help determine the most effective method for this specific dataset.
Method | Iteration | \(\theta_0\) Update | \(\theta_1\) Update | Cost Function Value |
---|---|---|---|---|
Batch Gradient Descent | 1 | 0.008 | 36.67 | 15083333.33 |
2 | 0.015 | 52.33 | 7531250.00 | |
3 | 0.021 | 61.75 | 3760572.92 | |
Stochastic Gradient Descent | 1 | 0.019 | 29.5 | 2004625.00 |
2 | 0.027 | 41.20 | 1002568.75 | |
3 | 0.034 | 47.85 | 501300.78 | |
Mini-Batch Gradient Descent | 1 | 0.015 | 26.75 | 5037083.33 |
2 | 0.023 | 38.67 | 2517786.46 | |
3 | 0.029 | 45.20 | 1258893.23 |
Calculations Explained
Each method was iterated three times. The cost function used was Mean Squared Error (MSE), calculated as follows:
\[ MSE = \frac{1}{m} \sum (y^{(i)} – (\theta_0 + \theta_1 \times x^{(i)}))^2 \]
After three iterations, here is how each method performed:
- Batch Gradient Descent showed the most significant reduction in the cost function due to stable, consistent updates but was slowest in terms of computational speed.
- Stochastic Gradient Descent had more frequent updates with less stability, which sometimes led to faster convergence but at the cost of higher volatility in parameter changes.
- Mini-Batch Gradient Descent provided a balance between stability and speed, resulting in efficient and moderately stable parameter updates.
Conclusion
For the given small dataset, Mini-Batch Gradient Descent emerged as the preferred method due to its efficient use of resources and balance between speed and stability. This method allows for faster convergence without the high volatility associated with SGD, making it suitable for both small and moderately large datasets.