Day 8 _ Gradient Decent Types : Batch, Stochastic and Mini-Batch

July 19, 2024
Posted by ingoampt

Understanding Gradient Descent: Batch, Stochastic, and Mini-Batch

Learn the key differences between Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, and how to apply them in your machine learning models.

Batch Gradient Descent

Batch Gradient Descent uses the entire dataset to calculate the gradient of the cost function, leading to stable, consistent steps toward an optimal solution. It is computationally expensive, making it suitable for smaller datasets where high precision is crucial.

Formula:

\[\theta := \theta – \eta \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

$\theta$ = parameters
$\eta$ = learning rate
$m$ = number of training examples
$\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})$ = gradient of the cost function

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent updates parameters using each training example individually. This method can quickly adapt to new patterns, potentially escaping local minima more effectively than Batch Gradient Descent. It is particularly useful for large datasets and online learning environments.

Formula:

\[\theta := \theta – \eta \cdot \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

$\theta$ = parameters
$\eta$ = learning rate
$\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})$ = gradient of the cost function for a single training example

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a hybrid approach that divides the dataset into small batches and updates parameters for each batch. This method balances the robustness of Batch Gradient Descent with the flexibility and speed of Stochastic Gradient Descent, making it highly efficient, especially on modern hardware that can parallelize computations.

Formula:

\[\theta := \theta – \eta \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})\]

$\theta$ = parameters
$\eta$ = learning rate
$n$ = number of training examples in a mini-batch
$\nabla_{\theta} J(\theta; x^{(i)}, y^{(i)})$ = gradient of the cost function for a mini-batch

Key Takeaways

Batch Gradient Descent: Best for smaller datasets; provides stable but slow convergence.
Stochastic Gradient Descent (SGD): Suitable for large datasets and online learning; faster updates but can be noisy.
Mini-Batch Gradient Descent: Combines the benefits of both; optimal for deep learning with balanced performance and stability.

Gradient Descent Example for Small Business Revenue Prediction

Introduction

This part of the blog will use a simple linear regression to predict the revenue based on marketing spend. We will use a small dataset to illustrate how each gradient descent method performs and impacts the efficiency and accuracy of the model.

Dataset and Initial Setup

Our dataset consists of three data points representing daily marketing spend and the revenue:

Day 1: $x_1 = \$2000$, $y_1 = \$3000$
Day 2: $x_2 = \$3000$, $y_2 = \$5000$
Day 3: $x_3 = \$2500$, $y_3 = \$4000$

Model: $ y = \theta_0 + \theta_1 \cdot x $

Initial Parameters: $ \theta_0 = 0 $, $ \theta_1 = 0 $

Learning Rate: $ \eta = 0.000001 $

Applying Gradient Descent Methods

Batch Gradient Descent

Calculate the gradient for each parameter across all data points.
Update parameters based on these gradients.
Repeat until convergence or for a fixed number of iterations.

Gradient Calculations and Update (One iteration):

Gradient of $ \theta_0 $ = $\frac{2}{3} [(-3000) + (-5000) + (-4000)] = -8000$
Gradient of $ \theta_1 $ = $\frac{2}{3} [(-3000 \times 2000) + (-5000 \times 3000) + (-4000 \times 2500)] = -36666666.67$
Updated $ \theta_0 $ = $0 – 0.000001 \times -8000 = 0.008$
Updated $ \theta_1 $ = $0 – 0.000001 \times -36666666.67 = 36.67$

Stochastic Gradient Descent

Updates parameters using each data point individually:

Update after Day 1:
- $\theta_0$ updated to 0.006
- $\theta_1$ updated to 6
Update after Day 2:
- $\theta_0$ updated to 0.011
- $\theta_1$ updated to 21
Update after Day 3:
- $\theta_0$ updated to 0.019
- $\theta_1$ updated to 29.5

Mini-Batch Gradient Descent

Using mini-batches of size 2 (First two days as one batch and the third day as another batch):

First Batch (Day 1 & 2):
- $\theta_0$ updated to 0.007
- $\theta_1$ updated to 13.5
Second Batch (Day 3):
- $\theta_0$ updated to 0.015
- $\theta_1$ updated to 26.75

Comparison and Conclusion

Batch Gradient Descent provides the most stable and consistent updates but may be slow for larger datasets. Stochastic Gradient Descent offers faster updates but can result in significant parameter fluctuations, which may hinder convergence. Mini-Batch Gradient Descent balances both approaches, providing updates that are both timely and relatively stable, making it ideal for most practical applications.

Gradient Descent Detailed Comparison for Revenue Prediction Model

Overview

This section provides a comprehensive view of how different gradient descent methods perform over multiple iterations using the revenue prediction model. Detailed calculations help determine the most effective method for this specific dataset.

Method	Iteration	$\theta_0$ Update	$\theta_1$ Update	Cost Function Value
Batch Gradient Descent	1	0.008	36.67	15083333.33
	2	0.015	52.33	7531250.00
	3	0.021	61.75	3760572.92
Stochastic Gradient Descent	1	0.019	29.5	2004625.00
	2	0.027	41.20	1002568.75
	3	0.034	47.85	501300.78
Mini-Batch Gradient Descent	1	0.015	26.75	5037083.33
	2	0.023	38.67	2517786.46
	3	0.029	45.20	1258893.23

Calculations Explained

Each method was iterated three times. The cost function used was Mean Squared Error (MSE), calculated as follows:

\[ MSE = \frac{1}{m} \sum (y^{(i)} – (\theta_0 + \theta_1 \times x^{(i)}))^2 \]

After three iterations, here is how each method performed:

Batch Gradient Descent showed the most significant reduction in the cost function due to stable, consistent updates but was slowest in terms of computational speed.
Stochastic Gradient Descent had more frequent updates with less stability, which sometimes led to faster convergence but at the cost of higher volatility in parameter changes.
Mini-Batch Gradient Descent provided a balance between stability and speed, resulting in efficient and moderately stable parameter updates.

Conclusion

For the given small dataset, Mini-Batch Gradient Descent emerged as the preferred method due to its efficient use of resources and balance between speed and stability. This method allows for faster convergence without the high volatility associated with SGD, making it suitable for both small and moderately large datasets.

Day 8 _ Gradient Decent Types : Batch, Stochastic and Mini-Batch

Understanding Gradient Descent: Batch, Stochastic, and Mini-Batch

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

Key Takeaways

Gradient Descent Example for Small Business Revenue Prediction

Introduction

Dataset and Initial Setup

Applying Gradient Descent Methods

Batch Gradient Descent

Stochastic Gradient Descent

Mini-Batch Gradient Descent

Comparison and Conclusion

Gradient Descent Detailed Comparison for Revenue Prediction Model

Overview

Calculations Explained

Conclusion

don't miss our new posts. Subscribe for updates

Where to Get Data for Machine Learning and Deep Learning Model Creation – day 8

Fine-Tuning in Deep Learning with a practical example – day 6

Effortlessly Integrate ChatGPT-4 API on Mac with Xcode: Your Ultimate 2024 Guide to Token-Based Access Without a Monthly Subscription

Mastering NLP: Unlocking the Math Behind It for Breakthrough Insights with a scientific paper study – day 71

Understanding Recurrent Neural Networks (RNNs) – part 2 – day 56

Deep Learning types and Best to Learn in 2024 as an iOS Developer – day 52

DropOut and Monte Carlo Dropout (MC Dropout)- day 48

DropOut and Monte Carlo Dropout (MC Dropout)- day 48

Understanding Regularization in Deep Learning – day 47

Learning rate – part 4- 1-Cycle Scheduling, exponential decay and Cyclic Exponential Decay (CED) – day 45

Theory Behind 1Cycle Learning Rate Scheduling and Learning Rate Schedules – Day 43

Theory Behind 1Cycle Learning Rate Scheduling and Learning Rate Schedules – Day 43

Education

Day 8 _ Gradient Decent Types : Batch, Stochastic and Mini-Batch

Batch Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-Batch Gradient Descent

Key Takeaways

Gradient Descent Example for Small Business Revenue Prediction

Introduction

Dataset and Initial Setup

Applying Gradient Descent Methods

Batch Gradient Descent

Stochastic Gradient Descent

Mini-Batch Gradient Descent

Comparison and Conclusion

Gradient Descent Detailed Comparison for Revenue Prediction Model

Overview

Calculations Explained

Conclusion

don't miss our new posts. Subscribe for updates

Related Posts