Understanding Linear Regression: The Normal Equation and Matrix Multiplications Explained
Linear regression is a fundamental concept in machine learning and statistics, used to predict a target variable based on one or more input features. While gradient descent is a popular method for finding the best-fitting line, the normal equation offers a direct, analytical approach that doesn’t require iterations. This blog post will walk you through the normal equation step-by-step, explaining why and how it works, and why using matrices simplifies the process.
Table of Contents
- Introduction to Linear Regression
- Gradient Descent vs. Normal Equation
- Step-by-Step Explanation of the Normal Equation
- Step 1: Add Column of Ones
- Step 2: Transpose of X (XT)
- Step 3: Matrix Multiplication (XTX)
- Step 4: Matrix Multiplication (XTy)
- Step 5: Inverse of XTX ((XTX)-1)
- Step 6: Final Multiplication to Get θ
- Why the Normal Equation Works Without Gradient Descent
- Advantages of Using Matrices
- Conclusion
Introduction to Linear Regression
Linear regression aims to fit a line to a dataset, predicting a target variable $y$ based on input features $x$. The model is defined as:
$$
y = \theta_0 + \theta_1 x
$$
For multiple features, it generalizes to:
$$
y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n
$$
Gradient Descent vs. Normal Equation
Gradient Descent
- Iterative Method: Starts with initial guesses for parameters $\theta$ and iteratively updates them to minimize the cost function (e.g., Mean Squared Error).
- Requires: Choosing a learning rate and number of iterations.
- Usage: Suitable for very large datasets and complex models.
Normal Equation
- Analytical Method: Directly computes the parameters $\theta$ that minimize the cost function without iterations.
- Formula: $\hat{\theta} = (X^T X)^{-1} X^T y$
- Advantages: Efficient for small to moderately sized datasets, no need for learning rates or iterations.
Step-by-Step Explanation of the Normal Equation
Let’s use a simple dataset to illustrate the process:
Hours Studied ($x$) | Score ($y$) |
---|---|
1 | 2 |
2 | 3 |
3 | 4.5 |
4 | 5 |
5 | 7 |
Step 1: Add Column of Ones
Purpose: To include the intercept term ($\theta_0$) in the model.
Process:
$$
X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}
$$
$$
y = \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix}
$$
Step 2: Transpose of $X$ ($X^T$)
Purpose: To prepare for matrix multiplication.
Process:
$$
X^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}
$$
Matrix Multiplication ($X^T X$)
Purpose: To consolidate feature data.
Process:
$$
X^T X = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}
\begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}
= \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}
$$
Matrix Multiplication ($X^T y$)
Purpose: To consolidate the relationship between features and target values.
Process:
$$
y = \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix}
$$
$$
X^T y = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}
\begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix}
= \begin{bmatrix} 21.5 \\ 77.5 \end{bmatrix}
$$
Inverse of $X^T X$ (($X^T X)^{-1}$))
Purpose: To isolate the parameter vector $\theta$.
Process:
$$
(X^T X)^{-1} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}^{-1}
$$
To find the inverse of a 2×2 matrix
$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$, we use the formula:
$$
\begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1}
= \frac{1}{ad – bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}
$$
Applying this to our matrix:
$$
a = 5, \quad b = 15, \quad c = 15, \quad d = 55
$$
$$
(X^T X)^{-1} = \frac{1}{5 \cdot 55 – 15 \cdot 15}
\begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix}
= \frac{1}{100}
\begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix}
= \begin{bmatrix} 0.55 & -0.15 \\ -0.15 & 0.05 \end{bmatrix}
$$
Step 6: Final Multiplication to Get $\hat{\theta}$
Purpose: To compute the optimal parameters.
Process:
We need to compute the parameter vector $\hat{\theta}$ using the formula:
$$
\hat{\theta} = (X^T X)^{-1} X^T y
$$
First, let’s break down each part:
Step 6.1: Compute $X^T X$
We already have:
$$
X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}
$$
Taking the transpose of $X$:
$$
X^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}
$$
Now, multiply $X^T$ by $X$:
$$
X^T X = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}
\begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix}
= \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}
$$
Step 6.2: Compute $X^T y$
Next, we need to multiply $X^T$ by $y$:
$$
y = \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix}
$$
$$
X^T y = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}
\begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix}
= \begin{bmatrix} 21.5 \\ 77.5 \end{bmatrix}
$$
Step 6.3: Compute the Inverse of $X^T X$
We need to find the inverse of the matrix $X^T X$:
$$
(X^T X)^{-1} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}^{-1}
$$
To find the inverse of a 2×2 matrix
$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$, we use the formula:
$$
\begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1}
= \frac{1}{ad – bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}
$$
Applying this to our matrix:
$$
a = 5, \quad b = 15, \quad c = 15, \quad d = 55
$$
$$
(X^T X)^{-1} = \frac{1}{5 \cdot 55 – 15 \cdot 15}
\begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix}
= \frac{1}{100}
\begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix}
= \begin{bmatrix} 0.55 & -0.15 \\ -0.15 & 0.05 \end{bmatrix}
$$
Step 6.4: Multiply $(X^T X)^{-1}$ by $X^T y$
Finally, we compute $\hat{\theta}$ by multiplying $(X^T X)^{-1}$ by $X^T y$:
$$
\hat{\theta} = \begin{bmatrix} 0.55 & -0.15 \\ -0.15 & 0.05 \end{bmatrix}
\begin{bmatrix} 21.5 \\ 77.5 \end{bmatrix}
$$
Let’s break it down:
For the first element of $\hat{\theta}$:
$$
0.55 \cdot 21.5 + (-0.15) \cdot 77.5
= 11.825 – 11.625
= 0.2
$$
For the second element of $\hat{\theta}$:
$$
-0.15 \cdot 21.5 + 0.05 \cdot 77.5
= -3.225 + 3.875
= 0.65
$$
So, we have:
$$
\hat{\theta} = \begin{bmatrix} 0.2 \\ 0.65 \end{bmatrix}
$$
Our final parameters are:
- $\theta_0 = 0.2$
- $\theta_1 = 0.65$
Why the Normal Equation Works Without Gradient Descent
The normal equation provides a closed-form solution by directly solving the linear system that minimizes the cost function. It works without iterations because it analytically finds the exact parameters that minimize the mean squared error (MSE).
- Closed-Form Solution: Directly finds the minimum of the cost function.
- No Iterations: Avoids the need for multiple updates and learning rate adjustments.
- Exact Solution: Provides an exact solution assuming the matrix $X^T X$ is invertible.
Advantages of Using Matrices
- Compact Representation: Simplifies equations and computations.
- Efficient Computation: Leverages linear algebra libraries for fast matrix operations.
- Scalability: Easily handles multiple features and large datasets.
Key to take from these steps:
Understanding the normal equation and matrix operations in linear regression provides a powerful tool for efficiently and accurately finding the best-fitting line for small to moderately sized datasets. By using matrices, we streamline the computation process, making it more scalable and manageable. While gradient descent is useful for larger datasets and complex models, the normal equation offers a straightforward, closed-form solution for many practical applications.
By following the steps outlined in this post, you can leverage the power of matrices and the normal equation to enhance your understanding and application of linear regression.
Lets Go Event Deeper Understanding the Math Behind Linear Regression and the Normal Equation
To fully grasp the normal equation and its derivation, we need to dive into the math behind linear regression. We’ll break down each step, explain where the formulas come from, and understand the situations in which they work.
Why Use the Mean Squared Error (MSE) Cost Function?
The Mean Squared Error (MSE) is used as the cost function in linear regression due to its properties:
- Mathematical Simplicity: MSE is easy to differentiate, which simplifies the optimization process.
- Penalizes Large Errors: Squaring the errors penalizes larger errors more than smaller ones, encouraging the model to avoid large deviations.
- Convex Function: The MSE cost function is convex, meaning it has a single global minimum, making it easier to find the optimal parameters using optimization techniques.
Cost Function in Linear Regression
In linear regression, we aim to predict the target variable \( y \) using a linear model:
\[ \hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n \]
This can be written in vector form as:
\[ \hat{y} = X \theta \]
where:
- X is the matrix of input features, with each row representing a data point and each column representing a feature, plus an additional column of ones to account for the intercept term (\(\theta_0\)).
- \theta is the vector of parameters.
The cost function (MSE) is defined as:
\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} – y^{(i)})^2 \]
where:
- m is the number of training examples.
- \(\hat{y}^{(i)}\) is the predicted value for the \(i\)-th example.
- \(y^{(i)}\) is the actual value for the \(i\)-th example.
In matrix form, the cost function can be expressed as:
\[ J(\theta) = \frac{1}{2m} (X\theta – y)^T (X\theta – y) \]
Deriving the Normal Equation
To minimize \( J(\theta) \), we need to find the parameters \(\theta\) that minimize the cost function. This involves calculus, specifically taking the derivative of the cost function with respect to \(\theta\) and setting it to zero.
Step-by-Step Derivation
- Express the Cost Function in Matrix Form:
This step converts the sum of squared errors into a compact matrix representation:
\[ J(\theta) = \frac{1}{2m} (X\theta – y)^T (X\theta – y) \]
- Compute the Gradient:
To minimize \( J(\theta) \), we compute the gradient of the cost function with respect to \(\theta\):
\[ \nabla_{\theta} J(\theta) = \frac{\partial J(\theta)}{\partial \theta} \]
Expanding the cost function:
\[ J(\theta) = \frac{1}{2m} (X\theta – y)^T (X\theta – y) \]
The gradient is:
\[ \nabla_{\theta} J(\theta) = \frac{1}{m} X^T (X\theta – y) \]
- Set the Gradient to Zero:
To find the minimum of the cost function, we set the gradient to zero:
\[ \frac{1}{m} X^T (X\theta – y) = 0 \]
Multiplying both sides by m (to remove the denominator):
\[ X^T (X\theta – y) = 0 \]
- Solve for \(\theta\):
Rearrange the equation to solve for \(\theta\):
\[ X^T X\theta = X^T y \]
\[ \theta = (X^T X)^{-1} X^T y \]
Why This Formula Works
The formula \(\theta = (X^T X)^{-1} X^T y\) is derived from the method of least squares, which ensures that the resulting parameters \(\theta\) will minimize the sum of squared errors between the predicted and actual values. This method is grounded in linear algebra and calculus, providing a closed-form solution for the parameters.
Situations Where the Normal Equation Works
The normal equation is particularly useful in the following scenarios:
- Small to Moderately Sized Datasets: The normal equation is efficient for datasets where the number of features (\(n\)) is not too large. For very large datasets, computing the inverse of \(X^T X\) can be computationally expensive.
- When Exact Solutions are Needed: The normal equation provides an exact solution, assuming \(X^T X\) is invertible.
- No Need for Iterative Methods: Unlike gradient descent, which iteratively updates parameters and requires tuning of learning rates, the normal equation directly computes the optimal parameters.
Summary: Direct Solution vs. Iterative Methods in Machine Learning
Direct Solution (Normal Equation):
What It Is:
A closed-form formula for linear regression:
\( \theta = (X^T X)^{-1} X^T y \)
directly computes the optimal parameters in one step.
When to Use It:
- Simple Structure: Works for linear regression with a convex, quadratic error function that has a single global minimum.
- Moderate Feature Size: Efficient when the number of features is not too high, making matrix inversion feasible.
- Exact Solution: Provides a precise solution without needing multiple guesses.
Analogy:
Like solving a small puzzle with a clear picture: you can quickly see and place each piece without trial and error.
Iterative Methods (e.g., Gradient Descent):
What It Is:
Techniques like gradient descent start with an initial guess and iteratively adjust parameters to reduce error:
\( \theta := \theta – \alpha \nabla J(\theta) \)
where \( \alpha \) is the learning rate.
When to Use Them:
- Large or Complex Problems: Ideal for high-dimensional data, huge datasets, or non-linear models where direct inversion is too slow or impossible.
- Memory & Computation Limits: Efficiently handle large matrices by working on smaller batches and avoiding the computational burden of matrix inversion.
- Flexibility: Necessary for models without closed-form solutions, such as deep neural networks.
Analogy:
Like solving a massive, complex puzzle without a guiding picture: you place pieces gradually, learning and adjusting along the way.
Why Choose One Over the Other ?:
Direct Calculation:
Advantages: Fast and exact for small, simple problems.
Limitations: Not practical for very large feature sets or complex models due to computational and memory demands.
Iterative Methods:
Advantages: Scalable, flexible, and practical for large-scale or non-linear problems.
Trade-offs: Requires many steps, parameter tuning, and may only approximate the optimal solution.
When to use each ?:
Use a direct solution like the normal equation when dealing with a straightforward linear regression problem with a manageable number of features.
Opt for iterative methods like gradient descent when the problem is large, complex, or doesn’t allow for a simple closed-form solution.
Lets compare Direct vs. Iterative Methods in a final comperative Linear Regression Example:
Dataset
Consider a simple dataset with three points:
\( (x, y) \) pairs: \( (1, 2), (2, 3), (3, 4) \)
We want to find a linear relationship:
\[
y = \theta_0 + \theta_1 x
\]
1. Direct Solution Using Normal Equation
Step 1: Set Up Matrices
For linear regression, add a bias term for \( \theta_0 \):
\[
X = \begin{pmatrix}
1 & 1 \\
1 & 2 \\
1 & 3
\end{pmatrix}, \quad
y = \begin{pmatrix}
2 \\
3 \\
4
\end{pmatrix}
\]
Step 2: Compute Normal Equation
The normal equation is:
\[
\theta = (X^T X)^{-1} X^T y
\]
Calculate \(X^T X\):
\[
X^T X = \begin{pmatrix} 3 & 6 \\ 6 & 14 \end{pmatrix}
\]
Calculate \(X^T y\):
\[
X^T y = \begin{pmatrix} 9 \\ 20 \end{pmatrix}
\]
Step 3: Solve for \( \theta \)
Invert \(X^T X\) and compute:
\[
\theta = (X^T X)^{-1} X^T y
\]
After calculations, we find:
\[
\theta_0 = 1, \quad \theta_1 = 1
\]
So the model is:
\[
y = 1 + 1 \cdot x
\]
2. Iterative Method Using Gradient Descent
Step 1: Initialize Parameters
Start with:
\[
\theta_0 = 0, \quad \theta_1 = 0
\]
and choose a learning rate, e.g., \( \alpha = 0.1 \).
Step 2: Define Hypothesis and Cost Gradient
Hypothesis function:
\[
h_\theta(x) = \theta_0 + \theta_1 x
\]
Gradients:
\[
\begin{aligned}
\frac{\partial J}{\partial \theta_0} &= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) \\
\frac{\partial J}{\partial \theta_1} &= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) \, x^{(i)}
\end{aligned}
\]
with \( m = 3 \).
Step 3: First Iteration
Using initial \( \theta_0 = 0, \theta_1 = 0 \), predictions for \( x = 1, 2, 3 \) are all 0.
Errors: \( -2 \) for \( x=1 \), \( -3 \) for \( x=2 \), \( -4 \) for \( x=3 \).
Compute gradients:
\[
\begin{aligned}
\frac{\partial J}{\partial \theta_0} &= -3 \\
\frac{\partial J}{\partial \theta_1} &\approx -6.667
\end{aligned}
\]
Update parameters:
\[
\begin{aligned}
\theta_0 &:= 0 + 0.1 \times 3 = 0.3 \\
\theta_1 &:= 0 + 0.1 \times 6.667 \approx 0.667
\end{aligned}
\]
After the first iteration: \( \theta_0 \approx 0.3, \; \theta_1 \approx 0.667 \).
Step 4: Continue Iterating
With each subsequent iteration, predictions improve and the parameters gradually approach:
\[
\theta_0 = 1, \quad \theta_1 = 1
\]
Key Takeaways
– The direct solution via the normal equation finds the optimal parameters \( \theta_0 = 1, \theta_1 = 1 \) in one calculation.
– The iterative method (gradient descent) starts with an initial guess and gradually improves the parameters until they match the direct solution.
Both approaches ultimately find the best-fit line:
\[
y = 1 + x
\]
for the given data.