Understanding Linear Regression: The Normal Equation and Matrix Multiplications Explained Understanding Linear Regression: The Normal Equation and Matrix Multiplications Explained Linear regression is a fundamental concept in machine learning and statistics, used to predict a target variable based on one or more input features. While gradient descent is a popular method for finding the best-fitting line, the normal equation offers a direct, analytical approach that doesn’t require iterations. This blog post will walk you through the normal equation step-by-step, explaining why and how it works, and why using matrices simplifies the process. Table of Contents Introduction to Linear Regression Gradient Descent vs. Normal Equation Step-by-Step Explanation of the Normal Equation Step 1: Add Column of Ones Step 2: Transpose of X (XT) Step 3: Matrix Multiplication (XTX) Step 4: Matrix Multiplication (XTy) Step 5: Inverse of XTX ((XTX)-1) Step 6: Final Multiplication to Get θ Why the Normal Equation Works Without Gradient Descent Advantages of Using Matrices Conclusion Introduction to Linear Regression Linear regression aims to fit a line to a dataset, predicting a target variable $y$ based on input features $x$. The model is defined as: $$ y = \theta_0 + \theta_1 x $$ For multiple features, it generalizes to: $$ y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n $$ Gradient Descent vs. Normal Equation Gradient Descent Iterative Method: Starts with initial guesses for parameters $\theta$ and iteratively updates them to minimize the cost function (e.g., Mean Squared Error). Requires: Choosing a learning rate and number of iterations. Usage: Suitable for very large datasets and complex models. Normal Equation Analytical Method: Directly computes the parameters $\theta$ that minimize the cost function without iterations. Formula: $\hat{\theta} = (X^T X)^{-1} X^T y$ Advantages: Efficient for small to moderately sized datasets, no need for learning rates or iterations. Step-by-Step Explanation of the Normal Equation Let’s use a simple dataset to illustrate the process: Hours Studied ($x$) Score ($y$) 1 2 2 3 3 4.5 4 5 5 7 Step 1: Add Column of Ones Purpose: To include the intercept term ($\theta_0$) in the model. Process: $$ X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix} $$ $$ y = \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix} $$ Step 2: Transpose of $X$ ($X^T$) Purpose: To prepare for matrix multiplication. Process: $$ X^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} $$ Matrix Multiplication ($X^T X$) Purpose: To consolidate feature data. Process: $$ X^T X = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix} $$ Matrix Multiplication ($X^T y$) Purpose: To consolidate the relationship between features and target values. Process: $$ y = \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix} $$ $$ X^T y = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix} = \begin{bmatrix} 21.5 \\ 77.5 \end{bmatrix} $$ Inverse of $X^T X$ (($X^T X)^{-1}$)) Purpose: To isolate the parameter vector $\theta$. Process: $$ (X^T X)^{-1} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}^{-1} $$ To find the inverse of a 2×2 matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$, we use the formula: $$ \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad – bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} $$ Applying this to our matrix: $$ a = 5, \quad b = 15, \quad c = 15, \quad d = 55 $$ $$ (X^T X)^{-1} = \frac{1}{5 \cdot 55 – 15 \cdot 15} \begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix} = \frac{1}{100} \begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix} = \begin{bmatrix} 0.55 & -0.15 \\ -0.15 & 0.05 \end{bmatrix} $$ Step 6: Final Multiplication to Get $\hat{\theta}$ Purpose: To compute the optimal parameters. Process: We need to compute the parameter vector $\hat{\theta}$ using the formula: $$ \hat{\theta} = (X^T X)^{-1} X^T y $$ First, let’s break down each part: Step 6.1: Compute $X^T X$ We already have: $$ X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix} $$ Taking the transpose of $X$: $$ X^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} $$ Now, multiply $X^T$ by $X$: $$ X^T X = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix} $$ Step 6.2: Compute $X^T y$ Next, we need to multiply $X^T$ by $y$: $$ y = \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix} $$ $$ X^T y = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \\ 4.5 \\ 5 \\ 7 \end{bmatrix} = \begin{bmatrix} 21.5 \\ 77.5 \end{bmatrix} $$ Step 6.3: Compute the Inverse of $X^T X$ We need to find the inverse of the matrix $X^T X$: $$ (X^T X)^{-1} = \begin{bmatrix} 5 & 15 \\ 15 & 55 \end{bmatrix}^{-1} $$ To find the inverse of a 2×2 matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$, we use the formula: $$ \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad – bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} $$ Applying this to our matrix: $$ a = 5, \quad b = 15, \quad c = 15, \quad d = 55 $$ $$ (X^T X)^{-1} = \frac{1}{5 \cdot 55 – 15 \cdot 15} \begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix} = \frac{1}{100} \begin{bmatrix} 55 & -15 \\ -15 & 5 \end{bmatrix} = \begin{bmatrix} 0.55 & -0.15 \\ -0.15 & 0.05 \end{bmatrix} $$ Step 6.4: Multiply $(X^T X)^{-1}$ by $X^T y$ Finally, we compute $\hat{\theta}$ by multiplying $(X^T X)^{-1}$ by $X^T y$: $$ \hat{\theta} = \begin{bmatrix} 0.55 & -0.15 \\ -0.15 & 0.05 \end{bmatrix} \begin{bmatrix} 21.5 \\ 77.5 \end{bmatrix} $$ Let’s break it down: For the first element of $\hat{\theta}$: $$ 0.55 \cdot 21.5 + (-0.15) \cdot 77.5 = 11.825 – 11.625 = 0.2 $$ For the second element of $\hat{\theta}$: $$ -0.15 \cdot 21.5 + 0.05 \cdot 77.5 = -3.225 + 3.875 = 0.65 $$ So, we have: $$ \hat{\theta} = \begin{bmatrix} 0.2 \\ 0.65 \end{bmatrix} $$ Our final parameters are: $\theta_0 = 0.2$ $\theta_1 = 0.65$ Why the Normal Equation Works Without Gradient Descent The normal equation provides a closed-form solution by directly solving the linear system that minimizes the cost function. It works without iterations because it analytically finds the exact parameters that minimize the mean squared error (MSE). Closed-Form Solution: Directly finds the minimum of the cost function. No Iterations: Avoids the need for multiple updates and learning rate adjustments. Exact Solution: Provides an exact solution assuming the matrix $X^T X$ is invertible. Advantages of Using Matrices Compact Representation: Simplifies equations and…
Thank you for reading this post, don't forget to subscribe!