Open In App
Related Articles

Difference between Gradient descent and Normal equation

Like Article
Save Article
Report issue

In regression models, our objective is to discover a model that can make predictions that closely resemble the actual target values. Basically, we try to find the parameters of the model which support our objective of the best model. The general behind finding this parameter is that we calculate the error between our actual value and predicted value and based on the error we manipulate the parameter which gives the lowest error.    

For models like Linear Regression, we can use two types of techniques to find the parameter: Normal Equation and Gradient descent. 

Gradient Descent

Gradient Descent is an iterative optimization algorithm that is used to find the values of parameters of a function that minimizes a cost function. It is one of the most used optimization techniques in machine learning projects for updating the parameters of a model in order to minimize a cost function. Parameters refer to coefficients in Linear Regression and weights in neural networks. 

Repeat until convergence\\ \theta_{j}=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J(\theta)
Gradient descent can also converge even if the learning rate is kept fixed. 

Python Implementation of Gradient Descent 

We can apply gradient descent to our input feature using the numpy library. However, to apply gradient we have to choose some hyperparameters which can be learned itself by the model.  


import numpy as np
def gradient_descent(X, y):
    # Decide learning rate value
    learning_rate = 0.01
    # Decide number of iterations
    num_iterations = 100
    # Get the number of rows and columns in the dataset
    num_samples, num_features = X.shape
    # Initialize random weights
    weights = np.random.randn(num_features)
    for iteration in range(num_iterations):
        permutation = np.random.permutation(num_samples)
        X = X[permutation]
        y = y[permutation]
        gradients = np.zeros(num_features)
        predictions =, weights)
        error = predictions - y
        gradients += 2 *, error) / num_samples
        # Update weights
        weights -= learning_rate * gradients
    return weights
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([3, 7, 11])
weights = gradient_descent(X, y)
# To make prediction on dataset
X_test = np.array([[3, 4], [5, 4], [7, 9]])
predictions =, weights)



[ 6.90093216 10.1567656  15.93407653]

Normal Equation

Normal Equation is an analytical approach used for optimization. It is an alternative for Gradient descent. Normal equation performs minimization without iteration. Normal equations directly compute the parameters of the model that minimizes the Sum of the squared difference between the actual term and the predicted term of the dataset without needing to choose any hyperparameters like learning rate or the number of iterations. 

\Theta=\left(X^{T} X\right)^{-1} X^{T} y

X = input feature value 
y = output value 
If the term X^T  X is non-invertible or singular then we can use regularization. 

Python implementation of Normal Equation in Gradient Descent 

We can use the numpy library to apply linear algebra functions on datasets to get the parameter of the linear regression Model. Also, we will add 1 to the beginning of each row of the matrix to get the bias parameter of the model. 


import numpy as np
# Input features
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
# Target values
y = np.array([[2], [3], [4], [5]])
# Transpose of X
X_transpose = X.T
# Compute X^T * X
X_transpose_X =, X)
# Compute X^T * y
X_transpose_y =, y)
# Add the identity matrix to X_transpose_X
X_with_intercept = X_transpose_X + np.eye(X_transpose_X.shape[0])
# Solve the normal equation
theta = np.linalg.solve(X_with_intercept, X_transpose_y)
# Input features for testing
X_test = np.array([[1], [4]])
X_test_with_intercept = np.c_[np.ones((X_test.shape[0], 1)), X_test]
predictions =, theta)




Difference between Gradient Descent and the Normal Equation. 

Gradient DescentNormal Equation
In gradient descent, we need to choose the learning rate, Number of iterations, and another hyperparameter. In the normal equation, there is no need to choose the learning rate.
It is an iterative algorithm.It is an analytical approach.
Gradient descent works well with large number of features.Normal equation works well with small number of features.
Feature scaling can be used.No need for feature scaling.
No need to handle non-invertibility cases.If (X^T  X) is non-invertible, regularization can be used to handle this.
Time complexity of the gradient descent algorithm depends upon number of iterations and data sizeThe time complexity of the normal equation depends upon on the matrix inversion operation of the input feature 

Last Updated : 10 Jun, 2023
Like Article
Save Article
Share your thoughts in the comments
Similar Reads