Skip to content
Related Articles

Related Articles

ML | Normal Equation in Linear Regression
  • Difficulty Level : Medium
  • Last Updated : 08 May, 2021

Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function. We can directly find out the value of θ without using Gradient Descent. Following this approach is an effective and a time-saving option when are working with a dataset with small features.

Normal Equation is a follows :

$\theta=\left(X^{T} X\right)^{-1} \cdot\left(X^{T} y\right)$

In the above equation,
θ : hypothesis parameters that define it the best.
X : Input feature value of each instance.
Y : Output value of each instance.

Maths Behind the equation –

Given the hypothesis function

$h(\theta)=\theta_{0} x_{0}+\theta_{1} x_{1}+\ldots \theta_{n} x_{n}$

where,
n : the no. of features in the data set.
x0 : 1 (for vector multiplication)

Notice that this is the dot product between θ and x values. So for the convenience to solve we can write it as :

$h(\theta)=\theta_{0} x_{0}+\theta_{1} x_{1}+\ldots \theta_{n} x_{n}$

The motive in Linear Regression is to minimize the cost function :
  J(\Theta) = \frac{1}{2m} \sum_{i = 1}^{m} \frac{1}{2} [h_{\Theta}(x^{(i)}) - y^{(i)}]^{2}

where,
xi : the input value of iih training example.
m : no. of training instances
n : no. of data-set features
yi : the expected result of ith instance



Let us representing cost function in a vector form.

$\left[\begin{array}{c}h_{\theta}\left(x^{0}\right) \\ h_{\theta}\left(x^{1}\right) \\ \cdots \\ h_{\theta}\left(x^{m}\right)\end{array}\right]-\left[\begin{array}{c}y^{0} \\ y^{1} \\ \cdots \\ y^{m}\end{array}\right]$

we have ignored 1/2m here as it will not make any difference in the working. It was used for the mathematical convenience while calculation gradient descent. But it is no more needed here.
$\left[\mid \begin{array}{l}\theta^{T}\left(x^{0}\right) \\ \theta^{T}\left(x^{1}\right) \\ \cdots \\ \theta^{T}\left(x^{m}\right)\end{array}\right]-y$

$\left[\mid \begin{array}{l}\theta_{0}\left(\begin{array}{c} 0 \\ x_{0} \end{array}\right)+\theta_{1}\left(\begin{array}{r} 0 \\ x_{1} \end{array}\right)+\ldots . \theta_{n}\left(\begin{array}{c} 0 \\ x_{n} \end{array}\right) \\ \theta_{0}\left(\begin{array}{c} 1 \\ x_{0} \end{array}\right)+\theta_{1}\left(\begin{array}{r} 1 \\ x_{1} \end{array}\right)+\ldots \theta_{n}\left(\begin{array}{c} 1 \\ x_{n} \end{array}\right) \\ \cdots \\ \theta_{0}\left(\begin{array}{c} m \\ x_{0} \end{array}\right)+\theta_{1}\left(\begin{array}{c} m \\ x_{1} \end{array}\right)+\ldots . \theta_{n}\left(\begin{array}{c} m \\ x_{n} \end{array}\right)\end{array}\right]-y$

xij : value of jih feature in iih training example.

This can further be reduced to  X\theta - y
But each residual value is squared. We cannot simply square the above expression. As the square of a vector/matrix is not equal to the square of each of its values. So to get the squared value, multiply the vector/matrix with its transpose. So, the final equation derived is

$(X \theta-y)^{T}(X \theta-y)$

Therefore, the cost function is
Cost $=(X \theta-y)^{T}(X \theta-y)$

So, now getting the value of θ using derivative

(1)   \begin{equation*} \frac{\partial J_{\theta}}{\partial \theta}=\frac{\partial}{\partial \theta}\left[(X \theta-y)^{T}(X \theta-y)\right] \end{equation*}

$\frac{\partial J_{\theta}}{\partial \theta}=2 X^{T} X \theta-2 X^{T} y$

$\operatorname{Cost}^{\prime}(\theta)=0$

$2 X^{T} X \theta-2 X^{T} y=0$

$2 X^{T} X \theta=2 X^{T} y$

$\left(X^{T} X\right)^{-1}\left(X^{T} X\right) \theta=\left(X^{T} X\right)^{-1} \cdot\left(X^{T} y\right)$

$\theta=\left(X^{T} X\right)^{-1} \cdot\left(X^{T} y\right)$

So, this is the finally derived Normal Equation with θ giving the minimum cost value.

machine-learning-img

My Personal Notes arrow_drop_up