Assumptions of Linear Regression

Last Updated : 28 Dec, 2022

Linear regression is the simplest machine learning algorithm of predictive analysis. Linear regression algorithm shows a linear relationship between a dependent (y variable) and one or more independent (x variable) variables hence it is known as linear regression. Since linear regression shows the linear relationship, which means it finds how the value of the dependent variable is changing according to the value of the independent variable. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc.

The theory of linear regression is based on certain statistical assumptions. It is crucial to check these regression assumptions before modeling the data using the linear regression approach.

Mainly there are 7 assumptions taken while using Linear Regression:

Linear Model
No Multicolinearlity in the data
Homoscedasticity of Residuals or Equal Variances
No Autocorrelation in residuals
Number of observations Greater than the number of predictors
Each observation is unique
Predictors are distributed Normally

Linear Model

According to this assumption, the relationship between the independent and dependent variables should be linear. The reason behind this relationship is that if the relationship will be non-linear which is certainly is the case in the real-world data then the predictions made by our linear regression model will not be accurate and will vary from the actual observations a lot.

No Multicolinearlity in the data

If the predictor variables are correlated among themselves, then the data is said to have a multicollinearity problem. But why is this a problem? The answer to this question is that high collinearity means that the two variables vary very similarly and contain the same kind of information. This will leads to redundancy in the dataset. Due to redundancy, only the complexity of the model increase, and no new information or pattern is learned by the model. We generally try to avoid highly correlated features even while using complex models.

We can identify highly correlated features using scatter plots or heatmap.

Homoscedasticity of Residuals or Equal Variances

Homoscedasity is the term that states that the spread residuals which we are getting from the linear regression model should be homogeneous or equal spaces. If the spread of the residuals is heterogeneous then the model is called to be an unsatisfactory model.

One can easily get an idea of the homoscedasticity of the residuals by plotting a scatter plot of the residual data.

No Autocorrelation in residuals

One of the critical assumptions of multiple linear regression is that there should be no autocorrelation in the data. When the residuals are dependent on each other, there is autocorrelation. This factor is visible in the case of stock prices when the price of a stock is not independent of its previous one.
Plotting the variables on a graph like a scatterplot or a line plot allows you to check for autocorrelations if any.

Number of observations Greater than the number of predictors

For a better-performing model, the number of training data or observations should be always greater than the number of test or prediction data. However greater the number of observations better the model performance. Therefore, to build a linear regression model you must have more observations than the number of independent variables (predictors) in the data set. The reason behind this can be understood by the curse of dimensionality.

Each observation is unique

It is also important to ensure that each observation is independent of the other observation. Meaning each observation in the data set should be measured separately on a unique occurrence of the event that caused the observation.

For example:

If you want to include two observations to measure the density of a liquid with 5 Kg mass and 5 L volume, then you must experiment twice to measure the density for the two independent observations. Such observations are called replicates of each other. It would be wrong to use the same measurement for both observations, as you will disregard the random error.

Predictors are distributed Normally

This assumption ensures that you have equally distributed observations for the range of each predictor. So at the end of the model training, the predicted values for each test data should be a normal distribution. One can get an idea of the distribution of the predicted values by plotting density, KDE, or QQ plots for the predictions.