# ML | Multiple Linear Regression (Backward Elimination Technique)

Multiple Linear Regression is a type of regression where the model depends on several independent variables(instead of only on one independent variable as seen in the case of Simple Linear Regression). Multiple Linear Regression has several techniques to build an effective model namely:-

- All-in
- Backward Elimination
- Forward Selection
- Bidirectional Elimination

In this article, we will implement multiple linear regression using the backward elimination technique.

Backward Elimination consists of the following steps:

- Select a significance level to stay in the model (eg. SL = 0.05)
- Fit the model with all possible predictors
- Consider the predictor with the highest P-value. If P>SL, go to point d.
- Remove the predictor
- Fit the model without this variable and repeat the step c until the condition becomes false.

Let us suppose that we have a dataset containing a set of expenditure information for different companies. We would like to know the profit made by each company to determine which company can give the best results if collaborated with them. We build the regression model using a step by step approach.

**Step 1 : Basic preprocessing and encoding**

`# import the necessary libraries ` `import` `pandas as pd ` `import` `numpy as np ` `from` `sklearn.model_selection ` `import` `train_test_split ` ` ` `# import the dataset ` `df ` `=` `pd.read_csv(` `'50_Startups.csv'` `) ` ` ` `# first five entries of the dataset ` `df.head() ` ` ` `# split the dataframe into dependent and independent variables. ` `x ` `=` `df[[` `'R&D Spend'` `, ` `'Administration'` `, ` `'Marketing Spend'` `, ` `'State'` `]] ` `y ` `=` `df[` `'Profit'` `] ` `x.head() ` `y.head() ` ` ` `# since the state is a string datatype column we need to encode it. ` `x ` `=` `pd.get_dummies(x) ` `x.head() ` |

*chevron_right*

*filter_none*

** Dataset**

**The set of independent variables after encoding the state column**

**Step 2 : Splitting the data into training and testing set and making predictions**

`x_train, x_test, y_train, y_test ` `=` `train_test_split( ` ` ` `x, y, test_size ` `=` `0.3` `, random_state ` `=` `0` `) ` `from` `sklearn.linear_model ` `import` `LinearRegression ` `lm ` `=` `LinearRegression() ` `lm.fit(x_train, y_train) ` `pred ` `=` `lm.predict(x_test) ` |

*chevron_right*

*filter_none*

We can see that our predictions our close enough to the test set but how do we find the most important factor contributing to the profit.

Here is a solution for that.

We know that the equation of a mutliple linear regression line is given by y=b1+b2*x+b3*x’+b4*x”+…….

where b1, b2, b3, … are the coefficients and x, x’, x” are all independent variables.

Since we dont have any ‘x’ for the first coefficient we assume it can be written as a product of b and 1 and hence we append a column of ones. There are libraries that take care of it but since we are using the stats model library we need to explicitly add the column.

**Step 3 : Using the backward elimination technique**

`import` `statsmodels.formula.api as sm ` `# add a column of ones as integer data type ` `x ` `=` `np.append(arr ` `=` `np.ones((` `50` `, ` `1` `)).astype(` `int` `), ` ` ` `values ` `=` `x, axis ` `=` `1` `) ` `# choose a Significance level usually 0.05, if p>0.05 ` `# for the highest values parameter, remove that value ` `x_opt ` `=` `x[:, [` `0` `, ` `1` `, ` `2` `, ` `3` `, ` `4` `, ` `5` `]] ` `ols ` `=` `sm.OLS(endog ` `=` `y, exog ` `=` `x_opt).fit() ` `ols.summary() ` |

*chevron_right*

*filter_none*

**This figure shows the highest valued parameter**

And now we follow the steps of the **backward elimination and start eliminating unnecessary parameters.**

`# remove the 4th column as it has the highest value ` `x_opt ` `=` `x[:, [` `0` `, ` `1` `, ` `2` `, ` `3` `, ` `5` `]] ` `ols ` `=` `sm.OLS(endog ` `=` `y, exog ` `=` `x_opt).fit() ` `ols.summary() ` ` ` `# remove the 5th column as it has the highest value ` `x_opt ` `=` `x[:, [` `0` `, ` `1` `, ` `2` `, ` `3` `]] ` `ols ` `=` `sm.OLS(endog ` `=` `y, exog ` `=` `x_opt).fit() ` `ols.summary() ` ` ` `# remove the 3rd column as it has the highest value ` `x_opt ` `=` `x[:, [` `0` `, ` `1` `, ` `2` `]] ` `ols ` `=` `sm.OLS(endog ` `=` `y, exog ` `=` `x_opt).fit() ` `ols.summary() ` ` ` `# remove the 2nd column as it has the highest value ` `x_opt ` `=` `x[:, [` `0` `, ` `1` `]] ` `ols ` `=` `sm.OLS(endog ` `=` `y, exog ` `=` `x_opt).fit() ` `ols.summary() ` |

*chevron_right*

*filter_none*

** Summary after removing the first unnecessary parameter.**

So if we continue the process we see that we are left with only one column at the end and that is the R&D spent.We can conclude that the company which has maximum expenditure on the R&D makes the highest profit.

With this, we have solved the problem statement of finding the company for collaboration. Now let us have a brief look at the parameters of the OLS summary.

**R square**– It tells about the goodness of the fit. It ranges between 0 and 1. The closer the value to 1, the better it is. It explains the extent of variation of the dependent variables in the model. However, it is biased in a way that it never decreases(even on adding variables).**Adj Rsquare**– This parameter has a penalising factor(the no. of regressors) and it always decreases or stays identical to the previous value as the number of independent variables increases. If its value keeps increasing on removing the unnecessary parameters go ahead with the model or stop and revert.**F statistic**– It is used to compare two variances and is always greater than 0. It is formulated as v1^{2}/v2^{2}. In regression, it is the ratio of the explained to the unexplained variance of the model.**AIC and BIC**– AIC stands for Akaike’s information criterion and BIC stands for Bayesian information criterion Both these parameters depend on the likelihood function L.**Skew**– Informs about the data symmetry about the mean.**Kurtosis**– It measures the shape of the distribution i.e.the amount of data close to the mean than far away from the mean.**Omnibus**–**D’Angostino’s test**. It provides a combined statistical test for the presence of skewness and kurtosis.**Log-likelihood**– It is the log of the likelihood function.- Understanding Logistic Regression
- Multiple Linear Regression using R
- Regression and Classification | Supervised Machine Learning
- Linear Regression using PyTorch
- Identifying handwritten digits using Logistic Regression in PyTorch
- Simple Linear-Regression using R
- Linear Regression Using Tensorflow
- ML | Linear Regression
- Gradient Descent in Linear Regression
- Mathematical explanation for Linear Regression working
- ML | Boston Housing Kaggle Challenge with Linear Regression
- ML | Normal Equation in Linear Regression
- Python | Implementation of Polynomial Regression
- Python | Decision Tree Regression using sklearn
- ML | Logistic Regression using Tensorflow

This image shows the preferred relative values of the parameters.

**References:- Machine Learning course by Kirill Eremenko and Hadelin de Ponteves.**

## Recommended Posts:

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.