ML | Multiple Linear Regression (Backward Elimination Technique)

Multiple Linear Regression is a type of regression where the model depends on several independent variables(instead of only on one independent variable as seen in the case of Simple Linear Regression). Multiple Linear Regression has several techniques to build an effective model namely:-

  • All-in
  • Backward Elimination
  • Forward Selection
  • Bidirectional Elimination

In this article, we will implement multiple linear regression using the backward elimination technique.
Backward Elimination consists of the following steps:

  • Select a significance level to stay in the model (eg. SL = 0.05)
  • Fit the model with all possible predictors
  • Consider the predictor with the highest P-value. If P>SL, go to point d.
  • Remove the predictor
  • Fit the model without this variable and repeat the step c until the condition becomes false.

Let us suppose that we have a dataset containing a set of expenditure information for different companies. We would like to know the profit made by each company to determine which company can give the best results if collaborated with them. We build the regression model using a step by step approach.



Step 1 : Basic preprocessing and encoding

filter_none

edit
close

play_arrow

link
brightness_4
code

# import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
  
# import the dataset
df = pd.read_csv('50_Startups.csv')
  
# first five entries of the dataset
df.head()
  
# split the dataframe into dependent and independent variables. 
x = df[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]
y = df['Profit']
x.head()
y.head()
  
# since the state is a string datatype column we need to encode it.
x = pd.get_dummies(x)
x.head()

chevron_right


Dataset

The set of independent variables after encoding the state column

Step 2 : Spliting the data into training and testing set and making predictions

filter_none

edit
close

play_arrow

link
brightness_4
code

x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size = 0.3, random_state = 0)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x_train, y_train)
pred = lm.predict(x_test)

chevron_right


We can see that our predictions our close enough to the test set but how do we find the most important factor contributing to the profit.
Here is a solution for that.
We know that the equation of a mutliple linear regression line is given by y=b1+b2*x+b3*x’+b4*x”+…….
where b1, b2, b3, … are the coefficients and x, x’, x” are all independent variables.
Since we dont have any ‘x’ for the first coefficient we assume it can be written as a product of b and 1 and hence we append a column of ones. There are libraries that take care of it but since we are using the stats model library we need to explicitly add the column.
Step 3 : Using the backward elimination technique

filter_none

edit
close

play_arrow

link
brightness_4
code

import statsmodels.formula.api as sm
# add a column of ones as integer data type
x = np.append(arr = np.ones((50, 1)).astype(int), 
              values = x, axis = 1)
# choose a Significance level usually 0.05, if p>0.05
#  for the highest values parameter, remove that value
x_opt = x[:, [0, 1, 2, 3, 4, 5]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()

chevron_right



This figure shows the highest valued parameter

And now we follow the steps of the backward elimination and start eliminating unnecessary parameters.

filter_none

edit
close

play_arrow

link
brightness_4
code

# remove the 4th column as it has the highest value
x_opt = x[:, [0, 1, 2, 3, 5]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()
  
# remove the 5th column as it has the highest value
x_opt = x[:, [0, 1, 2, 3]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()
  
# remove the 3rd column as it has the highest value
x_opt = x[:, [0, 1, 2]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()
  
# remove the 2nd column as it has the highest value
x_opt = x[:, [0, 1]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()

chevron_right


Summary after removing the first unnecessary parameter.

So if we continue the process we see that we are left with only one column at the end and that is the R&D spent.We can conclude that the company which has maximum expenditure on the R&D makes the highest profit.

With this, we have solved the problem statement of finding the company for collaboration. Now let us have a brief look at the parameters of the OLS summary.