Multiple Linear Regression With scikit-learn

Last Updated : 11 Jul, 2022

In this article, let’s learn about multiple linear regression using scikit-learn in the Python programming language.

Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it’s utilized as a method for predictive modeling, in which an algorithm is employed to forecast continuous outcomes. Multiple linear regression, often known as multiple regression, is a statistical method that predicts the result of a response variable by combining numerous explanatory variables. Multiple regression is a variant of linear regression (ordinary least squares) in which just one explanatory variable is used.

Mathematical Imputation:

To improve prediction, more independent factors are combined. The following is the linear relationship between the dependent and independent variables:

here, y is the dependent variable.

x1, x2,x3,… are independent variables.
b0 =intercept of the line.
b1, b2, … are coefficients.

for a simple linear regression line is of the form :

y = mx+c

for example if we take a simple example, :

feature 1: TV

feature 2: radio

feature 3: Newspaper

output variable: sales

Independent variables are the features feature1 , feature 2 and feature 3. Dependent variable is sales. The equation for this problem will be:

y = b0+b1x1+b2x2+b3x3

x1, x2 and x3 are the feature variables.

In this example, we use scikit-learn to perform linear regression. As we have multiple feature variables and a single outcome variable, it’s a Multiple linear regression. Let’s see how to do this step-wise.

Stepwise Implementation

Step 1: Import the necessary packages

The necessary packages such as pandas, NumPy, sklearn, etc… are imported.

Python3

# importing modules and packages 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error, mean_absolute_error 
from sklearn import preprocessing 

Step 2: Import the CSV file:

The CSV file is imported using pd.read_csv() method. To access the CSV file click here. The ‘No ‘ column is dropped as an index is already present. df.head() method is used to retrieve the first five rows of the dataframe. df.columns attribute returns the name of the columns. The column names starting with ‘X’ are the independent features in our dataset. The column ‘Y house price of unit area’ is the dependent variable column. As the number of independent or exploratory variables is more than one, it is a Multilinear regression.

To view and download the CSV file click here.

Python3

# importing data 
df = pd.read_csv('Real estate.csv') 
df.drop('No', inplace = True,axis=1) 
  
print(df.head()) 
print(df.columns)

Output:

X1 transaction date X2 house age … X6 longitude Y house price of unit area

0 2012.917 32.0 … 121.54024 37.9

1 2012.917 19.5 … 121.53951 42.2

2 2013.583 13.3 … 121.54391 47.3

3 2013.500 13.3 … 121.54391 54.8

4 2012.833 5.0 … 121.54245 43.1

[5 rows x 7 columns]

Index([‘X1 transaction date’, ‘X2 house age’,

‘X3 distance to the nearest MRT station’,

‘X4 number of convenience stores’, ‘X5 latitude’, ‘X6 longitude’,

‘Y house price of unit area’],

dtype=’object’)

Step 3: Create a scatterplot to visualize the data:

A scatterplot is created to visualize the relation between the ‘X4 number of convenience stores’ independent variable and the ‘Y house price of unit area’ dependent feature.

Python3

# plotting a scatterplot 
sns.scatterplot(x='X4 number of convenience stores', 
                y='Y house price of unit area', data=df) 

Output:

Step 4: Create feature variables:

To model the data we need to create feature variables, X variable contains independent variables and y variable contains a dependent variable. X and Y feature variables are printed to see the data.

Python3

# creating feature variables 
X = df.drop('Y house price of unit area',axis= 1) 
y = df['Y house price of unit area'] 
print(X) 
print(y)

Output:

X1 transaction date X2 house age … X5 latitude X6 longitude

0 2012.917 32.0 … 24.98298 121.54024

1 2012.917 19.5 … 24.98034 121.53951

2 2013.583 13.3 … 24.98746 121.54391

3 2013.500 13.3 … 24.98746 121.54391

4 2012.833 5.0 … 24.97937 121.54245

.. … … … … …

409 2013.000 13.7 … 24.94155 121.50381

410 2012.667 5.6 … 24.97433 121.54310

411 2013.250 18.8 … 24.97923 121.53986

412 2013.000 8.1 … 24.96674 121.54067

413 2013.500 6.5 … 24.97433 121.54310

[414 rows x 6 columns]

0 37.9

1 42.2

2 47.3

3 54.8

4 43.1

…

409 15.4

410 50.0

411 40.6

412 52.5

413 63.9

Name: Y house price of unit area, Length: 414, dtype: float64

Step 5: Split data into train and test sets:

Here, train_test_split() method is used to create train and test sets, the feature variables are passed in the method. test size is given as 0.3, which means 30% of the data goes into test sets, and train set data contains 70% data. the random state is given for data reproducibility.

Python3

# creating train and test sets 
X_train, X_test, y_train, y_test = train_test_split( 
    X, y, test_size=0.3, random_state=101) 

Step 6: Create a linear regression model

A simple linear regression model is created. LinearRegression() class is used to create a simple regression model, the class is imported from sklearn.linear_model package.

Python3

# creating a regression model 
model = LinearRegression() 

Step 7: Fit the model with training data.

After creating the model, it fits with the training data. The model gains knowledge about the statistics of the training model. fit() method is used to fit the data.

Python3

# fitting the model 
model.fit(X_train,y_train)

Step 8: Make predictions on the test data set.

In this model.predict() method is used to make predictions on the X_test data, as test data is unseen data and the model has no knowledge about the statistics of the test set.

Python3

# making predictions 
predictions = model.predict(X_test) 

Step 9: Evaluate the model with metrics.

The multi-linear regression model is evaluated with mean_squared_error and mean_absolute_error metric. when compared with the mean of the target variable, we’ll understand how well our model is predicting. mean_squared_error is the mean of the sum of residuals. mean_absolute_error is the mean of the absolute errors of the model. The less the error, the better the model performance is.

mean absolute error = it’s the mean of the sum of the absolute values of residuals.

mean square error = it’s the mean of the sum of the squares of residuals.

y= actual value
y hat = predictions

Python3

# model evaluation 
print( 
  'mean_squared_error : ', mean_squared_error(y_test, predictions)) 
print( 
  'mean_absolute_error : ', mean_absolute_error(y_test, predictions)) 

Output:

mean_squared_error :  46.21179783493418
mean_absolute_error :  5.392293684756571

For data collection, there should be a significant discrepancy between the numbers. If you want to ignore outliers in your data, MAE is a preferable alternative, but if you want to account for them in your loss function, MSE/RMSE is the way to go. MSE is always higher than MAE in most cases, MSE equals MAE only when the magnitudes of the errors are the same.

Code:

Here, is the full code together, combining the above steps.

Python3

# importing modules and packages 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error, mean_absolute_error 
from sklearn import preprocessing 
  
# importing data 
df = pd.read_csv('Real estate.csv') 
df.drop('No', inplace=True, axis=1) 
  
print(df.head()) 
  
print(df.columns) 
  
# plotting a scatterplot 
sns.scatterplot(x='X4 number of convenience stores', 
                y='Y house price of unit area', data=df) 
  
# creating feature variables 
X = df.drop('Y house price of unit area', axis=1) 
y = df['Y house price of unit area'] 
  
print(X) 
print(y) 
  
# creating train and test sets 
X_train, X_test, y_train, y_test = train_test_split( 
    X, y, test_size=0.3, random_state=101) 
  
# creating a regression model 
model = LinearRegression() 
  
# fitting the model 
model.fit(X_train, y_train) 
  
# making predictions 
predictions = model.predict(X_test) 
  
# model evaluation 
print('mean_squared_error : ', mean_squared_error(y_test, predictions)) 
print('mean_absolute_error : ', mean_absolute_error(y_test, predictions)) 

Output:

X1 transaction date X2 house age … X6 longitude Y house price of unit area

0 2012.917 32.0 … 121.54024 37.9

1 2012.917 19.5 … 121.53951 42.2

2 2013.583 13.3 … 121.54391 47.3

3 2013.500 13.3 … 121.54391 54.8

4 2012.833 5.0 … 121.54245 43.1

[5 rows x 7 columns]

Index([‘X1 transaction date’, ‘X2 house age’,

‘X3 distance to the nearest MRT station’,

‘X4 number of convenience stores’, ‘X5 latitude’, ‘X6 longitude’,

‘Y house price of unit area’],

dtype=’object’)

X1 transaction date X2 house age … X5 latitude X6 longitude

0 2012.917 32.0 … 24.98298 121.54024

1 2012.917 19.5 … 24.98034 121.53951

2 2013.583 13.3 … 24.98746 121.54391

3 2013.500 13.3 … 24.98746 121.54391

4 2012.833 5.0 … 24.97937 121.54245

.. … … … … …

409 2013.000 13.7 … 24.94155 121.50381

410 2012.667 5.6 … 24.97433 121.54310

411 2013.250 18.8 … 24.97923 121.53986

412 2013.000 8.1 … 24.96674 121.54067

413 2013.500 6.5 … 24.97433 121.54310

[414 rows x 6 columns]

0 37.9

1 42.2

2 47.3

3 54.8

4 43.1

…

409 15.4

410 50.0

411 40.6

412 52.5

413 63.9

Name: Y house price of unit area, Length: 414, dtype: float64

mean_squared_error : 46.21179783493418

mean_absolute_error : 5.392293684756571

Suggest improvement

How to Perform a One Proportion Z-Test in Python

Titanic Survival Prediction Using Machine Learning

Share your thoughts in the comments

Multiple Linear Regression With scikit-learn

Stepwise Implementation

Step 1: Import the necessary packages

Python3

Step 2: Import the CSV file:

Python3

Step 3: Create a scatterplot to visualize the data:

Python3

Step 4: Create feature variables:

Python3

Step 5: Split data into train and test sets:

Python3

Step 6: Create a linear regression model

Python3

Step 7: Fit the model with training data.

Python3

Step 8: Make predictions on the test data set.

Python3

Step 9: Evaluate the model with metrics.

Python3

Code:

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?