Open In App

Residual Leverage Plot (Regression Diagnostic)

In linear or multiple regression, it is not enough to just fit the model into the dataset. But, it may not give the desired result. To apply the linear or multiple regression efficiently to the dataset. There are some assumptions that we need to check on the dataset that made linear/multiple regression efficient and generate better accuracy.

Assumptions of Regression

Regression analysis requires some assumptions to be followed by the dataset. These assumptions are:



To perform a good linear regression analysis, we also need to check whether these assumptions are violated:

Regression Diagnostic Plots

The above plots can be used to validate and test the above assumptions are part of Regression Diagnostic. This diagnostic can be used to check whether the assumptions. Before we discuss the diagnostic plot one by one let’s discuss some important terms:



Below are the plots that we used in the diagnostic plot:

This plot is used to check for linearity and homoscedasticity, if the model meets the condition of linear relationship then it should have a horizontal line with much deviation. If the model meets the condition for homoscedasticity, the graph should be equally spread around the y=0 line.

Implementation

In this implementation, we will be plotting different diagnostic plots. For that, we use the Real-Estate dataset and apply the Ordinary Least Square (OLS) Regression. We then plot the regression diagnostic plot and Cook distance plot.

# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
  
# Load Real State Data
data = pd.read_csv('/content/Real estate.csv')
data.head()
  
# Fit a OLS regression variable
model =smf.ols(formula=' Y ~ X3 + X2', data= data )
results = model.fit()
print(results.summary())
  
# Get different Variables for diagnostic
residuals = results.resid
fitted_value = results.fittedvalues
stand_resids = results.resid_pearson
influence = results.get_influence()
leverage = influence.hat_matrix_diag
  
# PLot different diagnostic plots
plt.rcParams["figure.figsize"] = (20,15)
fig, ax = plt.subplots(nrows=2, ncols=2)
  
plt.style.use('seaborn')
  
# Residual vs Fitted Plot
sns.scatterplot(x=fitted_value, y=residuals, ax=ax[0, 0])
ax[0, 0].axhline(y=0, color='grey', linestyle='dashed')
ax[0, 0].set_xlabel('Fitted Values')
ax[0, 0].set_ylabel('Residuals')
ax[0, 0].set_title('Residuals vs Fitted Fitted')
  
# Normal Q-Q plot
sm.qqplot(residuals, fit=True, line='45',ax=ax[0, 1], c='#4C72B0')
ax[0, 1].set_title('Normal Q-Q')
  
# Scale-Location Plot
sns.scatterplot(x=fitted_value, y=residuals, ax=ax[1, 0])
ax[1, 0].axhline(y=0, color='grey', linestyle='dashed')
ax[1, 0].set_xlabel('Fitted values')
ax[1, 0].set_ylabel('Sqrt(standardized residuals)')
ax[1, 0].set_title('Scale-Location Plot')
  
# Residual vs Leverage Plot
sns.scatterplot(x=leverage, y=stand_resids, ax=ax[1, 1])
ax[1, 1].axhline(y=0, color='grey', linestyle='dashed')
ax[1, 1].set_xlabel('Leverage')
ax[1, 1].set_ylabel('Sqrt(standardized residuals)')
ax[1, 1].set_title('Residuals vs Leverage Plot')
  
  
plt.tight_layout()
plt.show()
  
# PLot Cook's distance plot
sm.graphics.influence_plot(results, criterion="cooks")

                    
------------
# data
No    X1    X2    X3    X4    X5    X6    Y
0    1    2012.917    32.0    84.87882    10    24.98298    121.54024    37.9
1    2    2012.917    19.5    306.59470    9    24.98034    121.53951    42.2
2    3    2013.583    13.3    561.98450    5    24.98746    121.54391    47.3
3    4    2013.500    13.3    561.98450    5    24.98746    121.54391    54.8
4    5    2012.833    5.0    390.56840    5    24.97937    121.54245    43.1
------------
OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.491
Model:                            OLS   Adj. R-squared:                  0.489
Method:                 Least Squares   F-statistic:                     198.3
Date:                Thu, 19 Aug 2021   Prob (F-statistic):           5.07e-61
Time:                        17:56:17   Log-Likelihood:                -1527.9
No. Observations:                 414   AIC:                             3062.
Df Residuals:                     411   BIC:                             3074.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     49.8856      0.968     51.547      0.000      47.983      51.788
X3            -0.0072      0.000    -18.997      0.000      -0.008      -0.006
X2            -0.2310      0.042     -5.496      0.000      -0.314      -0.148
==============================================================================
Omnibus:                      161.397   Durbin-Watson:                   2.130
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1297.792
Skew:                           1.443   Prob(JB):                    1.54e-282
Kurtosis:                      11.180   Cond. No.                     3.37e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.37e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
--------------

Regression Diagnostic Plots

Cook distance Plot

References:


Article Tags :