Open In App

Diagnostic Plots for Model Evaluation

Last Updated : 16 Nov, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Model evaluation is a critical step in the lifecycle of any statistical or machine-learning model. Diagnostic plots play a crucial role in assessing the performance, assumptions, and potential issues of a model. In this comprehensive overview, we will delve into the theory behind diagnostic plots, their types, and their interpretation.

Purpose of Diagnostic Plots

Diagnostic plots are visual tools designed to evaluate the validity of assumptions made by a statistical or machine-learning model. These assumptions include linearity, normality of residuals, homoscedasticity, and the absence of influential points. In R Programming Language Diagnostic plots help analysts and data scientists identify potential problems with the model, guiding them in making informed decisions about model improvement or transformation.

Types of Diagnostic Plots

4 types of Diagnostic Plots are discussed below.

Residuals vs Fitted Values

The Residuals vs Fitted Values plot is designed to check the linearity assumption of the model. It helps to identify if there are any patterns or trends in the residuals concerning the fitted (predicted) values.

  • Random Scatter: In a well-fitted model, the residuals should be randomly scattered around the horizontal axis (zero line) with no discernible pattern. This suggests that the model captures the underlying linear relationship adequately.
  • Patterns: If there are patterns, it might indicate non-linearity in the data that the model fails to capture.

Normal Q-Q Plot

The Normal Q-Q (Quantile-Quantile) plot assesses whether the residuals follow a normal distribution. It is particularly important for making inferences and assumptions about the statistical properties of the residuals.

  • Straight Line: If the points on the Q-Q plot fall approximately along a straight line, it suggests that the residuals are normally distributed.
  • Deviation from Line: Any deviation from the line indicates departures from normality, prompting further investigation.

Scale-Location (Spread-Location) Plot

The Scale-Location Plot is used to assess the homoscedasticity assumption, which implies that the spread of residuals remains constant across all levels of the independent variable(s).

  • Consistent Spread: A consistent spread of points suggests homoscedasticity, indicating that the variability of the residuals is constant.
  • Increasing/Decreasing Spread: An increasing or decreasing spread may indicate heteroscedasticity, where the variability of residuals changes with the level of the predictor variable.

Residuals vs Leverage Plot

The Residuals vs Leverage plot helps identify influential observations or outliers that may disproportionately impact the model.

  • Points within Bounds: Most points should fall within the Cook’s distance bounds, indicating that they have a low impact on the model.
  • Points outside Bounds: Points outside the bounds are potentially influential. These observations can significantly affect the model fit and merit closer examination.

Certainly! Let’s create a linear regression model and generate diagnostic plots for model evaluation using a real-world dataset. We’ll use the built-in mtcars dataset in R for this example.

R




# Load the dataset
data(mtcars)
 
# Explore the first few rows of the dataset
head(mtcars)


Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Linear Regression Model Creation

R




# Create a linear regression model using the 'mpg' (miles per gallon)
model <- lm(mpg ~ wt + hp + qsec, data = mtcars)
 
summary(model)


Output:

Call:
lm(formula = mpg ~ wt + hp + qsec, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.8591 -1.6418 -0.4636 1.1940 5.6092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.61053 8.41993 3.279 0.00278 **
wt -4.35880 0.75270 -5.791 3.22e-06 ***
hp -0.01782 0.01498 -1.190 0.24418
qsec 0.51083 0.43922 1.163 0.25463
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.578 on 28 degrees of freedom
Multiple R-squared: 0.8348, Adjusted R-squared: 0.8171
F-statistic: 47.15 on 3 and 28 DF, p-value: 4.506e-11

In this example, we are creating a linear regression model (lm) with miles per gallon (mpg) as the response variable and weight (wt), horsepower (hp), and quarter-mile time (qsec) as predictor variables using the mtcars dataset.

Diagnostic Plots

R




# Load necessary libraries for diagnostic plots
library(ggplot2)
library(gridExtra)
 
# Function to create diagnostic plots
create_diagnostic_plots <- function(model) {
  # Residuals vs Fitted Values
  plot1 <- ggplot(model, aes(.fitted, .resid)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals")
 
  # Normal Q-Q Plot
  plot2 <- ggplot(model, aes(sample = .stdresid)) +
    stat_qq() +
    stat_qq_line() +
    labs(title = "Normal Q-Q Plot", x = "Theoretical Quantiles",
         y = "Standardized Residuals")
 
  # Scale-Location Plot
  plot3 <- ggplot(model, aes(.fitted, sqrt(abs(.stdresid)))) +
    geom_point() +
    geom_smooth(se = FALSE, method = "loess", color = "red") +
    labs(title = "Scale-Location (Spread-Location) Plot", x = "Fitted Values",
         y = "√|Standardized Residuals|")
 
  # Residuals vs Leverage Plot
  plot4 <- ggplot(model, aes(.hat, .stdresid)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    labs(title = "Residuals vs Leverage", x = "Leverage", y = "Standardized Residuals")
 
  # Arrange the plots in a grid
  grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)
}
 
# Create and display diagnostic plots
create_diagnostic_plots(model)


Output:

gh

Diagnostic Plots for Model Evaluation

In this example, we use the ggplot2 library to create diagnostic plots for the linear regression model. The plots include Residuals vs Fitted Values, Normal Q-Q Plot, Scale-Location Plot, and Residuals vs Leverage Plot.

  • Residuals vs Fitted Values: Look for a random scatter of points around zero, indicating linearity.
  • Normal Q-Q Plot: Check if points follow a straight line, suggesting normality of residuals.
  • Scale-Location Plot: Inspect for a consistent spread of residuals across fitted values, indicating homoscedasticity.
  • Residuals vs Leverage Plot: Identify influential points outside Cook’s distance lines.

These diagnostic plots provide insights into the assumptions and performance of the linear regression model, helping you assess its validity and identify areas for potential improvement. Adjustments to the model may be required based on the observations from these plots.

Diagnostic Plots for Model Evaluation on Random Dataset

R




# Install and load necessary packages
install.packages(c("ggplot2", "gridExtra"))
library(ggplot2)
library(gridExtra)
 
# Create a synthetic dataset for linear regression
set.seed(123)
data <- data.frame(
  x = rnorm(100, mean = 50, sd = 10),
  y = 2 * rnorm(100, mean = 0, sd = 15) + 3 * rnorm(100, mean = 0, sd = 10)
)
 
# Fit a linear regression model
model <- lm(y ~ x, data = data)
 
summary(model)


Output:

Call:
lm(formula = y ~ x, data = data)
Residuals:
Min 1Q Median 3Q Max
-82.786 -25.096 -1.692 25.791 129.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.9283 23.4397 1.234 0.220
x -0.5607 0.4533 -1.237 0.219
Residual standard error: 41.17 on 98 degrees of freedom
Multiple R-squared: 0.01537, Adjusted R-squared: 0.005323
F-statistic: 1.53 on 1 and 98 DF, p-value: 0.2191

First installs and loads the required R packages, ggplot2 for creating plots and gridExtra for arranging multiple plots into a grid.

  • A synthetic dataset is generated with 100 observations. The x variable is drawn from a normal distribution with a mean of 50 and a standard deviation of 10. The y variable is created as a linear combination of random normal variables.
  • A linear regression model is fitted using the lm function. The model predicts y based on the predictor variable x in the synthetic dataset. and calculate the summary of the model.

R




# Obtain model predictions and residuals
predictions <- predict(model)
residuals <- residuals(model)
 
# Create a data frame for diagnostic plots
diagnostic_data <- data.frame(predictions = predictions, residuals = residuals,
                              hatvalues = hatvalues(model))
 
# Diagnostic plots for model evaluation
plot_residuals_vs_fitted <- ggplot(diagnostic_data,
                                   aes(x = predictions, y = residuals)) +
  geom_point(color = "blue", size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()
 
plot_qq_plot <- ggplot(diagnostic_data, aes(sample = residuals)) +
  geom_qq(color = "pink") +
  geom_qq_line(color = "green") +
  labs(title = "Normal Q-Q Plot") +
  theme_minimal()
 
plot_scale_location <- ggplot(diagnostic_data, aes(x = predictions,
                                                   y = sqrt(abs(residuals)))) +
  geom_point(color = "yellow", size = 3) +
  geom_smooth(se = FALSE, color = "orange") +
  labs(title = "Scale-Location Plot",
       x = "Fitted Values",
       y = "Square Root of Standardized Residuals") +
  theme_minimal()
 
plot_residuals_vs_leverage <- ggplot(diagnostic_data,
                                     aes(x = hatvalues, y = residuals)) +
  geom_point(color = "green", size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
  labs(title = "Residuals vs Leverage",
       x = "Leverage",
       y = "Residuals") +
  theme_minimal()
 
# Combine the plots into a grid
grid.arrange(
  plot_residuals_vs_fitted,
  plot_qq_plot,
  plot_scale_location,
  plot_residuals_vs_leverage,
  ncol = 2
)


Output:

gh

Diagnostic Plots for Model Evaluation

Model predictions and residuals are obtained. Predictions are calculated based on the fitted linear regression model, and residuals represent the differences between the observed and predicted values.
A new data frame (diagnostic_data) is created to store the predictions, residuals, and leverage values (hat values) for the purpose of creating diagnostic plots.

Four diagnostic plots are created using ggplot2.

  • Residuals vs Fitted Values: Scatter plot of residuals against fitted values with a horizontal dashed line at 0.
  • Normal Q-Q Plot: Quantile-quantile plot to check if residuals follow a normal distribution.
  • Scale-Location Plot: Scatter plot of square root of absolute standardized residuals against fitted values with a smooth line.
  • Residuals vs Leverage: Scatter plot of residuals against leverage values with a horizontal dashed line at 0.

The grid.arrange function from the gridExtra package is used to arrange the four diagnostic plots into a 2×2 grid for better visualization.

Conclusion

Diagnostic plots are indispensable tools for model evaluation. Understanding their types, purpose, and interpretation empowers practitioners to make informed decisions about model quality and potential improvements. Regular use of diagnostic plots contributes to robust and reliable statistical modeling practices.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads