Open In App

Standardized Residual in R

Last Updated : 27 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

The distinction between a dependent variable’s observed value and expected value is known as a residual in statistics. A sort of residual known as a “standardized residual” has been standardised to have a mean of zero and a standard deviation of one. It is employed in regression analyses to quantify how far a data point deviates from the predicted value and to spot potential outliers.

Concepts:

To compute the standardised residual, subtract the anticipated value from the observed value, then divide the result by the estimate’s standard error. The accuracy of predicting the dependent variable from the independent variable is measured by the standard error of the estimate.

 The standardised residual calculation formula is as follows:

(Observed value – Predicted value) / Standard Error of the Estimate yields the standardised residual.

Steps to be followed:

  • Load the required R packages, such as ‘car'(companion to applied regression) and ‘ggplot2’, which include the tools for generating consistent residuals and rendering them.
  • R’s ‘lm()’ function can be used to fit a regression model.
  • Utilize the ‘rstandard()’ function from the ‘car’ package to determine the standardised residuals.
  • To spot probable outliers, visualise the standardised residuals using a scatterplot or a histogram.
  • To comprehend the link between the dependent and independent variables, interpret the standardised residuals.

Using the ‘plot( )’ and ‘plot.lm( )’ functions, you can draw the ‘simple plot’ and ‘standardised residual plot’, respectively in R. Here is an illustration:

R




#Create some arbitrary data
x <- rnorm(50)
y <- 2*x + rnorm(50)
 
# Create a model of linear regression
model <- lm(y ~ x)
 
# Plotting the simple plot
plot(x, y, main = "Simple Plot")
 
# Plotting the standardized residue plot
plot(model, which = 1, main = "Standardized Residue Plot")


  • In the aforementioned example, we first generate some random data and then fit a linear regression model utilising the ‘lm( )’ function. Next, we plot the standardised residual plot and the simple plot using the ‘plot.lm( )’ and ‘plot( )’  respectively , using the ‘lm( )’ method with the ‘which‘ argument set to 1.
  • In this code, two vectors, x and y, are created, each with 50 randomly generated values. Using a normal distribution, the ‘rnorm( )’ function creates random numbers. Here, we are generating y values from a normal distribution with a mean of 2*x and some additional noise from a normal distribution with a mean of 0 while producing x values from a standard normal distribution.
  • Using the ‘lm( )’ method, one can fit a linear regression model. Modeling is being done to represent the connection between the independent variable x and the dependent variable y. The model object holds the outcome in storage.
  • Using the ‘plot( )’ method, we are making a straightforward scatter plot of the x and y data in this code. Using the primary argument, we also give the plot a title.
  • This code uses the plot() method to generate a standardised residual plot with the which argument set to 1. This displays a picture of the fitted values from the linear regression model against the standardised residuals. The title of the story is added using the main argument.
  • The linear regression model’s underlying hypotheses, namely that the residuals are normally distributed and have a constant variance over the entire range of the predictor variable, are verified using the standardised residual plot. The plot may show a pattern or trend that suggests the model is not a good match for the data.

Output:

 

 

Simple histogram and  Standardized Residual plot :
 

R




# Generate some random data
x <- rnorm(50)
y <- 2*x + rnorm(50)
 
# Fit a linear regression model
model <- lm(y ~ x)
 
# Plot the simple histogram
hist(y, main = "Simple Histogram")
 
# Plot the standardized residual plot
plot(model, which = 1, main = "Standardized Residue Plot")


Here is a straightforward histogram of the y variable, which displays the distribution of the answer variable, will be the first plot. The second plot will be the standardised residual plot, which displays the standard deviations by which each measurement deviates from the fitted regression line. The distinction between the two plots is that the histogram displays the response variable’s distribution, whereas the standardised residual plot displays the residuals—the discrepancies between the observed and predicted values—distribution. The standardised residual plot can be used to spot outliers and evaluate the regression model’s general goodness of fit.

Output:

 


 

The difference between a simple histogram and a standardized residual plot in the context of linear regression analysis:

A histogram is a graphical representation of the frequency distribution of the residuals in a linear regression analysis, which are the discrepancies between the predicted values and the real values of the dependent variable. 

The histogram is a popular instrument for analysing the residuals distribution and can be used to spot patterns like skewness or outliers. You can learn more about the residuals’ distribution, including whether or not they follow a normal distribution, by examining the histogram’s structure.

A standardised residual plot, on the other hand, is a graphic depiction of the standardised residuals. The residuals that have been scaled by their expected standard deviation are known as standardised residuals. The normality of the residuals are evaluated, and possible outliers or significant observations are found using standardised residual plots.
 

In the context of linear regression analysis, the primary difference between a simple histogram and a standardised residual plot is that the former displays the distribution of the dependent variable, whereas the latter displays the distribution of the model residuals. 

While a histogram can be used to verify the linear regression model’s normality premise, a standardised residual plot can be used to identify any model flaws like nonlinearity, heteroscedasticity, or influential points.
 

Here are some  scatter plot examples using R:
 

A straightforward scatter diagram with a regression line:

R




# Generate random data
x <- rnorm(100)
y <- 2*x + rnorm(100)
 
# Create scatter plot with regression line
plot(x, y)
abline(lm(y ~ x))


The first two lines of the code produce randomly distributed data for x and y. With x on the horizontal axis and y on the vertical axis, the third line generates a scatter diagram. The fourth line adds a regression line to the picture by fitting a linear regression model of y on x using the abline() and lm() functions. The resulting plot illustrates the relationship between x and y, with the regression line pointing in the right direction and showing how strongly the two variables are correlated.

Output:

 

Color-coded scatterplot with points:

R




# Generate random data
x <- rnorm(100)
y <- 2*x + rnorm(100)
group <- sample(1:3, 100, replace = TRUE)
 
# Create scatter plot with color-coded points
plot(x, y, col = group)


Output:

 

Here is some R code that uses the ‘mtcars’ dataset:

R




#Load the required packages.
library(car)
library(ggplot2)
 
#Create a model of linear regression between mpg and wt.
model <- lm(mpg ~ wt, data = mtcars)
 
#Determine the standardised residuals.
std.resid <- rstandard(model)
 
#Use a histogram to display standardised residuals.
ggplot(data.frame(std.resid), aes(x = std.resid)) +
  geom_histogram(binwidth = 1.0, fill = "Red") +
  xlab("Standardized Residuals") +
  ylab("Count") +
  ggtitle("Histogram of Standardized Residuals")


In this , we begin by loading the required programs, car and ggplot2.

The built-in ‘mtcars’ dataset in R and the ‘lm()’ function were then used to create a linear regression model between ‘mpg’ (miles per gallon) and ‘wt’ (weight).

Utilizing the ‘rstandard()’ function from the vehicle package, we determine the standardized residuals. Finally, we use a histogram produced by ‘ggplot2’ to display the standardised residuals.

Keep in mind that the ‘std.resid’ object is transformed into a data frame using the ‘data. Frame()’ function so that ‘ggplot( )’ may use it.
 

Output:

 

Another example utilising a simulated dataset is as follows:
 

R




#Load the required packages.
library(car)
library(ggplot2)
 
#Prepare the ground for replication
set.seed(123)
 
#Data simulation
x <- rnorm(50)
y <- 3*x + rnorm(50, mean = 0, sd = 0.5)
data <- data.frame(x, y)
 
#Create a model of linear regression between x and y.
model <- lm(y ~ x, data = data)
 
#Determine the standardised residuals.
std.resid <- rstandard(model)
 
#Use a histogram to visualise standardised residuals.
ggplot(data.frame(std.resid), aes(x = std.resid)) +
  geom_histogram(binwidth = 1.0, fill = "green") +
  xlab("Standardized Residue") +
  ylab("Count1") +
  ggtitle("Histogram of Standardized Residue")


In this example, we begin by loading the required programs, ‘car’ and ‘ggplot2’. Then, using a 50-observation dataset for x and y with added Gaussian noise, we simulate the data. 

We use the ‘lm( )’ function to train a linear regression model between ‘x’ and ‘y’, the ‘rstandard( )’ function from the car package to get the standardised residuals, and ‘ggplot2’ to produce a histogram to display the standardized residuals.

To guarantee the reproducibility of the simulation results, take note that we used ‘set.seed( )’ to set the seed to a precise value.
 

output:

 

 

In conclusion, standardized residuals are an indicator of how far an observation deviates from the value predicted by a regression model. They are helpful in locating outliers or significant observations that might be influencing the outcomes of the regression analysis.

We may use the rstandard() function from the car package to compute standardized residuals in R. To get the standardised residuals, we must first build a linear regression model using the ‘lm()’ function, and then send the model object to ‘rstandard( )’.

In order to spot any patterns or outliers, we can graphically represent the standardized residuals using a histogram or scatterplot. Standardized residuals can assist us in assessing the linear regression model’s assumptions and determining.
 

 Mathematical Concepts  Used Here: 
 

The standardised residual, which is expressed in units of the residuals’ standard deviation, is a measurement of how far each measured value of the response variable deviates from its predicted value in the linear regression model. It is determined by:

standardized residual = residual / (sqrt(MSE) * sqrt(1 – hii))
 

where MSE stands for the mean squared error of the model, ‘hii’ is the leverage for each observation, and residual is the residual for each observation. The leverage quantifies the weight that a measurement has in influencing the model’s fitted values.

The linear regression hypotheses can be verified with the help of the standardised residual plot. It charts the standardized residuals versus the model’s fitted values. The standardised residuals should be randomly distributed around zero and the plot should show no clear patterns or trends if the assumptions of linear regression are true.

Using the plot in R, you can produce a standardized residual plot. Using the which argument set to 1, call the ‘lm( )’ method. This will result in a plot.

Here,

We attempt to model the connection between a dependent variable y and one or more independent variables X in linear regression analysis. A linear regression model with only one independent variable has the following general equation:

y = β0 + β1X + ε

where X is the independent variable, β0 represents the intercept, β1 represents the slope coefficient, represents the error term, and y represents the dependent variable. The unexplained variation in the dependent variable that is not taken into consideration by the independent variable is represented by the error term. In order to predict the value of y given a value for X, linear regression analysis aims to estimate the coefficients β0 and β1 that best match the data.

The sum of squared errors, which is the total of the squared differences between the observed values of y and the predicted values of y, must be minimized when fitting a linear regression model. This can be mathematically written as:

SSE = Σ(yi – Å·i)2

where yi is the value of y that was witnessed, i is the value of y that was predicted, and  is the symbol that represents the sum of all values of i. The least squares method, which entails identifying the values of  Î²0 and  Î²1, can be used to estimate the coefficients  Î²0 and  Î²1 that minimise the sum of squared errors.

Once the linear regression model’s coefficients have been estimated, we can use them to forecast the dependent variable y given a value for X. You can write the expected value of y as:

ŷ = β0 + β1X

The residual is the difference between the measured value of y and the predicted value of y, and it can be written as follows:

ei = yi – Å·i

The unaccounted-for variance in the dependent variable that is not accounted for by the independent variable is represented by the residuals. The linear regression model is regarded as being legitimate if the residuals have a normal distribution with a mean of zero and a constant variance. The linear regression model, however, might not be reliable if there is a trend in the residuals, such as nonlinearity or heteroscedasticity, and further steps may be required to enhance the model.

The residuals are divided by their expected standard deviation to produce the standardised residuals. They are helpful in locating any outliers or influential data that might have an impact on the model. A standardised residual is deemed to be an influential observation and needs to be looked at more carefully to see if it is influencing the model if its absolute value is higher than 2.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads