Standard Error of the Regression vs. R-squared

Last Updated : 06 Oct, 2023

Regression is a statistical technique used to establish a relationship between dependent and independent variables. It predicts a continuous set of values in a given range. The general equation of Regression is given by

$y=mx+c$

Here y is the dependent variable. It is the variable whose value changes when the independent values are changed
x is the independent variable. Here y is dependent on x. It is to be noted that there can be more than one independent variable.
m is the slope
c is the y-intercept

There are different types of Regression: Linear Regression, Ridge Regression, Polynomial Regression, and Lasso Regression. Regression analysis involves the prediction of continuous values within a given range therefore we require evaluation metrics. Evaluation metrics help us to analyze the performance of the Machine Learning model. In Regression Analysis, we calculate how much the predicted values deviate from the actual values. There are different evaluation metrics for Regression Analysis like Mean Squared Error, Mean Absolute Error, R squared, etc.

Standard Error of Regression

Standard error is a statistical technique that is used to find the average distance between the observed values and the regression line. It defines how much the actual data is spread around the line. In other words, it can be said that it provides a measure of how much the actual dependent value deviates from the predicted value. Since it is an error, therefore lower the value better will be our prediction.

Suppose we want to estimate the slope as well as the Y-intercept using the independent variables and dependent variables. Let us look at the Python code.

Python3

import numpy as np
import statsmodels.api as sm
 
 
x = np.array([2, 3, 4, 9, 5])  # Independent variable
y = np.array([2, 6, 8, 18, 10])  # Dependent variable
 
# Add a constant term (intercept) to the independent variable
x = sm.add_constant(x, prepend=True)
 
# Fitting a linear regression model
regression = sm.OLS(y, x).fit()
 
 
regression.summary()

Output:

OLS Regression Results
Dep. Variable: 	y 	R-squared: 	0.984
Model: 	OLS 	Adj. R-squared: 	0.978
Method: 	Least Squares 	F-statistic: 	182.8
Date: 	Thu, 05 Oct 2023 	Prob (F-statistic): 	0.000875
Time: 	11:18:32 	Log-Likelihood: 	-5.1249
No. Observations: 	5 	AIC: 	14.25
Df Residuals: 	3 	BIC: 	13.47
Df Model: 	1 		
Covariance Type: 	nonrobust 	
 	coef 	std err 	t 	P>|t| 	[0.025 	0.975]
const 	-1.2192 	0.837 	-1.456 	0.241 	-3.883 	1.445
x1 	2.1781 	0.161 	13.519 	0.001 	1.665 	2.691
Omnibus: 	nan 	Durbin-Watson: 	2.045
Prob(Omnibus): 	nan 	Jarque-Bera (JB): 	0.624
Skew: 	-0.678 	Prob(JB): 	0.732
Kurtosis: 	1.925 	Cond. No. 	11.5


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Standard Error:

Python3

# Get the standard errors of regression coefficients
error = regression.bse
 
# Print the standard errors
print('Constant error :',error[0])
print('x1 error       :',error[1])

Output:

Constant error : 0.8371869364710968
x1 error       : 0.16111670104454195

After executing the code we get a list comprising two values. In the above code we have defined two numpy arrays. For statistical calculations we have the imported stats model library. After that we specified a constant term and theen used a linear regression model to fit the independent values and the dependent values. The model analyses the values and establishes a relationship between the dependent and indevariablesvariable. So basically the model estimates the slope y-interceptntercept. Then using the bse we are estimating the standard errors of the intercept and the slope respectively. Here the inhass has a standard error of 0.837 whereas it has a standard error of 0.161.

R – squared

R squared, also known as the coefficient of determination, is a statistical measure that determines how well the regression line fits the values. In other words it explains the variability between independent and dependent variables. The basic idea of R square is to provide information about the relationship between independent and dependent variable. Therefore higher the value of R² greaterthe relationship between independent and dependent variables.

R squared can be calculated using either statsmodel libthe rary or Scikit learn library. Let us look at each of them

Python3

# Get the r square value
r_squared = regression.rsquared
 
# Print 
print('R-Squared :',r_squared)

Output:

R-Squared : 0.9838496264009963

I,n the above code we have defined two numpy arrays. For statistical calculations we have imported statsmodel library. After that we specified a constant term and then used linear regression model to fit the independent values and the dependent values. The model analyses the values and establishes a relationship between the dependent and independent variable. Now to determine the strength of relationship we use squared to calculate the coefficient of determine action

Interpreting standard error vs. R-squared

Standard Error tells us on an average basis how much the data is spread around values found out on regression line. Therefore lower the distance between the actual data and the regression line, it means the model is gaining more knowledge about the data. R squared on the other hand focuses on the strength of relationship between dependent and independent variables. It measures the variability of dependent variables when the independent variables changes. R squared lies between 0 and 1. 0 means that the independent variable cannot predict the variability in dependent variable while 1 denotes that variability can be predicted by the independent variable.

Using Both Metrics Together

These metrics can be used to identify the overfitting of data, comparing two models etc. Let us have a look at each of them:

Overfitting is an issue that occurs when the model gives correct prediction for training data but provides incorrect prediction for low data. In Regression models overfitting can occur. These two metrics can be useful to determine the same. If a model has very high R squared value but very low standard error, then it can be said that the model is suffering from overfitting.

Comparison of two models: Since these two are evaluation metrics they can act as benchmarks for comparison of two or more regression models. This helps users to choose the which regression model provides better results

When to use standard error and R-squared?

Both metrics play an important role in the analysis of Regression model. Standard error is used when we want to measure the spread of datapoints around the regression line. R squared is used when we want the goodness of fit. However if both are combined then they can be used to check whether the model is overfitting or not.

Choosing Right Metric

All the metrics play an important role in evaluation of regression model. Each of the metric has its own specific role in the analysis. R- squared, also known as coefficient determination, is used to predict the strength of relationship between two variables. Standard error is nothing but standard deviation is statistics that helps to measure the spread of data around regression model. Hence we should use this metric only when we want to measure the spread of original data around the predicted data.

Limitations

The main drawback of Coefficient of determination is that it cannot be used to predict whether overall the model is good or bad. To evaluate whether the model is good or bad we should use other metrics like MSE, MAE etc. On the other hand, the main limitation of Standard Error of the Regression is that it assumes that there is linear relationship between the variables and may not give accurate values for nonlinear relationship.