Open In App

Prediction Interval for Linear Regression in R

Last Updated : 24 Aug, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Linear Regression model is used to establish a connection between two or more variables. These variables are either dependent or independent. Linear Regression In R Programming Language is used to give predictions based on the given data about a particular topic, It helps us to have valuable insights and give conclusions that help us in many future decisions. For this prediction to be precise we use a range under which the future observations are more likely to fall. Prediction Interval provides us with this range which helps us to achieve more accuracy and precision in our forecasting. We can understand this in a better way by taking a real-world problem assuming a retailer wants to predict the sales of units in the future month. By estimating past sales, we can predict a range for future sales. This prediction interval will help the retailer strategize his stock and strategy. The prediction interval can give three values, upper prediction limit, lower prediction limit, and modal output. The need to calculate prediction intervals is essential in many real-world examples like real-estate pricing, stock market analysis, sports analytics, climate change projections, and crop yield production. In all these examples prediction interval helps us by giving an estimated range of our interest that reduces the risk of mismanagement or loss. We can predict the price range for a building using prediction interval or player performance for better player evaluation by using this tool in different fields. In this article, we will explore the concept of prediction intervals for linear regression and demonstrate how to calculate them using the R programming language.

Linear regression is used to determine a relation between two variables, one of which is independent(x) and the other is the dependent variable(y) based on the observed data. The variable which we want to predict is called the dependent variable and the variables that might influence the dependent variable are known as the independent variable. For example, if we want to predict exam scores based on study hours, in this example exam scores will be the dependent variable and study time is the independent variable. By analyzing the relationship between these two variables one can estimate if higher study hours result in more marks or not and Linear Regression helps us in establishing that relation. This model helps us to fit the best line for the given data to predict new values of x. In linear regression, when we want to predict the value of a dependent variable (ŷ) for a given value of x, we calculate it by combining the estimated coefficients of the model :

ŷ = β₀ + β₁x

Here, β₀ and β₁ are the estimated coefficients representing the intercept and slope of the regression line, respectively. It is important to consider the uncertainty associated with the prediction. To calculate the prediction interval we should start with the standard error(SE) of the prediction which measures how widely the data tend to vary and provides us with the degree of uncertainty associated with our estimation. The standard error is the spread of data points around the regression line. By taking the above-mentioned example of estimating scores based on study hours, if the data points are scattered widely then there will be a higher chance of uncertainty in our prediction caused by high se and vice versa.

The mathematical formula for a prediction interval depends on the specific statistical model being used; we will use t-distribution here since our dataset is small. T-distribution is used when the sample size is small giving high accuracy in our prediction. For a given value of the independent variable, x, the prediction interval for the corresponding dependent variable, y, can be calculated as follows:

Prediction Interval

[ŷ - tα/2 * se, ŷ + tα/2 * se] 

where:

  • ŷ is the predicted value of y for the given x.
  • tα/2 is the critical value of the t-distribution with (n – 2) degrees of freedom, where n is the number of observations used to fit the regression model.
  • se is the standard error of the prediction, which is calculated as:
se = √(MSE * (1 + 1/n + (x - x̄)²/∑(xi - x̄)²))

where:

  • n is the number of observations used to fit the regression model.
  • x̄ is the mean of the independent variable values. ​
  • xi represents the individual values of the independent variable.
  • MSE is the mean squared error, which measures the average squared difference between the observed values of y and the predicted values.
MSE = Σ(yi - ŷi)² / (n - 2)

In this formula:

  • yi represents the observed values of the dependent variable.
  • i represents the predicted values of the dependent variable based on the regression model.
  • n is the number of observations used to fit the regression model.

The desired level of confidence for the prediction interval is represented by the critical value t/2. For example, if we wish to build a 95% confidence interval, we must find the t/2 value with a significance level (α) of 0.05. Building a 95% confidence interval means that we are 95% confident that our estimation will fall in this interval. This critical value allows for estimating uncertainty and helps us construct the interval wide enough taking the variability in mind. Additionally, if we have a large sample size, we can often approximate the critical value using a standard normal distribution instead of a t-distribution. There are certain assumptions that we must know before calculating the prediction interval such as,

  • There should be a linear relationship between the independent and dependent variable
  • The data points or observations are unrelated to each other.
  • The amount of variation in our predictions must be consistent across all values of the independent variables
  • We assume that extreme or unusual data points (outliers) do not have a significant impact on our predictions.

Taking care of assumptions is important to get more precise and reliable results because if any of the above-mentioned assumptions are violated in our data, our predictions may be biased, less accurate, or even misleading.
To understand Prediction Interval better and how it works we must know these terms first:

To calculate the prediction interval we need to follow the below-mentioned steps:
1. R Setup and Code Implementation: We can install RStudio and write the R code in R’s interactive environment, called the R Console or RStudio. It’s like typing our commands into a special window where R understands and executes them.
2. Preparing the data: This includes organizing our data
3. Fitting the model: We give R language the values that we want to predict based on the variables.
4. Calculating the Prediction Interval: The most important part where R calculates the interval.
5. Getting Results and Interpretation: It is important to interpret our data meaningfully to understand and utilize it.

These steps apply to all the examples mentioned below.

Example:

Price Prediction

Let’s take a real-world example of predicting house prizes based on their sizes. We will use a small dataset for 5 houses:

House Size (in square feet): [1200, 1500, 1700, 1900, 2200]

House Price (in thousands of dollars): [250, 320, 340, 400, 430]

Firstly we need to import data in R and store these in vectors: “size” and “price.”

R




# Define the sizes of houses
 size <- c(1200, 1500, 1700, 1900, 2200)
# Define the corresponding prices of houses
 price <- c(250, 320, 340, 400, 430)


lm(): In R, the lm() function is used to fit linear regression models. The term “lm” stands for “linear model.”

Confidence level: The confidence level indicates the degree of trust or probability that the prediction is accurate or will be correct.

To calculate the prediction interval for new house size, we need to define the desired confidence level (CL). Let’s assume a 95% confidence level.

R




# Create a linear regression model using the size and price data
model <- lm(price ~ size)
# Set the confidence level for prediction intervals
CL <- 0.95


Now, we can calculate the prediction interval using the predict() function in R. We will pass the model object, the new size value for which we want to predict the price, and the argument interval = "prediction"

Syntax:

predict(): It is a function in R used to make predictions.

data.frame(): It takes one or more vectors as input and combines them into a data frame where each vector becomes a column of the data frame.

interval=' ': function is used to calculate confidence intervals or prediction intervals for statistical models or estimates.

level=' ': is used as an argument to specify the confidence level (CL) for the prediction interval.

To calculate a prediction for a specific size :

R




# Specify the size for which we want to predict the price
new_size <- 1800
prediction <- predict(model, data.frame(size = new_size),
                      interval = "prediction", level = CL)


The prediction the object will contain three values: the predicted price, the lower prediction limit, and the upper prediction limit. We can access these values using indexing.

R




# Extract the predicted price, lower limit, and upper limit from the prediction
predicted_price <- prediction[1]
lower_limit <- prediction[2]
upper_limit <- prediction[3]


Output:

lower_limit: 320.604525926354

predicted_price : 366.275862068966

upper_limit: 411.947198211577

By the above results taking the confidence limit of 95%, we can be 95% confident that the price of the house will fall between the upper limit and the lower limit and most likely the price will be the same as the predicted price. The lower bound here is $320,600 representing the minimum price estimated for a house of size 1800 square feet. Similarly, the upper bound which is $412,000 represents the maximum estimated price for the same house. According to the calculations, the most probable price will be $366,280 for the house. This interval gives an idea to the buyer as well as the seller to keep the price according to the predicted range helping them to take better decisions minimizing the loss and risk.

We can also visualize our prediction interval by plotting these values:

  • plot(): function is used to create various types of plots, such as scatter plots, line plots, bar plots, histograms, and more.
  • abline(): function is used to add straight lines to a plot.
  • point(): This line adds a single point to the plot.
  • lines(): function is used to add additional lines to an existing plot
  • xlab = ‘ ‘ :This parameter sets the label for the x-axis
  • ylab ='': This parameter sets the label for the y-axis
  • main = ‘ ‘: parameter is used to set the main title of a plot
  • pch=’ ‘ : sets the point character to a solid circle.
  • lwd=’ ‘: sets the line width
  • lty=’ ‘: sets the line type
  • col = ‘ ‘: this parameter defines the color used.

Now to plot the prediction interval on a scatter plot:

R




# Create a scatter plot of size versus price
plot(size, price, xlab = "Size", ylab = "Price", main = "Housing Price vs. House Size")
abline(model, col = "green")
points(new_size, prediction[1], col = "black", pch = 20)
lines(c(new_size, new_size), c(prediction[2], prediction[3]),
      col = "red", lwd = 2, lty = "dashed")


Output:

scatter-plot-of-prediction-interval-GFG.png

Scatter plot of prediction interval for linear regression

Here, the dashed line represents the predicted interval, and the dot represents the predicted price at the new size. In this example, we used the prediction interval to calculate the range that provides an estimation of the possible values for house prices for a given specific house size which is 1800 square feet here. The model estimates that there is a 95% confidence that the actual price will fall within the range of $320,000 to $412,000 and the most likely price would be $366,280. This estimation helps people have an idea of the value of a house and get maximum profit while selling or buying houses because they now know the range of prices.
Prediction intervals are commonly used in various other fields, such as finance, weather forecasting, machine learning, etc. Here are some other examples of how it is used:

EXAMPLE

Weather Forecasting:

Prediction Intervals are widely used in weather forecasting to predict a range of uncertainty associated with the changing weather. For example, Forecasters often provide us with a range of possibilities instead of a single point value when the estimated humidity is related to a certain temperature. This range helps us in planning our day based on the prediction since we know the interval of possibilities. The upper predicted limit indicates the highest humidity achieved for a given temperature, similarly the lowest predicted limit indicates the lowest humidity level associated with the particular temperature. We can understand this better by an example:

  • set.seed(): produces a sequence of numbers that appear to be random but are determined by a starting point called the seed.
  • cat()- function is used for printing or displaying output.
  • seq() function is used to generate sequences of value.

Let’s try to find out the humidity range at different temperatures using a linear regression model:

R




# Generate example data
set.seed(1)
temperature <- seq(0, 30, by = 5)
humidity <- c(40, 45, 55, 60, 65, 70, 75)
weather_data <- data.frame(temperature, humidity)
 
# Fit linear regression model
model <- lm(humidity ~ temperature, data = weather_data)
 
# New temperature values for prediction
new_temperatures <- seq(5, 25, by = 5)
 
# Calculate predicted values and prediction intervals
predictions <- predict(model, newdata = data.frame(temperature = new_temperatures),
                       interval = "prediction")
 
# Extract lower, upper, and modal values
lower_values <- predictions[, "lwr"]
upper_values <- predictions[, "upr"]
modal_values <- predictions[, "fit"]
 
# Print predicted values and prediction intervals
for (i in seq_along(new_temperatures)) {
  cat("Temperature:", new_temperatures[i], "\n")
  cat("Predicted Lower Value:", lower_values[i], "\n")
  cat("Predicted Upper Value:", upper_values[i], "\n")
  cat("Predicted Modal Value:", modal_values[i], "\n")
  cat("\n")
}


OUTPUT

Temperature: 5 
Predicted Lower Value: 42.01531 
Predicted Upper Value: 51.55612 
Predicted Modal Value: 46.78571 

Temperature: 10 
Predicted Lower Value: 48.11126 
Predicted Upper Value: 57.24589 
Predicted Modal Value: 52.67857 

Temperature: 15 
Predicted Lower Value: 54.07385 
Predicted Upper Value: 63.06901 
Predicted Modal Value: 58.57143 

Temperature: 20 
Predicted Lower Value: 59.89697 
Predicted Upper Value: 69.0316 
Predicted Modal Value: 64.46429 

Temperature: 25 
Predicted Lower Value: 65.58674 
Predicted Upper Value: 75.12755 
Predicted Modal Value: 70.35714

In the above-predicted values, we have a range for humidity for a certain temperature. For example, if we study the humidity range for 5 degrees Celsius the lower value shows that the expected humidity will be at least 42% and the upper predicted value shows that the maximum value that humidity can achieve is 52%. According to our calculations, the most probable value of the humidity is 47% for the same temperature. This predicted interval not only helps us for planning our day or choosing appropriate clothing for the day but also plays an important role in agriculture as humidity has a significant impact on irrigation, crop growth etc. Thus Prediction interval is an important tool for the agriculture sector as well.

We can also plot the prediction interval on the graph:

R




# to plot the predicted values and prediction intervals
plot(weather_data$temperature, weather_data$humidity, pch = 16, xlab = "Temperature",
     ylab = "Humidity", main = "Weather Forecasting")
lines(new_temperatures, modal_values, col = "green", lwd = 2)
lines(new_temperatures, lower_values, col = "red", lty = 2)
lines(new_temperatures, upper_values, col = "red", lty = 2)
legend("topleft", legend = c("Modal Value", "Prediction Interval"),
       col = c("green", "red"), lwd = c(2, 1), lty = c(1, 2))


Output:

plot-of-prediction-interval-weather-example-GFG.png

Prediction interval of Humidity vs Temperature

In the above graph that helps us visualize our prediction interval the green line represents the model or the most probable values whereas the red dashed line represents the prediction interval under which our values are most likely to fall. This visualization results in a better understanding of the prediction intervals. The red dashed line shows the outliers of our estimation, our calculated values will fall between these two lines only.

EXAMPLE

Sales Forecasting

Let’s a take another real-world example of a scenario of predicting sales of a product based on advertising expenses. We have a dataset with the following information:

Advertising Expenses (in thousands of dollars): [2, 4, 6, 8, 10]

Sales (in thousands of units): [5, 7, 8, 11, 14]

We’ll again assume a 95% confidence level for the prediction interval.

R




# Import the data
advertising_expenses <- c(2, 4, 6, 8, 10)
sales <- c(5, 7, 8, 11, 14)
 
# Fit the linear regression model
model <- lm(sales ~ advertising_expenses)
 
# Define the new advertising expense value for prediction
new_advertising_expense <- 7
 
# Calculate the prediction interval
confidence_level <- 0.95
new_data <- data.frame(advertising_expenses = new_advertising_expense)
prediction <- predict(model, newdata = new_data, interval = "prediction",
                      level = confidence_level)
 
# Extract the prediction interval values
lower_limit <- prediction[1, "lwr"]
predicted_sales <- prediction[1, "fit"]
upper_limit <- prediction[1, "upr"]
 
# Print the prediction interval values
cat("Lower Limit:", lower_limit, "\n")
cat("Predicted Sales:", predicted_sales, "\n")
cat("Upper Limit:", upper_limit, "\n")


OUTPUT:

Lower Limit: 7.527659 

Predicted Sales: 10.1 

Upper Limit: 12.67234 

The above-calculated values show the prediction for sales on the basis of advertising expenses. The lower limit represents that the sales will not fall below 7,528 units for the given advertising expense of $7,000. The upper limit shows that it will not exceed 12,672.34 units for the same advertising expense and most likely the sales value will be 10,100 units based on our estimation. Businesses can use this information to plan their production, inventory, and staffing levels accordingly. It makes it possible for businesses to execute their marketing plans more effectively, spend resources more wisely, and make better decisions that will increase sales.

To plot the given data set of prediction interval:

R




# Plot the prediction interval on a graph
plot(advertising_expenses, sales, main = "Sales Forecasting",
     xlab = "Advertising Expenses (in thousands of dollars)",
     ylab = "Sales (in thousands of units)", pch = 16)
# Add the regression line
abline(model, col = "green"
# Add the predicted point
points(new_advertising_expense, predicted_sales, col = "red", pch = 16) 
segments(new_advertising_expense, lower_limit, new_advertising_expense,
         upper_limit, col = "red", lwd = 2) 


Output:

graph-of-sales-forcasting-GFG

Sales Forecasting using Prediction Interval

The red line in the above graph represents the predicted interval for the sales value. It is extended from the lower limit of the prediction interval to the upper limit. The lower limit value which is 7,528 units represents the minimum expected sales value corresponding to the given advertising expense. Similarly, the upper limit value (12,672.34 units) is the maximum expected sales value. The predicted value is the point estimation of the predicted sales. In the above example, it depicts that the most modal value of sales will be 10,100 units.

EXAMPLE:

Website Traffic Estimation

Another use of Prediction interval can be estimating the website traffic on the basis of historical data. In this example, we will look at how to anticipate website traffic using linear regression in R and how to visualize the expected values and prediction intervals. Prediction Intervals are calculated based on the past data of visitors at a particular time on the website. Website owners use this tool to manage advertisements, improvise user experience, and plan promotions.

R




# Generate example data
website_traffic <- data.frame(
  time = 1:10,
  traffic = c(120, 150, 175, 200, 220, 250, 280, 300, 320, 350)
)
 
# Perform linear regression
lm_model <- lm(traffic ~ time, data = website_traffic)
 
# Predict values
pred_data <- data.frame(time = 11:15) 
pred_values <- predict(lm_model, newdata = pred_data, interval = "prediction")
 
# Combine predicted, lower, and upper values
prediction_results <- data.frame(
  Time = pred_data$time,
  Predicted = pred_values[, "fit"],
  Lower_Bound = pred_values[, "lwr"],
  Upper_Bound = pred_values[, "upr"]
)
 
# Print prediction results
print(prediction_results)


Output:

  Time Predicted Lower_Bound Upper_Bound
1   11  375.0000    365.7760    384.2240
2   12  400.1818    390.5112    409.8524
3   13  425.3636    415.1968    435.5305
4   14  450.5455    439.8396    461.2513
5   15  475.7273    464.4458    487.0088,

In the above calculated prediction, the first column which is time shows the time period for which we have estimated our predictions. The Lower bound and Upper Bound values show the range under which the actual traffic will fall and the predicted value shows the approximate number of visitors on the website. This output informs us about the number of visitors we can expect on our website and the possible range of values that tell us about the actual visitors based on historical data. This helps us make better decisions about administration, maintenance, and campaigning. The website owners will manage the campaign and show advertisements when the website traffic is the most so that it reaches more people.
To understand our data better we can plot these values on a graph. This data visualization helps us in better understand and use of resources. Data visualization can be achieved with the help of “ggplot2” package provided in R language. It provides high quality, different layers, aesthetics, and geometry to make our data look meaningful and understandable. To use ggplot2 in R, we need to first install the package using the install.packages() function:

Once we have installed the package we can load the package into our R session using the library() function and then plot the graph:

  • library() function is used to load a package into the R environment.
  • ggplot(): This function initializes a new ggplot object, which serves as the canvas for creating the plot.
  • pred_data: The dataframe pred_data contains the time periods for which predictions were made.
  • aes(): This function specifies the aesthetic mappings for the plot.
  • geom_line(): This function adds a line to the plot.
  • geom_point(): This function adds points to the plot.
  • theme_minimal(): This function sets the overall theme of the plot in a minimalistic style.

Now, to plot the predicted values on a graph:

R




library(ggplot2)
# Plotting the predicted values and prediction intervals
ggplot(pred_data, aes(x = time)) +
  geom_line(data = website_traffic, aes(y = traffic), color = "green") +
  geom_point(data = website_traffic, aes(y = traffic), color = "darkgreen") +
  geom_line(data = pred_data, aes(y = predicted), color = "red") +
  geom_ribbon(data = pred_data, aes(ymin = lower_bound, ymax = upper_bound),
              fill = "lightgreen", alpha = 0.5) +
  labs(x = "Time", y = "Website Traffic") +
  theme_minimal()


OUTPUT:

Website-traffic-prediction-interval-GFG.png

Website Traffic Prediction Graph

  • The green line represents the historical website traffic values over time. It shows the overall trend or pattern observed in the historical data.
  • The Dark green dots represent the individual data points of the historical website traffic values. They show the specific values at that point of time.
  • The red line represents the predicted values for website traffic. It shows the estimated values for future time periods based on the linear regression model.
  • The prediction intervals are represented by the light green ribbon. It depicts the uncertainty associated with the expected values graphically. The greater the uncertainty, the wider the ribbon; the lesser the uncertainty, the narrower the ribbon.
    The graph gives a simple depiction of website traffic patterns, trends, and estimated future values by visualizing historical data, predicted values, and prediction intervals. It helps in evaluating expectations with respect to historical data and related uncertainty.

In conclusion, Prediction Intervals are useful tools in linear regression used in different fields for different aspects to estimate future possibilities. We use Linear regression to establish the relationship between the variables that gives the best fit. R is a powerful language with many libraries and packages that help us in calculation and statistical knowledge. Prediction Intervals help in better decision making like we observed in different above-mentioned examples. When we have an estimation for a certain topic of field having information about the range reduces the risk of loss. We may make educated decisions, set expectations, and judge the importance of expected facts through analyzing the lower and higher prediction bounds. Thus, it is widely used in different industries and fields.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads