Open In App

7 Steps to Run a Linear Regression Analysis using R

Last Updated : 09 Nov, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Linear Regression is a useful statistical tool for modelling the relationship between a dependent variable and one or more independent variables. It is widely used in many disciplines, such as science, medicine, economics, and education. For instance, several areas of education employ linear regression to estimate student performance or identify the factors influencing student performance. It can also be applied in the healthcare industry to comprehend how different elements, such as age and diet, affect a certain medical condition. This aids in inference and improves the accuracy of forecasts. The seven stages of performing a linear regression analysis will be covered in this post.

The seven steps to run linear regression analysis are

  1. Install and load necessary packages
  2. Load your data
  3. Explore and Understand the data
  4. Create the model
  5. Get a model summary
  6. Make predictions
  7. Plot and visualize your model

We can understand the above-mentioned steps in syntax and then with the help of different examples.

Step 1: Install and load the necessary packages

Before we start our linear regression analysis we must install the necessary packages, these packages help us in visualizing and plotting our data. For example, we can install packages like “ggplot2” and “dplyr” in R language for better analysis. Syntax to install these packages is.

# Install and load package
install.packages("ggplot2") 
library(ggplot2)

Step 2:Load your data

We need to import our data into R using functions like “read.csv” or “read. table” ensuring that data is stored in a data frame. Here, “your_data.csv” is the path of the file you want to read in R.

# Load your data
data <- read.csv("your_data.csv")

Step 3: Explore and Understand the data

It is important to understand the data we are dealing with, to get this idea we can use “summary()”, str(), or head() or tail() functions in R. str() is used for displaying our data compactly especially when the data is huge in number.
summary() gives the minimum, maximum, mean, median, and 1st and 3rd quartiles for our data to get a better understanding.
head() returns the first parts of our data frame whereas the tail() function returns the last part.

# Explore and understand the data
summary(data)
head(data)
tail(data)
str(data)

Step 4: Create the model

Creating a model in linear regression means establishing a relationship between the two variables. In R, the lm() function is used to create linear regression models. It takes two parameters: formula and data. The formula defines the formula we want to apply to our data.

# Create a linear regression model
model <- lm(formula, data = data)

Step 5: Get a model summary

summary() function in R is used to get the summary of our model, it returns detailed information about our data like coefficients, R-squared, and p-values, and the minimum, maximum, mean, median, and 1st and 3rd quartiles for our model.

# Get model summary
summary(model)

Step 6: Make Predictions

Once our model is fit we can make predictions on our new data and conclude. we use predict() to get predictions for our model.

# Make predictions
predictions <- predict(model, data = data)

Step 7: Visualize our model

Visualizing our model is good for understanding, we can plot many graphs as per our need, and this helps us in assessing our fit. The packages we installed in step one help us in this step, “ggplot2” helps us in plotting graphs in R. se = FALSE means we don’t want to include a shaded confidence interval around the line.

# Create a scatterplot with the regression line
ggplot(data, aes(x = predictor_variable, y = response_variable)) 
# Add points to the plot
  + geom_point()
# Add a regression line to the scatterplot
  + geom_smooth(method = "lm", se = FALSE)

Create model using mtcars dataset

We can understand all these steps easily with the help of an example. For this example, we will load a sample dataset that comes with R, called the “mtcars” dataset. This dataset contains information about various car models. It is a built-in dataset in R. We will apply all the above-mentioned steps to our dataset:

1. Install and load necessary packages

R




#install packages
install.packages("ggplot")
#load packages
library(ggplot2)


2. Load your data

R




# Load the mtcars dataset
data(mtcars) 


3. Explore and Understand the data:

R




# Explore and understand the data
#show the summary of our data
summary(mtcars)


Output:

      mpg             cyl             disp             hp             drat      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mea :3.597  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec             vs               am              gear      
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000   Min.   :3.000  
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000   Median :4.000  
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062   Mean   :3.688  
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000  
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000   Max.   :5.000  
      carb      
 Min.   :1.000  
 1st Qu.:2.000  
 Median :2.000  
 Mean   :2.812  
 3rd Qu.:4.000  
 Max.   :8.000  

These functions gave us an idea about our data. head() function shows the first 6 rows of our dataset showing the various brands of cars, mpg(miles per gallon), cyl(number of cylinder), horsepower etc. These are the attributes of our dataset. Now we need to specify our model, we will have to define the formula parameter for the model, with the dependent variable on the left side of the tilde (~) and the independent variables on the right side, and then explore the model with the help of summary() function.

4. Create the model

R




# Fit a linear regression model
model <- lm(mpg ~ hp, data = mtcars)


5. Get a model summary:

R




# Get summary
summary(model)


Output:

Call:
lm(formula = mpg ~ hp, data = mtcars)


Residuals:
    Min      1Q  Median      3Q     Max 
-5.7121 -2.1122 -0.8854  1.5819  8.2360 


Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
hp          -0.06823    0.01012  -6.742 1.79e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared:  0.6024,    Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
  • summary() function returns the values of mean, median, mode(max), 1st quartile and 3rd quartile for our model.
  • Residuals explain the differences between the observed values and the values predicted by the model.
  • “Std. Error” represents the standard error of the coefficient estimates.
  • t value” is used to test the significance of the coefficients.
  • Pr(>|t|)” represents the p-value associated with each coefficient.
  • multiple “R-squared” value which is 0.6024 here, represents the proportion of variance in the dependent variable (mpg).
  • R-squared value, 0.5892 in this summary estimation, is a modified version of the R-squared that represents the number of predictors in our model.
  • “F-statistic” shows the significance of our test, for which here p-value is very less describing that our test is significant statistically.
    This shows the summary of our data to get insights from it and make further predictions. Now, when we know our data we can make predictions according to our needs.

6. Make Predictions

R




# Predict mpg for a car with 300 horsepower
new_data <- data.frame(hp = 300)
predicted_mpg <- predict(model, new_data)
predicted_mpg


Output:

9.630377 

9.630 is our predicted value of mpg(miles per gallon) for a given horsepower. In this model we tried to predict mpg of a car whose horsepower is 300. We can also plot this value on graph.

7. Visualize our model

R




# Create a scatter plot using ggplot2
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "darkgreen") +
  geom_smooth(method = "lm", color = "green") +
  labs(x = "Horsepower", y = "Miles Per Gallon", title = "Scatterplot of hp vs. mpg")


Output:

Linear Regression Line -Geeksforgeeks

Mpg vs Horsepower Graph

The line is the result of the linear regression model fit to the data using the geom_smooth function. The dots represent cars we have in our dataset. This plot shows how the horsepower affects the mpg of a car and helps us in decision making.

In this example, we took the in-built dataset of R “mtcars” which has info about cars and their attributed. We explored over data and then predicted miles per gallon for a car having horsepower 300. This prediction helps in product analysis and comparison with the other cars. This also helps in performance analysis of a particular car. These summary data helps us to understand what is the average attribute of cars in the market. We also visualized this data by plotting a scatter plot of our predictions for better understanding.

In this example we will run linear regression on a small dataset in which we want to understand the relationship between the amount of rainfall and the yield of a specific crop. For this we will create our own fictional dataset and perform linear regression analysis.

Conclusion

In this article, we learned the seven necessary steps to run Linear regression analysis using R language. We understood the concept with the help of four different examples based on different fields such as education, weather forecasting, wage estimation and prediction using cars dataset. We used different ways to deal with our data as well from using the built-in dataset we have in R language to loading dataset from other websites and then performing analysis on it. We learned to built models and predict values based on the historical data. In this article, we dealt with different sizes of dataset as well. Linear regression analysis helps us to understand the relationship and dependency of variables on each other. This helps us in smart decision making and in avoiding any wastage by being prepared.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads