Cross-Validation in R programming

The major challenge in designing a machine learning model is to make it work accurately on the unseen data. To know whether the designed model is working fine or not, we have to test it against those data points which were not present during the training of the model. These data points will serve the purpose of unseen data for the model and it becomes easy to evaluate the model’s accuracy. One of the finest techniques to check the effectiveness of a machine learning model is Cross-validation techniques which can be easily implemented by using the R programming language. In this, a portion of the data set is reserved which will not be used in training the model. Once the model is ready, that reserved data set is used for testing purposes. Values of the dependent variable are predicted during the testing phase and the model accuracy is calculated on the basis of prediction error i.e., the difference in actual values and predicted values of the dependent variable. There are several statistical metrics which are used for evaluating the accuracy of regression models:

  1. Root Mean Squared Error (RMSE): As the name suggests it is the square root of the averaged squared difference between the actual value and the predicted value of the target variable. It gives the average prediction error made by the model, thus decrease the RMSE value to increase the accuracy of the model.
  2. Mean Absolute Error (MAE): This metric gives the absolute difference between the actual values and the values predicted by the model for the target variable. If the value of the outliers does not have much to do with the accuracy of the model, then MAE can be used to evaluate the performance of the model. Its value must be less in order to make better models.
  3. R2 Error: The value of R-squared metric gives an idea about how much percentage of variance in the dependent variable is explained collectively by the independent variables. In other words, it reflects the relationship strength between the target variable and the model on a scale of 0 – 100%. So, a better model should have a high value of R-squared.

Types of Cross-Validation

During the process of partitioning the complete dataset into the training set and the validation set, there are chances of losing some important and crucial data points for the training purpose. Since those data are not included in the training set, the model has not got the chance to detect some patterns. This situation can lead to overfitting or underfitting of the model. To avoid this, there are different types of cross-validation techniques which guarantees the random sampling of training and validation data set and maximizes the accuracy of the model. Some of the most popular cross-validation techniques are

  • Validation Set Approach
  • Leave one out cross-validation(LOOCV)
  • K-fold cross-Validation
  • Repeated K-fold cross-validation

Loading the Dataset

To implement linear regression, we are using a marketing dataset which is an inbuilt dataset in R programming language. Below is the code to import this dataset into your R programming environment.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading required packages
  
# package to perform data manipulation
# and visualization
library(tidyverse)
  
# package to compute
# cross - validation methods
library(caret)
  
# installing package to
# import desired dataset
install.packages("datarium")
  
# loading the dataset
data("marketing", package = "datarium")
  
# inspecting the dataset
head(marketing)

chevron_right


 Output:



   youtube facebook newspaper sales
1  276.12    45.36     83.04 26.52
2   53.40    47.16     54.12 12.48
3   20.64    55.08     83.16 11.16
4  181.80    49.56     70.20 22.20
5  216.96    12.96     70.08 15.48
6   10.44    58.68     90.00  8.64

Validation Set Approach(or data split)

In this method, the dataset is divided randomly into training and testing sets. Following steps are performed to implement this technique:

  1. A random sampling of the dataset
  2. Model is trained on the training data set
  3. The resultant model is applied to the testing data set
  4. Calculate prediction error by using model performance metrics

Below is the implementation of this method:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# R program to implement
# validation set approach
  
# setting seed to generate a 
# reproducible random sampling
set.seed(123)
  
# creating training data as 80% of the dataset
random_sample <- createDataPartition(marketing $ sales, 
                                p = 0.8, list = FALSE)
  
# generating training dataset
# from the random_sample
training_dataset  <- marketing[random_sample, ]
  
# generating testing dataset
# from rows which are not 
# included in random_sample
testing_dataset <- marketing[-random_sample, ]
  
# Building the model
  
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(sales ~., data = training_dataset)
  
# predicting the target variable
predictions <- predict(model, testing_dataset)
  
# computing model performance metrics
data.frame( R2 = R2(predictions, testing_dataset $ sales),
            RMSE = RMSE(predictions, testing_dataset $ sales),
            MAE = MAE(predictions, testing_dataset $ sales))

chevron_right


Output:

       R2     RMSE      MAE
1 0.9049049 1.965508 1.433609

Advantages:

  • One of the most basic and simple techniques for evaluating a model.
  • No complex steps for implementation.

Disadvantages:

  • Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
  • Using only one subset of the data for training purposes can make the model biased.

Leave One Out Cross-Validation(LOOCV)

This method also splits the dataset into 2 parts but it overcomes the drawbacks of the Validation set approach. LOOCV carry out the cross-validation in the following way:

  1. Train the model on N-1 data points
  2. Testing the model against that one data points which was left in the previous step
  3. Calculate prediction error
  4. Repeat above 3 steps until the model is not trained and tested on all data points
  5. Generate overall prediction error by taking the average of prediction errors in every case

Below is the implementation of this method:



R

filter_none

edit
close

play_arrow

link
brightness_4
code

# R program to implement
# Leave one out cross validation
  
# defining training control
# as Leave One Out Cross Validation
train_control <- trainControl(method = "LOOCV")
  
# training the model by assigning sales column
# as target variable and rest other column
# as independent varaible
model <- train(sales ~., data = marketing, 
               method = "lm",
               trControl = train_control)
  
# printing model performance metrics
# along with other details
print(model)

chevron_right


 Output:

Linear Regression 

200 samples
  3 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.059984  0.8912074  1.539441

Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • Less bias model as almost every data point is used for training.
  • No randomness in the value of performance metrics because LOOCV runs multiple times on the dataset

Disadvantages:

  • Training the model N times leads to expensive computation time if the dataset is large.

K-fold Cross-Validation

This cross-validation technique divides the data into K subsets(folds) of almost equal size. Out of these K folds, one subset is used as a validation set, and rest others are involved in training the model. Following are the complete working procedure of this method:

  1. Split the dataset into K subsets randomly
  2. Use K-1 subsets for training the model
  3. Test the model against that one subset which was left in the previous step
  4. Repeat the above steps for K times i.e., until the model is not trained and tested on all subsets
  5. Generate overall prediction error by taking the average of prediction errors in every case

Below is the implementation of this method:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# R program to implement
# K-fold cross-validation
  
# setting seed to generate a 
# reproducible random sampling
set.seed(125) 
  
# defining training control
# as cross-validation and 
# value of K equal to 10
train_control <- trainControl(method = "cv",
                              number = 10)
  
# training the model by assigning sales column
# as target variable and rest other column
# as independent varaible
model <- train(sales ~., data = marketing, 
               method = "lm",
               trControl = train_control)
  
# printing model performance metrics
# along with other details
print(model)

chevron_right


 Output:

Linear Regression 

200 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.027409  0.9041909  1.539866

Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • Fast computation speed.
  • A very effective method to estimate the prediction error and the accuracy of a model.

Disadvantages:

  • A lower value of K leads to a biased model and a higher value of K can lead to variability in performance metrics of the model. Thus, it is very important to use the correct value of K for the model(generally K = 5 and K = 10 is desirable).

Repeated K-fold cross-validation:

As the name suggests, in this method the K-fold cross-validation algorithm is repeated a certain number of times. Below is the implementation of this method:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# R program to implement
# repeated K-fold cross-validation
  
# setting seed to generate a 
# reproducible random sampling
set.seed(125) 
  
# defining training control as
# repeated cross-validation and 
# value of K is 10 and repetation is 3 times
train_control <- trainControl(method = "repeatedcv"
                            number = 10, repeats = 3)
  
# training the model by assigning sales column
# as target variable and rest other column
# as independent varaible
model <- train(sales ~., data = marketing, 
               method = "lm",
               trControl = train_control)
  
# printing model performance metrics
# along with other details
print(model)

chevron_right


 Output:

Linear Regression 

200 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.020061  0.9038559  1.541517

Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages:

  • In each repetition, the data sample is shuffled which results in developing different splits of the sample data.

Disadvantages:

  • With each repetition, the algorithm has to train the model from scratch which means the computation time to evaluate the model increases by the times of repetition.

Note: The most preferred cross-validation technique is repeated K-fold cross-validation for both regression and classification machine learning model.




My Personal Notes arrow_drop_up

Android Developer(Java, Kotlin), Technical Content Writer

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.