Open In App

Repeated K-fold Cross Validation in R Programming

Last Updated : 15 Nov, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

Repeated K-fold is the most preferred cross-validation technique for both classification and regression machine learning models. Shuffling and random sampling of the data set multiple times is the core procedure of repeated K-fold algorithm and it results in making a robust model as it covers the maximum training and testing operations. The working of this cross-validation technique to evaluate the accuracy of a machine learning model depends upon 2 parameters. The first parameter is K which is an integer value and it states that the given dataset will be split into K folds(or subsets). Among the K folds, the model is trained on the K-1 subsets and the remaining subset will be used to evaluate the model’s performance. These steps will be repeated up to a certain number of times which will be decided by the second parameter of this algorithm and thus it got its name as Repeated K-fold i.e., the K-fold cross-validation algorithm is repeated a certain number of times. 

Steps involved in the repeated K-fold cross-validation:

Each iteration of the repeated K-fold is the implementation of a normal K-fold algorithm. In the K-fold cross-validation technique following steps are involved:

  1. Split the data set into K subsets randomly
  2. For each one of the developed subsets of data points
    • Treat that subset as the validation set
    • Use all the rest subsets for training purpose
    • Training of the model and evaluate it on the validation set or test set
    • Calculate prediction error
  3. Repeat the above step K times i.e., until the model is not trained and tested on all subsets
  4. Generate overall prediction error by taking the average of prediction errors in every case

Thus, in the repeated k-fold cross-validation method, the above steps will be repeated on the given dataset for a certain number of times. In each iteration, there will be a complete different split of the dataset into K-folds and the performance score of the model will also be different. At last, the mean performance score in all the cases will give the final accuracy of the model.  To carry out these complex tasks of the repeated K-fold method, R language provides a rich library of inbuilt functions and packages. Below is the step by step approach to implement the repeated K-fold cross-validation technique on classification and regression machine learning model.

Implement Repeated K-fold Cross-validation on Classification

When the target variable is of categorical data type then classification machine learning models are used to predict the class labels. In this example, the Naive Bayes algorithm will be used as a probabilistic classifier to predict the class label of the target variable. 

Step 1:  Loading the required packages and libraries

All the necessary libraries and packages must be imported to perform the task without any error. Below is the code to set up the R environment for repeated K-fold algorithm.

R




# load the library
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)
 
 
# loading package to
# import desired dataset
library(ISLR)


Step 2: Exploring the dataset

After importing the required libraries, its time to load the dataset in the R environment. Exploration of the dataset is also very important as it gives an idea if any change is required in the dataset before using it for training and testing purposes. Below is the code to carry out this task.

R




# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
 
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
 
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)


Output:

Rows: 1,250
Columns: 9
$ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume    <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...

> table(dataset$Direction)

Down   Up 
 602  648

The above information suggests that the independent variables of the dataset are of <dbl> data type means a double-precision floating-point number. The target variable of the dataset is “Direction” and it is of the desired data type that is the factor(<fct>) data type. The values present in the dependent variable are Down and Up and they are in approximately equal proportion. If there is a case of class imbalance in the target variable then the following methods are used to correct this:

  • Down Sampling
  • Up Sampling
  • Hybrid Sampling using SMOTE and ROSE

Step 3: Building the model with repeated K-fold algorithm

The trainControl() function is defined to set the number of repetitions and the value of the K parameter. After that, the model is developed as per the steps involved in the repeated K-fold algorithm. Below is the implementation.

R




# setting seed to generate a 
# reproducible random sampling
set.seed(123)
 
# define training control which
# generates parameters that further
# control how models are created
train_control <- trainControl(method = "repeatedcv",
                              number = 10, repeats = 3)
 
# building the model and
# predicting the target variable
# as per the Naive Bayes classifier
model <- train(Direction~., data = dataset,
               trControl = train_control, method = "nb")


Step 4: Evaluating the accuracy of the model

In this final step, the performance score of the model will be generated after testing it on all possible validation folds. Below is the code to print the accuracy and overall summary of the developed model.

R




# summarize results of the
# model after calculating
# prediction error in each case
print(model)


Output:

Naive Bayes 

1250 samples
   8 predictor
   2 classes: 'Down', 'Up' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 1124, 1125, 1126, 1125, 1125, 1126, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.9562616  0.9121273
   TRUE      0.9696037  0.9390601

Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust = 1.

Implement Repeated K-fold Cross-validation on Regression

Regression machine learning models are preferred for those datasets in which the target variable is of continuous nature like the temperature of an area, cost of a commodity, etc. The values of the target variable are either integer or floating-point numbers. Below are the steps required to implement the repeated k-fold algorithm as the cross-validation technique in regression models.

Step 1: Loading the dataset and required packages

As the first step, the R environment must be loaded with all essential packages and libraries to perform various operations. Below is the code to import all the required libraries.

R




# loading required packages
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)


Step 2: Loading and inspecting the dataset

Once all packages are imported, its time to load the desired dataset. Here “trees” dataset is used for the regression model, which is an inbuilt dataset of R language. moreover, in order to build a correct model, it is necessary to know the structure of the dataset. All these tasks can be performed using the below code.

R




# access the data from R’s datasets package
data(trees)
 
# look at the first several rows of the data
head(trees)


Output:

  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

Step 3: Building the model with the repeated K-fold algorithm

The trainControl() function is defined to set the number of repetitions and the value of the K parameter. After that, the model is developed as per the steps involved in the repeated K-fold algorithm. Below is the implementation. 

R




# setting seed to generate a 
# reproducible random sampling
set.seed(125) 
 
# defining training control as
# repeated cross-validation and 
# value of K is 10 and repetition is 3 times
train_control <- trainControl(method = "repeatedcv"
                              number = 10, repeats = 3)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(Volume ~., data = trees, 
               method = "lm",
               trControl = train_control)


Step 4:  Evaluating the accuracy of the model

As per the algorithm of repeated K-fold technique that model is tested against every unique fold(or subset) of the dataset and in each case, the prediction error is calculated and at last, the mean of all prediction errors is treated as the final performance score of the model. So, below is the code to print the final score and overall summary of the model. 

R




# printing model performance metrics
# along with other details
print(model)


Output:

Linear Regression 

31 samples
 2 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 28, 28, 28, 29, 28, 28, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  4.021691  0.957571  3.362063

Tuning parameter 'intercept' was held constant at a value of TRUE

Advantages of Repeated K-fold cross-validation

  • A very effective method to estimate the prediction error and the accuracy of a model.
  • In each repetition, the data sample is shuffled which results in developing different splits of the sample data.

Disadvantages of Repeated K-fold cross-validation

  • A lower value of K leads to a biased model, and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is essential to use the correct value of K for the model(generally K = 5 and K = 10 is desirable).
  • With each repetition, the algorithm has to train the model from scratch which means the computation time to evaluate the model increases by the times of repetition.


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads