Repeated K-fold is the most preferred cross-validation technique for both classification and regression machine learning models. Shuffling and random sampling of the data set multiple times is the core procedure of repeated K-fold algorithm and it results in making a robust model as it covers the maximum training and testing operations. The working of this cross-validation technique to evaluate the accuracy of a machine learning model depends upon 2 parameters. The first parameter is K which is an integer value and it states that the given dataset will be split into K folds(or subsets). Among the K folds, the model is trained on the K-1 subsets and the remaining subset will be used to evaluate the model’s performance. These steps will be repeated up to a certain number of times which will be decided by the second parameter of this algorithm and thus it got its name as Repeated K-fold i.e., the K-fold cross-validation algorithm is repeated a certain number of times.
Steps involved in the repeated K-fold cross-validation:
Each iteration of the repeated K-fold is the implementation of a normal K-fold algorithm. In the K-fold cross-validation technique following steps are involved:
- Split the data set into K subsets randomly
- For each one of the developed subsets of data points
- Treat that subset as the validation set
- Use all the rest subsets for training purpose
- Training of the model and evaluate it on the validation set or test set
- Calculate prediction error
- Repeat the above step K times i.e., until the model is not trained and tested on all subsets
- Generate overall prediction error by taking the average of prediction errors in every case
Thus, in the repeated k-fold cross-validation method, the above steps will be repeated on the given dataset for a certain number of times. In each iteration, there will be a complete different split of the dataset into K-folds and the performance score of the model will also be different. At last, the mean performance score in all the cases will give the final accuracy of the model. To carry out these complex tasks of the repeated K-fold method, R language provides a rich library of inbuilt functions and packages. Below is the step by step approach to implement the repeated K-fold cross-validation technique on classification and regression machine learning model.
Implement Repeated K-fold Cross-validation on Classification
When the target variable is of categorical data type then classification machine learning models are used to predict the class labels. In this example, the Naive Bayes algorithm will be used as a probabilistic classifier to predict the class label of the target variable.
Step 1: Loading the required packages and libraries
All the necessary libraries and packages must be imported to perform the task without any error. Below is the code to set up the R environment for repeated K-fold algorithm.
R
library (tidyverse)
library (caret)
library (ISLR)
|
Step 2: Exploring the dataset
After importing the required libraries, its time to load the dataset in the R environment. Exploration of the dataset is also very important as it gives an idea if any change is required in the dataset before using it for training and testing purposes. Below is the code to carry out this task.
R
dataset <- Smarket[ complete.cases (Smarket), ]
glimpse (dataset)
table (dataset$Direction)
|
Output:
Rows: 1,250
Columns: 9
$ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...
> table(dataset$Direction)
Down Up
602 648
The above information suggests that the independent variables of the dataset are of <dbl> data type means a double-precision floating-point number. The target variable of the dataset is “Direction” and it is of the desired data type that is the factor(<fct>) data type. The values present in the dependent variable are Down and Up and they are in approximately equal proportion. If there is a case of class imbalance in the target variable then the following methods are used to correct this:
- Down Sampling
- Up Sampling
- Hybrid Sampling using SMOTE and ROSE
Step 3: Building the model with repeated K-fold algorithm
The trainControl() function is defined to set the number of repetitions and the value of the K parameter. After that, the model is developed as per the steps involved in the repeated K-fold algorithm. Below is the implementation.
R
set.seed (123)
train_control <- trainControl (method = "repeatedcv" ,
number = 10, repeats = 3)
model <- train (Direction~., data = dataset,
trControl = train_control, method = "nb" )
|
Step 4: Evaluating the accuracy of the model
In this final step, the performance score of the model will be generated after testing it on all possible validation folds. Below is the code to print the accuracy and overall summary of the developed model.
Output:
Naive Bayes
1250 samples
8 predictor
2 classes: 'Down', 'Up'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 1124, 1125, 1126, 1125, 1125, 1126, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.9562616 0.9121273
TRUE 0.9696037 0.9390601
Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust = 1.
Implement Repeated K-fold Cross-validation on Regression
Regression machine learning models are preferred for those datasets in which the target variable is of continuous nature like the temperature of an area, cost of a commodity, etc. The values of the target variable are either integer or floating-point numbers. Below are the steps required to implement the repeated k-fold algorithm as the cross-validation technique in regression models.
Step 1: Loading the dataset and required packages
As the first step, the R environment must be loaded with all essential packages and libraries to perform various operations. Below is the code to import all the required libraries.
R
library (tidyverse)
library (caret)
|
Step 2: Loading and inspecting the dataset
Once all packages are imported, its time to load the desired dataset. Here “trees” dataset is used for the regression model, which is an inbuilt dataset of R language. moreover, in order to build a correct model, it is necessary to know the structure of the dataset. All these tasks can be performed using the below code.
Output:
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
Step 3: Building the model with the repeated K-fold algorithm
The trainControl() function is defined to set the number of repetitions and the value of the K parameter. After that, the model is developed as per the steps involved in the repeated K-fold algorithm. Below is the implementation.
R
set.seed (125)
train_control <- trainControl (method = "repeatedcv" ,
number = 10, repeats = 3)
model <- train (Volume ~., data = trees,
method = "lm" ,
trControl = train_control)
|
Step 4: Evaluating the accuracy of the model
As per the algorithm of repeated K-fold technique that model is tested against every unique fold(or subset) of the dataset and in each case, the prediction error is calculated and at last, the mean of all prediction errors is treated as the final performance score of the model. So, below is the code to print the final score and overall summary of the model.
Output:
Linear Regression
31 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 28, 28, 28, 29, 28, 28, ...
Resampling results:
RMSE Rsquared MAE
4.021691 0.957571 3.362063
Tuning parameter 'intercept' was held constant at a value of TRUE
Advantages of Repeated K-fold cross-validation
- A very effective method to estimate the prediction error and the accuracy of a model.
- In each repetition, the data sample is shuffled which results in developing different splits of the sample data.
Disadvantages of Repeated K-fold cross-validation
- A lower value of K leads to a biased model, and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is essential to use the correct value of K for the model(generally K = 5 and K = 10 is desirable).
- With each repetition, the algorithm has to train the model from scratch which means the computation time to evaluate the model increases by the times of repetition.
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
15 Nov, 2021
Like Article
Save Article