The prime aim of any machine learning model is to predict the outcome of real-time data. To check whether the developed model is efficient enough to predict the outcome of an unseen data point, performance evaluation of the applied machine learning model becomes very necessary. K-fold cross-validation technique is basically a method of resampling the data set in order to evaluate a machine learning model. In this technique, the parameter K refers to the number of different subsets that the given data set is to be split into. Further, K-1 subsets are used to train the model and the left out subsets are used as a validation set.
Steps involved in the K-fold Cross Validation in R:
- Split the data set into K subsets randomly
- For each one of the developed subsets of data points
- Treat that subset as the validation set
- Use all the rest subsets for training purpose
- Training of the model and evaluate it on the validation set or test set
- Calculate prediction error
- Repeat the above step K times i.e., until the model is not trained and tested on all subsets
- Generate overall prediction error by taking the average of prediction errors in every case
To implement all the steps involved in the K-fold method, the R language has rich libraries and packages of inbuilt functions through which it becomes very easy to carry out the complete task. The following are the step-by-step procedure to implement the K-fold technique as a cross-validation method on Classification and Regression machine learning models.
Implement the K-fold Technique on Classification
Classification machine learning models are preferred when the target variable consist of categorical values like spam, not spam, true or false, etc. Here Naive Bayes classifier will be used as a probabilistic classifier to predict the class label of the target variable.
Step 1: Loading the dataset and other required packages
The very first requirement is to set up the R environment by loading all required libraries as well as packages to carry out the complete process without any failure. Below is the implementation of this step.
R
library (tidyverse)
library (caret)
library (ISLR)
|
Step 2: Exploring the dataset
In order to perform manipulations on the data set, it is very necessary to inspect it first. It will give a clear idea about the structure as well as the various kinds of data types present in the data set. For this purpose, the data set must be assigned to a variable. Below is the code to do the same.
R
dataset <- Smarket[ complete.cases (Smarket), ]
glimpse (dataset)
table (dataset$Direction)
|
Output:
Rows: 1,250
Columns: 9
$ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
$ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1…
$ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0…
$ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -…
$ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, …
$ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, …
$ Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, …
$ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0…
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up…
> table(dataset$Direction)
Down Up
602 648
According to the above information, the dataset contains 250 rows and 9 columns. The data type of independent variables is <dbl> which comes from double and it means the double-precision floating-point number. The target variable is of <fct> data type means factor and it is desirable for a classification model. Moreover, the target variable has 2 outcomes, namely Down and Up where the ratio of these two categories is almost 1:1, i.e., they are balanced. All the categories of the target variable must be in approximately equal proportion to make an unbiased model.
For this purpose, there are many techniques like:
- Down Sampling
- Up Sampling
- Hybrid Sampling using SMOTE and ROSE
Step 3: Building the model with K-fold algorithm
In this step, the trainControl() function is defined to set the value of the K parameter and then the model is developed as per the steps involved in the K-fold technique. Below is the implementation.
R
set.seed (123)
train_control <- trainControl (method = "cv" ,
number = 10)
model <- train (Direction~., data = dataset,
trControl = train_control,
method = "nb" )
|
Step 4: Evaluating the accuracy of the model
After training and validation of the model, it is time to calculate the overall accuracy of the model. Below is the code to generate a summary of the model.
Output:
Naive Bayes
1250 samples
8 predictor
2 classes: ‘Down’, ‘Up’
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1125, 1125, 1125, 1126, 1125, 1124, …
Resampling results across tuning parameters:
usekernel Accuracy Kappa
FALSE 0.9543996 0.9083514
TRUE 0.9711870 0.9422498
Tuning parameter ‘fL’ was held constant at a value of 0
Tuning parameter ‘adjust’ was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust = 1.
Implement the K-fold Technique on Regression
Regression machine learning models are used to predict the target variable which is of continuous nature like the price of a commodity or sales of a firm. Below are the complete steps for implementing the K-fold cross-validation technique on regression models.
Step 1: Importing all required packages
Set up the R environment by importing all necessary packages and libraries. Below is the implementation of this step.
R
library (tidyverse)
library (caret)
install.packages ( "datarium" )
|
Step 2: Loading and inspecting the dataset
In this step, the desired dataset is loaded in the R environment. After that, some rows of the data set are printed in order to understand its structure. Below is the code to carry out this task.
R
data ( "marketing" , package = "datarium" )
head (marketing)
|
Output:
youtube facebook newspaper sales
1 276.12 45.36 83.04 26.52
2 53.40 47.16 54.12 12.48
3 20.64 55.08 83.16 11.16
4 181.80 49.56 70.20 22.20
5 216.96 12.96 70.08 15.48
6 10.44 58.68 90.00 8.64
Step 3: Building the model with K-fold algorithm
The value of the K parameter is defined in the trainControl() function and the model is developed according to the steps mentioned in the algorithm of the K-fold cross-validation technique. Below is the implementation.
R
set.seed (125)
train_control <- trainControl (method = "cv" ,
number = 10)
model <- train (sales ~., data = marketing,
method = "lm" ,
trControl = train_control)
|
Step 4: Evaluate the model performance
As mentioned in the algorithm of K-fold that model is tested against every unique fold(or subset) of the dataset and in each case, the prediction error is calculated and at last, the mean of all prediction errors is treated as the final performance score of the model. So, below is the code to print the final score and overall summary of the model.
Output:
Linear Regression
200 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 181, 180, 180, 179, 180, 180, …
Resampling results:
RMSE Rsquared MAE
2.027409 0.9041909 1.539866
Tuning parameter ‘intercept’ was held constant at a value of TRUE
Advantages of K-fold Cross-Validation
- Fast computation speed.
- A very effective method to estimate the prediction error and the accuracy of a model.
Disadvantages of K-fold Cross-Validation
- A lower value of K leads to a biased model and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is very important to use the correct value of K for the model (generally K = 5 and K = 10 is desirable).