# Cross-Validation in R programming

The major challenge in designing a machine learning model is to make it work accurately on the unseen data. To know whether the designed model is working fine or not, we have to test it against those data points which were not present during the training of the model. These data points will serve the purpose of unseen data for the model, and it becomes easy to evaluate the model’s accuracy. One of the finest techniques to check the effectiveness of a machine learning model is Cross-validation techniques which can be easily implemented by using the R programming language. In this, a portion of the data set is reserved which will not be used in training the model. Once the model is ready, that reserved data set is used for testing purposes. Values of the dependent variable are predicted during the testing phase and the model accuracy is calculated on the basis of prediction error i.e., the difference in actual values and predicted values of the dependent variable. There are several statistical metrics that are used for evaluating the accuracy of regression models:

**Root Mean Squared Error (RMSE)****:**As the name suggests it is the square root of the averaged squared difference between the actual value and the predicted value of the target variable. It gives the average prediction error made by the model, thus decrease the RMSE value to increase the accuracy of the model.**Mean Absolute Error (MAE):**This metric gives the absolute difference between the actual values and the values predicted by the model for the target variable. If the value of the outliers does not have much to do with the accuracy of the model, then MAE can be used to evaluate the performance of the model. Its value must be less in order to make better models.**R**The value of the R-squared metric gives an idea about how much percentage of variance in the dependent variable is explained collectively by the independent variables. In other words, it reflects the relationship strength between the target variable and the model on a scale of 0 – 100%. So, a better model should have a high value of R-squared.^{2 }Error:

#### Types of Cross-Validation

During the process of partitioning the complete dataset into the training set and the validation set, there are chances of losing some important and crucial data points for the training purpose. Since those data are not included in the training set, the model has not got the chance to detect some patterns. This situation can lead to overfitting or under fitting of the model. To avoid this, there are different types of cross-validation techniques that guarantees the random sampling of training and validation data set and maximizes the accuracy of the model. Some of the most popular cross-validation techniques are

- Validation Set Approach
- Leave one out cross-validation(LOOCV)
- K-fold cross-Validation
- Repeated K-fold cross-validation

**Loading the Dataset**

To implement linear regression, we are using a **marketing **dataset which is an inbuilt dataset in R programming language. Below is the code to import this dataset into your R programming environment.

## R

`# loading required packages` `# package to perform data manipulation` `# and visualization` `library` `(tidyverse)` `# package to compute` `# cross - validation methods` `library` `(caret)` `# installing package to` `# import desired dataset` `install.packages` `(` `"datarium"` `)` `# loading the dataset` `data` `(` `"marketing"` `, package = ` `"datarium"` `)` `# inspecting the dataset` `head` `(marketing)` |

**Output:**

youtube facebook newspaper sales 1 276.12 45.36 83.04 26.52 2 53.40 47.16 54.12 12.48 3 20.64 55.08 83.16 11.16 4 181.80 49.56 70.20 22.20 5 216.96 12.96 70.08 15.48 6 10.44 58.68 90.00 8.64

#### Validation Set Approach(or data split)

In this method, the dataset is divided randomly into training and testing sets. Following steps are performed to implement this technique:

- A random sampling of the dataset
- Model is trained on the training data set
- The resultant model is applied to the testing data set
- Calculate prediction error by using model performance metrics

Below is the implementation of this method:

## R

`# R program to implement` `# validation set approach` `# setting seed to generate a` `# reproducible random sampling` `set.seed` `(123)` `# creating training data as 80% of the dataset` `random_sample <- ` `createDataPartition` `(marketing $ sales,` ` ` `p = 0.8, list = ` `FALSE` `)` `# generating training dataset` `# from the random_sample` `training_dataset <- marketing[random_sample, ]` `# generating testing dataset` `# from rows which are not` `# included in random_sample` `testing_dataset <- marketing[-random_sample, ]` `# Building the model` `# training the model by assigning sales column` `# as target variable and rest other columns` `# as independent variables` `model <- ` `lm` `(sales ~., data = training_dataset)` `# predicting the target variable` `predictions <- ` `predict` `(model, testing_dataset)` `# computing model performance metrics` `data.frame` `( R2 = ` `R2` `(predictions, testing_dataset $ sales),` ` ` `RMSE = ` `RMSE` `(predictions, testing_dataset $ sales),` ` ` `MAE = ` `MAE` `(predictions, testing_dataset $ sales))` |

**Output:**

R2 RMSE MAE 1 0.9049049 1.965508 1.433609

**Advantages:**

- One of the most basic and simple techniques for evaluating a model.
- No complex steps for implementation.

**Disadvantages:**

- Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
- Using only one subset of the data for training purposes can make the model biased.

#### Leave One Out Cross-Validation(LOOCV)

This method also splits the dataset into 2 parts but it overcomes the drawbacks of the Validation set approach. LOOCV carry out the cross-validation in the following way:

- Train the model on N-1 data points
- Testing the model against that one data points which was left in the previous step
- Calculate prediction error
- Repeat above 3 steps until the model is not trained and tested on all data points
- Generate overall prediction error by taking the average of prediction errors in every case

Below is the implementation of this method:

## R

`# R program to implement` `# Leave one out cross validation` `# defining training control` `# as Leave One Out Cross Validation` `train_control <- ` `trainControl` `(method = ` `"LOOCV"` `)` `# training the model by assigning sales column` `# as target variable and rest other column` `# as independent variable` `model <- ` `train` `(sales ~., data = marketing,` ` ` `method = ` `"lm"` `,` ` ` `trControl = train_control)` `# printing model performance metrics` `# along with other details` `print` `(model)` |

**Output:**

Linear Regression 200 samples 3 predictor No pre-processing Resampling: Leave-One-Out Cross-Validation Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... Resampling results: RMSE Rsquared MAE 2.059984 0.8912074 1.539441 Tuning parameter 'intercept' was held constant at a value of TRUE

**Advantages:**

- Less bias model as almost every data point is used for training.
- No randomness in the value of performance metrics because LOOCV runs multiple times on the dataset

**Disadvantages:**

- Training the model N times leads to expensive computation time if the dataset is large.

#### K-fold Cross-Validation

This cross-validation technique divides the data into K subsets(folds) of almost equal size. Out of these K folds, one subset is used as a validation set, and rest others are involved in training the model. Following are the complete working procedure of this method:

- Split the dataset into K subsets randomly
- Use K-1 subsets for training the model
- Test the model against that one subset which was left in the previous step
- Repeat the above steps for K times i.e., until the model is not trained and tested on all subsets
- Generate overall prediction error by taking the average of prediction errors in every case

Below is the implementation of this method:

## R

`# R program to implement` `# K-fold cross-validation` `# setting seed to generate a` `# reproducible random sampling` `set.seed` `(125)` `# defining training control` `# as cross-validation and` `# value of K equal to 10` `train_control <- ` `trainControl` `(method = ` `"cv"` `,` ` ` `number = 10)` `# training the model by assigning sales column` `# as target variable and rest other column` `# as independent variable` `model <- ` `train` `(sales ~., data = marketing,` ` ` `method = ` `"lm"` `,` ` ` `trControl = train_control)` `# printing model performance metrics` `# along with other details` `print` `(model)` |

**Output:**

Linear Regression 200 samples 3 predictor No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 181, 180, 180, 179, 180, 180, ... Resampling results: RMSE Rsquared MAE 2.027409 0.9041909 1.539866 Tuning parameter 'intercept' was held constant at a value of TRUE

**Advantages:**

- Fast computation speed.
- A very effective method to estimate the prediction error and the accuracy of a model.

**Disadvantages:**

- A lower value of K leads to a biased model and a higher value of K can lead to variability in performance metrics of the model. Thus, it is very important to use the correct value of K for the model(generally K = 5 and K = 10 is desirable).

#### Repeated K-fold cross-validation:

As the name suggests, in this method the K-fold cross-validation algorithm is repeated a certain number of times. Below is the implementation of this method:

## R

`# R program to implement` `# repeated K-fold cross-validation` `# setting seed to generate a` `# reproducible random sampling` `set.seed` `(125)` `# defining training control as` `# repeated cross-validation and` `# value of K is 10 and repetation is 3 times` `train_control <- ` `trainControl` `(method = ` `"repeatedcv"` `,` ` ` `number = 10, repeats = 3)` `# training the model by assigning sales column` `# as target variable and rest other column` `# as independent variable` `model <- ` `train` `(sales ~., data = marketing,` ` ` `method = ` `"lm"` `,` ` ` `trControl = train_control)` `# printing model performance metrics` `# along with other details` `print` `(model)` |

**Output:**

Linear Regression 200 samples 3 predictor No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 181, 180, 180, 179, 180, 180, ... Resampling results: RMSE Rsquared MAE 2.020061 0.9038559 1.541517 Tuning parameter 'intercept' was held constant at a value of TRUE

**Advantages:**

- In each repetition, the data sample is shuffled which results in developing different splits of the sample data.

**Disadvantages:**

- With each repetition, the algorithm has to train the model from scratch which means the computation time to evaluate the model increases by the times of repetition.

Note:The most preferred cross-validation technique isrepeated K-fold cross-validationfor both regression and classification machine learning model.