# The Validation Set Approach in R Programming

**The validation set approach** is a cross-validation technique in Machine learning. Cross-validation techniques are often used to judge the performance and accuracy of a machine learning model. In the Validation Set approach, the dataset which will be used to build the model is divided randomly into 2 parts namely training set and validation set(or testing set). The model is trained on the training dataset and its accuracy is calculated by predicting the target variable for those data points which is not present during the training that is validation set. This whole process of splitting the data, training the model, testing the model is a complex task. But the R language consists of numerous libraries and inbuilt functions which can carry out all the tasks very easily and efficiently.

### Steps Involved in the Validation Set Approach

- A random splitting of the dataset into a certain ratio(generally 70-30 or 80-20 ratio is preferred)
- Training of the model on the training data set
- The resultant model is applied to the validation set
- Model’s accuracy is calculated through prediction error by using model performance metrics

This article discusses the **step by step method of implementing the Validation set approach** as a cross-validation technique for both **classification** and **regression** machine learning models.

### For Classification Machine Learning Models

This type of machine learning model is used when the target variable is a categorical variable like positive, negative, or diabetic, non-diabetic, etc. The model predicts the class label of the dependent variable. Here, the Logistic regression algorithm will be applied to build the classification model.

#### Step 1: Loading the dataset and other required packages

Before doing any exploratory or manipulation task, one must include all the required libraries and packages to use various inbuilt functions and a dataset which will make it easier to carry out the whole process.

## R

`# loading required packages` ` ` `# package to perform data manipulation ` `# and visualization` `library` `(tidyverse)` ` ` `# package to compute ` `# cross - validation methods` `library` `(caret)` ` ` `# package Used to split the data ` `# used during classification into ` `# train and test subsets` `library` `(caTools)` ` ` `# loading package to ` `# import desired dataset` `library` `(ISLR)` |

#### Step 2: Exploring the dataset

It is very necessary to understand the structure and dimension of the dataset as this will help in building a correct model. Also, as this is a classification model, one must know the different categories present in the target variable.

## R

`# assigning the complete dataset` `# Smarket to a variable` `dataset <- Smarket[` `complete.cases` `(Smarket), ] ` ` ` `# display the dataset with details ` `# like column name and its data type` `# along with values in each row` `glimpse` `(dataset)` ` ` `# checking values present` `# in the Direction column ` `# of the dataset` `table` `(dataset$Direction)` |

**Output:**

Rows: 1,250 Columns: 9 $ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ... $ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1... $ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0... $ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -... $ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ... $ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ... $ Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ... $ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0... $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up... > table(dataset$Direction) Down Up 602 648

According to the above information, the imported dataset has 250 rows and 9 columns. The data type of columns as **<dbl>** means the double-precision floating-point number (dbl came from double). The target variable must be of **factor** datatype in classification models. Since the data type of the Direction column is already **<fct>,** there is no need to change anything.

Moreover, the **response variable** or target variable is a binary categorical variable(as the values in the column are only Down and Up) and the proportion of both class labels is approximately 1:1 means they are balanced. If there will be a case of class imbalance as if the proportion of class labels would be 1:2, we have to make sure that both the categories are in approximately equal proportion. For this purpose, there are many techniques like:

- Down Sampling
- Up Sampling
- Hybrid Sampling using SMOTE and ROSE

#### Step 3: Building the model and generating the validation set

This step involves the random splitting of the dataset, developing training and validation set, and training of the model. Below is the implementation.

## R

`# setting seed to generate a ` `# reproducible random sampling` `set.seed` `(100)` ` ` `# dividing the complete dataset` `# into 2 parts having ratio of` `# 70% and 30%` `spl = ` `sample.split` `(dataset$Direction, SplitRatio = 0.7)` ` ` `# selecting that part of dataset` `# which belongs to the 70% of the` `# dataset divided in previous step` `train = ` `subset` `(dataset, spl == ` `TRUE` `)` ` ` `# selecting that part of dataset` `# which belongs to the 30% of the` `# dataset divided in previous step` `test = ` `subset` `(dataset, spl == ` `FALSE` `)` ` ` `# checking number of rows and column` `# in training and testing dataset` `print` `(` `dim` `(train))` `print` `(` `dim` `(test))` ` ` `# Building the model ` ` ` `# training the model by assigning Direction column ` `# as target variable and rest other columns ` `# as independent variables` `model_glm = ` `glm` `(Direction ~ . , family = ` `"binomial"` `, ` ` ` `data = train, maxit = 100)` |

**Output:**

> print(dim(train)) [1] 875 9 > print(dim(test)) [1] 375 9

#### Step 4: Predicting the target variable

As the training of the model is completed, it is time to make predictions on the unseen data. Here, the target variable has only 2 possible values so in the **predict()** function it is desirable to use **type = response **such that the model predicts the probability score of the target categorical variable as 0 or 1.

There is an optional step of transforming the response variable into the factor variable of 1’s and 0’s so that if the probability score of a data point is above a certain threshold, it will be treated as 1 and if below that threshold it will be treated as 0. Here, the **probability cutoff is set as 0.5. **Below is the code to implement these steps

## R

`# predictions on the validation set` `predictTest = ` `predict` `(model_glm, newdata = test, ` ` ` `type = ` `"response"` `)` ` ` `# assigning the probability cutoff as 0.5` `predicted_classes <- ` `as.factor` `(` `ifelse` `(predictTest >= 0.5, ` ` ` `"Up"` `, ` `"Down"` `))` |

**Step 5: Evaluating the accuracy of the model**

The Best way to judge the accuracy of a classification machine learning model is through Confusion Matrix. This matrix gives us a numerical value which suggests how many data points are predicted correctly as well as incorrectly by taking reference with the actual values of the target variable in the testing dataset. Along with the confusion matrix, other statistical details of the model like accuracy and kappa can be calculated using the below code.

## R

`# generating confusion matrix and` `# other details from the ` `# prediction made by the model` `print` `(` `confusionMatrix` `(predicted_classes, test$Direction))` |

**Output:**

Confusion Matrix and Statistics Reference Prediction Down Up Down 177 5 Up 4 189 Accuracy : 0.976 95% CI : (0.9549, 0.989) No Information Rate : 0.5173 P-Value [Acc > NIR] : <2e-16 Kappa : 0.952 Mcnemar's Test P-Value : 1 Sensitivity : 0.9779 Specificity : 0.9742 Pos Pred Value : 0.9725 Neg Pred Value : 0.9793 Prevalence : 0.4827 Detection Rate : 0.4720 Detection Prevalence : 0.4853 Balanced Accuracy : 0.9761 'Positive' Class : Down

### For Regression Machine Learning Models

Regression models are used to predict a quantity whose nature is continuous like the price of a house, sales of a product, etc. Generally in a regression problem, the target variable is a real number such as integer or floating-point values. The accuracy of this kind of model is calculated by taking the mean of errors in predicting the output of various data points. Below are the steps to implement the validation set approach in Linear Regression Models.

#### Step 1: Loading the dataset and required packages

R language contains a variety of datasets. Here we are using **trees** dataset which is an inbuilt dataset for the linear regression model. Below is the code to import the required dataset and packages to perform various operations to build the model.

## R

`# loading required packages ` ` ` `# package to perform data manipulation ` `# and visualization ` `library` `(tidyverse) ` ` ` `# package to compute ` `# cross - validation methods ` `library` `(caret) ` ` ` `# access the data from R’s datasets package` `data` `(trees)` ` ` `# look at the first several rows of the data` `head` `(trees)` |

**Output:**

Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7

So, in this dataset, there are a total of 3 columns among which **Volume **is the target variable. Since the variable is of continuous nature, a linear regression algorithm can be used to predict the outcome.

#### Step 2: Building the model and generating the validation set

In this step, the model is split randomly into a ratio of 80-20. 80% of the data points will be used to train the model while 20% acts as the validation set which will give us the accuracy of the model. Below is the code for the same.

## R

`# reproducible random sampling ` `set.seed` `(123)` ` ` `# creating training data as 80% of the dataset ` `random_sample <- ` `createDataPartition` `(trees $ Volume, ` ` ` `p = 0.8, list = ` `FALSE` `) ` ` ` `# generating training dataset ` `# from the random_sample ` `training_dataset <- trees[random_sample, ] ` ` ` `# generating testing dataset ` `# from rows which are not ` `# included in random_sample ` `testing_dataset <- trees[-random_sample, ] ` ` ` `# Building the model ` ` ` `# training the model by assigning sales column ` `# as target variable and rest other columns ` `# as independent variables ` `model <- ` `lm` `(Volume ~., data = training_dataset)` |

**Step 3: Predict the target variable**

After building and training the model, predictions of the target variable of the data points belong to the validation set will be done.

## R

`# predicting the target variable ` `predictions <- ` `predict` `(model, testing_dataset)` |

**Step 4: Evaluating the accuracy of the model**

Statistical metrics that are used for evaluating the performance of a Linear regression model are **Root Mean Square Error(RMSE), Mean Squared Error(MAE)**,** **and **R ^{2 }Error. **Among all R

^{2}Error, metric makes the most accurate judgement and its value must be high for a better model. Below is the code to calculate the prediction error of the model.

## R

`# computing model performance metrics ` `data.frame` `(R2 = ` `R2` `(predictions, testing_dataset $ Volume), ` ` ` `RMSE = ` `RMSE` `(predictions, testing_dataset $ Volume), ` ` ` `MAE = ` `MAE` `(predictions, testing_dataset $ Volume))` |

**Output:**

R2 RMSE MAE 1 0.9564487 5.274129 4.73567

### Advantages of the Validation Set approach

- One of the most basic and simple techniques for evaluating a model.
- No complex steps for implementation.

### Disadvantages of the Validation Set approach

- Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
- Using only one subset of the data for training purposes can make the model biased.

## Please

Loginto comment...