The Validation Set Approach in R Programming

The validation set approach is a cross-validation technique in Machine learning. Cross-validation techniques are often used to judge the performance and accuracy of a machine learning model. In the Validation Set approach, the dataset which will be used to build the model is divided randomly into 2 parts namely training set and validation set(or testing set). The model is trained on the training dataset and its accuracy is calculated by predicting the target variable for those data points which is not present during the training that is validation set. This whole process of splitting the data, training the model, testing the model is a complex task. But the R language consists of numerous libraries and inbuilt functions which can carry out all the tasks very easily and efficiently. 

Steps Involved in the Validation Set Approach

  1. A random splitting of the dataset into a certain ratio(generally 70-30 or 80-20 ratio is preferred)
  2. Training of the model on the training data set
  3. The resultant model is applied to the validation set
  4. Model’s accuracy is calculated through prediction error by using model performance metrics

This article discusses the step by step method of implementing the Validation set approach as a cross-validation technique for both classification and regression machine learning models.

For Classification Machine Learning Models

This type of machine learning model is used when the target variable is a categorical variable like positive, negative, or diabetic, non-diabetic, etc. The model predicts the class label of the dependent variable. Here, the Logistic regression algorithm will be applied to build the classification model. 

Step 1: Loading the dataset and other required packages

Before doing any exploratory or manipulation task, one must include all the required libraries and packages to use various inbuilt functions and a dataset which will make it easier to carry out the whole process.

R



filter_none

edit
close

play_arrow

link
brightness_4
code

# loading required packages
  
# package to perform data manipulation 
# and visualization
library(tidyverse)
  
# package to compute 
# cross - validation methods
library(caret)
  
# package Used to split the data 
# used during classification into 
# train and test subsets
library(caTools)
  
# loading package to 
# import desired dataset
library(ISLR)

chevron_right


 Step 2: Exploring the dataset

It is very necessary to understand the structure and dimension of the dataset as this will help in building a correct model. Also, as this is a classification model, one must know the different categories present in the target variable. 

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ] 
  
# display the dataset with details 
# like column name and its data type
# along with values in each row
glimpse(dataset)
  
# checking values present
# in the Direction column 
# of the dataset
table(dataset$Direction)

chevron_right


 
Output:

Rows: 1,250
Columns: 9
$ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume    <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...

> table(dataset$Direction)
Down   Up 
 602  648

According to the above information, the imported dataset has 250 rows and 9 columns. The data type of columns as <dbl> means the double-precision floating-point number (dbl came from double). The target variable must be of factor datatype in classification models. Since the data type of the Direction column is already <fct>, there is no need to change anything. 

Moreover, the response variable or target variable is a binary categorical variable(as the values in the column are only Down and Up) and the proportion of both class labels is approximately 1:1 means they are balanced. If there will be a case of class imbalance as if the proportion of class labels would be 1:2, we have to make sure that both the categories are in approximately equal proportion. For this purpose, there are many techniques like:

  • Down Sampling
  • Up Sampling
  • Hybrid Sampling using SMOTE and ROSE

Step 3: Building the model and generating the validation set

This step involves the random splitting of the dataset, developing training and validation set, and training of the model. Below is the implementation.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# setting seed to generate a  
# reproducible random sampling
set.seed(100)
  
# dividing the complete dataset
# into 2 parts having ratio of
# 70% and 30%
spl = sample.split(dataset$Direction, SplitRatio = 0.7)
  
# selecting that part of dataset
# which belongs to the 70% of the
# dataset divided in previous step
train = subset(dataset, spl == TRUE)
  
# selecting that part of dataset
# which belongs to the 30% of the
# dataset divided in previous step
test = subset(dataset, spl == FALSE)
  
# checking number of rows and column
# in training and testing dataset
print(dim(train))
print(dim(test))
  
# Building the model 
  
# training the model by assigning Direction column 
# as target variable and rest other columns 
# as independent variables
model_glm = glm(Direction ~ . , family = "binomial"
                data = train, maxit = 100)

chevron_right


 Output:



> print(dim(train))
[1] 875   9
> print(dim(test))
[1] 375   9

Step 4: Predicting the target variable

As the training of the model is completed, it is time to make predictions on the unseen data. Here, the target variable has only 2 possible values so in the predict() function it is desirable to use type = response such that the model predicts the probability score of the target categorical variable as 0 or 1. 

There is an optional step of transforming the response variable into the factor variable of 1’s and 0’s so that if the probability score of a data point is above a certain threshold, it will be treated as 1 and if below that threshold it will be treated as 0. Here, the probability cutoff is set as 0.5.  Below is the code to implement these steps

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# predictions on the validation set
predictTest = predict(model_glm, newdata = test, 
                      type = "response")
  
# assigning the probability cutoff as 0.5
predicted_classes <- as.factor(ifelse(predictTest >= 0.5, 
                                      "Up", "Down"))

chevron_right


Step 5: Evaluating the accuracy of the model

The Best way to judge the accuracy of a classification machine learning model is through Confusion Matrix. This matrix gives us a numerical value which suggests how many data points are predicted correctly as well as incorrectly by taking reference with the actual values of the target variable in the testing dataset. Along with the confusion matrix, other statistical details of the model like accuracy and kappa can be calculated using the below code.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# generating confusion matrix and
# other detials from the  
# prediction made by the model
print(confusionMatrix(predicted_classes, test$Direction))

chevron_right


 
Output:

Confusion Matrix and Statistics

          Reference
Prediction Down  Up
      Down  177   5
      Up      4 189
                                         
               Accuracy : 0.976          
                 95% CI : (0.9549, 0.989)
    No Information Rate : 0.5173         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.952          
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.9779         
            Specificity : 0.9742         
         Pos Pred Value : 0.9725         
         Neg Pred Value : 0.9793         
             Prevalence : 0.4827         
         Detection Rate : 0.4720         
   Detection Prevalence : 0.4853         
      Balanced Accuracy : 0.9761         
                                         
       'Positive' Class : Down                                                                                            

For Regression Machine Learning Models

Regression models are used to predict a quantity whose nature is continuous like the price of a house, sales of a product, etc. Generally in a regression problem, the target variable is a real number such as integer or floating-point values. The accuracy of this kind of model is calculated by taking the mean of errors in predicting the output of various data points. Below are the steps to implement the validation set approach in Linear Regression Models.

Step 1: Loading the dataset and required packages

R language contains a variety of datasets. Here we are using trees dataset which is an inbuilt dataset for the linear regression model. Below is the code to import the required dataset and packages to perform various operations to build the model.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading required packages 
  
# package to perform data manipulation 
# and visualization 
library(tidyverse) 
  
# package to compute 
# cross - validation methods 
library(caret) 
  
# access the data from R’s datasets package
data(trees)
  
# look at the first several rows of the data
head(trees)

chevron_right


 
Output:



  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

So, in this dataset, there are a total of 3 columns among which Volume is the target variable. Since the variable is of continuous nature, a linear regression algorithm can be used to predict the outcome.

Step 2: Building the model and generating the validation set

In this step, the model is split randomly into a ratio of 80-20. 80% of the data points will be used to train the model while 20% acts as the validation set which will give us the accuracy of the model. Below is the code for the same.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# reproducible random sampling 
set.seed(123)
  
# creating training data as 80% of the dataset 
random_sample <- createDataPartition(trees $ Volume,  
                                     p = 0.8, list = FALSE
  
# generating training dataset 
# from the random_sample 
training_dataset  <- trees[random_sample, ] 
  
# generating testing dataset 
# from rows which are not  
# included in random_sample 
testing_dataset <- trees[-random_sample, ] 
  
# Building the model 
  
# training the model by assigning sales column 
# as target variable and rest other columns 
# as independent variables 
model <- lm(Volume ~., data = training_dataset)

chevron_right


Step 3: Predict the target variable

After building and training the model, predictions of the target variable of the data points belong to the validation set will be done. 

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# predicting the target variable 
predictions <- predict(model, testing_dataset)

chevron_right


Step 4: Evaluating the accuracy of the model

Statistical metrics that are used for evaluating the performance of a Linear regression model are Root Mean Square Error(RMSE), Mean Squared Error(MAE), and R2 Error. Among all R2 Error, metric makes the most accurate judgment and its value must be high for a better model. Below is the code to calculate the prediction error of the model.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# computing model performance metrics 
data.frame(R2 = R2(predictions, testing_dataset $ Volume), 
           RMSE = RMSE(predictions, testing_dataset $ Volume), 
           MAE = MAE(predictions, testing_dataset $ Volume))

chevron_right


Output:

         R2     RMSE     MAE
1 0.9564487 5.274129 4.73567

Advantages of the Validation Set approach

  • One of the most basic and simple techniques for evaluating a model.
  • No complex steps for implementation.

Disadvantages of the Validation Set approach

  • Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
  • Using only one subset of the data for training purposes can make the model biased.



My Personal Notes arrow_drop_up

Android Developer(Java, Kotlin), Technical Content Writer

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.