Introduction of Holdout Method

Last Updated : 26 Aug, 2020

Holdout Method is the simplest sort of method to evaluate a classifier. In this method, the data set (a collection of data items or examples) is separated into two sets, called the Training set and Test set.

A classifier performs function of assigning data items in a given collection to a target category or class.

Example –
E-mails in our inbox being classified into spam and non-spam.

Classifier should be evaluated to find out, it’s accuracy, error rate, and error estimates. It can be done using various methods. One of most primitive methods in evaluation of classifier is ‘Holdout Method’.

In the holdout method, data set is partitioned, such that – maximum data belongs to training set and remaining data belongs to test set.

Example –
If there are 20 data items present, 12 are placed in training set and remaining 8 are placed in test set.

After partitioning data set into two sets, training set is used to build a model/classifier.
After construction of classifier, we use data items in test set, to test accuracy, error rate and error estimate of model/classifier.

However, it is vital to remember two statements with regard to holdout method. These are :

If maximum possible data items are placed in training set for construction of model/classifier, classifier’s error rates and estimates would be very low and accuracy would be high. This is sign of a good classifier/model.

Example –
A student ‘gfg’ is coached by a teacher. Teacher teaches her all possible topics which might appear for exam. Hence, she tends to commit very less mistakes in exam, thus performing well.

If more training data are used to construct a classifier, it qualifies any data used from test set, to test it (classifier).

If more number of data items are present in test set, such that they are used to test classifier built using training set. We can observe more accurate evaluation of classifier with respect to it’s accuracy, error rate and estimation.

Example –
A student ‘gfg’ is coached by a teacher. Teacher teaches her some topics, which might appear for the exam. If the student ‘gfg’ is given a number of exams on basis of this coaching, an accurate determination of student’s weak and strong points can be found out.

If more test data are used to evaluate constructed classifier, it’s error rate, error estimate and accuracy can be accurately determined.

Problem :
During partitioning of whole data set into 2 parts i.e., training set and test set, if all data items belonging to class – GFG1, are placed in test set entirely, such that none of data items of class GFG1 are in training set. It is evident, that model/classifier built, is not trained using data items of class – GFG1.

Solution :
Stratification is a technique, using which data items belonging to class – GFG1 are divided and placed into two data sets i.e training set and test set, equally. Such that, model/classifier is trained by data items belonging to class -GFG1.

Example –
All the four data items belonging to class – GFG1, here, are divided equally and placed, two data items each, into two data sets – training set and test set.

Suggest improvement

Why Buffer Stock is created by Government?

Difference between Hypertext and Hypermedia

Share your thoughts in the comments