Open In App

Techniques To Evaluate Accuracy of Classifier in Data Mining

Last Updated : 30 Jan, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Pre-requisites: Data Mining

Data Mining can be referred to as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging.  In this article, we will see techniques to evaluate the accuracy of classifiers. 

HoldOut

In the holdout method, the largest dataset is randomly divided into three subsets:

  • A training set is a subset of the dataset which are been used to build predictive models.
  • The validation set is a subset of the dataset which is been used to assess the performance of the model built in the training phase. It provides a test platform for fine-tuning of the model’s parameters and selecting the best-performing model. It is not necessary for all modeling algorithms to need a validation set.
  • Test sets or unseen examples are the subset of the dataset to assess the likely future performance of the model. If a model is fitting into the training set much better than it fits into the test set, then overfitting is probably the cause that occurred here.

Basically, two-thirds of the data are been allocated to the training set and the remaining one-third is been allocated to the test set.

HOLDOUT

 

Random Subsampling

  • Random subsampling is a variation of the holdout method. The holdout method is been repeated K times.
  • The holdout subsampling involves randomly splitting the data into a training set and a test set.
  • On the training set the data is been trained and the mean square error (MSE) is been obtained from the predictions on the test set.
  • As MSE is dependent on the split, this method is not recommended. So a new split can give you a new MSE.
  • The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}
RANDOM SAMPLING

 

Cross-Validation

  • K-fold cross-validation is been used when there is only a limited amount of data available, to achieve an unbiased estimation of the performance of the model.
  • Here, we divide the data into K subsets of equal sizes.
  • We build models K times, each time leaving out one of the subsets from the training, and use it as the test set.
  • If K equals the sample size, then this is called a “Leave-One-Out”

Bootstrapping

  • Bootstrapping is one of the techniques which is used to make the estimations from the data by taking an average of the estimates from smaller data samples.
  • The bootstrapping method involves the iterative resampling of a dataset with replacement.
  • On resampling instead of only estimating the statistics once on complete data, we can do it many times.
  • Repeating this multiple times helps to obtain a vector of estimates.
  • Bootstrapping can compute variance, expected value, and other relevant statistics of these estimates.

Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads