True Error vs Sample Error
The true error can be said as the probability that the hypothesis will misclassify a single randomly drawn sample from the population. Here the population represents all the data in the world.
Let’s consider a hypothesis h(x) and the true/target function is f(x) of population P. The probability that h will misclassify an instance drawn at random i.e. true error is:
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.
The sample error of S with respect to target function f and data sample S is the proportion of examples S misclassifies.
or, the following formula represents also represents sample error:
- S.E. = 1- Accuracy
Suppose Hypothesis h misclassifies the 7 out of the 33 examples in total populations. Then the sampling error should be:
Bias & Variance
Bias: Bias is the difference between the average prediction of the hypothesis and the correct value of prediction. The hypothesis with high bias tries to oversimplify the training (not working on a complex model). It tends to have high training errors and high test errors.
Variance: High variance hypotheses have high variability between their predictions. They try to over-complex the model and do not generalize the data very well.
Generally, the true error is complex and difficult to calculate. It can be estimated with the help of a confidence interval. The confidence interval can be estimated as the function of the sampling error.
Below are the steps for the confidence interval:
- Randomly drawn n samples S (independently of each other), where n should be >30 from the population P.
- Calculate the Sample Error of sample S.
Here we assume that the sampling error is the unbiased estimator of True Error. Following is the formula for calculating true error:
where zs is the value of the z-score of the s percentage of the confidence interval:
% Confidence Interval 50 80 90 95 99 99.5 Z-score 0.67 1.28 1.64 1.96 2.58 2.80
True Error vs Sample Error
|True Error||Sample Error|
|The true error represents the probability that a random sample from the population is misclassified.||Sample Error represents the fraction of the sample which is misclassified.|
|True error is used to estimate the error of the population.||Sample Error is used to estimate the errors of the sample.|
|True error is difficult to calculate. It is estimated by the confidence interval range on the basis of Sample error.||Sample Error is easy to calculate. You just have to calculate the fraction of the sample that is misclassified.|
|The true error can be caused by poor data collection methods, selection bias, or non-response bias.||Sampling error can be of type population-specific error (wrong people to survey), selection error, sample-frame error (wrong frame window selected for sample), and non-response error (when respondent failed to respond).|
In this implementation, we will be implementing the estimation of true error using a confidence interval.
# confidence Interval 90%: (17.868667310403545, 19.891332689596453) 95%: (17.67492277275104, 20.08507722724896) 99%: (17.29626006422982, 20.463739935770178) 99.5%: (17.154104780989755, 20.60589521901025)