True Error vs Sample Error

• Last Updated : 21 Sep, 2021

True Error

The true error can be said as the probability that the hypothesis will misclassify a single randomly drawn sample from the population. Here the population represents all the data in the world.

Let’s consider a hypothesis h(x) and the true/target function is f(x) of population P. The probability that h will misclassify an instance drawn at random i.e. true error is:

Sample Error

The sample error of S with respect to target function f and data sample S is the proportion of examples S misclassifies.

or, the following formula represents also represents sample error:

• S.E. = 1- Accuracy

Suppose Hypothesis h misclassifies the 7 out of the 33 examples in total populations. Then the sampling error should be:

Bias & Variance

Bias: Bias is the difference between the average prediction of the hypothesis and the correct value of prediction. The hypothesis with high bias tries to oversimplify the training (not working on a complex model). It tends to have high training errors and high test errors.

Variance: High variance hypotheses have high variability between their predictions. They try to over-complex the model and do not generalize the data very well.

Confidence Interval

Generally, the true error is complex and difficult to calculate. It can be estimated with the help of a confidence interval. The confidence interval can be estimated as the function of the sampling error.

Below are the steps for the confidence interval:

• Randomly drawn n samples S (independently of each other), where n should be >30 from the population P.
• Calculate the Sample Error of sample S.

Here we assume that the sampling error is the unbiased estimator of True Error. Following is the formula for calculating true error:

where zs is the value of the z-score of the s percentage of the confidence interval:

Implementation:

In this implementation, we will be implementing the estimation of true error using a confidence interval.

Python3

 # importsimport numpy as npimport scipy.stats as st  #define sample datanp.random.seed(0)data = np.random.randint(10, 30, 10000)  alphas = [0.90, 0.95, 0.99, 0.995]for alpha in alphas:  print(st.norm.interval(alpha=alpha, loc=np.mean(data), scale=st.sem(data)))

# confidence Interval
90%: (17.868667310403545, 19.891332689596453)
95%: (17.67492277275104, 20.08507722724896)
99%: (17.29626006422982, 20.463739935770178)
99.5%: (17.154104780989755, 20.60589521901025)

References:

My Personal Notes arrow_drop_up