Open In App

Confidence Intervals for Machine Learning

In machine learning, confidence intervals play a crucial role in quantifying the uncertainty associated with model predictions and parameter estimates. They provide a range of values within which we can be confident that the true value lies, given a certain level of probability. In this article, we will see confidence intervals relevance in Machine Learning.

What are Confidence Intervals?

A confidence interval is a range of values that likely contains the true population parameter, such as the population mean or proportion, based on a sample from that population and a specified level of confidence.

For example, if we calculate a 95% confidence interval for the mean test scores of students, it means that we are 95% confident that the true population mean lies within that interval.

import numpy as np
import matplotlib.pyplot as plt

# Sample data
np.random.seed(42)
sample_data = np.random.normal(loc=80, scale=10, size=100)  # mean=80, std=10, sample size=100

# Calculate sample statistics
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data)
sample_size = len(sample_data)

# Confidence level
confidence_level = 0.95

# Calculate z-score for 95% confidence level
z_score = 1.96  # approximate z-score for 95% confidence

# Margin of error
margin_of_error = z_score * (sample_std / np.sqrt(sample_size))

# Confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

# Visualize the confidence interval
plt.figure(figsize=(8, 6))
plt.hist(sample_data, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(confidence_interval[0], color='red', linestyle='--', label='Lower CI')
plt.axvline(confidence_interval[1], color='red', linestyle='--', label='Upper CI')
plt.axvline(sample_mean, color='black', linestyle='-', label='Sample Mean')
plt.xlabel('Test Scores')
plt.ylabel('Frequency')
plt.title('Histogram with 95% Confidence Interval')
plt.legend()
plt.show()

# Print the confidence interval
print("Confidence Interval (95%):", confidence_interval)

Output:

Confidence Interval (95%): (77.1904471198356, 80.73262253228253)
Figure_1

Visualize the CI


1. For Population Mean (Known Standard Deviation):

The confidence interval for the population mean when the standard deviation is known is calculated using the formula:

[Tex]\text{Confidence Interval} = \bar{x} \pm z \left( \frac{\sigma}{\sqrt{n}} \right) [/Tex]
Where,

2. For Population Mean (Unknown Standard Deviation):

The confidence interval for the population mean when the standard deviation is unknown is calculated using the t-distribution instead of the standard normal distribution.

[Tex]\text{Confidence Interval} = \bar{x} \pm t \left( \frac{s}{\sqrt{n}} \right) [/Tex]

Where,

3. For Population Proportion: The confidence interval for a population proportion p is calculated using the formula


[Tex]\text{Confidence Interval} = \hat{p} \pm z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} [/Tex]

Where,

What are features of Confidence Interval ?

How to Calculate Confidence Interval (CI)?

1. For Population Mean (Known Population Standard Deviation):

import scipy.stats as stats

# Sample data
sample_data = [86, 88, 84, 90, 85, 87, 89, 82, 91, 83]

# Calculate sample statistics
sample_mean = sum(sample_data) / len(sample_data)
population_std = 5  # example: known population standard deviation
sample_size = len(sample_data)

# Confidence level
confidence_level = 0.95

# Z-score for 95% confidence level
z_score = stats.norm.ppf((1 + confidence_level) / 2)

# Margin of error
margin_of_error = z_score * (population_std / (sample_size ** 0.5))

# Confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
print("Confidence Interval (Population Mean):", confidence_interval)

Output:

Confidence Interval (Population Mean): (83.4010248384772, 89.5989751615228)

The confidence interval (83.4010248384772, 89.5989751615228) for a population mean represents a range of values within which we are reasonably confident that the true population mean lies. Specifically, this interval indicates that if we were to take multiple samples from the same population and calculate a confidence interval for the population mean from each sample, about 95% of those intervals would contain the true population mean.

2. For Population Mean (Unknown Population Standard Deviation)

Following the same steps as above, but instead of using the z-score, you'll use the t-score from the t-distribution based on n−1 degrees of freedom.

import scipy.stats as stats

# Sample data
sample_data = [86, 88, 84, 90, 85, 87, 89, 82, 91, 83]

# Calculate sample statistics
sample_mean = sum(sample_data) / len(sample_data)
sample_std = stats.tstd(sample_data)  # using sample standard deviation
sample_size = len(sample_data)

# Confidence level
confidence_level = 0.95

# T-score for 95% confidence level and (n-1) degrees of freedom
t_score = stats.t.ppf((1 + confidence_level) / 2, df=sample_size-1)

# Margin of error
margin_of_error = t_score * (sample_std / (sample_size ** 0.5))

# Confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
print("Confidence Interval (Population Mean - Unknown Std Dev):", confidence_interval)

Output:

Confidence Interval (Population Mean - Unknown Std Dev): (84.33414941027831, 88.66585058972169)

The confidence interval (84.33414941027831, 88.66585058972169) for a population mean when the standard deviation is unknown represents a range of values within which we are reasonably confident that the true population mean lies. Similar to the previous explanation, this interval indicates that if we were to take multiple samples from the same population and calculate a confidence interval for the population mean from each sample, about 95% of those intervals would contain the true population mea

3. For Population Proportion

import scipy.stats as stats

# Sample data
successes = 35  # number of successes
total_obs = 50  # total sample size

# Calculate sample proportion
sample_proportion = successes / total_obs

# Confidence level
confidence_level = 0.95

# Z-score for 95% confidence level
z_score = stats.norm.ppf((1 + confidence_level) / 2)

# Margin of error
margin_of_error = z_score * ((sample_proportion * (1 - sample_proportion)) / total_obs) ** 0.5

# Confidence interval
confidence_interval = (sample_proportion - margin_of_error, sample_proportion + margin_of_error)
print("Confidence Interval (Population Proportion):", confidence_interval)

Output:

Confidence Interval (Population Proportion): (0.5729798163797764, 0.8270201836202236)

The confidence interval (0.5729798163797764, 0.8270201836202236) for a population proportion represents a range of values within which we are reasonably confident that the true population proportion lies. Specifically, this interval indicates that if we were to take multiple samples from the same population and calculate a confidence interval for the population proportion from each sample, about 95% of those intervals would contain the true population proportion.

Applications of Confidence Intervals (CIs)

Conclusion

Confidence intervals are essential tools in machine learning and statistical analysis. They help us understand the uncertainty associated with our estimates, assess the significance of model parameters, and make informed decisions based on reliable data. By providing a range of likely values for population parameters and model predictions, confidence intervals enable us to evaluate model performance, guide decision-making processes, and ensure the robustness of our analyses in various real-world applications.

Article Tags :