Open In App

Continuous Probability Distributions for Machine Learning

Machine learning relies heavily on probability distributions because they offer a framework for comprehending the uncertainty and variability present in data. Specifically, for a given dataset, continuous probability distributions express the chance of witnessing continuous outcomes, like real numbers.

What are Continuous Probability Distributions (CPDs)?

A probability distribution is a mathematical function that describes the likelihood of different outcomes for a random variable. Continuous probability distributions (CPDs) are probability distributions that apply to continuous random variables. It describes events that can take on any value within a specific range, like the height of a person or the amount of time it takes to complete a task.

In continuous probability distributions, two key functions describe the likelihood of a variable taking on specific values:



Probability Density Function (PDF):

The PDF gives the probability density at a specific point or interval for a continuous random variable. It indicates how likely the variable is to fall within a small interval around a particular value.

Cumulative Distribution Function (CDF):

The CDF gives the probability that a random variable is less than or equal to a specific value.It provides a cumulative view of the probability distribution, starting at 0 and increasing to 1 as the value of the random variable increases.

CDF is the integral of the PDF, and the PDF is the derivative of the CDF.


Difference between PDF and CDF in Continuous Probability Distributions


Why are Continuous Probability Distribution important in machine learning?

Imagine trying to build a model to predict the price of a car. You have data on various factors like mileage, year, and brand. But how do you account for the fact that prices can vary continuously? This is where continuous distributions come to the rescue! By fitting a suitable distribution to the price data, you can estimate the probability of a car with specific features falling within a certain price range.

Types of Continuous Probability Distributions

Here are some common types used in Machine learning,

Normal Distribution (Bell Curve) or Gaussian Distribution:

The Normal Distribution, sometimes referred to as the Gaussian Distribution, is a bell-shaped, symmetrical basic continuous probability distribution. Two factors define it: the standard deviation (σ), which indicates the distribution’s spread or dispersion, and the mean (μ), which establishes the distribution’s

For a random variable x, it is expressed as,


Note: The shape of the Normal Distribution is such that about 68% of the values fall within one standard deviation of the mean (μ ± σ), about 95% fall within two standard deviations (μ ± 2σ), and about 99.7% fall within three standard deviations (μ ± 3σ).

Uniform Distribution:

The Uniform Distribution is a continuous probability distribution where all values within a specified range are equally likely to occur.

It is expressed as:

Exponential Distribution:

The exponential distribution is a continuous probability distribution that represents the duration between occurrences in a Poisson process, which occurs continuously and independently at a constant average rate.

For a random variable x, it is expressed as

Chi-Squared Distribution:

The Chi-Squared Distribution is a continuous probability distribution that arises in statistics, particularly in hypothesis testing and confidence interval estimation.

For a random variable x, it is expressed as

Determining the distribution of a variable

Example :

Consider the iris dataset and let us try to understand how the petal length is distributed, here are the steps to be considered

Execute on jupyter notebook or any other ide that supports libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm  # loading normal distribution
 
# Step 1: Load the Iris dataset
iris_data = pd.read_csv(url)
 
# Step 2: Select the feature for analysis (e.g., petal length)
selected_feature = 'petal_length'
selected_data = iris_data[selected_feature]
 
# Step 3: Plot the histogram of the selected feature
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.hist(selected_data, bins=30, density=True, color='skyblue', alpha=0.6)
plt.title('Histogram of {}'.format(selected_feature))
plt.xlabel(selected_feature)
plt.ylabel('Density')
plt.grid(True)
# Step 4: Fit a Gaussian distribution to the selected feature
estimated_mean, estimated_std = np.mean(selected_data), np.std(selected_data)
 
# Step 5: Plot the histogram along with the fitted Gaussian distribution
plt.subplot(1, 2, 2)
plt.hist(selected_data, bins=30, density=True, color='skyblue', alpha=0.6)
 
x = np.linspace(np.min(selected_data), np.max(selected_data), 100)
pdf = norm.pdf(x, estimated_mean, estimated_std)
plt.plot(x, pdf, color='red', linestyle='--', linewidth=2)
 
plt.title('Histogram and Fitted Gaussian Distribution of {}'.format(selected_feature))
plt.xlabel(selected_feature)
plt.ylabel('Density')
plt.legend(['Fitted Gaussian Distribution', 'Histogram'])
plt.grid(True)
 
plt.tight_layout()
plt.show()

                    

Output:

Explanation for the output:

These graphs provide insights into the distribution of petal lengths in the Iris dataset and help us assess whether a Gaussian distribution is a suitable model for representing this data.


Article Tags :