Open In App

Gaussian Distribution In Machine Learning

The Gaussian distribution, also known as the normal distribution, plays a fundamental role in machine learning. It is a key concept used to model the distribution of real-valued random variables and is essential for understanding various statistical methods and algorithms.

Gaussian Distribution

In machine learning, the Gaussian distribution, is also known as the normal distribution. It is a continuous probability distribution function that is symmetrical at the mean, and the majority of data falls within one standard deviation of the mean. It is characterized by its bell-shaped curve.

Gaussian Distribution Formula

The PDF (probability density function) of the Gaussian distribution is given by the formula:

[Tex]f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2} \right) [/Tex]

where:

Gaussian Distribution Curve

The curve is symmetric and bell-shaped, and it mathematically represents the probability distribution of a continuous random variable. The Gaussian distribution is characterized by two parameters: the mean (μ) and the standard deviation (σ), which determine the location and the spread of the curve.

Probability-Distribution-Curve

  • Within one standard deviation of the mean (Mean ± 1 SD), approximately 68% of the data is expected to fall.
  • Within two standard deviations of the mean (Mean ± 2 SD), approximately 95% of the data is expected to fall.
  • Within three standard deviations of the mean (Mean ± 3 SD), approximately 99.7% of the data is expected to fall.

Gaussian Distribution Table

Note:

  • Columns = value of z ranging from -3.4 to 3.4, with increments of 0.1.
  • Rows = percentile value ranging from 0.00 to 0.09, with increments of 0.01.


Z-Value00.010.020.030.040.050.060.070.080.09
000.0040.0080.0120.0160.01990.02390.02790.03190.0359
0.10.03980.04380.04780.05170.05570.05960.06360.06750.07140.0753
0.20.07930.08320.08710.0910.09480.09870.10260.10640.11030.1141
0.30.11790.12170.12550.12930.13310.13680.14060.14430.1480.1517
0.40.15540.15910.16280.16640.170.17360.17720.18080.18440.1879
0.50.19150.1950.19850.20190.20540.20880.21230.21570.2190.2224
0.60.22570.22910.23240.23570.23890.24220.24540.24860.25170.2549
0.70.2580.26110.26420.26730.27040.27340.27640.27940.28230.2852
0.80.28810.2910.29390.29670.29950.30230.30510.30780.31060.3133
0.90.31590.31860.32120.32380.32640.32890.33150.3340.33650.3389
10.34130.34380.34610.34850.35080.35310.35540.35770.35990.3621
1.10.36430.36650.36860.37080.37290.37490.3770.3790.3810.383
1.20.38490.38690.38880.39070.39250.39440.39620.3980.39970.4015
1.30.40320.40490.40660.40820.40990.41150.41310.41470.41620.4177
1.40.41920.42070.42220.42360.42510.42650.42790.42920.43060.4319
1.50.43320.43450.43570.4370.43820.43940.44060.44180.44290.4441
1.60.44520.44630.44740.44840.44950.45050.45150.45250.45350.4545
1.70.45540.45640.45730.45820.45910.45990.46080.46160.46250.4633
1.80.46410.46490.46560.46640.46710.46780.46860.46930.46990.4706
1.90.47130.47190.47260.47320.47380.47440.4750.47560.47610.4767
20.47720.47780.47830.47880.47930.47980.48030.48080.48120.4817

The Z score table is often used in statistical calculations and hypothesis testing to determine probabilities associated with specific z-values.

For example , z-value of 1.96 in the table then the cumulative probability to be approximately 0.975 , we can infer that approximately 97.5% of the area under the standard normal curve lies to the left of z = 1.96.

Properties of Gaussian Distribution

Some of the important properties are

Machine Learning Methods that uses Gaussian Distribution

Implementation of Gaussian Distribution in Machine Learning

Consider the famous Iris dataset consists of 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. We can examine the distribution of one of these features, such as sepal length, using a histogram to see if it approximately follows a Gaussian distribution.

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

# Load the Iris dataset
iris = load_iris()
sepal_length = iris.data[:, 0]  # Extract sepal length (feature at index 0)

mu, std = np.mean(sepal_length), np.std(sepal_length)
x = np.linspace(np.min(sepal_length), np.max(sepal_length), 100)
y = (1 / (std * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / std)**2)

plt.figure(figsize=(8, 6))
plt.hist(sepal_length, bins=20, color='skyblue', edgecolor='black', alpha=0.7, density=True)
plt.plot(x, y, color='red', label='Gaussian Fit')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Density')
plt.title('Distribution of Sepal Length in Iris Dataset with Gaussian Fit')
plt.legend()
plt.show()


Output:

Screenshot-2024-03-13-173238

FIGURE 1


The stability of Gaussian distributions under linear combinations facilitates analytical solutions for understanding the behavior of random variables and making predictions based on data making it a cornerstone in statistical modeling and analysis.

Article Tags :