Open In App

Gaussian Mixture Model

Last Updated : 10 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Suppose there are a set of data points that need to be grouped into several parts or clusters based on their similarity. In Machine Learning, this is known as Clustering. There are several methods available for clustering:

In this article, Gaussian Mixture Model will be discussed.

Normal or Gaussian Distribution

In real life, many datasets can be modeled by Gaussian Distribution (Univariate or Multivariate). So it is quite natural and intuitive to assume that the clusters come from different Gaussian Distributions. Or in other words, it tried to model the dataset as a mixture of several Gaussian Distributions. This is the core idea of this model.
In one dimension the probability density function of a Gaussian Distribution is given by

G(X|\mu,\sigma) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{-\frac{{}\left ( x-\mu \right )^{^{2}}}{2\sigma^{2}}}

where \mu       and \sigma^2  are respectively the mean and variance of the distribution. For Multivariate ( let us say d-variate) Gaussian Distribution, the probability density function is given by 

G(X|\mu, \Sigma)= \frac{1}{\sqrt{(2\pi)|\boldsymbol\Sigma|}} \exp\left(-\frac{1}{2}({X}-{\mu})^T{\boldsymbol\Sigma}^{-1}({X}-{\mu}) \right)

Here $\mu$   is a d dimensional vector denoting the mean of the distribution and $\Sigma$   is the d X d covariance matrix.

Gaussian Mixture Model

Suppose there are K clusters (For the sake of simplicity here it is assumed that the number of clusters is known and it is K). So \mu      and \Sigma      are also estimated for each k. Had it been only one distribution, they would have been estimated by the maximum-likelihood method. But since there are K such clusters and the probability density is defined as a linear function of densities of all these K distributions, i.e.

p(X) =\sum_{k=1}^K \pi_k G(X|\mu_k, \Sigma_k)

where \pi_k       is the mixing coefficient for kth distribution. For estimating the parameters by the maximum log-likelihood method, compute p(X|\mu     \Sigma     \pi      ).
 

\begin{aligned}\ln{p( X|\mu, \Sigma, \pi)} &= \sum_{i=1}^N p(X_i)  \\ &=\sum_{i=1}^N \ln {\sum_{k=1}^K \pi_k G(X_i | \mu_k, \Sigma_k)}\end{aligned}


Now define a random variable \gamma_k(X)      such that \gamma_k(X)      =p(k|X).

From Bayes theorem,  

\begin{aligned}\gamma_k(X) &=\frac{p(X|k)p(k)}{\sum_{k=1}^K p(k)p(X|k)} \\ &=\frac{p(X|k)\pi_k}{\sum_{k=1}^K \pi_k p(X|k)}\end{aligned}


Now for the log-likelihood function to be maximum, its derivative of p(X|\mu, \Sigma, \pi)     with respect to \mu    \Sigma    , and \pi     should be zero. So equating the derivative of p(X|\mu, \Sigma, \pi)     with respect to \mu     to zero and rearranging the terms, 
 

\mu_k=\frac{\sum_{n=1}^N \gamma_k(x_n)x_n}{\sum_{n=1}^N \gamma_k(x_n)}


Similarly taking the derivative with respect to \Sigma     and pi respectively, one can obtain the following expressions.
 

\Sigma_k=\frac{\sum_{n=1}^N \gamma_k(x_n)(x_n-\mu_k)(x_n-\mu_k)^T}{\sum{n=1}^N \gamma_k(x_n)} \newline
 

And 

\pi_k=\frac{1}{N} \sum_{n=1}^N \gamma_k(x_n)

Note: \sum_{n=1}^N\gamma_k(x_n)    denotes the total number of sample points in the kth cluster. Here it is assumed that there is a total N number of samples and each sample containing d features is denoted by x_i    .

So it can be clearly seen that the parameters cannot be estimated in closed form. This is where the Expectation-Maximization algorithm is beneficial.

Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood estimates for model parameters when the data is incomplete or has some missing data points or has some hidden variables. EM chooses some random values for the missing data points and estimates a new set of data. These new values are then recursively used to estimate a better first date, by filling up missing points, until the values get fixed. 

In the Expectation-Maximization (EM) algorithm, the estimation step (E-step) and maximization step (M-step) are the two most important steps that are iteratively performed to update the model parameters until the model convergence.

Estimation Step (E-step):

  • In the estimation step, we first initialize our model parameters like the mean (μk​), covariance matrix (Σk​), and mixing coefficients (Ï€k​).
  • For each data point, We calculate the posterior probabilities of data points belonging to each centroid using the current parameter values. These probabilities are often represented by the latent variables γk​.
  • At the end Estimate the values of the latent variables  Î³ k ​ based on the current parameter values

Maximization Step

  • In the maximization step, we update parameter values ( i.e. \mu_k    \Sigma_k    and\pi_k    ) using the estimated latent variable γk. 
  • We will update the mean of the cluster point (μk​) by taking the weighted average of data points using the corresponding latent variable probabilities
  • We will update the covariance matrix (Σk​) by taking the weighted average of the squared differences between the data points and the mean, using the corresponding latent variable probabilities.
  • We will update the mixing coefficients (Ï€k​) by taking the average of the latent variable probabilities for each component.

Repeat the E-step and M-step until convergence

  • We iterate between the estimation step and maximization step until the change in the log-likelihood or the parameters falls below a predefined threshold or until a maximum number of iterations is reached.
  • Basically, in the estimation step, we update the latent variables based on the current parameter values.
  • However, in the maximization step, we update the parameter values using the estimated latent variables
  • This process is iteratively repeated until our model converges. 

The Expectation-Maximization (EM) algorithm is a general framework and can be applied to various models, including Gaussian Mixture Models (GMMs). The steps described above are specifically for GMMs, but the overall concept of the Estimization-step and Maximization-step remains the same for other models that use the EM algorithm.

Implementation of the Gaussian Mixture Model

In this example, iris Dataset is taken. In Python, there is a Gaussian mixture class to implement GMM. Load the iris dataset from the datasets package. To keep things simple, take the only first two columns (i.e sepal length and sepal width respectively). Now plot the dataset.

Python3

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame
from sklearn import datasets
from sklearn.mixture import GaussianMixture
 
# load the iris dataset
iris = datasets.load_iris()
 
# select first two columns
X = iris.data[:, :2]
 
# turn it into a dataframe
d = pd.DataFrame(X)
 
# plot the data
plt.scatter(d[0], d[1])
plt.show()

                    

Output: 

Iris dataset

Iris dataset 

Now fit the data as a mixture of 3 Gaussians. Then do the clustering, i.e assign a label to each observation. Also, find the number of iterations needed for the log-likelihood function to converge and the converged log-likelihood value.

Python3

gmm = GaussianMixture(n_components = 3)
 
# Fit the GMM model for the dataset
# which expresses the dataset as a
# mixture of 3 Gaussian Distribution
gmm.fit(d)
 
# Assign a label to each sample
labels = gmm.predict(d)
d['labels']= labels
d0 = d[d['labels']== 0]
d1 = d[d['labels']== 1]
d2 = d[d['labels']== 2]
 
# plot three clusters in same plot
plt.scatter(d0[0], d0[1], c ='r')
plt.scatter(d1[0], d1[1], c ='yellow')
plt.scatter(d2[0], d2[1], c ='g')
plt.show()

                    

Output:

Clustering in the iris dataset using GMM

Clustering in the iris dataset using GMM

Print the converged log-likelihood value and no. of iterations needed for the model to converge

Python3

# print the converged log-likelihood value
print(gmm.lower_bound_)
 
# print the number of iterations needed
# for the log-likelihood value to converge
print(gmm.n_iter_)

                    

Output:

-1.4985672470486966
8

Hence, it needed 7 iterations for the log-likelihood to converge. If more iterations are performed, no appreciable change in the log-likelihood value can be observed.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads