Dirichlet Process Mixture Models (DPMMs)

Clustering is the process of grouping similar data points. The objective is to discover natural grouping within a dataset in such a way that data points within the same cluster are more similar to each other than they are to data points in another cluster. It’s unsupervised learning where we do not have a predefined target or label.

Key features of clustering are

Similarity Measure: Clustering algorithms typically rely on a measure of similarity or dissimilarity between data points. Common measures include Euclidean distance, cosine similarity, or other distance metrics
Grouping Criteria: Clusters are formed based on a grouping criterion that determines how data points should be combined. This criterion is often defined by the chosen clustering algorithm
Unsupervised: The algorithm explores the data structure without prior knowledge of class labels or categories.

Flexible Clustering

In the well-known traditional clustering algorithms like K-means and Gaussian Mixture Model, real world we need to specify the optimal number of clusters K beforehand while developing the model. But practically this is not always feasible in real world scenarios. This poses a significant challenge as the true number of clusters within a dataset is often unknown.

Can we have a way where the model automatically learns the number of clusters? The ability of the model to decide the number of clusters, their shapes, and sizes as per the observed data is known as flexible clustering.

Dirichlet Process Mixture Models (DPMMs) are one such example, which offers a probabilistic (model calculates a probability of a data point belonging to an existing cluster or new cluster) and nonparametric approach ( a fancy way of saying we do not need to specify the number of clusters and their parameters beforehand) to clustering that can dynamically adjust to the complexities of the data.

To understand DPMMs we need first to understand the Dirichlet Distribution and Dirichlet process.

Dirichlet Distribution

First, let us see what a Beta distribution is. The Dirichlet Distribution is an extension of the Beta distribution Also,understanding of beta distribution is required to understand the Dirichlet process which we will discuss after the Dirichlet distribution

Beta Distribution

Recall that in binomial distribution we model the probability of different outcomes of n Bernoulli trial with each trial having the probability of success as ‘p’. For example, if we have 10 trials with each trial having the probability of success ‘p’ we can derive the probability of observing 1 to 10 success.

However,lets say we don’t know the probability of success Can we have a distribution that models the probability of success based on observed data? The beta distribution does that. It is defined by two parameters α and β which determine the shape of distribution. In its standard form, it is a continuous probability distribution defined on the interval [0,1] (though it can be defined for any interval [a,b]).

The PDF of the beta distribution is given by:

where the denominator is the beta function.

Dirchlet Distribution

The beta distribution models probabilities for binomial cases where we can only have two outcomes i.e. success or failure. But what about a multinominal scenario where the number of outcomes is more than two?Like drawing from a dice where we have 6 outcomes. Now in the case of a fair dice, it will be equal to 1/6 for each of the sides of the dice. But let us consider a situation where we have more than two outcomes and we don’t know the probabilities of the outcomes. Dirichlet distribution is a way to model random probability mass functions (PMFs) for such scenarios The Dirichlet distribution gives us a way to plot the distribution of these probabilities based on some parameters. The number of parameters is equal to several outcomes and the value of parameters defines the shape of the distribution.

The pdf of

where

p=(p1,p2,…, pK) is a vector representing a probability distribution over K categories. Each pi is a probability, and ∑Kpi=1.
α=(α1,α2,…,αK) is a vector of positive shape parameters. This determines the shape of the distribution
B(α) is a beta function.

So what does the above equation do? For each of the combinations of p_i(s) such that ∑p_i = 1 it gives us a pdf value. This pdf value depends on alpha. The alpha determines which combination of p_is is more likely to happen than others.

How is this alpha determined practically?

Let us say we have a scenario with three outcomes each with some individual probabilities. We experimented with some trials and observed the frequencies of each outcome. Let this be α1 α2 and α3. This observed frequency becomes our parameter of distribution.

How does the alpha shape our distribution?

α influences both the shape and concentration of the Dirichlet distribution.

Higher values of α result in higher concentration around the mean probabilities.
When all α(s)are equal, the distribution is symmetric, and as α values differ, the distribution becomes more skewed

Thus the Dirichlet distribution is a multivariate generalization of the Beta distribution. So what the Dirichlet distribution have to do with our clustering problem? Note that the Dirichlet distribution gives us a distribution of probabilities along k categories. We can associate each of the k categories to a unique cluster with a certain mean and variance . The probability of the category is the probability of data points coming from that cluster.

Dirichlet Process

The intuition behind Dirichlet process

Let’s revisit the problem at hand. We have a set of data. We want to classify it without specifying the number of clusters. What it essentially means we want the model to classify a given data point to either one of the already existing clusters if the data point is similar to any one of them or generate a new cluster if the data point is quite different from all of the existing clusters.

Let us flip our problem. Can we have a process that can produce such type of data? Why do we do so? So if we can define a process that can produce such type of data and understand the properties of the process, then we can use our understanding of the properties of such process to solve our original problem.

What essentially we want our process to do is to

Create an initial finite set of clusters with some dominant probabilities. The sum of the probabilities of these finite dominant clusters should be near 1 but not exactly 1 leaving room for generation of new clusters.
Associate with this finite set of clusters some parameters to generate data. For example like mu and sigma in case we assume our data generation from gaussian
This finite set of clusters are dominant cluster that will generate the data in our process most of the time.
For generating data we will choose any one of the existing clusters with chances of selecting each cluster equal to the probabilities. Generate a data point from the chosen cluster with its associated paramter.
On some occasions,it should not choose any of the existing clusters and instead create a new cluster to generate a new data point.
The process should have a parameter that gives us control to determine how diverse the first few dominant clusters .

The described process is essentially the Dirichlet process, a concept that can be explained in simpler terms. When exploring the Dirichlet process, you might encounter various metaphors like the Chinese restaurant process, the poly urn process, and the stick-breaking process. Despite different names, these metaphors essentially describe the same data generation mechanism inherent in the Dirichlet process. We’ll delve deeper into the stick-breaking metaphor in the technical discussion of the Dirichlet process.

To enable such a process, we aim to generate an infinite set of probabilities, ensuring that their sum remains equal to one. The question arises: how can we generate an infinite number of probabilities while maintaining this sum? This is where thhe use of the Dirichlet distribution comes into play. Initially assuming probabilities from a finite number of categories (denoted as k), we extend this concept to an infinite number of categories by replacing the discrete distribution of dirichlelt distribution with a continuous distribution. In this context, we leverage the Gaussian or normal distribution to achieve an infinite number of categories while ensuring the sum of probabilities remains equal to one.

Mathematical definition of Dirichlet process

Now let us define Dirichlet process technically

A Dirichlet process is a stochastic process, meaning it describes a family of probability distributions over some space. In this context, the space is typically the space of probability measures. A Dirichlet process, denoted as DP(α,G0 ), is defined by two parameters:

α: The concentration parameter, a positive real number.

G0: The base distribution, a probability distribution.

The concentration parameter α controls the degree of variability or concentration in the distribution. Higher values of α result in a more peaked distribution, meaning the probability mass is concentrated around a few components
G0 is the base distribution from which the random probability measures are drawn. It represents the distribution that is mixed and modified by the Dirichlet process.
A draw from a Dirichlet process is a discrete probability distribution Px. The probability measure of this distribution admits the following representation.

Px =

Sampling

Now in Dirichlet process there are two levels of sampling. Let’s discuss both of them

1. Sampling from Dirichlet process

The sampling from dirchlet process gives us two things.
- First a distribution of probability . Each of these probabilities is asccociated with each cluster
- Second the parameters of each cluster. This parameters are derived from the base distribution used. Example in our case the gaussian distribution.
In theory, we can sample an infinite number of probabilities and paramters values from the base distribution G0. But practically we cannot sample infinite numbers.

We only sample enough probabilities from Dirichlet process such that the sum of probabilities is nearly one. We only sample new probabilities if we are required to generate a new sample(this we will see in point 2).

In order to sample only limited number of samples we utilize two properties of Dirichlet process

The marginal distributions of the Dirichlet are beta distributions – This means if we take one sample from Dirichlet process and sum all the other samples they will follow a beta distribution
Conditional distributions are rescaled Dirichlet distributions – This means if we know the probability of one sample then the remaining sample will also follow a Dirchlet distribution rescaled by the first probability

The above sampling process can be explained by a stick-breaking metaphor

We take a stick of length unit 1 representing our base probability distribution
Using marginal distribution property we break it into two. We use beta distribution. Suppose the length obtained is p1
The conditional probability of the remaining categories is a Dirichlet distribution
The length of the stick that remains is 1-p1, and using the marginal property again
Repeat the above steps to obtain enough pi such that the sum is close to 1
Mathematically this can be expressed ad
- For k=1,p1=β(1,α)
- For k=2,p2=β(1,α)∗(1−p1)
- For k=3,p3=β(1,α)∗(1−p1−p2)

For each of the categories sample we also sample μ from our base distribution. This becomes our cluster parameters.

Visualize representation of stick breaking process

By repeatedly applying the rules for the marginal and conditional distributions of the Dirichlet, we have reduced sampling steps from Dirichelt process to sampling from a beta distribution—this is the stick-breaking approach for sampling from the Dirichlet distribution.

2. Sampling from the probability distribution sampled from Dirichlet process to generate the data.

Now once we have obtained the probability distribution from our Dirichlet process how do we generate data?

Now suppose we have carried out step 1 two times and obtained the below parameters
- Proababilies P1=0.5,p2=0.3
- Parameters of distribution H:θ1=0.1,θ2=−0.5
To sample from the probability distribution H, we start by generating a uniform random number u between 0 and 1.
- If u<0.5, we generate our data using the parameter θ1=0.1.
- If 0.5≤u<0.8, we generate our data using parameter θ2=−0.5.
If u≥0.8, we draw a new probability using step 1 and its corresponding parameter and generate our data using this

Having understood the Dirichlet process let us understand what is the probability of a data getting generated from one of the existing cluster and probability of data getting generated from a new cluster.

Let’s say we have 1 to k cluster derived already. n_k represents the number of points generated from cluster k. We want to generate a new data point. The probability that it will be generated from the kth cluster is

Probability it will generate a new cluster is

The above two probabilities known as prior probability is used in DPMM for finding out the clusters in our model.

DPMM

The DPMM uses the Dirichlet process as a prior. Formally we can define DPMM as an extension of a finite mixture model that allows for an infinite number of components that uses the Dirichlet Process (DP) as a prior distribution for the mixture model, enabling the model to automatically determine the number of components or clusters needed to represent the data.

Let us now understand how we can solve the original problem at hand of learning the cluster assignments of a given data.

We are interested in finding the cluster assignments of our data group. For this, we use a technique called gibbs sampling. The Gibbs sampling process makes use of Dirichlet process as prior.

Initialize: We start with a random initial assignment of data points to clusters and an initial set of parameters for each cluster.
Iteration
- We pick a data point. We fix the cluster assignments of all the other data point except the chosen point.
- We now want to assign the chosen point a new cluster . This new cluster can be from existing cluster or a totally new cluster
- The assignment of a point depends on the prior probability of the Dirichlet process multiplied by the likelihood of the data coming from that cluster
- This can be mathematically expressed as
  - Probability of assignment to an existing cluster is
  - The probability of assignment to the new cluster is
  - Here we have assumed based distribution is normal with mean zero and and unit variance
  - Mathematically what we are doing for each data point we calculate the probability of the data belonging to that cluster(this is equivalent to prior probability) multiplied by the likelihood of the data point coming from that point (using our Gaussian normal distribution pdf)
- We calculate above proabilites and assign the chooses data point to the cluster with the highest probabilites
Repeat the above process till we reach convergence. that are no more changes in cluster assignment

Advantages over traditional methods

One of the primary advantages of DPMMs is their ability to automatically determine the number of clusters in the data. Traditional methods often require the pre-specification of the number of clusters (e.g., in k-means), which can be challenging in real-world applications.
DPMMs operate within a probabilistic framework, allowing for the quantification of uncertainty. Traditional methods often provide “hard” assignments of data points to clusters, while DPMMs give probabilistic cluster assignments, capturing the uncertainty inherent in the data
DPMMs find applications in a wide range of fields, including natural language processing, computer vision, bioinformatics, and finance. Their flexibility makes them applicable to diverse datasets and problem domains.

Implementation

Now let us implement DPMM process in scikit learn

Python3

import numpy as np

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.mixture import BayesianGaussianMixture

from sklearn.decomposition import PCA

Library Imports:

import numpy as np: Imports the NumPy library with the alias np.
import matplotlib.pyplot as plt: Imports the Matplotlib library for plotting with the alias plt.
from sklearn import datasets: Imports the datasets module from scikit-learn, which includes the Iris dataset.
from sklearn.mixture import BayesianGaussianMixture: Imports the BayesianGaussianMixture model from scikit-learn, which is a probabilistic clustering model.
from sklearn.decomposition import PCA: Imports the PCA (Principal Component Analysis) module from scikit-learn, which is used for dimensionality reduction.

Python3

# Load the Iris dataset

iris = datasets.load_iris()

X = iris.data

labels_true = iris.target

Load Iris Dataset:

iris = datasets.load_iris(): Loads the Iris dataset from scikit-learn’s built-in datasets.
X = iris.data: Extracts the feature matrix X from the Iris dataset.
labels_true = iris.target: Extracts the true labels (ground truth) from the Iris dataset.

Python3

# Fit BayesianGaussianMixture model

n_components = 100  # Number of maximum components/clusters

bgm = BayesianGaussianMixture(

    n_components=n_components, weight_concentration_prior=5, covariance_type='full', random_state=42)
bgm.fit(X)
 
# Predict cluster assignments

labels_pred = bgm.predict(X)

Fit BayesianGaussianMixture Model:

n_components = 100: Specifies the maximum number of components/clusters that the BayesianGaussianMixture model can consider.
bgm = BayesianGaussianMixture(n_components=n_components, weight_concentration_prior=5, covariance_type=’full’, random_state=42): Initializes a Bayesian Gaussian Mixture model with specific parameters (e.g., maximum components, prior weight concentration, covariance type, and random seed).
bgm.fit(X): Fits the model to the feature matrix X.

Predict Cluster Assignments:

labels_pred = bgm.predict(X): Predicts cluster assignments for each data point in the feature matrix X based on the fitted Bayesian Gaussian Mixture model.

Let us see actual number of components used by the model

Python3

set(labels_pred)

{5, 8, 13, 19, 21}

We see the model discovered 5 clusters. we can get the weights of this using bgm.weights_ and the cluster means using the attribute bgm.means_

We can visualize our clusters in two dimension using pca as below.

Python3

# Perform dimensionality reduction for visualization

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)
 
# Visualize the results

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_pred, edgecolors='k', cmap=plt.cm.Paired, marker='o', s=100, linewidth=2)

plt.title('Bayesian Gaussian Mixture Model Clustering')
plt.show()

Output:

The output of plot

Conclusion

In this article we saw the intuition and the mathematical details behind the DPMMs which offer us a flexible of clustering in which we don’t have to define the number of clusters beforehand.

Article Tags :

AI-ML-DS

Geeks Premier League

Machine Learning

Geeks Premier League 2023