Open In App

Dirichlet Process Mixture Models (DPMMs)

Clustering is the process of grouping similar data points. The objective is to discover natural grouping within a dataset in such a way that data points within the same cluster are more similar to each other than they are to data points in another cluster. It’s unsupervised learning where we do not have a predefined target or label.

Key features of clustering are



Flexible Clustering

In the well-known traditional clustering algorithms like K-means and Gaussian Mixture Model, real world we need to specify the optimal number of clusters K beforehand while developing the model. But practically this is not always feasible in real world scenarios. This poses a significant challenge as the true number of clusters within a dataset is often unknown.

Can we have a way where the model automatically learns the number of clusters? The ability of the model to decide the number of clusters, their shapes, and sizes as per the observed data is known as flexible clustering.



Dirichlet Process Mixture Models (DPMMs) are one such example, which offers a probabilistic (model calculates a probability of a data point belonging to an existing cluster or new cluster) and nonparametric approach ( a fancy way of saying we do not need to specify the number of clusters and their parameters beforehand) to clustering that can dynamically adjust to the complexities of the data.

To understand DPMMs we need first to understand the Dirichlet Distribution and Dirichlet process.

Dirichlet Distribution

First, let us see what a Beta distribution is. The Dirichlet Distribution is an extension of the Beta distribution Also,understanding of beta distribution is required to understand the Dirichlet process which we will discuss after the Dirichlet distribution

Beta Distribution

Recall that in binomial distribution we model the probability of different outcomes of n Bernoulli trial with each trial having the probability of success as ‘p’. For example, if we have 10 trials with each trial having the probability of success ‘p’ we can derive the probability of observing 1 to 10 success.

However,lets say we don’t know the probability of success Can we have a distribution that models the probability of success based on observed data? The beta distribution does that. It is defined by two parameters α and β which determine the shape of distribution. In its standard form, it is a continuous probability distribution defined on the interval [0,1] (though it can be defined for any interval [a,b]).

The PDF of the beta distribution is given by:

where the denominator is the beta function.

Dirchlet Distribution

The beta distribution models probabilities for binomial cases where we can only have two outcomes i.e. success or failure. But what about a multinominal scenario where the number of outcomes is more than two?Like drawing from a dice where we have 6 outcomes. Now in the case of a fair dice, it will be equal to 1/6 for each of the sides of the dice. But let us consider a situation where we have more than two outcomes and we don’t know the probabilities of the outcomes. Dirichlet distribution is a way to model random probability mass functions (PMFs) for such scenarios The Dirichlet distribution gives us a way to plot the distribution of these probabilities based on some parameters. The number of parameters is equal to several outcomes and the value of parameters defines the shape of the distribution.

The pdf of

where

So what does the above equation do? For each of the combinations of pi(s) such that ∑pi = 1 it gives us a pdf value. This pdf value depends on alpha. The alpha determines which combination of pis is more likely to happen than others.

How is this alpha determined practically?

Let us say we have a scenario with three outcomes each with some individual probabilities. We experimented with some trials and observed the frequencies of each outcome. Let this be α1 α2 and α3. This observed frequency becomes our parameter of distribution.

How does the alpha shape our distribution?

α influences both the shape and concentration of the Dirichlet distribution.

Thus the Dirichlet distribution is a multivariate generalization of the Beta distribution. So what the Dirichlet distribution have to do with our clustering problem? Note that the Dirichlet distribution gives us a distribution of probabilities along k categories. We can associate each of the k categories to a unique cluster with a certain mean and variance . The probability of the category is the probability of data points coming from that cluster.

Dirichlet Process

The intuition behind Dirichlet process

Let’s revisit the problem at hand. We have a set of data. We want to classify it without specifying the number of clusters. What it essentially means we want the model to classify a given data point to either one of the already existing clusters if the data point is similar to any one of them or generate a new cluster if the data point is quite different from all of the existing clusters.

Let us flip our problem. Can we have a process that can produce such type of data? Why do we do so? So if we can define a process that can produce such type of data and understand the properties of the process, then we can use our understanding of the properties of such process to solve our original problem.

What essentially we want our process to do is to

The described process is essentially the Dirichlet process, a concept that can be explained in simpler terms. When exploring the Dirichlet process, you might encounter various metaphors like the Chinese restaurant process, the poly urn process, and the stick-breaking process. Despite different names, these metaphors essentially describe the same data generation mechanism inherent in the Dirichlet process. We’ll delve deeper into the stick-breaking metaphor in the technical discussion of the Dirichlet process.

To enable such a process, we aim to generate an infinite set of probabilities, ensuring that their sum remains equal to one. The question arises: how can we generate an infinite number of probabilities while maintaining this sum? This is where thhe use of the Dirichlet distribution comes into play. Initially assuming probabilities from a finite number of categories (denoted as k), we extend this concept to an infinite number of categories by replacing the discrete distribution of dirichlelt distribution with a continuous distribution. In this context, we leverage the Gaussian or normal distribution to achieve an infinite number of categories while ensuring the sum of probabilities remains equal to one.

Mathematical definition of Dirichlet process

Now let us define Dirichlet process technically

A Dirichlet process is a stochastic process, meaning it describes a family of probability distributions over some space. In this context, the space is typically the space of probability measures. A Dirichlet process, denoted as DP(α,G0 ), is defined by two parameters:

α: The concentration parameter, a positive real number.

G0: The base distribution, a probability distribution.

Px =

Sampling

Now in Dirichlet process there are two levels of sampling. Let’s discuss both of them

1. Sampling from Dirichlet process

We only sample enough probabilities from Dirichlet process such that the sum of probabilities is nearly one. We only sample new probabilities if we are required to generate a new sample(this we will see in point 2).

In order to sample only limited number of samples we utilize two properties of Dirichlet process

The above sampling process can be explained by a stick-breaking metaphor

For each of the categories sample we also sample μ from our base distribution. This becomes our cluster parameters.

Visualize representation of stick breaking process

By repeatedly applying the rules for the marginal and conditional distributions of the Dirichlet, we have reduced sampling steps from Dirichelt process to sampling from a beta distribution—this is the stick-breaking approach for sampling from the Dirichlet distribution.

2. Sampling from the probability distribution sampled from Dirichlet process to generate the data.

Now once we have obtained the probability distribution from our Dirichlet process how do we generate data?

Having understood the Dirichlet process let us understand what is the probability of a data getting generated from one of the existing cluster and probability of data getting generated from a new cluster.

Let’s say we have 1 to k cluster derived already. nk represents the number of points generated from cluster k. We want to generate a new data point. The probability that it will be generated from the kth cluster is

Probability it will generate a new cluster is

The above two probabilities known as prior probability is used in DPMM for finding out the clusters in our model.

DPMM

The DPMM uses the Dirichlet process as a prior. Formally we can define DPMM as an extension of a finite mixture model that allows for an infinite number of components that uses the Dirichlet Process (DP) as a prior distribution for the mixture model, enabling the model to automatically determine the number of components or clusters needed to represent the data.

Let us now understand how we can solve the original problem at hand of learning the cluster assignments of a given data.

We are interested in finding the cluster assignments of our data group. For this, we use a technique called gibbs sampling. The Gibbs sampling process makes use of Dirichlet process as prior.

Advantages over traditional methods

Implementation

Now let us implement DPMM process in scikit learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.mixture import BayesianGaussianMixture
from sklearn.decomposition import PCA

                    

Library Imports:


# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
labels_true = iris.target

                    

Load Iris Dataset:

# Fit BayesianGaussianMixture model
n_components = 100  # Number of maximum components/clusters
bgm = BayesianGaussianMixture(
    n_components=n_components, weight_concentration_prior=5, covariance_type='full', random_state=42)
bgm.fit(X)
 
# Predict cluster assignments
labels_pred = bgm.predict(X)

                    

Fit BayesianGaussianMixture Model:

Predict Cluster Assignments:

Let us see actual number of components used by the model

set(labels_pred)

                    
{5, 8, 13, 19, 21}

We see the model discovered 5 clusters. we can get the weights of this using bgm.weights_ and the cluster means using the attribute bgm.means_

We can visualize our clusters in two dimension using pca as below.

# Perform dimensionality reduction for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
 
# Visualize the results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels_pred, edgecolors='k', cmap=plt.cm.Paired, marker='o', s=100, linewidth=2)
plt.title('Bayesian Gaussian Mixture Model Clustering')
plt.show()

                    

Output:

The output of plot

Conclusion

In this article we saw the intuition and the mathematical details behind the DPMMs which offer us a flexible of clustering in which we don’t have to define the number of clusters beforehand.


Article Tags :