Open In App

What is Gaussian mixture model clustering using R

Last Updated : 02 Feb, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Gaussian mixture model (GMM) clustering is a used technique in unsupervised machine learning that groups data points based on their probability distributions. In R Programming Language versatility lies in its ability to model clusters of shapes and sizes making it applicable to scenarios. The approach assumes that the data consists of a mixture of distributions each representing a distinct cluster. By estimating the parameters of these components GMM clustering identifies and separates data points belonging to different clusters.

Mathematical Concept

These data are represented as a mixture of several Gaussian distributions in GMMs. Each Gaussian is characterized by its mean vector (center) and covariance matrix (spread and shape). For the determination of the probability density function of a data point belonging to a specific cluster, the corresponding Gaussian distribution is used.

GMM Clustering Algorithm

  1. Initialization: initialize parameters, such as mean or covariance matrices, by choosing the number of clusters(k).
  2. Expectation (E) step: Assign each data point to a cluster based on the current parameters, considering its probability of belonging to each cluster.
  3. Maximization (M) step: Update the parameters (means and covariances) based on the current cluster assignments.
  4. Repeat steps 2 and 3: Iterate these number of steps until convergence (minimal change in parameters).

Understanding the GMM Architecture

Our data sets as an entire landscape, filled with points. Of being scattered these points are organized into clusters each having its own “center of gravity” and “shape.” The center of this cluster is represented by a vector that includes values for each of these features. Covariance matrices describe the variation of data points from mean to mean and thus determine their shape for a cluster.

It is assumed that each data point belongs to one of these clusters and is assigned a membership probability accordingly in the Gaussian Mixture Models. Unlike clustering algorithms such as k means, this probabilistic approach offers advantages.

  1. GMMs allow data points to be included in clusters to acknowledge that the boundaries between clusters are not always well-defined.
  2. GMMs are capable of automatically determining a number on their own via inference, as opposed to k means which require the enumeration of clusters beforehand.
  3. Thanks to their use of a covariate matrix, GMMs are flexible enough for the collection of clusters with shapes and orientations.

R provides various packages to cluster GMMs with ClusterR as a single package in this field. A set of functions for analysing the GMMs is available in ClusterR, including:

  • GMM() : This is the most important function you use to populate your dataset with a GMM.
  • The predict() function predicts the cluster’s membership for new data points.
  • The optimal number of clusters can be selected using BIC() and aic().

mclust:

  • The mclust function offers a number of advanced functions, such as model selection and automatic k determination.
  • The optimum model shall be selected using BIC and ICL methods in accordance with different criteria.

mixtools:

  • normalmixEM() is a function which can be used for GMMs with particular model options.
  • The probability of a cluster’s membership is predicted by predict() function.
  • The comparison of various GMM models is facilitated by AIC and BIC functions.

Steps Involved in GMM Clustering using R

  1. Load and process data to ensure that they are coded properly for clustering.
  2. Use methods such as elbow analysis or silhouette analysis to select a number of clusters.
  3. Using the’mclust’ package in R, apply GMM to your data
  4. Assign each data point to the cluster that has the probability of being its place. Use metrics, such as the silhouette score or the Calinski Harabasz index, to evaluate clustering results.

Important Packages for this model

The R package called model-based is commonly utilized for performing model-based, clustering, density estimation and discriminant analysis using Gaussian mixture models. In order to estimate the parameters of the models, it uses the Expectation Maximization algorithm. The package’s capabilities for handling mixture models, selection of mixtures according to criteria like BIC or ICL as well as support in density estimation and discriminant analysis have been noted. This particular package proves to be highly useful when it comes to organizing and grouping multivariate data that adheres to a distribution.

Customer Segmentation

R




# Install and load the 'mclust' package
library(mclust)
 
# Generate a synthetic dataset with three clusters
set.seed(123)
data <- rbind(matrix(rnorm(100, mean = 0, sd = 1), ncol = 2),
              matrix(rnorm(100, mean = 5, sd = 1), ncol = 2),
              matrix(rnorm(100, mean = 10, sd = 1), ncol = 2))
 
# Perform GMM clustering
# G represents the number of clusters
gmm_model <- Mclust(data, G = 3) 
 
# Get the cluster assignments
cluster_assignments <- predict(gmm_model)$classification
 
# Visualize the results
plot(data, col = cluster_assignments, main = "GMM Clustering Results")
points(gmm_model$parameters$mean, col = 1:3, pch = 8, cex = 2)


Output:

gh

Gaussian mixture model clustering using R

Make sure you have the mclust package installed if you haven’t already. This package offers options, for model selection that you can explore based on our data.

  • Create a dataset with three clusters using the rnorm function.
  • Utilize the mclust function to fit a Gaussian Mixture Model to the dataset specifying the desired number of clusters (G parameter).
  • Determine the cluster assignments by using predict(gmm_model)$classification.
  • Present the clustering results visually through a scatter plot.

Anomaly Detection

R




# Install and load the 'mclust' package
library(mclust)
 
# Generate a synthetic dataset with normal and anomalous data
set.seed(123)
normal_data <- matrix(rnorm(1000, mean = 0, sd = 1), ncol = 2)
anomalous_data <- matrix(rnorm(50, mean = 10, sd = 5), ncol = 2)
data <- rbind(normal_data, anomalous_data)
 
# Fit a Gaussian Mixture Model to the data
# Assuming there are 2 components (normal and anomalous)
gmm_model <- Mclust(data, G = 2) 
summary(gmm_model)


Output:

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------

Mclust VII (spherical, varying volume) model with 2 components:

log-likelihood n df BIC ICL
-1667.953 525 7 -3379.749 -3380.132

Clustering table:
1 2
500 25

The Gaussian finite mixture model, specifically Mclust VII with 2 components, was fitted to a dataset of 525 points using the Expectation-Maximization (EM) algorithm.

  • The log-likelihood, a measure of model fit, is approximately -1667.953.
  • The BIC (Bayesian Information Criterion) and ICL (Integrated Completed Likelihood) are -3379.749 and -3380.132, respectively, with lower values indicating a better model fit.
  • The model has 7 degrees of freedom, and the clustering table reveals two clusters with 500 and 25 data points, respectively.

Overall, the model suggests a good fit to the data with well-defined clusters based on the provided metrics.

Visualize the result of Anomaly Detection

R




# Get the log-likelihood values for each data point
log_likelihoods <- matrix(predict(gmm_model)$z, ncol = gmm_model$G)
 
# Calculate the anomaly scores based on log-likelihoods
anomaly_scores <- apply(log_likelihoods, 1, max)
 
# Set a threshold to classify anomalies
threshold <- quantile(anomaly_scores, 0.95)
 
# Identify anomalies based on the threshold
anomalies <- data[anomaly_scores > threshold, ]
 
# Visualize the results
plot(data, pch = 19, col = ifelse(anomaly_scores > threshold, "red", "blue"),
     main = "Anomaly Detection using Gaussian Mixture Model")
points(anomalies, pch = 3, col = "red")


Output:

gh

Gaussian mixture model clustering using R

To ensure accuracy we generate a dataset containing both anomalous data points.

  • We then employ the mclust function to fit a Gaussian Mixture Model to this dataset.
  • By utilizing predict(gmm_model)$z we obtain log likelihood values for each data point.
  • Based on these log likelihood values we calculate anomaly scores with scores indicating deviation, from normality.
  • Anomalies are identified by applying a threshold specifically using the percentile of anomaly scores in this case.
  • Finally we visualize our dataset with anomalies highlighted in red.

Advantages/Disadvantages of Gaussian Mixture Model (GMM) Clustering in R:

Advantages

  1. Allows GMM to model cluster shapes that are varied, e.g. ellipse or elongated clusters, as opposed to some clustering algorithms assuming a sphere shape.
  2. Particularly useful for highdimensional data: GMM is capable of processing datasets with a number of features and can be used to analyse complex figures containing many variables.
  3. Provides soft clustering: GMM assigns probability to each cluster, enabling a data point to belong to multiple clusters with different weightings, unlike hard clustered algorithms which assign all data points to one cluster.
  4. Estimates cluster densities: GMM estimates the probability density function of each cluster, making it possible to understand how data are distributed between clusters.
  5. Leverages Expectation-Maximization (EM) algorithm: GMM uses the EM algorithm for parameter estimation, which is known for its efficiency and robustness in handling incomplete data and missing values.

Disadvantages

  1. The initial values of cluster means and variations may affect the performance of GMM. Inefficiencies of initialization may lead to unsatisfactory results.
  2. The training of GMMs can be computationally intensive, in particular when it comes to huge datasets and data sizes greater than 3D.
  3. There is a requirement for GMM to specify the clusters in advance. If the wrong number was selected it could have a significant impact on the result.
  4. The GMM gives cluster probabilities, but does not make it easy to understand the features of each cluster because they cannot be interpreted as other clustering algorithms.
  5. The GMM may be able to mislead the data. This may lead to poor generalizations in the field of unobserved data.


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads