Open In App

K-Mode Clustering in Python

Last Updated : 15 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

K-mode clustering is an unsupervised machine-learning technique used to group a set of data objects into a specified number of clusters, based on their categorical attributes. The algorithm is called “K-Mode” because it uses modes (i.e. the most frequent values) instead of means or medians to represent the clusters.

In K-means clustering when we used categorical data after converting it into a numerical form. it doesn’t give a good result for high-dimensional data. 
So, Some changes are made for categorical data t.

  • Replace Euclidean distance with Dissimilarity metric
  • Replace Mean by Mode for cluster centers.
  • Apply a frequency-based method in each iteration to update the mode.

And then this is called K-MODE Clustering because of MODE.

Dissimilarity measurements between two data objects

K-modes is an algorithm for clustering categorical data. It is used to partition a dataset into a specified number of clusters, where each cluster is characterized by a mode, which is the most frequent categorical value in the cluster.

Similarity and dissimilarity measurements are used to determine the distance between the data objects in the dataset. In the case of K-modes, these distances are calculated using a dissimilarity measure called the Hamming distance. The Hamming distance between two data objects is the number of categorical attributes that differ between the two objects.

Let x and y be two categorical data objects defined by m features or attributes.

d(x,y) = \sum_{j=1}^{m} \delta(x_{j},y_{j})

                                                        Where,

\delta(x_{j},y_{j}) = \begin{cases} 0 & \text{ if } x_{j}=  y_{j} \\ 1 & \text{ if } x_{j}\neq  y_{j} \end{cases}

For example, consider the following dataset with three categorical attributes:

S.NoAttribute 1 Attribute 2 Attribute 3 
1ABC
2ABD
3ACE
4BCE

To calculate the Hamming distance between data objects 1 and 2, we compare their values for each attribute and count the number of differences. In this case, there is one difference (Attribute 3 is C for object 1 and D for object 2), so the Hamming distance between objects 1 and 2 is 1.

To calculate the Hamming distance between objects 1 and 3, we again compare their values for each attribute and count the number of differences. In this case, there are two differences (Attribute 2 is B for object 1 and C for object 3, and Attribute 3 is C for object 1 and E for object 3), so the Hamming distance between objects 1 and 3 is 2.

To calculate the Hamming distance between objects 1 and 4, we again compare their values for each attribute and count the number of differences. In this case, there are three differences (Attribute 1 is A for objects 1 and B for object 4, Attribute 2 is B for object 1 and C for object 4, and Attribute 3 is C for objects 1 and E for object 4), so the Hamming distance between objects 1 and 4 is 3.

Data objects with a smaller Hamming distance are considered more similar, while objects with a larger Hamming distance is considered more dissimilar.

How K-Modes clustering works:

 Let X be a set of categorical data objects of   X = \begin{bmatrix} x_{11}, & ... & x_{1n}\\ ... & ... & ...\\ x_{n1},& ... & x_{nm} \end{bmatrix}              that can be denoted as X = [X_{1},X_{2},...,X_{n}]             . And the mode of Z is a vector Q = [q_{1},q_{2},...,q_{m}]             then, minimize

  D(X,Q) = \sum_{i=1}^{n}d(X_{i},Q)               

Apply dissimilarity metric equation for data objects 

D(X,Q) = \sum_{i=1}^{n}\sum_{j=1}^{m}\delta(x_{ij},Q)

Suppose we want to K cluster, Then we have Q = [q_{k1},q_{k1},….,q_{km}] \epsilon Q

C(Q) = \sum_{k=1}^{K}\sum_{j=1}^{n}\sum_{j=1}^{m}\delta(x_{ij},q_{kj})              

 

The main task for K-Modes algorithm is to minimize this C(Q) cost function. 

It consists of the following steps.

  1. Select K data objects for each cluster.
  2. Calculate dissimilarities  D(X,Q)            and allocate each data object to nearest cluster.
  3. Calculate the new modes for all clusters.
  4. Repeat step 2 and 3 until the cluster will become stable.

Some variations of the K-modes algorithm may use different methods for updating the centroids (modes) of the clusters, such as taking the weighted mode or the median of the objects within each cluster.

Overall, the goal of K-modes clustering is to minimize the dissimilarities between the data objects and the centroids (modes) of the clusters, using a measure of categorical similarity such as the Hamming distance.

Implementation of the k-mode clustering algorithm in Python:

Python3

import numpy as np
from scipy.stats import mode
  
# Define the data set with three categorical variables
data = np.array([['A', 'B', 'C'],
                 ['B', 'C', 'A'],
                 ['C', 'A', 'B'],
                 ['A', 'C', 'B'],
                 ['A', 'A', 'B']])
  
# Choose the number of clusters k
k = 2
  
# Initialize the modes for each cluster
modes = [['A', 'B', 'C'],
         ['C', 'B', 'A']]
  
# Initialize the cluster assignments for each data objects
clusters = np.zeros(data.shape[0], dtype=int)
clusters_prev = np.zeros(data.shape[0], dtype=int)
  
# Iteratively assign data objects to
# their clusters and update the cluster modes
for i in range(10):
    # Assign data objects to the closest cluster
    for j, object in enumerate(data):
        distances = np.array([sum(object != mode) for mode in modes])
        clusters[j] = np.argmin(distances)
  
    # Update the cluster modes
    for j in range(k):
        modes[j] = mode(data[clusters == j]).mode[0]
  
    # Check if the cluster assignments have converged
    if (clusters == clusters_prev).all():
        break
  
    # Store the current cluster assignments
    clusters_prev = clusters
  
# Print the cluster assignments for each data point
print("The cluster assignments for each data object: ", clusters)
  
# Print the modes for each cluster
print("Modes for each cluster: ", modes)

                    

Output:

The cluster assignments for each data point:  [0 1 0 0 0]
Modes for each cluster:  [array(['A', 'A', 'B'], dtype='<U1'), array(['B', 'C', 'A'], dtype='<U1')]

The above code is an implementation of the k-mode clustering algorithm, which is a clustering algorithm that can be used to group a set of data points with categorical attributes into clusters. The code uses the numpy and scipy.stats modules, which are commonly used for scientific computing in Python.

  • The NumPy module provides functions for working with arrays and matrices.
  • The scipy.stats module provides functions for working with statistical data, which is used to compute the modes for the clusters in the code.
  • The code first defines the data set with three categorical variables and then chooses the number of clusters k that the data should be grouped into. 
  • It then initializes the modes for each cluster and the cluster assignments for each data point.
  • The code then uses a for loop to iteratively assign data points to their clusters and update the cluster modes. 
  • At each iteration, the code first assigns each data point to the cluster with the most similar modes for its categorical attributes. It does this by computing the distances between the modes for each cluster and the values for the categorical attributes of the data point and then assigning the data point to the cluster with the smallest distance.
  • Next, the code updates the cluster modes by computing the modes for the categorical attributes of the data points that are assigned to each cluster. It then checks if the cluster assignments have converged, and if they have, it breaks out of the for a loop. Otherwise, it stores the current cluster assignments in the clusters_prev variable and continues with the next iteration.
  • Finally, the code prints the cluster assignments for each data point, and the modes for each cluster.
  • This shows that the first, third, fourth, and fifth data points have been assigned to the first cluster, and the second data points have been assigned to the second cluster. 
  • The modes for each cluster are also printed, which are [‘A’, ‘A’, ‘B’] for the first cluster and [‘B’, ‘C’, ‘A’] for the second cluster.
     

Cluster with kmodes library.

Python3

pip install kmodes

                    

Find the optimal number of clusters in the K-Mode algorithm?

Elbow method is used to find the optimal number of clusters

Python3

# importing necessary libraries
import pandas as pd
import numpy as np
# !pip install kmodes
from kmodes.kmodes import KModes
import matplotlib.pyplot as plt
%matplotlib inline
  
# Elbow curve to find optimal K
cost = []
K = range(1,5)
for k in list(K):
    kmode = KModes(n_clusters=k, init = "random", n_init = 5, verbose=1)
    kmode.fit_predict(data)
    cost.append(kmode.cost_)
      
plt.plot(K, cost, 'x-')
plt.xlabel('No. of clusters')
plt.ylabel('Cost')
plt.title('Elbow Curve')
plt.show()

                    

Outputs :

Elbow Method - Geeksforgeeks

Elbow Method

As we can see from the graph there is an elbow-like shape at 2 and 3. Now it we can consider either 2 or 3 cluster. Let’s consider Number of cluster =2

Python3

# Building the model with 3 clusters
kmode = KModes(n_clusters=2, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(data)
clusters

                    

Outputs :

array([1, 0, 1, 1, 1], dtype=uint16)

This also shows that the first, third, fourth, and fifth data points have been assigned to the first cluster, and the second data points have been assigned to the second cluster.  So, our previous answer was 100 % correct.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads