Open In App

Comparing Different Clustering Algorithms on Toy Datasets in Scikit Learn

Improve
Improve
Like Article
Like
Save
Share
Report

In the domain of machine learning, we generally come across two kinds of problems that is regression and classification both of them are supervised learning problems. In unsupervised learning, we have to try to form different clusters out of the data to find patterns in the dataset provided. For that, we have different types of clustering algorithms. In this article, this is exactly what we will be looking at like how can we train or use different clustering algorithms with the help of a toy dataset.

In this article we will learn about:

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

  • Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib/Seaborn – This library is used to draw visualizations.
  • Let’s use the iris dataset as a toy dataset for clustering.

Python3




import seaborn as sns
import numpy as np
df = sns.load_dataset('iris')
x = df.drop('species', axis=1)


The Iris dataset consists of three species of flowers hence three clusters should be formed.

KMeans Clustering Algorithm:

KMeans clustering is an unsupervised machine learning algorithm that is mainly used when we have to cluster or classify data that do not have labels assigned to it. In the KMeans clustering algorithm clusters are divided on basis of centroids. hence this algorithm is also called a centroid-based algorithm where k defines a number of centroids or groups to form.

Python3




from sklearn.cluster import KMeans
# Within cluster Squared Sum
wcss = []
  
# Using Elbow method to determine
# the proper n(no of clusters)
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i,
                    init='k-means++',
                    random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)


 WCSS will continue to minimize as “n” decreases. Plotting WCSS against clusters will define the shape of the elbow. The last bending point of Elbow will define no_of_clusters to choose in K_means clustering kmeans.inertia function returns us within cluster squared distance.

Python3




plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()


Output:

Inertia with respect to the number of clusters formed in the data

Inertia with respect to the number of clusters formed in the data

The plt.plot() function takes two parameters. first is a range of points or array for points to be plotted on the x-axis, and similarly 2nd parameter for Y axis.

  1. From the Elbow diagram, It can be concluded appropriate no of clusters are three.
  2. From the diagram of the Elbow method as there is a drastic change at cluster number 3  which indicates 3 clusters should be formed ideally.
  3. Number the of species in the iris dataset and clusters we got after applying KMeans is the same.

Python3




kmeans = KMeans(n_clusters=3,
                init='k-means++',
                max_iter=300,
                n_init=10,
                random_state=0)
y_kmeans = kmeans.fit_predict(x)
plt.scatter(x[y_kmeans == 0, 0],
            x[y_kmeans == 0, 1],
            s=100, c='purple',
            label='Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0],
            x[y_kmeans == 1, 1],
            s=100, c='orange',
            label='Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0],
            x[y_kmeans == 2, 1],
            s=100, c='green',
            label='Iris-virginica')


Output:

Clusters formed in the iris data

Clusters formed in the iris data

Agglomerative clustering:

Agglomerative clustering is a type of hierarchical clustering algorithm in which a collection of smaller clusters forms larger clusters and at last, there is only one bigger cluster that contains all data points.

From start, it used a distance matrix that consists of distances of neighboring points. at the start, all point as points is considered singleton clusters later they agglomerate according to the condition used for agglomeration.

Python3




import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
import numpy as np
  
df = sns.load_dataset('iris')
x = df.drop('species', axis=1)


We will be using the same iris dataset that we have used for KMeans Clustering.

How to Exactly determine the value of N  i.e no of clusters?

  1. In KMeans It was actually simple and can be done through the Elbow method.
  2. In Agglomerative clustering, it can be concluded from dendrograms.
  3. To decide the number of clusters form dendrograms:
  4. Point out the longest vertical line which is not been cut by any other horizontal line going through that region
  5. count those types of longest vertical lines 
  6. The number of vertical lines is exactly the same as no of optimal clusters to be formed using the algorithm.
  7. For reference, you can actually apply these steps on the above diagram

Python3




from sklearn.cluster import AgglomerativeClustering
  
aglo = AgglomerativeClustering(n_clusters=3)
aglo.fit(x)
aglo_labels = aglo.labels_
  
# Visualizing first 10 labels..
# labels defines cluster to which
# data point is assigned to
aglo_labels[:10]


Output:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Python3




x = np.array(x)
plt.scatter(x[aglo_labels == 0, 0],
            x[aglo_labels == 0, 1],
            s=100, c='purple',
            label='Iris-setosa')
plt.scatter(x[aglo_labels == 1, 0],
            x[aglo_labels == 1, 1],
            s=100, c='orange',
            label='Iris-versicolour')
plt.scatter(x[aglo_labels == 2, 0],
            x[aglo_labels == 2, 1],
            s=100, c='green',
            label='Iris-virginica')


Output:

Clusters formed in the iris data

Clusters formed in the iris data

DBSCAN Clustering Algorithm

DBSCAN clustering algorithm is an unsupervised machine learning algorithm in which there is no need to pre-specify the cluster to form. it totally depends on the parameters defined and those parameters are defined depending on the problem statement or business use case.

To understand the DBSCAN algorithm in detail follow this article.

Python3




from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
import numpy as np
  
df = sns.load_dataset('iris')
x = df.drop('species', axis=1)
clustering = DBSCAN(eps=0.4,
                    min_samples=5).fit(x)
clustering.labels_


In the above code we have used the epsilon value directly just for the sake of initialization but to form proper clusters it is necessary to find the optimal value of epsilon hence we have to do the following things to get the optimal value of epsilon.:

  1. Finding Nearest neighbors using scikit learn.
  2. The reason to find nearest neighbors is just that we have to find in which distance range we have a maximum number of points so that we can get the optimal number of clusters according to that.
  3.  Now if take those distances and then by sorting and plotting those distances we can get the exact “Distance”  vs  “no_of_points” graph from which we can find the optimal distance value.

Python3




from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(x)
distances, indices = nbrs.kneighbors(x)
distances = np.sort(distances, axis=0)
distances = distances[:, 1]
plt.plot(distances)


Output:

 

The above diagram shows that the point where there is abrupt change is somewhere around 0.5~0.6 hence with trial and error and with silhouette score, we can get the optimal value of epsilon.

Python3




x = np.array(x)
plt.scatter(x[clustering.labels_ == -1, 0],
            x[clustering.labels_ == -1, 1],
            s=100, c='purple',
            label='Iris-setosa')
plt.scatter(x[clustering.labels_ == 0, 0],
            x[clustering.labels_ == 0, 1],
            s=100, c='orange',
            label='Iris-versicolour')
plt.scatter(x[clustering.labels_ == 1, 0],
            x[clustering.labels_ == 1, 1],
            s=100, c='green',
            label='Iris-virginica')


Output:

Clusters formed in the iris data

Clusters formed in the iris data

Optics Clustering Algorithm

The optics clustering algorithm is just some modification to the DBSCAN clustering algorithm moreover, it just adds some more terms to DBSCAN clustering.

This technique is different from other clustering algorithms because rather than explicitly segmenting data into clusters it creates a visualization of reachability distance and uses visualization to cluster the data.

Python3




import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
import numpy as np
from sklearn.cluster import OPTICS
  
df = sns.load_dataset('iris')
x = df.drop('species', axis=1)
clustering = OPTICS(eps=0.8,
                    min_samples=13).fit(x)
clustering.labels_


Output:

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1, -1,  1, -1,  1,  1,  1,
       -1,  1,  2,  1,  2,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  2, -1,
       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  2,
        3,  3,  3, -1, -1, -1, -1, -1,  3,  3,  3,  2, -1,  3,  3, -1, -1,
       -1,  3,  2, -1,  2,  3, -1,  2,  2,  3, -1, -1, -1,  3,  2, -1, -1,
        3,  3,  2,  3,  3,  3,  2,  3,  3,  3,  2,  3,  3,  2])

Now let’s plot the clusters formed in the data using different colors for the different classes to analyze the clusters.

Python3




x = np.array(x)
plt.scatter(x[clustering.labels_ == -1, 0],
            x[clustering.labels_ == -1, 1],
            s=100, c='purple',
            label='Iris-setosa')
plt.scatter(x[clustering.labels_ == 0, 0],
            x[clustering.labels_ == 0, 1],
            s=100, c='orange',
            label='Iris-versicolour')
plt.scatter(x[clustering.labels_ == 1, 0],
            x[clustering.labels_ == 1, 1],
            s=100, c='green',
            label='Iris-virginica')


Output:

Clusters formed in the iris data

Clusters formed in the iris data

Comparison Between Different Clustering Algorithms:

  1. In the KMeans clustering algorithm once clusters are created then if we add a new data point in those clusters then with help of predict function we can assign that new data point to any of the clusters
  2. We cannot add a new point in clusters created by the DBSCAN algorithm because DBSCAN recalculates it every time hence predict function cannot be used in DBSCAN 
  3. The KMeans algorithm gets highly affected by outliers and it’s also highly dependent on the initial values of the centroid. while DBSCAN does not have those disadvantages.
  4. In KMeans & Hierarchical clustering we have to pre-specify the value of k  i.e no_of_clusters while in DBSCAN value of epsilon decides it
  5. Hierarchical clustering is too computationally expensive considered to other clustering algorithms.
  6. The output we get in DBSCAN clustering and OPTICS clustering is fairly different than the real output, It mostly happens because the wrong value is chosen for eps, and min_points hence, proper domain knowledge is required before applying this algorithm so that we can properly choose eps value and min_points value
  7. The nearest Neighbour method could also be used to determine the appropriate epsilon value.
  8. DBSCAN and OPTICS clustering algorithms efficiently handle outliers, while the KMeans clustering algorithm does not properly handle outliers and noisy datasets.


Last Updated : 02 Jan, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads