Silhouette Index – Cluster Validity index | Set 2

• Last Updated : 22 May, 2019

Prerequisite: Dunn index and DB index – Cluster Validity indices

Many interesting algorithms are applied to analyze very large datasets. Most algorithms don’t provide any means for its validation and evaluation. So it is very difficult to conclude which are the best clusters and should be taken for analysis.

There are several indices for predicting optimal clusters –

1. Silhouette Index
2. Dunn Index
3. DB Index
4. CS Index
5. I- Index
6. XB or Xie Beni Index

Now, let’s discuss internal cluster validity index Silhouette Index.

Silhouette Index –

Silhouette analysis refers to a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.

How Silhouette Analysis Works ?

The Silhouette validation technique calculates the silhouette index for each sample, average silhouette index for each cluster and overall average silhouette index for a dataset. Using the approach each cluster could be represented by Silhouette index, which is based on the comparison of its tightness and separation.

Calculation of Silhouette Value –
If the Silhouette index value is high, the object is well-matched to its own cluster and poorly matched to neighbouring clusters. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient is defined as –

S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }

Where,

• a(i) is the average dissimilarity of ith object to all other objects in the same cluster
• b(i) is the average dissimilarity of ith object with all objects in the closest cluster.

Range of Silhouette Value –

Now, obviously S(i) will lie between [-1, 1]

1. If silhouette value is close to 1, sample is well-clustered and already assigned to a very appropriate cluster.
2. If silhouette value is about to 0, sample could be assign to another cluster closest to it and the sample lies equally far away from both the clusters. That means it indicates overlapping clusters
3. If silhouette value is close to –1, sample is misclassified and is merely placed somewhere in between the clusters.

Below is the Python implementation of above Silhouette Index:

 from sklearn.datasets import make_blobsfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_score  # Generating the sample data from make_blobs  X, Y = make_blobs()    no_of_clusters = [2, 3, 4, 5, 6]  for n_clusters in no_of_clusters:      cluster = KMeans(n_clusters = n_clusters)    cluster_labels = cluster.fit_predict(X)      # The silhouette_score gives the     # average value for all the samples.    silhouette_avg = silhouette_score(X, cluster_labels)      print("For no of clusters =", n_clusters,          " The average silhouette_score is :", silhouette_avg)

Output:

For no of clusters = 2  The average silhouette_score is : 0.7722709127556407
For no of clusters = 3  The average silhouette_score is : 0.8307470737845413
For no of clusters = 4  The average silhouette_score is : 0.6782013483149748
For no of clusters = 5  The average silhouette_score is : 0.5220013897800627
For no of clusters = 6  The average silhouette_score is : 0.3453103523071251

References: