**Prerequisite:** Dunn index and DB index – Cluster Validity indices

Many interesting algorithms are applied to analyze very large datasets. Most algorithms don’t provide any means for its validation and evaluation. So it is very difficult to conclude which are the best clusters and should be taken for analysis.

There are several indices for predicting optimal clusters –

- Silhouette Index
- Dunn Index
- DB Index
- CS Index
- I- Index
- XB or Xie Beni Index

Now, let’s discuss internal cluster validity index * Silhouette Index*.

### Silhouette Index –

Silhouette analysis refers to a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.

How Silhouette Analysis Works ?The Silhouette validation technique calculates the silhouette index for each sample, average silhouette index for each cluster and overall average silhouette index for a dataset. Using the approach each cluster could be represented by Silhouette index, which is based on the comparison of its tightness and separation.

**Calculation of Silhouette Value –**

If the Silhouette index value is high, the object is well-matched to its own cluster and poorly matched to neighbouring clusters. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient is defined as –

**S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }**

Where,

- a(i) is the average dissimilarity of i
^{th}object to all other objects in the same cluster - b(i) is the average dissimilarity of i
^{th}object with all objects in the closest cluster.

**Range of Silhouette Value –**

Now, obviously S(i) will lie between **[-1, 1]** –

- If silhouette value is close to 1, sample is well-clustered and already assigned to a very appropriate cluster.
- If silhouette value is about to 0, sample could be assign to another cluster closest to it and the sample lies equally far away from both the clusters. That means it indicates overlapping clusters
- If silhouette value is close to –1, sample is misclassified and is merely placed somewhere in between the clusters.

Below is the Python implementation of above Silhouette Index:

`from` `sklearn.datasets ` `import` `make_blobs ` `from` `sklearn.cluster ` `import` `KMeans ` `from` `sklearn.metrics ` `import` `silhouette_score ` ` ` `# Generating the sample data from make_blobs ` ` ` `X, Y ` `=` `make_blobs() ` ` ` `no_of_clusters ` `=` `[` `2` `, ` `3` `, ` `4` `, ` `5` `, ` `6` `] ` ` ` `for` `n_clusters ` `in` `no_of_clusters: ` ` ` ` ` `cluster ` `=` `KMeans(n_clusters ` `=` `n_clusters) ` ` ` `cluster_labels ` `=` `cluster.fit_predict(X) ` ` ` ` ` `# The silhouette_score gives the ` ` ` `# average value for all the samples. ` ` ` `silhouette_avg ` `=` `silhouette_score(X, cluster_labels) ` ` ` ` ` `print` `(` `"For no of clusters ="` `, n_clusters, ` ` ` `" The average silhouette_score is :"` `, silhouette_avg) ` |

*chevron_right*

*filter_none*

**Output:**

For no of clusters = 2 The average silhouette_score is : 0.7722709127556407 For no of clusters = 3 The average silhouette_score is : 0.8307470737845413 For no of clusters = 4 The average silhouette_score is : 0.6782013483149748 For no of clusters = 5 The average silhouette_score is : 0.5220013897800627 For no of clusters = 6 The average silhouette_score is : 0.3453103523071251

**References:**

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

https://en.wikipedia.org/wiki/Silhouette_(clustering)

## Recommended Posts:

- Dunn index and DB index - Cluster Validity indices | Set 1
- Silhouette Algorithm to determine the optimal value of k
- Missing data imputation with fancyimpute
- Indroduction in deep learning with julia
- 5 Machine Learning Project Ideas for Beginners
- Top Python Notebooks for Machine Learning
- What is SageMaker in AWS?
- Wilcoxon Signed Rank Test in R Programming
- Top 10 Business Intelligence Platforms in 2020
- Bidirectional Associative Memory (BAM) Implementation from Scratch
- ANN - Bidirectional Associative Memory (BAM) Learning Algorithm
- Pearson Correlation Testing in R Programming
- Deep Convolutional GAN with Keras
- 5 Best Books to Learn Data Science in 2020

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.