One of the fundamental steps of an unsupervised learning algorithm is to determine the number of clusters into which the data may be divided. The silhouette algorithm is one of the many algorithms to determine the optimal number of clusters for an unsupervised learning technique.
In the Silhouette algorithm, we assume that the data has already been clustered into k clusters by a clustering technique(Typically K-Means Clustering technique). Then for each data point, we define the following:-
C(i) -The cluster assigned to the ith data point
|C(i)| – The number of data points in the cluster assigned to the ith data point
a(i) – It gives a measure of how well assigned the ith data point is to it’s cluster
b(i) – It is defined as the average dissimilarity to the closest cluster which is not it’s cluster
The silhouette coefficient s(i) is given by:-
We determine the average silhouette for each value of k and for the value of k which has the maximum value of s(i) is considered the optimal number of clusters for the unsupervised learning algorithm.
Let us consider the following data:-
We now iterate the values of k from 2 to 5. We assume that no practical data exists for which all the data points can be optimally clustered into 1 cluster.
We construct the following tables for each value of k:-
k = 2
Average value of s(i) = 0.58
k = 3
Average value of s(i) = 0.84
k = 4
Average value of s(i) = 0.37
k = 5
Average value of s(i) = 0
We see that the highest value of s(i) exists for k = 3. Therefore we conclude that the optimal number of clusters for the given data is 3.
- Silhouette Index – Cluster Validity index | Set 2
- ML | Determine the optimal value of K in K-Means Clustering
- Elbow Method for optimal value of k in KMeans
- Choose optimal number of epochs to train a neural network in Keras
- Determine the type of an image in Python using imghdr
- Python | Ways to determine common prefix in set of strings
- Python program to determine if the given IPv4 Address is reserved using ipaddress module
- Python program to determine if the given IP Address is Public or Private using ipaddress module
- How to determine Period Range with Frequency in Pandas?
- Python - Extract ith Key's Value of K's Maximum value dictionary
- Bisect Algorithm Functions in Python
- Page Rank Algorithm and Implementation
- Different Types of Clustering Algorithm
- Simplex Algorithm - Tabular Method
- Asynchronous Advantage Actor Critic (A3C) algorithm
- Cristian's Algorithm
- Facebook News Feed Algorithm
- Python | Foreground Extraction in an Image using Grabcut Algorithm
- Gradient Descent algorithm and its variants
- k-nearest neighbor algorithm in Python
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.