One of the fundamental steps of an unsupervised learning algorithm is to determine the number of clusters into which the data may be divided. The silhouette algorithm is one of the many algorithms to determine the optimal number of clusters for an unsupervised learning technique.
In the Silhouette algorithm, we assume that the data has already been clustered into k clusters by a clustering technique(Typically K-Means Clustering technique). Then for each data point, we define the following:-
C(i) -The cluster assigned to the ith data point
|C(i)| – The number of data points in the cluster assigned to the ith data point
a(i) – It gives a measure of how well assigned the ith data point is to it’s cluster
b(i) – It is defined as the average dissimilarity to the closest cluster which is not it’s cluster
The silhouette coefficient s(i) is given by:-
We determine the average silhouette for each value of k and for the value of k which has the maximum value of s(i) is considered the optimal number of clusters for the unsupervised learning algorithm.
Let us consider the following data:-
We now iterate the values of k from 2 to 5. We assume that no practical data exists for which all the data points can be optimally clustered into 1 cluster.
We construct the following tables for each value of k:-
k = 2
Average value of s(i) = 0.58
k = 3
Average value of s(i) = 0.84
k = 4
Average value of s(i) = 0.37
k = 5
Average value of s(i) = 0
We see that the highest value of s(i) exists for k = 3. Therefore we conclude that the optimal number of clusters for the given data is 3.