Prerequisites: Cluster Validity Index
Clustering validation has been recognized as one of the important factors essential to the success of clustering algorithms. How to effectively and efficiently assess the clustering results of clustering algorithms is the key to the problem. Generally, cluster validity measures are categorized into 3 classes (Internal cluster validation, External cluster validation and Relative cluster validation).
In this article, we focus on an internal cluster validation index i.e. Calinski-Harabasz Index.
Calinski-Harabasz (CH) Index (introduced by Calinski and Harabasz in 1974) can be used to evaluate the model when ground truth labels are not known where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. The CH Index (also known as Variance ratio criterion) is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Here cohesion is estimated based on the distances from the data points in a cluster to its cluster centroid and separation is based on the distance of the cluster centroids from the global centroid. CH index has a form of (a . Separation)/(b . Cohesion) , where a and b are weights.
Calculation of Calinski-Harabasz Index:
The CH index for K number of clusters on a dataset D =[ d1 , d2 , d3 , … dN ] is defined as,
where, nk and ck are the no. of points and centroid of the kth cluster respectively, c is the global centroid, N is the total no. of data points.
Higher value of CH index means the clusters are dense and well separated, although there is no “acceptable” cut-off value. We need to choose that solution which gives a peak or at least an abrupt elbow on the line plot of CH indices. On the other hand, if the line is smooth (horizontal or ascending or descending) then there is no such reason to prefer one solution over others.
Below is the Python implementation of the above CH index using the sklearn library :