Different performance metrics are used to evaluate different Machine Learning Algorithms. In case of classification problem, we have a variety of performance measure to evaluate how good our model is. For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?
Why do we need cluster validity indices ?
- To compare clustering algorithms.
- To compare two sets of clusters.
- To compare two clusters i.e which one is better in terms of compactness and connectedness.
- To determine whether random structure exists in the data due to noise.
Generally, cluster validity measures are categorized into 3 classes, they are –
- Internal cluster validation : The clustering result is evaluated based on the data clustered itself (internal information) without reference to external information.
- External cluster validation : Clustering results are evaluated based on some externally known result, such as externally provided class labels.
- Relative cluster validation : The clustering results are evaluated by varying different parameters for the same algorithm (e.g. changing the number of clusters).
Besides the term cluster validity index, we need to know about inter-cluster distance d(a, b) between two cluster a, b and intra-cluster index D(a) of cluster a.
Inter-cluster distance d(a, b) between two clusters a and b can be –
- Single linkage distance: Closest distance between two objects belonging to a and b respectively.
- Complete linkage distance: Distance between two most remote objects belonging to a and b respectively.
- Average linkage distance: Average distance between all the objects belonging to a and b respectively.
- Centroid linkage distance: Distance between the centroid of the two clusters a and b respectively.
Intra-cluster distance D(a) of a cluster a can be –
- Complete diameter linkage distance: Distance between two farthest objects belonging to cluster a.
- Average diameter linkage distance: Average distance between all the objects belonging to cluster a.
- Centroid diameter linkage distance: Twice the average distance between all the objects and the centroid of the cluster a.
Now, let’s discuss 2 internal cluster validity indices namely Dunn index and DB index.
Dunn index :
The Dunn index (DI) (introduced by J. C. Dunn in 1974), a metric for evaluating clustering algorithms, is an internal evaluation scheme, where the result is based on the clustered data itself. Like all other such indices, the aim of this Dunn index to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance.
Higher the Dunn index value, better is the clustering. The number of clusters that maximizes Dunn index is taken as the optimal number of clusters k. It also has some drawbacks. As the number of clusters and dimensionality of the data increase, the computational cost also increases.
The Dunn index for c number of clusters is defined as :
Below is the Python implementation of above Dunn index using the jqmcvi library :
DB index :
The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W. Bouldin in 1979), a metric for evaluating clustering algorithms, is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset.
Lower the DB index value, better is the clustering. It also has a drawback. A good value reported by this method does not imply the best information retrieval.
The DB index for k number of clusters is defined as :
Below is the Python implementation of above DB index using the sklearn library :