In data mining and statistics, hierarchical clustering analysis is a method of cluster analysis which seeks to build a hierarchy of clusters i.e. tree type structure based on the hierarchy.
Basically, there are two types of hierarchical cluster analysis strategies –
- Agglomerative Clustering: Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). A structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data.
given a dataset (d1, d2, d3, ....dN) of size N # compute the distance matrix for i=1 to N: # as the distance matrix is symmetric about # the primary diagonal so we compute only lower # part of the primary diagonal for j=1 to i: dis_mat[i][j] = distance[di, dj] each data point is a singleton cluster repeat merge the two cluster having minimum distance update the distance matrix untill only a single cluster remains
Python implementation of the above algorithm using scikit-learn library:
numpy as np
# randomly chosen dataset
# here we need to mention the number of clusters
# otherwise the result will be a single cluster
# containing all the data
# print the class labels
[1, 1, 1, 0, 0, 0]
- Divisive clustering : Also known as top-down approach. This algorithm also does not require to prespecify the number of clusters. Top-down clustering requires a method for splitting a cluster that contains the whole data and proceeds by splitting clusters recursively until individual data have been splitted into singleton cluster.
given a dataset (d1, d2, d3, ....dN) of size N at the top we have all data in one cluster the cluster is split using a flat clustering method eg. K-Means etc repeat choose the best cluster among all the clusters to split split that cluster by the flat clustering algorithm untill each data is in its own singleton cluster
Hierarchical Agglomerative vs Divisive clustering –
- Divisive clustering is more complex as compared to agglomerative clustering, as in case of divisive clustering we need a flat clustering method as “subroutine” to split each cluster until we have each data having its own singleton cluster.
- Divisive clustering is more efficient if we do not generate a complete hierarchy all the way down to individual data leaves. Time complexity of a naive agglomerative clustering is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in each of N-1 iterations. Using priority queue data structure we can reduce this complexity to O(n2logn). By using some more optimizations it can be brought down to O(n2). Whereas for divisive clustering given a fixed number of top levels, using an efficient flat algorithm like K-Means, divisive algorithms are linear in the number of patterns and clusters.
- Divisive algorithm is also more accurate. Agglomerative clustering makes decisions by considering the local patterns or neighbor points without initially taking into account the global distribution of data. These early decisions cannot be undone. whereas divisive clustering takes into consideration the global distribution of data when making top-level partitioning decisions.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.
- Difference between Hierarchical and Non Hierarchical Clustering
- Implementing Agglomerative Clustering using Sklearn
- Difference between K means and Hierarchical Clustering
- Hierarchical Clustering in R Programming
- Difference between CURE Clustering and DBSCAN Clustering
- DBSCAN Clustering in ML | Density based clustering
- Hierarchical treeview in Python GUI application
- Create non-hierarchical columns with Pandas Group by module
- Python | Clustering, Connectivity and other Graph properties using Networkx
- K means Clustering - Introduction
- Clustering in R Programming
- Analysis of test data using K-Means Clustering in Python
- Clustering in Machine Learning
- Different Types of Clustering Algorithm
- ML | Unsupervised Face Clustering Pipeline
- ML | Determine the optimal value of K in K-Means Clustering
- ML | Mini Batch K-means clustering algorithm
- Image compression using K-means clustering
- ML | Mean-Shift Clustering
- ML | K-Medoids clustering with solved example
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.