Open In App

Structured vs Unstructured Ward in Hierarchical Clustering Using Scikit Learn

Hierarchical clustering is a popular and widely used method for grouping and organizing data into clusters. In hierarchical clustering, data points are grouped based on their similarity, with similar data points being placed in the same cluster and dissimilar data points being placed in different clusters.

One of the key decisions in hierarchical clustering is how to determine the similarity between data points and how to merge clusters. In scikit-learn, the two most commonly used methods for this are structured Ward and unstructured Ward.



Structured Ward Clustering:

Unstructured Ward Clustering:

Conditional Use cases:

  1. Overall, both structured and unstructured wards are effective methods for hierarchical clustering in scikit-learn. 
  2. The choice between the two methods depends on the characteristics of the data and the desired properties of the clusters. 
  3. Structured Ward is generally better suited for data with compact and well-separated clusters.
  4. Unstructured Ward is better suited for data with more complex and flexible cluster structures.
  5. Let’s see it in the example below.
  6. Silhouette score will show which algorithms perform better on complex datasets.

Here is an example of how to use hierarchical clustering with structured and unstructured Ward in scikit-learn:

Python code for Structured Ward clustering:




from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
# Generate some sample data
X, y = make_blobs(n_samples=10000,
                  n_features=8,
                  centers=5)
  
# Create a structured Ward hierarchical clustering object
structured_ward = AgglomerativeClustering(n_clusters=5,
                                          linkage='ward')
structured_ward.fit(X)
  
# Print the labels for each data point
print("Structured Ward labels:",
      structured_ward.labels_)
print(silhouette_score(X,
                       structured_ward.labels_))

Output:



Structured Ward labels: [2 4 3 ... 3 4 0]
0.6958103589455868 

Python code for Unstructured Ward clustering:




from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
  
# Generate some sample data
X, y = make_blobs(n_samples=10000,
                  n_features=8, centers=5)
  
# Create an unstructured Ward
# hierarchical clustering object
unstructured_ward = AgglomerativeClustering(
    n_clusters=5,
    linkage='ward',
    affinity='euclidean')
unstructured_ward.fit(X)
  
  
print("Unstructured Ward labels:",
      unstructured_ward.labels_)
print(silhouette_score(X,
                       unstructured_ward.labels_))

Output:

Unstructured Ward labels: [3 0 2 ... 1 4 0]
0.7733847795088261
  1. This code generates some sample data using the make_blobs function and then uses the AgglomerativeClustering class to perform hierarchical clustering with structured and unstructured Ward. 
  2. The structured_ward object uses the default ward linkage criterion with the sum of squared distances.
  3. The unstructured_ward object uses the Euclidean distance metric.
  4.  The labels for each data point are then printed for each clustering method.
  5. The above example is for a complex dataset where an unstructured ward performs way better than the structured ward algorithm

You can try running this code yourself to see the results and compare the clusters produced by structured and unstructured Ward. You can also try experimenting with different settings and parameters to see how they affect the clustering results.


Article Tags :