Open In App

PyTorch for Unsupervised Clustering

Last Updated : 04 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

The aim of unsupervised clustering, a fundamental machine learning problem, is to divide data into groups or clusters based on resemblance or some underlying structure. One well-liked deep learning framework for unsupervised clustering problems is PyTorch.

What is Unsupervised Clustering?

Unsupervised clustering is a machine-learning method that does not require labelled instances in order to find hidden patterns or groupings within data. It entails dividing data points according to distance or similarity measures into discrete clusters.

There are several types of unsupervised clustering algorithms, each with its approach to grouping data points. Some of the most common types include:

  1. K-Means Clustering: A partitioning algorithm that divides data points into k clusters based on their features, with each cluster represented by the mean of its data points.
  2. Hierarchical Clustering: Builds a hierarchy of clusters either from the bottom up (agglomerative) or from the top down (divisive), where each data point starts in its cluster and pairs of clusters are merged or split recursively.
  3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters together closely packed points and marks points that are in low-density regions as outliers. It can find clusters of varying shapes and sizes.

K-means Clustering

A well-liked unsupervised machine learning technique for dividing data points into K clusters is K-means clustering. The approach updates the centroids to minimize the within-cluster sum of squared distances by iteratively assigning each data point to the closest centroid based on the Euclidean distance. K-means may converge to a local minimum and is sensitive to the centroids that are first chosen.

Implementing K-means clustering using PyTorch

1. Importing Necessary Libraries

Python3




import torch
import torch.nn.functional as F
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt


2. Generate Synthetic Data and convert data to PyTorch tensor

Python3




# Generate synthetic data
data, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
 
# Convert data to PyTorch tensor
tensor_data = torch.from_numpy(data).float()


3. Perform K-means Clustering

In this code, we are going to implement K-means clustering:

  • We randomly initialize 4 data points from the dataset as centroids and defined the number of iterations.
  • In the main loop, we calculate the distances between each data point and each centroid using the Euclidean distance. Then, we assign each data point to the closest centroid based on the calculated distances. Then, we update the centroids by computing the mean of the data points assigned to each centroid.
  • These steps are repeated for the specified number of iterations.
  • This process ultimately converges to a set of centroids that represent the centers of the clusters in the data.

Python3




# Initialize centroids randomly
centroids = tensor_data[torch.randperm(tensor_data.size(0))[:4]]
 
# Define the number of iterations
num_iterations = 100
 
for _ in range(num_iterations):
    # Calculate distances from data points to centroids
    distances = torch.cdist(tensor_data, centroids)
 
    # Assign each data point to the closest centroid
    _, labels = torch.min(distances, dim=1)
 
    # Update centroids by taking the mean of data points assigned to each centroid
    for i in range(4):
        if torch.sum(labels == i) > 0:
            centroids[i] = torch.mean(tensor_data[labels == i], dim=0)


4. Visualize Clusters

Python3




# Visualize clusters
plt.scatter(data[:, 0], data[:, 1], c=labels.numpy(), cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, color='red')
plt.show()


Output:

download-(3)

Hierarchical Clustering

Data points can be grouped into a hierarchy of clusters using the hierarchical clustering technique. Unlike K-means, it doesn’t require a set number of clusters. Rather, by repeatedly merging or dividing clusters depending on similarity measurements, it creates a tree-like structure of clusters until all data points are part of a single cluster or separate clusters. Both dividing and agglomerative (bottom-up) hierarchical clustering are possible (top-down).

  1. Hierarchical Agglomerative Clustering: It starts with every data point as a distinct cluster and repeatedly joins the closest pairs of clusters until every point is a part of a single cluster.
  2. Hierarchical Divisive Clustering: It divides the dataset recursively into smaller clusters until every data point is in its own cluster, starting with all the data points in one cluster.

Implementing Agglomerative Clustering using PyTorch

The code demonstrates how to perform hierarchical clustering using the linkage function from scipy.cluster.hierarchy and visualize the resulting dendrogram using Matplotlib.

  1. Import Libraries: Import necessary libraries including PyTorch for tensor operations, SciPy for hierarchical clustering, and Matplotlib for plotting.
  2. Sample Data: Create a tensor X containing sample data points.
  3. Standardize Data: Standardize the data using z-score normalization to ensure that all features have equal importance in the distance calculation.
  4. Calculate Pairwise Euclidean Distances: Use torch.cdist to calculate pairwise Euclidean distances between all points in the standardized data.
  5. Convert Distances to Numpy Array: Convert the distance tensor to a NumPy array since SciPy’s linkage function expects a NumPy array.
  6. Perform Hierarchical Clustering: Use SciPy’s linkage function to perform hierarchical clustering on the pairwise distances. The method 'single' indicates that the minimum distance between clusters should be used as the metric for merging clusters.
  7. Plot Dendrogram: Plot the dendrogram using Matplotlib, visualizing the hierarchical clustering results.

NOTE: We are using SciPy for hierarchical clustering as PyTorch does not have built-in functions for hierarchical clustering. We use PyTorch for calculating pairwise distances between data points and then convert the distances to a NumPy array for use with SciPy’s hierarchical clustering functions.

Python3




import torch
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
 
# Sample data
X = torch.tensor([[1, 2], [1, 4], [1, 0],
                 [4, 2], [4, 4], [4, 0]])
 
# Standardize data (ensure floating point output)
X_std = (X.float() - X.float().mean(dim=0)) / X.float().std(dim=0)
 
# Calculate pairwise Euclidean distances using PyTorch
distances = torch.cdist(X_std, X_std, p=2# p=2 for Euclidean distance
 
# Convert distances to numpy array for SciPy usage
distances = distances.numpy()
 
# Perform hierarchical clustering using SciPy
Z = linkage(distances, 'single')
 
# Plot dendrogram using matplotlib
plt.figure(figsize=(10, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Index')
plt.ylabel('Distance')
dendrogram(Z)
plt.show()


Output:

download-(4)

DBSCAN Clustering

A density-based clustering algorithm called DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clusters together densely packed data points while classifying outliers as noise. The user does not need to predetermine the number of clusters while using DBSCAN. Clusters are described as continuous high-density areas divided by low-density areas. DBSCAN can detect clusters of any shape and is resistant to outliers. Its two parameters are min_samples, which is the bare minimum of data points needed to build a dense region, and epsilon (eps), which specifies the radius of the neighborhood around a data point.

Implementing DBSCAN using PyTorch

1. Importing necessary libraries

Python3




import torch
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons


2. Generate Synthetic Data

Generate synthetic data using make_moons function with 200 samples, some noise (0.05), and a fixed random state.

Python3




# Generate synthetic data
X, _ = make_moons(n_samples=200, noise=0.05, random_state=0)
X = torch.tensor(X, dtype=torch.float)


3. DBSCAN Algorithm

Define Euclidean distance function to calculate the distance between two points.

Python3




def euclidean_distance(x1, x2):
    return torch.sqrt(torch.sum((x1 - x2) ** 2, dim=1))


DBSCAN Function performs clustering.

The dbscan function implements the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm, a density-based clustering method. It takes an input dataset X, maximum distance parameter eps, and minimum number of samples parameter min_samples. It iterates over each data point, finding its neighbors within the specified distance. If a point has fewer neighbors than min_samples, it is labeled as noise; otherwise, it is labeled with a new cluster label and its neighbors are recursively visited to expand the cluster. The function efficiently assigns cluster labels to each point in the dataset based on their density and proximity relationships, providing a flexible and effective clustering method without requiring the number of clusters to be specified beforehand.

Python3




def dbscan(X, eps, min_samples):
   
    n_samples = X.shape[0]
    labels = torch.zeros(n_samples, dtype=torch.int)
 
    # Initialize cluster label and visited flags
    cluster_label = 0
    visited = torch.zeros(n_samples, dtype=torch.bool)
 
    # Iterate over each point
    for i in range(n_samples):
        if visited[i]:
            continue
        visited[i] = True
 
        # Find neighbors
        neighbors = torch.nonzero(euclidean_distance(X[i], X) < eps).squeeze()
         
        if neighbors.shape[0] < min_samples:
            # Label as noise
            labels[i] = 0
        else:
            # Expand cluster
            cluster_label += 1
            labels[i] = cluster_label
            expand_cluster(X, labels, visited, neighbors, cluster_label, eps, min_samples)
 
    return labels


This function implements the core logic of the DBSCAN algorithm, which iteratively identifies core points, expands clusters, and labels noise points. It operates directly on the input data and efficiently finds clusters without requiring the user to specify the number of clusters in advance.

Expand Cluster Function expands the cluster by assigning the cluster label to all reachable points from the core point.

Python3




def expand_cluster(X, labels, visited, neighbors, cluster_label, eps, min_samples):
    i = 0
    while i < neighbors.shape[0]:
        neighbor_index = neighbors[i].item()
        if not visited[neighbor_index]:
            visited[neighbor_index] = True
            neighbor_neighbors = torch.nonzero(euclidean_distance(X[neighbor_index], X) < eps).squeeze()
            if neighbor_neighbors.shape[0] >= min_samples:
                neighbors = torch.cat((neighbors, neighbor_neighbors))
        if labels[neighbor_index] == 0:
            labels[neighbor_index] = cluster_label
        i += 1


4. Performing Clustering and Visualize Clusters

Python3




# DBSCAN parameters
eps = 0.3
min_samples = 5
 
# Perform clustering
labels = dbscan(X, eps, min_samples)
 
# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar()
plt.show()


Output:

download-(5)

Evaluating Clustering Performance

To evaluate the quality of the clustering results, one must evaluate the clustering performance. Calinski-Harabasz index, Davies-Bouldin index, and silhouette score are examples of common evaluation measures.

Metrics for evaluating clustering results

  • The compactness and separation of clusters are measured by the silhouette score.
  • The Davies-Bouldin Index measures the average similarity, normalized by the spread of the cluster, between each cluster and its most similar cluster.
  • Calculates the ratio of within-cluster dispersion to between-cluster dispersion using the Calinski-Harabasz index.

Conclusion

It is crucial to evaluate various clustering algorithms according to their scalability, applicability for particular datasets, and performance metrics. Try out other algorithms, such DBSCAN, K-means, and hierarchical clustering, to see which one works best for your data.

Finally, PyTorch offers an effective framework for exploring and putting different unsupervised clustering methods into practice. Through a grasp of clustering principles and the use of PyTorch’s functionalities, data scientists can acquire significant knowledge from unlabeled data and arrive at well-informed conclusions across multiple fields.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads