Open In App

Ordering Points To Identify Cluster Structure (OPTICS) using Sklearn

Sklearn’s OPTICS, an acronym for Ordering Points To Identify the Clustering Structure, stands as a powerful tool in the realm of machine learning and data analysis. It is a part of the Scikit-learn library, a popular machine-learning library in Python. OPTICS is particularly adept at uncovering hidden patterns and structures within datasets, making it an invaluable asset for cluster analysis.

Cluster analysis involves grouping data points that share similarities, and OPTICS excels in this by providing a flexible and adaptive approach. Unlike traditional clustering algorithms, OPTICS doesn’t rely on pre-defined clusters or assume a fixed distance metric. Instead, it examines the density distribution of data points, allowing for the identification of clusters with varying shapes and sizes.



Optics Clustering

A density-based clustering algorithm called OPTICS (Ordering Points To Identify the Clustering Structure) is intended to find clusters in a dataset with different densities and shapes. In contrast to conventional techniques, which call for a fixed number of clusters, OPTICS analyzes the local density of data points to dynamically identify clusters. The reachability distance of each point—a measure of the density surrounding it—is used by the algorithm to rank the points.

The reachability distance makes it possible to identify noise and outliers and aids in illuminating the hierarchical structure of clusters. OPTICS has the advantages of robustness against changing cluster sizes, flexibility to accommodate various data patterns, and the capacity to detect irregularly shaped clusters.



Parameters of Optics

The OPTICS (Ordering Points To Identify the Clustering Structure) clustering algorithm has several parameters that can be adjusted to control its behavior. Here are the key parameters:

How OPTICS Works

OPTICS operates based on the notion of reachability and ordering. It begins by defining a reachability distance for each data point, which signifies the minimum density required to connect two points. This eliminates the need for setting a fixed density threshold, enabling the algorithm to adapt to the varying density across the dataset.

The algorithm then orders the data points based on their reachability distances, creating a reachability plot. The peaks and valleys in this plot represent potential clusters and gaps in the data. By analyzing these structures, OPTICS can unveil clusters of different shapes and sizes, providing a more nuanced understanding of the underlying data distribution.

In essence, Sklearn’s OPTICS introduces a dynamic and adaptable approach to cluster analysis, making it a valuable tool for uncovering intricate patterns in datasets with varying density and structures. Its ability to handle diverse datasets without rigid assumptions contributes to its popularity in the field of machine learning.

Concepts Related to OPTICS

Reachability Distance: The reachability distance in OPTICS defines the minimum density required to connect two data points. It represents how close points need to be to be considered part of the same cluster. Points with lower reachability distances are more densely connected and likely belong to the same cluster.

Core Distance: Core distance is the minimum distance at which a data point is still considered a core point. Core points are central to cluster formation, and the core distance helps in identifying the density within a cluster.

Ordering: The ordering of points in OPTICS is based on their reachability distances. The ordered list reflects the density distribution of the dataset, highlighting potential clusters and gaps between them.

Reachability Plot: The reachability plot visualizes the ordered points and their corresponding reachability distances. Peaks and valleys in the plot indicate potential cluster boundaries and sparse regions in the data.

Clustering Structure: OPTICS aims to identify the underlying clustering structure without assuming fixed cluster shapes or sizes. It adapts to the varying density of data, providing a more realistic representation of clusters.

Mathematical Concepts Used in OPTICS:

OPTICS involves the following mathematical concepts:

1. Core Distance Formula

Core distance (CD) for a point p is calculated as the distance to its k-th nearest neighbor, denoted as NNk(p):

2.Reachability Distance Formula:

Reachability distance (RD) for a point o concerning another point p is given by:

This formula ensures that the reachability distance considers both the core distance of the point and the actual distance between points.

Implementation of Optics Clustering

Importing Libraries

import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import numpy as np
 
from sklearn.cluster import OPTICS, cluster_optics_dbscan

                    

The required libraries for clustering and visualization are imported by this code. It analyzes and visualizes clusters in a synthetic dataset using scikit-learn’s OPTICS clustering algorithm, and then uses DBSCAN to show the resulting reachability plot and clustered data points. To display various aspects of the clustering analysis, a multi-subplot grid is created using the matplotlib library.

Generating Data

# Generate modified sample data
 
np.random.seed(42)
n_points_per_cluster = 200
 
C1 = [-3, -1] + 1.0 * np.random.randn(n_points_per_cluster, 2)
C2 = [2, -2] + 0.5 * np.random.randn(n_points_per_cluster, 2)
C3 = [0, 2] + 0.8 * np.random.randn(n_points_per_cluster, 2)
C4 = [-1, 4] + 0.2 * np.random.randn(n_points_per_cluster, 2)
C5 = [1, -3] + 1.2 * np.random.randn(n_points_per_cluster, 2)
C6 = [4, 5] + 1.5 * np.random.randn(n_points_per_cluster, 2)
X_modified = np.vstack((C1, C2, C3, C4, C5, C6))
 
clust = OPTICS(min_samples=40, xi=0.1, min_cluster_size=0.1)

                    

This code uses NumPy’s random module to create a modified synthetic dataset with six clusters. We then apply the OPTICS clustering algorithm from scikit-learn (OPTICS) with the following parameters to the modified dataset: min_samples=40, xi=0.1, and min_cluster_size=0.1. These parameters determine the core distance scaling parameter, the minimum cluster size, and the minimum number of samples for core points, respectively. The clustering structure is contained in the clust object that is produced. With matplotlib, the altered dataset and the clustering outcomes can be shown.

OPTICS Clustering Visualization

# Run the fit
clust.fit(X_modified)
 
labels_050 = cluster_optics_dbscan(
    reachability=clust.reachability_,
    core_distances=clust.core_distances_,
    ordering=clust.ordering_,
    eps=0.7,
)
labels_200 = cluster_optics_dbscan(
    reachability=clust.reachability_,
    core_distances=clust.core_distances_,
    ordering=clust.ordering_,
    eps=1.5,
)
 
space = np.arange(len(X_modified))
reachability = clust.reachability_[clust.ordering_]
labels = clust.labels_[clust.ordering_]
 
plt.figure(figsize=(10, 7))
G = gridspec.GridSpec(2, 3)
ax1 = plt.subplot(G[0, :])
ax2 = plt.subplot(G[1, 0])
ax3 = plt.subplot(G[1, 1])
ax4 = plt.subplot(G[1, 2])
 
# Reachability plot
colors = ["b.", "g.", "r.", "y.", "c."]
for klass, color in zip(range(0, 5), colors):
    Xk = space[labels == klass]
    Rk = reachability[labels == klass]
    ax1.plot(Xk, Rk, color, alpha=0.3)
ax1.plot(space[labels == -1], reachability[labels == -1], "k.", alpha=0.3)
ax1.plot(space, np.full_like(space, 1.5, dtype=float), "k-", alpha=0.5)
ax1.plot(space, np.full_like(space, 0.8, dtype=float), "k-.", alpha=0.5)
ax1.set_ylabel("Reachability (epsilon distance)")
ax1.set_title("Reachability Plot")
 
# OPTICS
colors = ["b.", "g.", "r.", "y.", "c."]
for klass, color in zip(range(0, 5), colors):
    Xk = X_modified[clust.labels_ == klass]
    ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax2.plot(X_modified[clust.labels_ == -1, 0],
         X_modified[clust.labels_ == -1, 1], "k+", alpha=0.1)
ax2.set_title(" Automatic Clustering\nOPTICS")
 
# DBSCAN at 0.7
colors = ["b.", "g.", "r.", "c."]
for klass, color in zip(range(0, 4), colors):
    Xk = X_modified[labels_050 == klass]
    ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax3.plot(X_modified[labels_050 == -1, 0],
         X_modified[labels_050 == -1, 1], "k+", alpha=0.1)
ax3.set_title("Clustering at 0.7 epsilon cut\nDBSCAN")
 
# DBSCAN at 1.5
colors = ["b.", "m.", "y.", "c."]
for klass, color in zip(range(0, 4), colors):
    Xk = X_modified[labels_200 == klass]
    ax4.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax4.plot(X_modified[labels_200 == -1, 0],
         X_modified[labels_200 == -1, 1], "k+", alpha=0.1)
ax4.set_title(" Clustering at 1.5 epsilon cut\nDBSCAN")
 
plt.tight_layout()
plt.show()

                    

Output:

This code uses the modified dataset (X_modified) and applies the OPTICS clustering algorithm. Following clust.fit()’s data fitting, cluster_optics_dbscan is used to cluster the data at various epsilon values, and labels (labels_050 and labels_200) are produced based on reachability distances. The top subplot (ax1) shows the reachability plot with epsilon distances plotted against data points. The resulting subplots (ax2, ax3, ax4) at various epsilon values display the clustered data points from OPTICS and DBSCAN. The hierarchical structure of the clusters and the effects of different epsilon values on the clustering results are revealed by each subplot. To ensure clarity, the visualization is arranged in a 2×3 grid.

Advantages of Optics

OPTICS (Ordering Points To Identify the Clustering Structure) is a useful clustering algorithm for some kinds of datasets because it has a number of benefits:

Disadvantages of Optics

OPTICS (Ordering Points To Identify the Clustering Structure) has a number of benefits, but it also has some drawbacks and restrictions:

Conclusion

In conclusion, Sklearn’s OPTICS is a powerful tool for uncovering hidden patterns and structures in datasets. Unlike traditional clustering algorithms, OPTICS adapts to varying data densities, making it versatile for real-world applications. The reachability plot gives insights into the data’s structure, guiding the identification of clusters and sparse regions. The cluster visualization further enhances interpretability, allowing users to understand how data points are grouped. OPTICS excels in scenarios where clusters have irregular shapes and sizes.

This article has introduced OPTICS, explained its working principles, and demonstrated its application on a real-world dataset. By offering flexibility in cluster analysis without rigid assumptions, OPTICS proves valuable in understanding complex data relationships. Researchers, data scientists, and analysts can leverage OPTICS to gain deeper insights into their datasets and make informed decisions based on the discovered clustering structure.


Article Tags :