Ordering Points To Identify Cluster Structure (OPTICS) using Sklearn

Sklearn’s OPTICS, an acronym for Ordering Points To Identify the Clustering Structure, stands as a powerful tool in the realm of machine learning and data analysis. It is a part of the Scikit-learn library, a popular machine-learning library in Python. OPTICS is particularly adept at uncovering hidden patterns and structures within datasets, making it an invaluable asset for cluster analysis.

Cluster analysis involves grouping data points that share similarities, and OPTICS excels in this by providing a flexible and adaptive approach. Unlike traditional clustering algorithms, OPTICS doesn’t rely on pre-defined clusters or assume a fixed distance metric. Instead, it examines the density distribution of data points, allowing for the identification of clusters with varying shapes and sizes.

Optics Clustering

A density-based clustering algorithm called OPTICS (Ordering Points To Identify the Clustering Structure) is intended to find clusters in a dataset with different densities and shapes. In contrast to conventional techniques, which call for a fixed number of clusters, OPTICS analyzes the local density of data points to dynamically identify clusters. The reachability distance of each point—a measure of the density surrounding it—is used by the algorithm to rank the points.

The reachability distance makes it possible to identify noise and outliers and aids in illuminating the hierarchical structure of clusters. OPTICS has the advantages of robustness against changing cluster sizes, flexibility to accommodate various data patterns, and the capacity to detect irregularly shaped clusters.

Parameters of Optics

The OPTICS (Ordering Points To Identify the Clustering Structure) clustering algorithm has several parameters that can be adjusted to control its behavior. Here are the key parameters:

min_samples: The minimum number of samples in a neighborhood for a data point to be considered a core point. Core points are central to the formation of clusters.
xi: The core distance scaling parameter, which establishes the permissible variation between a point’s reachability distance and core distance. A finer-grained clustering structure is produced by smaller values of xi.
min_cluster_size: The bare minimum of samples required for a cluster to be deemed legitimate. Noise will be applied to clusters with fewer samples than this threshold.
max_eps: The greatest separation that separates two samples such that one is deemed to be in the vicinity of the other. Any points outside of this range are regarded as noise.
p: The Minkowski metric’s power parameter. It is equal to the Euclidean metric when p is set to 2.
metric_params: The metric function’s extra keyword arguments.
cluster_method: The reachability plot’s cluster extraction technique. You can use ‘dbscan’ to apply a method similar to DBSCAN, or ‘xi’ for a fixed value of xi.
eps: The furthest distance that two samples must be apart in order for one to be deemed to be nearby the other. It defaults to max_eps if it’s left blank.
predecessor_correction: When creating the reachability plot, if True, apply predecessor corrections.
algorithm: The nearest neighbor search algorithm. The options are ‘auto,’ ‘ball_tree,’ ‘kd_tree,’ and ‘brute.‘
leaf_size: The quantity of points at which the algorithm resorts to a brute-force search. Both the speed and memory usage are impacted.
n_jobs: The quantity of concurrent jobs that the nearest neighbors search should perform. It utilizes all CPUs if set to -1.

How OPTICS Works

OPTICS operates based on the notion of reachability and ordering. It begins by defining a reachability distance for each data point, which signifies the minimum density required to connect two points. This eliminates the need for setting a fixed density threshold, enabling the algorithm to adapt to the varying density across the dataset.

The algorithm then orders the data points based on their reachability distances, creating a reachability plot. The peaks and valleys in this plot represent potential clusters and gaps in the data. By analyzing these structures, OPTICS can unveil clusters of different shapes and sizes, providing a more nuanced understanding of the underlying data distribution.

In essence, Sklearn’s OPTICS introduces a dynamic and adaptable approach to cluster analysis, making it a valuable tool for uncovering intricate patterns in datasets with varying density and structures. Its ability to handle diverse datasets without rigid assumptions contributes to its popularity in the field of machine learning.

Concepts Related to OPTICS

Reachability Distance: The reachability distance in OPTICS defines the minimum density required to connect two data points. It represents how close points need to be to be considered part of the same cluster. Points with lower reachability distances are more densely connected and likely belong to the same cluster.

Core Distance: Core distance is the minimum distance at which a data point is still considered a core point. Core points are central to cluster formation, and the core distance helps in identifying the density within a cluster.

Ordering: The ordering of points in OPTICS is based on their reachability distances. The ordered list reflects the density distribution of the dataset, highlighting potential clusters and gaps between them.

Reachability Plot: The reachability plot visualizes the ordered points and their corresponding reachability distances. Peaks and valleys in the plot indicate potential cluster boundaries and sparse regions in the data.

Clustering Structure: OPTICS aims to identify the underlying clustering structure without assuming fixed cluster shapes or sizes. It adapts to the varying density of data, providing a more realistic representation of clusters.

Mathematical Concepts Used in OPTICS:

OPTICS involves the following mathematical concepts:

1. Core Distance Formula

Core distance (CD) for a point p is calculated as the distance to its k-th nearest neighbor, denoted as NNk(p):

2.Reachability Distance Formula:

Reachability distance (RD) for a point o concerning another point p is given by:

This formula ensures that the reachability distance considers both the core distance of the point and the actual distance between points.

Implementation of Optics Clustering

Importing Libraries

Python3

import matplotlib.gridspec as gridspec

import matplotlib.pyplot as plt

import numpy as np
 
from sklearn.cluster import OPTICS, cluster_optics_dbscan

The required libraries for clustering and visualization are imported by this code. It analyzes and visualizes clusters in a synthetic dataset using scikit-learn’s OPTICS clustering algorithm, and then uses DBSCAN to show the resulting reachability plot and clustered data points. To display various aspects of the clustering analysis, a multi-subplot grid is created using the matplotlib library.

Generating Data

Python3

# Generate modified sample data
 
np.random.seed(42)

n_points_per_cluster = 200
 
C1 = [-3, -1] + 1.0 * np.random.randn(n_points_per_cluster, 2)

C2 = [2, -2] + 0.5 * np.random.randn(n_points_per_cluster, 2)

C3 = [0, 2] + 0.8 * np.random.randn(n_points_per_cluster, 2)

C4 = [-1, 4] + 0.2 * np.random.randn(n_points_per_cluster, 2)

C5 = [1, -3] + 1.2 * np.random.randn(n_points_per_cluster, 2)

C6 = [4, 5] + 1.5 * np.random.randn(n_points_per_cluster, 2)

X_modified = np.vstack((C1, C2, C3, C4, C5, C6))
 
clust = OPTICS(min_samples=40, xi=0.1, min_cluster_size=0.1)

This code uses NumPy’s random module to create a modified synthetic dataset with six clusters. We then apply the OPTICS clustering algorithm from scikit-learn (OPTICS) with the following parameters to the modified dataset: min_samples=40, xi=0.1, and min_cluster_size=0.1. These parameters determine the core distance scaling parameter, the minimum cluster size, and the minimum number of samples for core points, respectively. The clustering structure is contained in the clust object that is produced. With matplotlib, the altered dataset and the clustering outcomes can be shown.

OPTICS Clustering Visualization

Python3

# Run the fit
clust.fit(X_modified)
 
labels_050 = cluster_optics_dbscan(

    reachability=clust.reachability_,

    core_distances=clust.core_distances_,

    ordering=clust.ordering_,

    eps=0.7,
)

labels_200 = cluster_optics_dbscan(

    reachability=clust.reachability_,

    core_distances=clust.core_distances_,

    ordering=clust.ordering_,

    eps=1.5,
)
 
space = np.arange(len(X_modified))

reachability = clust.reachability_[clust.ordering_]

labels = clust.labels_[clust.ordering_]
 
plt.figure(figsize=(10, 7))

G = gridspec.GridSpec(2, 3)

ax1 = plt.subplot(G[0, :])

ax2 = plt.subplot(G[1, 0])

ax3 = plt.subplot(G[1, 1])

ax4 = plt.subplot(G[1, 2])
 
# Reachability plot

colors = ["b.", "g.", "r.", "y.", "c."]

for klass, color in zip(range(0, 5), colors):

    Xk = space[labels == klass]

    Rk = reachability[labels == klass]

    ax1.plot(Xk, Rk, color, alpha=0.3)

ax1.plot(space[labels == -1], reachability[labels == -1], "k.", alpha=0.3)

ax1.plot(space, np.full_like(space, 1.5, dtype=float), "k-", alpha=0.5)

ax1.plot(space, np.full_like(space, 0.8, dtype=float), "k-.", alpha=0.5)

ax1.set_ylabel("Reachability (epsilon distance)")

ax1.set_title("Reachability Plot")
 
# OPTICS

colors = ["b.", "g.", "r.", "y.", "c."]

for klass, color in zip(range(0, 5), colors):

    Xk = X_modified[clust.labels_ == klass]

    ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)

ax2.plot(X_modified[clust.labels_ == -1, 0],

         X_modified[clust.labels_ == -1, 1], "k+", alpha=0.1)

ax2.set_title(" Automatic Clustering\nOPTICS")
 
# DBSCAN at 0.7

colors = ["b.", "g.", "r.", "c."]

for klass, color in zip(range(0, 4), colors):

    Xk = X_modified[labels_050 == klass]

    ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)

ax3.plot(X_modified[labels_050 == -1, 0],

         X_modified[labels_050 == -1, 1], "k+", alpha=0.1)

ax3.set_title("Clustering at 0.7 epsilon cut\nDBSCAN")
 
# DBSCAN at 1.5

colors = ["b.", "m.", "y.", "c."]

for klass, color in zip(range(0, 4), colors):

    Xk = X_modified[labels_200 == klass]

    ax4.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)

ax4.plot(X_modified[labels_200 == -1, 0],

         X_modified[labels_200 == -1, 1], "k+", alpha=0.1)

ax4.set_title(" Clustering at 1.5 epsilon cut\nDBSCAN")
 
plt.tight_layout()
plt.show()

Output:

This code uses the modified dataset (X_modified) and applies the OPTICS clustering algorithm. Following clust.fit()’s data fitting, cluster_optics_dbscan is used to cluster the data at various epsilon values, and labels (labels_050 and labels_200) are produced based on reachability distances. The top subplot (ax1) shows the reachability plot with epsilon distances plotted against data points. The resulting subplots (ax2, ax3, ax4) at various epsilon values display the clustered data points from OPTICS and DBSCAN. The hierarchical structure of the clusters and the effects of different epsilon values on the clustering results are revealed by each subplot. To ensure clarity, the visualization is arranged in a 2×3 grid.

Advantages of Optics

OPTICS (Ordering Points To Identify the Clustering Structure) is a useful clustering algorithm for some kinds of datasets because it has a number of benefits:

Flexibility in Cluster Shape and Size: OPTICS is appropriate for datasets with complex structures and irregularly shaped clusters because it can recognize clusters of different sizes and shapes.
Adaptive to Local Density: It can effectively handle datasets with regions of varying point densities because it adjusts to the local density of data points.
Robust to Noise and Outliers: Because OPTICS finds core points and takes reachability distances into account, it is resistant to noise and outliers. In the clustering process, outliers are usually classified as noise.
No Need for Specifying the Number of Clusters: OPTICS is appropriate for datasets where the number of clusters is unknown in advance because it does not require users to predefine the number of clusters.

Disadvantages of Optics

OPTICS (Ordering Points To Identify the Clustering Structure) has a number of benefits, but it also has some drawbacks and restrictions:

Computational Complexity: Because OPTICS involves reachability plot creation and distance calculations, it can be computationally costly, particularly for large datasets.
Non-Deterministic Output: The reachability plot’s point ordering may produce non-deterministic clustering outcomes. The final clustering result may change if the order of the input data is slightly altered.
Sensitivity to Parameters: The selection of parameters, such as min_samples, xi, and min_cluster_size, can have an impact on OPTICS performance. Different datasets might require different levels of parameter tuning.
Limited Support for Stream Data: OPTICS is not a good fit for streaming data because it usually needs the whole dataset in order to create the reachability plot. Incremental or real-time updates are difficult.

Conclusion

In conclusion, Sklearn’s OPTICS is a powerful tool for uncovering hidden patterns and structures in datasets. Unlike traditional clustering algorithms, OPTICS adapts to varying data densities, making it versatile for real-world applications. The reachability plot gives insights into the data’s structure, guiding the identification of clusters and sparse regions. The cluster visualization further enhances interpretability, allowing users to understand how data points are grouped. OPTICS excels in scenarios where clusters have irregular shapes and sizes.

This article has introduced OPTICS, explained its working principles, and demonstrated its application on a real-world dataset. By offering flexibility in cluster analysis without rigid assumptions, OPTICS proves valuable in understanding complex data relationships. Researchers, data scientists, and analysts can leverage OPTICS to gain deeper insights into their datasets and make informed decisions based on the discovered clustering structure.

Article Tags :

Geeks Premier League

Machine Learning

Geeks Premier League 2023

Python scikit-module