Open In App

Projected clustering in data analytics

We already know about traditional clustering algorithms like k-means, DBSCAN, or hierarchical clustering that operate on all the dimensions of the data simultaneously. However, in high-dimensional data, clusters might only be present in a few dimensions, making the traditional clustering algorithms less effective. In this case, we use projected clustering.

What is Projected Clustering 

Projected clustering, also known as subspace clustering, is a technique that is used to identify clusters in high-dimensional data by considering subsets of dimensions or projections of the data into lower dimensions. The projected clustering algorithm is based on the concept of k-medoid clustering, which was presented by Aggarwal (1999).



In projected clustering, the algorithm determines medoids for each cluster iteratively using a greedy hill-climbing technique. It starts by selecting medoids from a sample of the data and then iteratively upgrades the results. The quality of clusters in the projected clustering algorithm is typically measured on the average distance between data points and their closest medoid. This measure helps in determining how compact and separated the clusters are in the output.

The subspace dimensionality is an important input parameter in projected clustering. It determines the number of dimensions or projections considered for clustering. By selecting relevant subspaces or projections, the algorithm can discover clusters that might not be evident when considering all dimensions simultaneously. The choice of subspace dimensionality can influence the size and structure of the resulting clusters



Features of Projected Clustering :

Steps Required in Projected Clustering

Input and Output for Projected Clustering: 

Input –

Output –

Python Implementation of Projected Clustering 

For implementing projected clustering in Python with a dataset having 20 rows we will first. We will first apply PCA(principal component analysis) to reduce the dimension of the dataset from 20 rows to 2 rows. After reducing the dimension of the dataset we will apply the k-means clustering algorithm on the dataset to cluster the data points.




import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
 
# Generate example high-dimensional data
np.random.seed(0)
num_samples = 1000
num_dimensions = 20
data = np.random.randn(num_samples, num_dimensions)
 
# Dimensionality reduction using PCA
num_selected_dimensions = 2
pca = PCA(n_components=num_selected_dimensions)
projected_data = pca.fit_transform(data)
 
# Perform k-means clustering on the projected data
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters, random_state=0)
kmeans.fit(projected_data)
cluster_labels = kmeans.labels_
 
# Plot the clusters
plt.scatter(projected_data[:, 0], projected_data[:, 1], c=cluster_labels)
plt.title("Projected Clustering using K-means")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.show()

Output:

Projected clustering k-means 


Article Tags :