Open In App

Spectral Clustering using R

Spectral clustering is a technique used in machine learning and data analysis for grouping data points based on their similarity. The method involves transforming the data into a representation where the clusters become apparent and then using a clustering algorithm on this transformed data. In R Programming Language spectral clustering, the transformation is done using the eigenvalues and eigenvectors of a similarity matrix.

Spectral Clustering

There are some components that we used in Spectral Clustering in the R Programming Language.



Spectral clustering works by transforming the data into a lower-dimensional space where clustering is performed more effectively. The key steps involved in spectral clustering are as follows:

Affinity Matrix

Start with a dataset of data points. Compute an affinity or similarity matrix that quantifies the relationships between these data points. This matrix reflects how similar or related each pair of data points is. Common affinity measures include Gaussian similarity, k-nearest neighbors, or a user-defined similarity function.



Graph Representation

Interpret the affinity matrix as the adjacency matrix of a weighted undirected graph. In this graph, each data point corresponds to a vertex, and the weight of the edge between vertices reflects the similarity between the corresponding data points.

Laplacian Matrix

Construct the graph Laplacian matrix, which captures the connectivity of the data points in the graph. There are two main types of Laplacian matrices used in spectral clustering.

Eigenvalue Decomposition

Compute the eigenvalues (λ_1, λ_2, …, λ_n) and the corresponding eigenvectors (v_1, v_2, …, v_n) of the Laplacian matrix. You typically compute a few eigenvectors, corresponding to the smallest non-zero eigenvalues.

Embedding

Use the selected eigenvectors to embed the data into a lower-dimensional space. The eigenvectors represent new features that capture the underlying structure of the data. The matrix containing these eigenvectors is referred to as the spectral embedding.

Clustering

Apply a clustering algorithm to the rows of the spectral embedding. Common clustering algorithms like k-means, normalized cuts, or spectral clustering can be used to group the data points into clusters in this lower-dimensional space.

The key idea behind spectral clustering is that by using spectral embeddings, We can potentially find clusters that are not easily separable in the original feature space. The choice of the number of clusters and the number of eigenvectors to retain in the embedding space often depends on domain knowledge, data characteristics, and application-specific requirements.

Now we will take the iris dataset for clustering.

Load the iris dataset




# Load the iris dataset
data(iris)
 
head(iris)

Output:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Cluster
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1

Create a Similarity Matrix




#Using Euclidean distance as a similarity measure
similarity_matrix <- exp(-dist(iris[, 1:4])^2 / (2 * 1^2))
 
#Compute Eigenvalues and Eigenvectors
eigen_result <- eigen(similarity_matrix)
eigenvalues <- eigen_result$values
eigenvectors <- eigen_result$vectors
 
#Choose the First k Eigenvectors
k <- 3
selected_eigenvectors <- eigenvectors[, 1:k]
 
#Apply K-Means Clustering
cluster_assignments <- kmeans(selected_eigenvectors, centers = k)$cluster
 
# Add species information to the clustering results
iris$Cluster <- factor(cluster_assignments)
iris$Species <- as.character(iris$Species)

A similarity matrix is created. This matrix quantifies the similarity between data points using the Euclidean distance as a similarity measure.

Visualize the Results




library(ggplot2)
 
# Visualizing the clusters with species names
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Cluster, label = Species)) +
  geom_point() +
  geom_text(check_overlap = TRUE, vjust = 1.5) +
  labs(title = "Spectral Clustering of Iris Dataset",
       x = "Sepal Length", y = "Sepal Width")

Output:

Spectral Clustering using R

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Cluster, label = Species)): This sets up the initial plot using the iris dataset. It specifies that the x-axis should represent Sepal.Length, the y-axis should represent Sepal.Width, and the color of the points should be determined by the ‘Cluster’ column. Additionally, the ‘label’ aesthetic is set to ‘Species’ to label the data points with the species names.

Spectral Clustering with k-means




# Generate random data for clustering
set.seed(123)
n <- 100
k <- 3
 
# Create random data points with three clusters
data <- rbind(
  matrix(rnorm(n * 2, mean = c(2, 2), sd = 0.5), ncol = 2),
  matrix(rnorm(n * 2, mean = c(-2, 2), sd = 0.5), ncol = 2),
  matrix(rnorm(n * 2, mean = c(0, -2), sd = 0.5), ncol = 2)
)
 
# Compute the similarity matrix
similarity_matrix <- exp(-dist(data)^2)
 
# Perform spectral decomposition
eigen_result <- eigen(similarity_matrix)
 
# Extract the top-k eigenvectors
k_eigenvectors <- eigen_result$vectors[, 1:k]
 
# Perform k-means clustering on the eigenvectors
cluster_assignments <- kmeans(k_eigenvectors, centers = k)$cluster
 
# Visualize the clusters
plot(data, col = cluster_assignments, pch = 19,
     main = "Spectral Clustering with k-means")

Output:

Spectral Clustering using R

First generates a random dataset with three clusters. It sets the random seed for reproducibility and creates data points using the rnorm function, which generates random numbers from a normal distribution. We stack these data points using rbind to create the dataset.

Spectral Clustering using igraph package




library(igraph)
 
# Set seed for reproducibility
set.seed(2000)
 
# Create a tree graph with 80 vertices and a branching factor of 4
treeGraph <- make_tree(80, children = 4, mode = "undirected")
 
# Generate random cluster assignments for the tree graph
num_clusters <- 4
cluster <- sample(1:num_clusters, size = vcount(treeGraph), replace = TRUE)
 
# Define cluster colors and labels
cluster_colors <- c("red", "blue", "green", "purple")
cluster_labels <- c("Cluster A", "Cluster B", "Cluster C", "Cluster D")
 
# Visualize the tree graph with markers
plot(treeGraph,
     layout = layout_nicely(treeGraph),
     vertex.size = 10,
     vertex.label = NA,
     vertex.color = cluster_colors[cluster],  # Use the defined colors
     main = "Spectral Clustering of a Tree Graph",
     edge.arrow.size = 0.2)
 
# Add markers to the plot legend
legend("topright", legend = cluster_labels, fill = cluster_colors,
       title = "Clusters")

Output:

Spectral Clustering using R

First loads the igraph library, which is a package in R used for creating and analyzing network graphs and structures.

We define cluster_colors and cluster_labels. cluster_colors is a vector of color names, and cluster_labels is a vector of labels corresponding to each cluster. These will be used in the plot and legend.

Finally, this code adds a legend to the plot. It specifies the position (“topright”) of the legend, the labels (cluster_labels) for each cluster, the fill colors (cluster_colors), and a title for the legend (“Clusters”). This legend provides a visual reference to the cluster assignments and their associated colors on the graph.


Article Tags :