# Spectral Clustering in Machine Learning

Last Updated : 10 May, 2023

Prerequisites: K-Means Clustering

In the clustering algorithm that we have studied before we used compactness(distance) between the data points as a characteristic to cluster our data points. However, we can also use connectivity between the data point as a feature to cluster our data points. Using connectivity we can cluster two data points into the same clusters even if the distance between the two data points is larger.

## Spectral Clustering

Spectral Clustering is a variant of the clustering algorithm that uses the connectivity between the data points to form the clustering. It uses eigenvalues and eigenvectors of the data matrix to forecast the data into lower dimensions space to cluster the data points. It is based on the idea of a graph representation of data where the data point are represented as nodes and the similarity between the data points are represented by an edge.

### Steps performed for spectral Clustering

Building the Similarity Graph Of The Data: This step builds the Similarity Graph in the form of an adjacency matrix which is represented by A. The adjacency matrix can be built in the following manners:

• Epsilon-neighbourhood Graph: A parameter epsilon is fixed beforehand. Then, each point is connected to all the points which lie in its epsilon-radius. If all the distances between any two points are similar in scale then typically the weights of the edges ie the distance between the two points are not stored since they do not provide any additional information. Thus, in this case, the graph built is an undirected and unweighted graph.
• K-Nearest Neighbours A parameter k is fixed beforehand. Then, for two vertices u and v, an edge is directed from u to v only if v is among the k-nearest neighbours of u. Note that this leads to the formation of a weighted and directed graph because it is not always the case that for each u having v as one of the k-nearest neighbours, it will be the same case for v having u among its k-nearest neighbours. To make this graph undirected, one of the following approaches is followed:-
1. Direct an edge from u to v and from v to u if either v is among the k-nearest neighbours of u OR u is among the k-nearest neighbours of v.
2. Direct an edge from u to v and from v to u if v is among the k-nearest neighbours of u AND u is among the k-nearest neighbours of v.
3. Fully-Connected Graph: To build this graph, each point is connected with an undirected edge-weighted by the distance between the two points to every other point. Since this approach is used to model the local neighbourhood relationships thus typically the Gaussian similarity metric is used to calculate the distance.

Projecting the data onto a lower Dimensional Space: This step is done to account for the possibility that members of the same cluster may be far away in the given dimensional space. Thus the dimensional space is reduced so that those points are closer in the reduced dimensional space and thus can be clustered together by a traditional clustering algorithm. It is done by computing the Graph Laplacian Matrix

#### Python Code For Graph Laplacian Matrix

To compute it though first, the degree of a node needs to be defined. The degree of the ith node is given byNote that is the edge between the nodes i and j as defined in the adjacency matrix above.

## Python3

 # Defining the adjaceny matiximport numpy as npA = np.array([    [0, 1, 1, 0, 0, 0, 0, 0, 1, 1],    [1, 0, 1, 0, 0, 0, 0, 0, 0, 0],    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0],    [0, 0, 0, 0, 1, 1, 0, 0, 0, 0],    [0, 0, 0, 1, 0, 1, 0, 0, 0, 0],    [0, 0, 0, 1, 1, 0, 1, 1, 0, 0],    [0, 0, 0, 0, 0, 1, 0, 1, 0, 0],    [0, 0, 0, 0, 0, 1, 1, 0, 0, 0],    [1, 0, 0, 0, 0, 0, 0, 0, 0, 1],    [1, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

The degree matrix is defined as follows:-

## Python3

 D = np.diag(A.sum(axis=1))print(D)

Thus the Graph Laplacian Matrix is defined as:-

## Python3

 L = D-Aprint(L)

This Matrix is then normalized for mathematical efficiency. To reduce the dimensions, first, the eigenvalues and the respective eigenvectors are calculated. If the number of clusters is k then the first eigenvalues and their eigenvectors are taken and stacked into a matrix such that the eigenvectors are the columns.

Code For Calculating eigenvalues and eigenvector of the matrix in Python

## Python3

 # find eigenvalues and eigenvectorsvals, vecs = np.linalg.eig(A)

Clustering the Data: This process mainly involves clustering the reduced data by using any traditional clustering technique – typically K-Means Clustering. First, each node is assigned a row of the normalized of the Graph Laplacian Matrix. Then this data is clustered using any traditional technique. To transform the clustering result, the node identifier is retained.

Properties:

1. Assumption-Less: This clustering technique, unlike other traditional techniques do not assume the data to follow some property. Thus this makes this technique to answer a more-generic class of clustering problems.
2. Ease of implementation and Speed: This algorithm is easier to implement than other clustering algorithms and is also very fast as it mainly consists of mathematical computations.
3. Not-Scalable: Since it involves the building of matrices and computation of eigenvalues and eigenvectors it is time-consuming for dense datasets.
4. Dimensionality Reduction: The algorithm uses eigenvalue decomposition to reduce the dimensionality of the data, making it easier to visualize and analyze.
5. Cluster Shape: This technique can handle non-linear cluster shapes, making it suitable for a wide range of applications.
6. Noise Sensitivity: It is sensitive to noise and outliers, which may affect the quality of the resulting clusters.
7. Number of Clusters: The algorithm requires the user to specify the number of clusters beforehand, which can be challenging in some cases.
8. Memory Requirements: The algorithm requires significant memory to store the similarity matrix, which can be a limitation for large datasets.

## Credit Card Data Clustering Using Spectral Clustering

The below steps demonstrate how to implement Spectral Clustering using Sklearn. The data for the following steps is the Credit Card Data which can be downloaded from Kaggle

Step 1: Importing the required libraries

We will first import all the libraries that are needed for this project

## Python3

 import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.cluster import SpectralClusteringfrom sklearn.preprocessing import StandardScaler, normalizefrom sklearn.decomposition import PCAfrom sklearn.metrics import silhouette_score

## Python3

 # Changing the working location to the location of the datacd "C:\Users\Dev\Desktop\Kaggle\Credit_Card" # Loading the dataX = pd.read_csv('CC_GENERAL.csv') # Dropping the CUST_ID column from the dataX = X.drop('CUST_ID', axis = 1) # Handling the missing values if anyX.fillna(method ='ffill', inplace = True) X.head()

Output:

Step 3: Preprocessing the data to make the data visualizable

## Python3

 # Preprocessing the data to make it visualizable # Scaling the Datascaler = StandardScaler()X_scaled = scaler.fit_transform(X) # Normalizing the DataX_normalized = normalize(X_scaled) # Converting the numpy array into a pandas DataFrameX_normalized = pd.DataFrame(X_normalized) # Reducing the dimensions of the datapca = PCA(n_components = 2)X_principal = pca.fit_transform(X_normalized)X_principal = pd.DataFrame(X_principal)X_principal.columns = ['P1', 'P2'] X_principal.head()

Step 4: Building the Clustering models and Visualizing the Clustering

In the below steps, two different Spectral Clustering models with different values for the parameter ‘affinity’. You can read about the documentation of the Spectral Clustering class here. a) affinity = ‘rbf’

## Python3

 # Building the clustering modelspectral_model_rbf = SpectralClustering(n_clusters = 2, affinity ='rbf') # Training the model and Storing the predicted cluster labelslabels_rbf = spectral_model_rbf.fit_predict(X_principal)

## Python3

 # Building the label to colour mappingcolours = {}colours[0] = 'b'colours[1] = 'y' # Building the colour vector for each data pointcvec = [colours[label] for label in labels_rbf] # Plotting the clustered scatter plot b = plt.scatter(X_principal['P1'], X_principal['P2'], color ='b');y = plt.scatter(X_principal['P1'], X_principal['P2'], color ='y'); plt.figure(figsize =(9, 9))plt.scatter(X_principal['P1'], X_principal['P2'], c = cvec)plt.legend((b, y), ('Label 0', 'Label 1'))plt.show()

Output:

b) affinity = ‘nearest_neighbors’

## Python3

 # Building the clustering modelspectral_model_nn = SpectralClustering(n_clusters = 2, affinity ='nearest_neighbors') # Training the model and Storing the predicted cluster labelslabels_nn = spectral_model_nn.fit_predict(X_principal)

Output:

Step 5: Evaluating the performances

## Python3

 # List of different values of affinityaffinity = ['rbf', 'nearest-neighbours'] # List of Silhouette Scoress_scores = [] # Evaluating the performances_scores.append(silhouette_score(X, labels_rbf))s_scores.append(silhouette_score(X, labels_nn)) print(s_scores)

Step 6: Comparing the performances

## Python3

 # Plotting a Bar Graph to compare the modelsplt.bar(affinity, s_scores)plt.xlabel('Affinity')plt.ylabel('Silhouette Score')plt.title('Comparison of different Clustering Models')plt.show()

Output:

Spectral Clustering is a type of clustering algorithm in machine learning that uses eigenvectors of a similarity matrix to divide a set of data points into clusters. The basic idea behind spectral clustering is to use the eigenvectors of the Laplacian matrix of a graph to represent the data points and find clusters by applying k-means or another clustering algorithm to the eigenvectors.

1. Scalability: Spectral clustering can handle large datasets and high-dimensional data, as it reduces the dimensionality of the data before clustering.
2. Flexibility: Spectral clustering can be applied to non-linearly separable data, as it does not rely on traditional distance-based clustering methods.
3. Robustness: Spectral clustering can be more robust to noise and outliers in the data, as it considers the global structure of the data, rather than just local distances between data points.