Open In App

K-Nearest Neighbors and Curse of Dimensionality

Last Updated : 04 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In high-dimensional data, the performance of the k-nearest neighbor (k-NN) algorithm often deteriorates due to increased computational complexity and the breakdown of the assumption that similar points are proximate. These challenges hinder the algorithm’s accuracy and efficiency in high-dimensional spaces. The aim of the article is to explore the challenges faced by the k-nearest neighbor (k-NN) algorithm in high-dimensional data, known as the curse of dimensionality. It discusses how increasing dimensionality affects k-NN performance and offers strategies to mitigate these issues, providing insights into enhancing the algorithm’s effectiveness.

What is the KNN algorithm (K-Nearest Neighbors)?

  • K-Nearest Neighbors (KNN) is a simple, non-parametric, and instance-based learning algorithm used for classification and regression tasks.
  • In KNN, the output of an instance (whether it belongs to a certain class or has a certain value) is determined by the majority vote or averaging of its k nearest neighbours in the feature space.
  • The “nearest” neighbours are defined based on a distance metric, commonly Euclidean distance, but other metrics like Manhattan distance or cosine similarity can also be used.

What is the Curse of Dimensionality?

  • The Curse of Dimensionality refers to various phenomena that arise when dealing with high-dimensional data.
  • As the number of features or dimensions increases, the volume of the feature space grows exponentially, leading to sparsity in the data distribution.
  • This sparsity can result in several challenges such as increased computational complexity, overfitting, and deteriorating performance of certain algorithms.

How does Dimensionality effect KNN Performance?

The impact of dimensionality on the performance of KNN (K-Nearest Neighbors) is a well-known issue in machine learning. Here’s a breakdown of how dimensionality affects KNN performance:

  1. Increased Sparsity: As the number of dimensions increases, the volume of the space grows exponentially. Consequently, the available data becomes sparser, meaning that data points are spread farther apart from each other. This sparsity can lead to difficulties in finding meaningful nearest neighbors, as there may be fewer neighboring points within a given distance.
  2. Equal Distances: In high-dimensional spaces, the concept of distance becomes less meaningful. As the number of dimensions increases, the distance between any two points tends to become more uniform, or equidistant. This phenomenon occurs because the influence of any single dimension diminishes as the number of dimensions grows, leading to points being distributed more uniformly across the space.
  3. Degraded Performance: KNN relies on the assumption that nearby points in the feature space are likely to have similar labels. However, in high-dimensional spaces, this assumption may no longer hold true due to the increased sparsity and equalization of distances. As a result, KNN may struggle to accurately classify data points, leading to degraded performance.
  4. Increased Computational Complexity: With higher dimensionality, the computational cost of KNN increases significantly. The algorithm needs to compute distances in a high-dimensional space, which involves more calculations. This can make the KNN algorithm slower and less efficient, especially when dealing with large datasets.

How to reduce Dimensionality?

Dimensionality reduction techniques aim to mitigate the curse of dimensionality by reducing the number of features while preserving the most relevant information.

  1. Principal Component Analysis (PCA): PCA is a technique used to reduce the dimensionality of the data by projecting it onto a lower-dimensional subspace while preserving the maximum variance. It achieves this by finding the principal components, which are orthogonal directions in the feature space that capture the most variance in the data.
  2. Linear Discriminant Analysis (LDA): LDA is a supervised dimensionality reduction technique that aims to find the linear combinations of features that best separate different classes in the data. Unlike PCA, which focuses on maximizing variance, LDA focuses on maximizing the between-class scatter while minimizing the within-class scatter. LDA projects the data onto a lower-dimensional space that maximizes class separability.

Both PCA and LDA can be used as preprocessing steps before applying KNN or other machine learning algorithms to reduce the dimensionality of the data and improve the performance of the models.

Implementation

In the implementation, we aim to assess the impact of dimensionality reduction techniques, specifically Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), on the accuracy of the k-nearest neighbor (KNN) algorithm. We will initially apply KNN without dimensionality reduction, then utilize LDA and PCA separately to reduce the dimensions of the data. Subsequently, we will compare the accuracy scores achieved by KNN with the original data, LDA-transformed data, and PCA-transformed data. This comparative analysis will provide insights into the effectiveness of dimensionality reduction techniques in improving KNN performance.

Steps for KNN perform check on high-dimensional data :

Import all the important libraries.

Python3




# Importing necessary libraries
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score


Loading the dataset and splitting the data

Python3




# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X = mnist.data.astype('float64')
y = mnist.target.astype('int64')
 
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Standardize the features.

Python3




# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Comparing KNN without dimensionality reduction and with dimensionality

In the provided code snippet, K-Nearest Neighbors (KNN) classification is implemented with and without dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)

Python3




# Implementing KNN without dimensionality reduction
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
accuracy_without_reduction = accuracy_score(y_test, y_pred)
print("Accuracy of KNN without dimensionality reduction:", accuracy_without_reduction)
 
# Implementing PCA for dimensionality reduction
pca = PCA(n_components=100# Reduce to 100 dimensions
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
 
# Implementing KNN with PCA
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_with_pca = accuracy_score(y_test, y_pred_pca)
print("Accuracy of KNN with PCA:", accuracy_with_pca)
 
# Implementing LDA for dimensionality reduction
lda = LinearDiscriminantAnalysis(n_components=9# Reduce to 9 dimensions
X_train_lda = lda.fit_transform(X_train_scaled, y_train)
X_test_lda = lda.transform(X_test_scaled)
 
# Implementing KNN with LDA
knn_lda = KNeighborsClassifier(n_neighbors=5)
knn_lda.fit(X_train_lda, y_train)
y_pred_lda = knn_lda.predict(X_test_lda)
accuracy_with_lda = accuracy_score(y_test, y_pred_lda)
print("Accuracy of KNN with LDA:", accuracy_with_lda)


Output :

Accuracy of KNN without dimensionality reduction: 0.79457857142857143
Accuracy of KNN with PCA: 0.9497142857142857
Accuracy of KNN with LDA: 0.9161428571428571


Conclusion

we applied K-Nearest Neighbors (KNN) algorithm, Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA) on the MNIST dataset for image classification. KNN achieved moderate accuracy without dimensionality reduction. PCA significantly reduced dimensions while maintaining high accuracy, suitable for computational efficiency. LDA, although reducing dimensions less than PCA, improved accuracy compared to KNN alone by focusing on class separability. These techniques demonstrate their effectiveness in enhancing classification performance and managing high-dimensional image data.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads