How to Find The Optimal Value of K in KNN

Last Updated : 14 Dec, 2023

KNN or k nearest neighbor is a non-parametric, supervised learning classifier, that can be used for both classification and regression tasks, which uses proximity as a feature for classification or prediction. It is a classic example of a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.

Using K-Nearest Neighbor, we predict the category of the test point from the available class labels by finding the distance between the test point and trained k nearest feature values.

In this article, we will understand the KNN algorithm’s working mechanism along with the parameters affecting the algorithm, Distance Metrics, the advantages and disadvantages of the KNN algorithm, and the real-world use case of KNN and at last, build a model to visualize the effect of change of K neighbors in for a selection of suitable K value in KNN algorithm.

What is KNN(K Nearest Neighbors)?

KNN or K-Nearest Neighbors (KNN) is a non-parametric algorithm, which means it does not make assumptions about underlying data.

K-Nearest Neighbors (KNN) is a versatile supervised machine learning algorithm used for both classification and regression tasks. Its fundamental concept revolves around the idea that similar things are close to each other in a feature space. KNN operates on the principle of lazy learning, where it does not explicitly learn a model during the training phase but stores the entire training dataset.

During the prediction phase, when faced with a new data point, KNN identifies the ‘k’ nearest neighbours to that point based on a chosen distance metric, commonly the Euclidean distance. For classification tasks, KNN assigns the majority class among the neighbours to the new data point. In regression tasks, it calculates the average of the target values of the ‘k’ nearest neighbours.

During the training phase, the KNN algorithm memorizes the entire dataset. When presented with new data, it categorizes it into a class that closely resembles the characteristics of the new data as shown in Figure 1 below.

K-Nearest Neighbors

How does KNN algorithm work?

Let’s understand how K-Nearest Neighbors (KNN) works with a simple step-by-step approach-

Step 1: Choose the Number of Neighbors (K)

Start by deciding how many neighbors (data points from your dataset) you want to consider when making predictions. This is your ‘K’ value.

Step 2: Calculate Euclidean Distance

Find the distance between your new data point and the chosen number of neighbors. Imagine it as measuring the straight-line distance between two points.

Step 3: Identify Nearest Neighbors

Pick the ‘K’ neighbors with the smallest calculated distances. These are the closest points to your new data.

Step 4: Count Data Points in Each Category

Among these neighbors, count how many belong to each category. For instance, count how many are in Category A and how many are in Category B.

Step 5: Assign to the Majority Category

Assign your new data point to the category that has the most neighbors. If most of them are in Category A, your new point goes into Category A.

Let’s picture this with an example: If we choose ‘K’ to be 5, calculate distances, and find that 3 neighbors are in Category A and 2 are in Category B, our new data point is likely in Category A.

Choosing ‘K’ is crucial. It represents the number of neighbors considered. KNN is a lazy learning algorithm, meaning it doesn’t update distances with every calculation to save computational resources.

As seen in our example, changing ‘K’ changes predictions. With K=3, we might predict Category B, while with K=7, it could be Category A. So, picking the right ‘K’ is a big deal in making KNN work well.

KNN Algorithm

Metrics to compute the Distance between the two Data Points

The measure used to calculate the distance between data points. Euclidean distance is common, but other metrics like Manhattan distance or MinKowski Distance can be used.

To identify which data points are closest to a given query point in the KNN algorithm, the computation of distances between the query point and other data points is essential. These distance metrics play a crucial role in establishing decision boundaries, effectively dividing query points into distinct regions, often depicted using Voronoi diagrams.

Among the various distance measures available, this discussion will focus on the following:

Euclidean Distance (p=2)
Widely employed, Euclidean distance is tailored for real-valued vectors. Calculated using the formula below, it represents the straight-line distance between the query point and the point being measured.
$d = \sqrt{(x_{2}-x_{1})^2 + (y_{2}-y_{1})^2}$
Manhattan Distance|
Also known as taxicab or city block distance, Manhattan distance gauges the absolute value between two points. Visualized with a grid, it reflects how one might traverse city streets between two addresses.
$Manhattan Distance Formula = | x_{1} − x_{2} | + | y_{1} − y_{2} |$
MinKowski Distance
|A generalized form encompassing both Euclidean and Manhattan metrics, MinKowski distance introduces a parameter, p, allowing the creation of diverse distance metrics. The formula below represents Euclidean distance when p is 2 and Manhattan distance when p is 1.
$MinKowski\; Distance = (\Sigma_{i=1}^n |x_{i} - y_{i}|)^\frac{1}{p}$
Hamming Distance
Primarily employed with Boolean or string vectors, Hamming distance identifies points where vectors diverge. Also known as the overlap metric, the formula below captures this dissimilarity.
$d(x,y) = \frac{1}{n} \Sigma^{n=n}_{n=1}|x_{i} - y_{i}|$
For instance, if given two strings, the Hamming distance would be 2 if only two values differ.

Selecting optimal n_neighbors value in KNN-

Here the few steps for selecting optimal value of n neighbors in KNN:-

Define a Range:-Set a range for n_neighbors, like 1 to 20.
Data Split:- Segment your dataset into training and testing sets to gauge model generalization.
Choose Metric:- Opt for a performance metric (e.g., accuracy, precision, recall) aligned with your goal.
Cross-Validation Setup:- Employ cross-validation, like k-fold, to robustly evaluate each n_neighbors value.
Grid Search:- Execute a grid search, testing n_neighbors across the defined range.
Evaluate Performance:-Compute performance metrics for each n_neighbors through cross-validation, aiding the identification of the optimal value.
Visualize Findings:-Craft visualizations (line plots, bar charts) to depict the correlation between n_neighbors and the chosen performance metric.
Fine-Tune if Needed:- If initial results hint at a promising range, narrow it down for finer optimization.
Select Optimal n_neighbors:- Choose the n_neighbors value balancing accuracy and simplicity, informed by evaluation and visualization.
Test on New Data:-Validate the chosen n_neighbors value on a test set to ensure efficacy in predicting new, unseen data.

By following these streamlined steps, you can efficiently pinpoint the best n_neighbors for your KNN model, aligning with your dataset’s characteristics and machine learning objectives.

Parameter affecting KNN algorithm-k (Number of Neighbors)

The user defines the value of k. A smaller k makes the model more sensitive to noise but might overfit, while a larger k provides a smoother prediction but might lose detail. We will see it’s in detail later in this article.

To select the value of K that fits your data, we run the KNN algorithm multiple times with different K values. We’ll use accuracy as the metric for evaluating K performance. If the value of accuracy changes proportionally to the change in K, then it’s a good candidate for our K value.

When it comes to choosing the best value for K, we must keep in mind the number of features and sample size per group. The more features and groups in our data set, the larger a selection we need to make in order to find an appropriate value of K.

When we decrease the value of K to 1, our predictions become less stable. The accuracy decreases.

Example to demonstrate the effect of K neighbors in KNN

The below code demonstrate the K-Nearest Neighbors (KNN) algorithm’s behavior by visualizing its performance with varying values of K. It begins by generating a synthetic dataset and splitting it into training and testing sets. For each selected k (1, 3, 5, 7, 9, 11), a KNN classifier is trained, and its decision boundaries are visually represented. The code goes further to create scatter plots, displaying both training and testing points, providing an intuitive insight into how well the model fits the data. The accuracy of each KNN model on the testing set is quantified, allowing a nuanced understanding of how different k values impact predictive accuracy. This visual narrative encapsulates the essence of choosing an optimal k for effective KNN model performance, making the intricacies of the algorithm accessible and insightful.

Python3

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
 
# Generate a synthetic dataset
X, y = make_classification(
    n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=42)
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
 
# Define a range of k values to test
k_values = [1, 3, 5, 7, 9, 11]
 
# Create subplots for each k value
fig, axes = plt.subplots(1, len(k_values), figsize=(15, 3))
 
# Train and visualize the models with varying k values
for i, k in enumerate(k_values):
    # Create KNN model
    knn = KNeighborsClassifier(n_neighbors=k)
 
    # Train the model
    knn.fit(X_train, y_train)
 
    # Make predictions
    y_pred = knn.predict(X_test)
 
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
 
    # Plot decision boundary
    h = 0.02  # Step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
 
    # Plot the decision boundary
    axes[i].contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
 
    # Plot the training points
    axes[i].scatter(X_train[:, 0], X_train[:, 1], c=y_train,
                    edgecolors='k', cmap=plt.cm.Paired)
 
    # Plot the testing points
    axes[i].scatter(X_test[:, 0], X_test[:, 1], c=y_test,
                    marker='x', edgecolors='k', cmap=plt.cm.Paired)
 
    # Set plot labels and title
    axes[i].set_title(f'k={k}, Accuracy={accuracy:.2f}')
    axes[i].set_xlabel('Feature 1')
    axes[i].set_ylabel('Feature 2')
 
plt.show()

Output:

Plot between K values and Accuracy score

The above figure depicts graphical visualization of the varying score in this case accuracy with corresponding to number of K values in this model, as it can be seen the at k=1, accuracy = 0.900 and on increasing the value of k to k=2 ,accuracy=0.950 there is a significant increase in the accuracy score till it reaches to a saturation point after k=11 the accuracy score starts to decrease for the above data set . so this demonstrates the effect of varying values of k neighbors in KNN algorithm.

Advantages of K-Nearest Neighbors (KNN)

Here are few examples advantages of KNN algorithm –

Simplicity and Intuitiveness: KNN is easy to understand and implement, making it accessible even for those new to machine learning.
No Training Phase: KNN doesn’t require a separate training phase. It quickly adapts to new data, making it suitable for dynamic environments.
Versatility: It works well for both classification and regression tasks, providing flexibility in application.

Disadvantages of K-Nearest Neighbors (KNN)

Here are few examples disadvantages of KNN algorithm –

Computational Cost: KNN can be computationally expensive, especially with large datasets, as it needs to calculate distances for each prediction.
Sensitive to Irrelevant Features: It is sensitive to irrelevant or redundant features in the dataset, which can impact the quality of predictions.
Careful Parameter Selection: Selecting the right value for k and the appropriate distance metric is crucial. Poor choices can lead to suboptimal performance.

Real-Life Use Cases of KNN

Here are few examples of real-life use cases of KNN algorithm-

Image Recognition: KNN is employed in image recognition tasks, identifying similar images based on pixel values. This is used in facial recognition, pattern matching, and image classification.
Medical Diagnosis: In healthcare, KNN assists in predicting diseases based on patient data. It analyzes similarities between a patient’s medical profile and known cases to make predictions.
Recommendation Systems: KNN is utilized in recommendation systems for suggesting products, movies, or content based on user behavior. It identifies similar users and recommends items they liked.

Frequently Asked Questions

1. Why is KNN a non-parametric Algorithm?

KNN is considered a non-parametric algorithm due to its lack of assumptions about the underlying data distribution. It doesn’t learn parameters; instead, it memorizes the complete dataset, relying on the proximity of data points for predictions.

2. Why is the odd value of “K” preferred over even values in the KNN Algorithm?

The odd values of k are preferred over even values to avoid ties in voting.

3. what are the uses of knn in ott?

KNN algo is used in OTT platform’s for various purpose like movie or series recommendation based on the content you have recently watched. For an instance if you have watched Mission impossible 6 the platform will suggest you to watch Mission impossible 8 , or edge of tomorrow or movies that are quite similar to the one you saw first , and also this is how the genre of the movie is decided .the movies mentioned above falls in Action genre.

Suggest improvement

ML | Determine the optimal value of K in K-Means Clustering

Share your thoughts in the comments