KNN Model Complexity

KNN is a machine learning algorithm which is used for both classification (using KNearestClassifier) and Regression (using KNearestRegressor) problems.In KNN algorithm K is the Hyperparameter. Choosing the right value of K matters. A machine learning model is said to have high model complexity if the built model is having low Bias and High Variance.

We know that,

High Bias and Low Variance = Under-fitting model.
Low Bias and High Variance = Over-fitting model. [Indicated highly complex model ].
Low Bias and Low Variance = Best fitting model. [This is preferred ].
High training accuracy and Low test accuracy ( out of sample accuracy ) = High Variance = Over-fitting model = More model complexity.
Low training accuracy and Low test accuracy ( out of sample accuracy ) = High Bias = Under-fitting model.

Code: To understand how K value in KNN algorithm affects the model complexity.

# This code may not run on GFG ide 
# As required modules are not found. 

# Import required modules 

import matplotlib.pyplot as plt 

from sklearn.datasets import make_regression 

from sklearn.neighbors import KNeighborsRegressor 

from sklearn.model_selection import train_test_split 

import numpy as np 

# Synthetically Create Data Set 
plt.figure() 

plt.title('SIMPLE-LINEAR-REGRESSION') 

x, y = make_regression( 

    n_samples = 100, n_features = 1,  

    n_informative = 1, noise = 15, random_state = 3) 

plt.scatter(x, y, color ='red', marker ='o', s = 30) 

# Train the model. 

knn = KNeighborsRegressor(n_neighbors = 7) 

x_train, x_test, y_train, y_test = train_test_split( 

    x, y, test_size = 0.2, random_state = 0) 
knn.fit(x_train, y_train) 

predict = knn.predict(x_test) 

print('Test Accuracy:', knn.score(x_test, y_test)) 

print('Training Accuracy:', knn.score(x_train, y_train)) 

# Plot The Output 

x_new = np.linspace(-3, 2, 100).reshape(100, 1) 

predict_new = knn.predict(x_new) 
plt.plot( 

    x_new, predict_new, color ='blue',  

    label ="K = 7") 

plt.scatter(x_train, y_train, color ='red' ) 

plt.scatter(x_test, predict, marker ='^', s = 90) 
plt.legend()

Output:

Test Accuracy: 0.6465919540035108
Training Accuracy: 0.8687977824212627

Now let’s vary the value of K (Hyperparameter) from Low to High and observe the model complexity
K = 1

K = 10

K = 20

K = 50

K = 70

Observations:

When K value is small i.e. K=1, The model complexity is high ( Over-fitting or High Variance).
When K value is very large i.e. K=70, The model complexity decreases ( Under-fitting or High Bias ).

Conclusion:
As K value becomes small model complexity increases and as K value becomes large the model complexity decreases.

Code: Let’s consider the below plot

# This code may not run on GFG 
# As required modules are not found. 

# To plot test accuracy and train accuracy Vs K value. 

p = list(range(1, 31)) 

lst_test =[] 

lst_train =[] 

for i in p: 

    knn = KNeighborsRegressor(n_neighbors = i) 

    knn.fit(x_train, y_train) 

    z = knn.score(x_test, y_test) 

    t = knn.score(x_train, y_train) 

    lst_test.append(z) 

    lst_train.append(t) 

plt.plot(p, lst_test, color ='red', label ='Test Accuracy') 

plt.plot(p, lst_train, color ='b', label ='Train Accuracy') 

plt.xlabel('K VALUES --->') 

plt.title('FINDING BEST VALUE FOR K') 
plt.legend()

Output:

Observation:
From the above graph, we can conclude that when K is small i.e. K=1, Training Accuracy is High but Test Accuracy is Low which means the model is over-fitting ( High Variance or High Model Complexity). When the value of K is large i.e. K=50, Training Accuracy is Low as well as Test Accuracy is Low which means the model is under-fitting ( High Bias or Low Model Complexity ).

So Hyperparameter tuning is necessary i.e. to select the best value of K in KNN algorithm for which the model has Low Bias and Low Variance and results in a good model with high out of sample accuracy.

We can use GridSearchCV or RandomSearchCv to find the best value of hyper parameter K.

Article Tags :

Machine Learning

Python

ML-Clustering