Open In App

kNN: k-Nearest Neighbour Algorithm in R From Scratch

Last Updated : 25 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to discuss what is KNN algorithm, how it is coded in R Programming Language, its application, advantages and disadvantages of the KNN algorithm.

kNN algorithm in R

KNN can be defined as a K-nearest neighbor algorithm. It is a supervised learning algorithm that can be used for both classification and regression tasks. It is the simplest algorithm that can be applied in machine learning, data analytics, and data science.KNN algorithm assigns labels to the testing data set based on the class labels of the training data set. It is a lazy learning algorithm because there is no learning that happens in the real sense.KNN algorithm can be applied to both categorical and numerical data. In this article we are going to discuss the KNN algorithm in detail and how it can be implemented on R programming language.

Let us now discuss the steps for the implementation of the KNN algorithm and how to assign class labels to the test data point based on the training dataset.

  1. Input: Take training dataset, test data
  2. Select the value of K (i.e., number of nearest neighbours to be considered).
  3. Calculate the euclidean distance for every point from the test data point , where euclidean distance can be calculated by the formula ((x2-x1)2+(y2-y1)2)(1/2)
  4. Euclidean distance=√((x2-x1)^2+(y2-y1)^2 ).
  5. Identify the K nearest training data points.
  6. If k=1 assign the class label of test data point with the training data point class label.
  7. If k>1 assign the class label of test data point with the predominant class label of training data point.

Example of KNN Algorithm

Let us now discuss an example how to implement the K Nearest Neighbour Algorithm.

The below table represents the training dataset.The first column represents the serial number. The second column represents the number of pages in a book , third column represents the cost of book and fourth column represents the class of book based on the number of pages and cost of book . The class names include white and black&white. The book is catrgorized as white or black and white based on the cost price and cost of book.

SI. No

Number of Pages

Cost of Book

Class

1

167

51

White

2

182

62

Black and white

3

176

69

Black and white

4

173

64

Black and white

5

172

65

Black and white

6

174

56

White

7

169

58

Black and white

8

173

57

Black and white

9

170

55

Black and white

The above table represents the training dataset which have the class labels as In white , black& white.. The class labels are classified based on the values of cost of book and number of pages of book.

Test Data : SI.No:10 Number of pages : 170 Cost of book : 57 Class ?

For the above test data we need to identify the class label by using the training data set as represented in the table with the help of KNN algorithm steps.

Let us assign the value of K is 3 (i.e., k=3)

Euclidean distance=√((x2-x1)^2+(y2-y1)^2 )

Now let us calculate the euclidean distance for every training datapoint from the test data point. The fourth column in the below table represents the calculation of euclidean distance by using the formula mentioned above where x2,y2 are the test data points (i.e., x2=number of pages of test data , y2=cost of book of test data) and x1,y1 are the training data points (i.e., x1=number of pages of training data , y1=cost of book of training data) .

SI.No

Number of Pages

Cost of Book

Euclidean Distance

Class

1

167

51

√((170-167)^2+(57-51)^2 ) =√((3)^2+(6)^2 ) = √(9+36) =√45 = 6.75

White

2

182

62

13

Black and white

3

176

69

13.4

Black and white

4

173

64

7.6

Black and white

5

172

65

8.2

Black and white

6

174

56

4.1

White

7

169

58

1.4

Black and white

8

173

57

3

Black and white

9

170

55

2

Black and white

10

170

57

?

Now let us rearrange the table based on the distance (arranging the distance either in ascending or descending order). We have arranged the above table as below by arranging the euclidean distance in ascending order and rearranged the table as per the order of euclidean distance as shown below.

Si.No

Number of Pages

COst of Book

Euclidean Distance

Class

1

169

58

1.4

Black and white

2

170

55

2

Black and white

3

173

57

3

Black and white

4

174

56

4.1

White

5

167

51

6.7

White

6

173

64

7.6

Black and white

7

172

65

8.2

Black and white

8

182

62

13

Black and white

9

176

69

13.4

Black and white

10

170

57

?

Now we are showing the class labels for different values of k.

SI.No1

Number of Pages

Cost of Book

Euclidean Distance

Class

1

169

58

1.4

Black and white

k=1

2

170

55

2

Black and white

k=2

3

173

57

3

Black and white

k=3

4

174

56

4.1

White

5

167

51

6.7

White

6

173

64

7.6

Black and white

7

172

65

8.2

Black and white

8

182

62

13

Black and white

9

176

69

13.4

Black and white

10

170

57

?

Given the value of k is 3 . For k=3 the class labels are Intelligent, speaker and intelligent respectively.Based on the training dataset now we can assign the test data set with the class label black and white as it is predominant class label for k=3. Therefore the class label for the test data point is shown in the below table along with the training dataset.

Step by step explanation of the KNN algorithm code from the scratch

Let us now implement the above provided example in R programming from scratch.

Taking Data Set as input

In the below code we have taken an external dataset .In the below dataset we have 10 observations and 4 varibales which includes Serial Number , Number of Pages , Cost of Book and Class.The value of class is based on the number of pages and cost of book either white or black and white.

We can load / access the data by using the function read.csv(). In Rstudio we have some built in datasets . In the code explained below we have used an external data set as mentioned below . We have a lot of datasets avaliable in few websites like google , kaggle.com etc. We can download the datasets from the https://www.kaggle.com website .

Dataset Link: Example Dataset

R




20240112220442/exampleData.csv")
dataFrame


Output:

   S.No Number.of.Pages Cost.of.Book           Class
1 1 167 51 White
2 2 182 62 Black and White
3 3 176 69 Black and White
4 4 173 64 Black and White
5 5 172 65 Black and White
6 6 174 56 White
7 7 169 58 Black and White
8 8 173 57 Black and White
9 9 170 55 Black and White
10 10 170 57 Black and White



We have divided last row of the data as test data and remaining data as the training data . We are predicting the class of the divided test data by using the KNN algorithm. The below code represents the division of dataset into train data and test data

R




#creating training data
trainData=dataFrame[1:nrow(dataFrame)-1,]
trainData
 #creating test data
 testData=dataFrame[nrow(dataFrame),]
testData


Output:

  S.No Number.of.Pages Cost.of.Book           Class
1 1 167 51 White
2 2 182 62 Black and White
3 3 176 69 Black and White
4 4 173 64 Black and White
5 5 172 65 Black and White
6 6 174 56 White
7 7 169 58 Black and White
8 8 173 57 Black and White
9 9 170 55 Black and White
testData
S.No Number.of.Pages Cost.of.Book Class
10 10 170 57 Black and White



We can inspect and analyze the data by using functions like str() and summary() in R.

R




summary(dataFrame)


Output:

      S.No       Number.of.Pages  Cost.of.Book               Class  
Min. : 1.00 Min. :167.0 Min. :51.00 Black and White:8
1st Qu.: 3.25 1st Qu.:170.0 1st Qu.:56.25 White :2
Median : 5.50 Median :172.5 Median :57.50
Mean : 5.50 Mean :172.6 Mean :59.40
3rd Qu.: 7.75 3rd Qu.:173.8 3rd Qu.:63.50
Max. :10.00 Max. :182.0 Max. :69.00



This function is used to get the summary of the whole provided data.

R




str(dataFrame)


Output:

'data.frame':    10 obs. of  4 variables:
$ S.No : int 1 2 3 4 5 6 7 8 9 10
$ Number.of.Pages: int 167 182 176 173 172 174 169 173 170 170
$ Cost.of.Book : int 51 62 69 64 65 56 58 57 55 57
$ Class : Factor w/ 2 levels "Black and White",..: 2 1 1 1 1 2 1 1 1 1



str() function in R used to display the internal structure of an object.It provides the information about the rows , columns , names of the rows , names of the colums and also give few additional points.

Selecting the value of K and Calculation of euclidean distance

In KNN algorithm we predict the class of the data based on the value of K . The value of K will be decided based on the value of number of observations .Usually, the value of K is the squareroot of the number of observations . For the data we have used , have 10 observations.By the value of observations the value of K will be 3. We can also use this code to store the value of k.

√((x2-x1)^2+(y2-y1)^2 ) is the formula to calculate the eucledian distance where x1,y1 are training data point and x2,y2 are the test data point . We find the euclidean distance for training data points from test data point .The below code represents the function for the calculation of the euclidean distance.

R




k<-3
#function for calculating the  euclidean distance
euclideanDistance=function(x,y){
  #checking whether x and y have same number of observation
  if(length(x)==length(y))
     {
       sqrt(sum((x-y)^2))
     }
  else
    {
       stop('X and Y shouls have same variable numbers')
    }
 }
euclideanDistance(9:15,16:22)


Output:

[1] 18.52026



In the above we have created a function to calculate the distance of the point . To determine the execution of the function we just called the function by providing the values x and y . Where x and y are equal length dataframes.

Complete implementation of KNN algorithm

R




#function for calculating the  euclidean distance
euclideanDistance=function(x,y){
  #checking whether x and y have same number of observation
  if(length(x)==length(y))
     {
       sqrt(sum((x-y)^2))
     }
  else
    {
       stop('X and Y shouls have same variable numbers')
    }
 }
      
 #function to find the K nearest neighbours
 nearestNeighbours=function(trainData,testData,k,funct,s=NULL)
 {
    #checking whether the observations are same or not
    if(ncol(trainData)!=ncol(testData))
    {
       stop('data should be same')
    }
    if(is.null(s))
    {
       distance=apply(trainData,1,funct,testData)
    }
    else
    {
        distance=apply(trainData,1,funct,testData,s)
    }
    
    #getting closest neighbours
    distances=sort(distance)[1:k]
    neighbour_res=which(distance %in% sort(distance)[1:k])
     
    if(length(neighbour_res)!=k)
    {
         warning
         (
           paste('Many variables have same length')
         )
    }
   result=list(neighbour_res,distances)
   return(result)
 }
 
 #Accessing the data
                            exampleData.csv")
 #creating train data
 trainData=dataFrame[1:nrow(dataFrame)-1,]
 #creating test data
 testData=dataFrame[nrow(dataFrame),]
 #calling nearestNeaighbour function
 res=nearestNeighbours(trainData[,1:3],testData[,1:3],3,euclideanDistance)[[1]]
 as.matrix(trainData[res,1:3])
 #creating a prediction function
 knnPrediction=function(trainData, varible)
 {
     interData=table(trainData[,varible])
     predicted=interData[interData==max(interData)]
     return(predicted)
 }
 #calling knnPrection () function
 knnPrediction(trainData[res,],'Class')


Output:

  S.No Number.of.Pages Cost.of.Book
7 7 169 58
8 8 173 57
9 9 170 55
Black and White
3


The above output is predicted as black and white based on the training data present.

Step by Step Explanation of the KNN algorithm

Installing Packages

To implement the KNN algorithm in R programming , we need to install some packages includes class , ggplot2 , caret and GGally.

Process to install packages in the Rstudio.

We can install packages in R studio in two ways:

  • In the Rstudio go to tools, then click on tools , in tools we find install packages click on it then we find a tab , in that tab determine the required package name and click on install . These steps will successfully install the required packages.The below figure represents the tab that is shown when clicked on install packages.⇒ open Rstudio → click on tools →click on install packages → in install packages tab give package name →click on install in install packages tab.
  • We can also install the packages using the command install.packages(“package_name”) in the command prompt of Rstudio .The below figure represents the installation of packages using the command.⇒ open Rstudio → in console type install.packages(“package_name”) .

Importing Packages

In order to work with KNN algorithm we need to import the installed packages into our script . We load or import the packages into the Rscript by using the function library().Below lines represents the importing/loading of packages into a R script where class , caret, ggplot2 and GGally are the packages for different purpose.The purpose of each package is discussed below.

  • class – It is a package in R programming to work with the KNN algorithm and classification. It includes the functions like knn(), reduce.nn(),knn.cv() and many more.In this article we are importing this package to work the function knn().
  • caret – It is package in R to work with classification problems as well as with the regression problems.
  • ggplot2 – It is a pckage in R programming to create graphics. It is used for the purpose of the data visulaization.
  • GGally – It is package that is the extension of the package ggplot2 . This package will reduce the complexity of some functions.
  • library() – It is function in Rstudio used to load the specified package in the Rscript. We load many packages at a time in library() function. The syntax of library() function is library(“package1″,package2″……….”package n”).

R




library(class)
library(caret)
library(ggplot2)
library(GGally)


Accessing/Importing Dataset

After importing the required packages we need to load the data into the Rscript. We can the load or get the data into Rscript into two ways.Now let us discuss each of them.

  • We can load/acess the available datasets by using the function data().In Rstudion there are approximately 104 bult in datasetsare available.The below represents the code to load the dataset using the function data().In the code explained below we have used the built in dataset iris which has 150 rows and 5 columns.

R




data(iris)
iris


Output:

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa


We accessing the data into our R script.

  • We can also load / access the data by using the function read.csv(). The below code represents the code to load the dataset using the function read.csv(). read.csv() function stores the accessed data in data frame format.We can download the datasets from kaggle.com website , other google source or we can create our own data.

Normalization

In KNN algorithm we use normalization to make all variables of data to same level. We can make the data to same level by using normalization or standardization. We can use normalization when there is a lot of difference in variable values,it is not necessary to use all the time.

R




normal_frame<-function(a)(
  return  (((a-min(a))/(max(a)-min(a))))
)
iris_new_frame<-as.data.frame(lapply(iris[,-5],normal_frame))
summary(iris_new_frame)


Output:

  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000


We observed that the normalization function has created a output with same level of value for all variables

Creating test and training data

We know that the KNN algorithm is a supervised learning algorithm in which it has both training and test data. Supervised learning algorithms learn from the previously available data. Now we are dividing our available data into training data and testing data . We are creating 70% of our data as training data and remaining data as test data.Here we have created two train and two test datasets. In the first set of train and test data set we have created with out the class column(i.e., Species clumn) . In the second setwe have reated data set with the class column (i.e, including the Species column).

R




set.seed(1234)
data_ran<-sample(1:nrow(iris_new_frame),size = nrow(iris_new_frame)*0.7,replace = FALSE)
train_iris<-iris_new_frame[data_ran,]
test_iris<-iris_new_frame[-data_ran,]
 
train_iris_ran<-iris[data_ran,5]
test_iris_ran<-iris[-data_ran,5]


Model Creating

We are creating the KNN model in R with the help of the function knn().The below code represents the creation of model using the function knn() . IN knn() function we have given the values of training data set , test data set , training dataset which as the class variable(in this data set the class variable is species in fifth column),the value of K.

R




knnModel<-knn(train=train_iris,test=test_iris,cl=train_iris_ran,k=13)
summary(knnModel)


Output:

    setosa versicolor  virginica 
16 16 13


Performance of model

We evaluate the performance of the model by calculating the accuracy of the model.Accuracy tells that how accurately /correctly we are predicting the species based on the sepal length , sepal width , petal length and petal width.The below gives an idea how to calculate the accuracy of the model.

R




accuracy<-100*sum(test_iris_ran==knnModel)/NROW(test_iris_ran)
accuracy


Output:

[1] 95.55556


We can also know the performance parameters of the model by creating the confusion matrix for the model. In R programming we can create the confusion matrix by using the function confusionMatrix().This function can be used only when the caret is downloaded in the Rstudio.

R




table(knnModel,test_iris_ran)
confusionMatrix(table(knnModel,test_iris_ran))


Output:

            test_iris_ran
knnModel setosa versicolor virginica
setosa 16 0 0
versicolor 0 15 1
virginica 0 1 12
> confusionMatrix(table(knnModel,test_iris_ran))
Confusion Matrix and Statistics
test_iris_ran
knnModel setosa versicolor virginica
setosa 16 0 0
versicolor 0 15 1
virginica 0 1 12
Overall Statistics

Accuracy : 0.9556
95% CI : (0.8485, 0.9946)
No Information Rate : 0.3556
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.933

Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9375 0.9231
Specificity 1.0000 0.9655 0.9688
Pos Pred Value 1.0000 0.9375 0.9231
Neg Pred Value 1.0000 0.9655 0.9688
Prevalence 0.3556 0.3556 0.2889
Detection Rate 0.3556 0.3333 0.2667
Detection Prevalence 0.3556 0.3556 0.2889
Balanced Accuracy 1.0000 0.9515 0.9459


Visualization

R




ggplot(aes(Sepal.Length,Petal.Width),data=iris)+
geom_point(aes(color=factor(Species)))


Output:

gh

kNN algorithm in R from scratch

Applications of KNN Algorithm

  1. KNN algorithm is used for classifying images in image recognition.
  2. KNN algorithm can be used in text categeorization task.
  3. It is useful for the detection of spam messages and spam mails.
  4. KNN algorithm can also be used for the stock prediction , house price prediction ,weather prediction , market segmentation and real estate.
  5. KNN algorithm can be used for the identification of fraud activities in financial transactions.
  6. It can be used for the detection of unusual network traffic patterns.
  7. KNN algorithm can be used in drug discovery and disease diagnosis.
  8. It is helpful in the recognition of hand writing and face patterns.
  9. KNN algorithms is useful to navigate the robots . It is helpful for robotics and robot motion planning.

Advanatges of KNN Algorithm

  1. KNN algorithm is a simple algorithm.
  2. It is an easy algorithm to implement.
  3. KNN algorithm is a lazy learning algorithm.It doesn’t have training phase.
  4. As, KNN algorithm is a lazy learning algorithm and build model at the time of prediction, it is suitable for dynamic and changing datasets .
  5. KNN algorithm show versatality .KNN algorithm is suitable to implement both regression and classification problems.
  6. KNN algorithm has the ability to deal with both qualitative and quantitative data (i.e., categorical and numerical data).
  7. It is less sensitivity to outliers when compared with other algorithms.
  8. KNN algorithm can implement complex patterns and easily acquire local structure of data.

Disadvantages of KNN Algorithm

  1. KNN algorithm has complexity for calculating the distances.
  2. It requires more space to store the training dataset.
  3. The performance of the algorithm decreases as the number of dimensions of the dataset increases.
  4. The performance of the algorithm also depends on the value of k. The small value of k leads to noise while large value of k leads to reduced sensitivity.
  5. This algorithm is sensitive to noisy data, outliers and irrelevant features.

Conclusion

In this article we have learned about the KNN algorithm and the steps to implement the KNN algorithm. We have also learned about the implementation of the KNN algorithm in R programming language . We also learned about the applications , advantages and disadvantages of the KNN algorithm in detail.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads