K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into groups identified by an attribute.
As an example, consider the following table of data points containing two features:
Now, given another set of data points (also called testing data), allocate these points a group by analyzing the training set. Note that the unclassified points are marked as ‘White’.
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an unclassified point, we can assign it to a group by observing what group its nearest neighbors belong to. This means a point close to a cluster of points classified as ‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the second point (5.5, 4.5) should be classified as ‘Red’.
Let m be the number of training data samples. Let p be an unknown point.
- Store the training samples in an array of data points arr. This means each element of this array represents a tuple (x, y).
for i=0 to m: Calculate Euclidean distance d(arr[i], p).
- Make set S of K smallest distances obtained. Each of these distances corresponds to an already classified data point.
- Return the majority label among S.
K can be kept as an odd number so that we can calculate a clear majority in the case where only two groups are possible (e.g. Red/Blue). With increasing K, we get smoother, more defined boundaries across different classifications. Also, the accuracy of the above classifier increases as we increase the number of data points in the training set.
Assume 0 and 1 as the two classifiers (groups).
The value classified to unknown point is 0.
This article is contributed by Anannya Uberoi. If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
- Find element that maximizes LCM of an Array in the range 1 to M
- Maximum product of bitonic subsequence of size 3
- Number of factors of very large number N modulo M where M is any prime number
- Why Prim’s and Kruskal's MST algorithm fails for Directed Graph?
- How to Get Masters in Data Science in 2020?
- D'Esopo-Pape Algorithm : Single Source Shortest Path
- Largest element smaller than current element on left for every element in Array
- Maximum repeated frequency of characters in a given string
- Maximum number of strings that can be formed with given zeros and ones
- Count of even and odd set bit with array element after XOR with K
- Count numbers whose maximum sum of distinct digit-sum is less than or equals M
- Count ways to change direction of edges such that graph becomes acyclic
- Count ways to reach end from start stone with at most K jumps at each step
- Minimum sum of product of elements of pairs of the given array