Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

Project | kNN | Classifying IRIS Dataset

  • Difficulty Level : Easy
  • Last Updated : 19 Jan, 2022

Introduction | kNN Algorithm

Statistical learning refers to a collection of mathematical and computation tools to understand data.In what is often called supervised learning, the goal is to estimate or predict an output based on one or more inputs.The inputs have many names, like predictors, independent variables, features, and variables being called common.The output or outputs are often called response variables, or dependent variables.If the response is quantitative – say, a number that measures weight or height, we call these problems regression problems.If the response is qualitative– say, yes or no, or blue or green, we call these problems classification problems.This case study deals with one specific approach to classification.The goal is to set up a classifier such that when it’s presented with a new observation whose category is not known, it will attempt to assign that observation to a category, or a class, based on the observations for which it does know the true category.This specific method is known as the k-Nearest Neighbors classifier, or kNN for short.Given a positive integer k, say 5, and a new data point, it first identifies those k points in the data that are nearest to the point and classifies the new data point as belonging to the most common class among those k neighbors.
Aim: Build our very own k – Nearest Neighbor classifier to classify data from the IRIS dataset of scikit-learn.
 

Distance between two points

We are going to write a function, which will find the distance between two given 2-D points in the x-y plane.We will import numpy, to take help of numpy arrays for storing the coordinates.Finding the distance between two points will help in finding the nearest neighbor of the input point. 
 

Python




import numpy as np
 
def distance(p1, p2):
    return np.sqrt(np.sum(np.power(p2-p1, 2))) #distance between two points
p1 = np.array([1, 1])   #coordinate x = 1, y = 1
p2 = np.array([4, 4])   #coordinate x = 4, y = 4
distance(p1, p2)

Majority vote counter

We will create a 3 x 3 matrix of points with the help of numpy array to build the environment of dispersed points in the plane.We will also create a function called majority_vote() to find the highest count/vote of a particular vote list, e.g ( 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 3, 2, 3) etc.This is indirectly the mode of the given data, so can also be calculated with the help of scipy statistics module.We will create another function called majority_vote_short() which will perform the same functionality as majority_vote() but will make use of mode() from scipy.stats.Both these functions will be necessary in predicting the points later. 
Our aim is to build a kNN classifier, so we need to develop an algorithm to find the nearest neighbours of a given set of points.Suppose we need to insert a point into x-y plane within an environment of given set of existing points.We will have to classify the point we wish to insert into one of the category of the existing points and then insert accordingly.So, we will build a function find_nearest_neighbours() to find the nearest neighbor of the given point.It will take in (i)The point we wish to insert (ii)set of existing points and (iii)k helps with the indices, as parameters to the function.We will visualize the situation by plotting the x-y plane filled with points with the help of matplotlib. 
 

Python




import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
 
points = np.array([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3]]) 
   #points = existing points
p = np.array([2.5, 2])   #p = point we wish to insert
 
def majority_vote(votes):
    vote_counts = {}
    for vote in votes:
        if vote in vote_counts:
           vote_counts[vote]+= 1
        else:
            vote_counts[vote]= 1
    winners = []
    max_count = max(vote_counts.values())
    for vote, count in vote_counts.items():
        if count == max_count:
            winners.append(vote)
    return random.choice(winners) #returns winner randomly if there are more than 1 winner
 
#>>>votes =[1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 1, 1, 1, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 1, 1, 2]
#sample vote counts above
# >>>vote_counts = majority_vote(votes)
 
def majority_vote_short(votes):
    mode, count = ss.mstats.mode(votes)
    return mode
 
def find_nearest_neighbours(p, points, k = 5):  #algorithm to find the nearest neighbours
    distances = np.zeros(points.shape[0])
    for i in range(len(distances)):
        distances[i]= distance(p, points[i])
    ind = np.argsort(distances)      #returns index, according to sorted values in array
    return ind[:k]
 
ind = find_nearest_neighbours(p, points, 2);print(points[ind])
 #gives the nearest neighbour's for this sample case
 
plt.plot(points[:, 0], points[:, 1], "ro")
plt.plot(p[0], p[1], "bo")
plt.axis([0.5, 3.5, 0.5, 3.5])
plt.show()

kNN Predict around Synthetic Data

After finding the nearest neighbors, we will have to predict the category of the input point.We will build a function called knn_predict() which will predict the category of the point we wish to insert.We can build another function called generate_synth_data() to generate synthetic points in the x-y plane. 
 

Python




import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
 
''' add the functions and libraries from previous programmes '''
 
def knn_predict(p, points, outcomes, k = 5):
    ind = find_nearest_neighbours(p, points, k)
    return majority_vote(outcomes[ind])
 
outcomes = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])
knn_predict(np.array([2.5, 2.7]), points, outcomes, k = 2)
 
def generate_synth_data(n = 50):
    points = np.concatenate((ss.norm(0, 1).rvs((n, 2)), ss.norm(1, 1).rvs((n, 2))), axis = 0)
    outcomes = np.concatenate((np.repeat(0, n), np.repeat(1, n)))
    return (points, outcomes)
 
n = 20
plt.figure()
plt.plot(points[:n, 0], points[:n, 1], "ro")
plt.plot(points[n:, 0], points[n:, 1], "bo")
plt.show()

kNN Prediction GRID

We will build a function called make_prediction_grid() which will make a grid and allot the different class of points in the grid.Another function plot_prediction_grid() must be created to plot the outputs of make_prediction_grid() using matplotlib.
 

Python




import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
 
def make_prediction_grid(predictors, outcomes, limits, h, k):
    (x_min, x_max, y_min, y_max) = limits
    xs = np.arange(x_min, x_max, h)
    ys = np.arange(y_min, y_max, h)
    xx, yy = np.meshgrid(xs, ys)
 
    prediction_grid = np.zeros(xx.shape, dtype = int)
    for i, x in enumerate(xs):
        for j, y in enumerate(ys):
            p = np.array([x, y])
            prediction_grid[j, i] = knn_predict(p, predictors, outcomes, k)
    return (xx, yy, prediction_grid)
 
def plot_prediction_grid (xx, yy, prediction_grid, filename):
    """ Plot KNN predictions for every point on the grid."""
    from matplotlib.colors import ListedColormap
    background_colormap = ListedColormap (["hotpink", "lightskyblue", "yellowgreen"])
    observation_colormap = ListedColormap (["red", "blue", "green"])
    plt.figure(figsize =(10, 10))
    plt.pcolormesh(xx, yy, prediction_grid, cmap = background_colormap, alpha = 0.5)
    plt.scatter(predictors[:, 0], predictors [:, 1], c = outcomes, cmap = observation_colormap, s = 50)
    plt.xlabel('Variable 1'); plt.ylabel('Variable 2')
    plt.xticks(()); plt.yticks(())
    plt.xlim (np.min(xx), np.max(xx))
    plt.ylim (np.min(yy), np.max(yy))
    plt.savefig(filename)
 
(predictors, outcomes) = generate_synth_data()
# >>>predictors.shape
# >>>outcomes.shape
k = 5; filename ="knn_synth_5.pdf"; limits =(-3, 4, -3, 4); h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename)
plt.show()

Output: The plot shown here is a grid of two class, visually shown as pink and green.We tried to predict the class of the points based on their position and environment.The green points must fall in the green bricks of the grid and the red in the pink bricks of the grid.Look at the enlarged view to visually check the classifiers work.
 

Classifying the IRIS Dataset

We will test our classifier on a scikit learn dataset, called “IRIS”.For importing “IRIS”, we need to import datasets from sklearn and call the function datasets.load_iris().The “IRIS” dataset holds information on sepal length, sepal width, petal length & petal width for three different class of Iris flower – Iris-Setosa, Iris-Versicolour & Iris-Verginica.Based on the data from the dataset, we need to classify and visualize them using our classifier.The Sci-kit learn (sklearn) library already holds a pre built classifier.We will compare both the classifiers, [scikitlearn vs the one that we built] and check/compare prediction accuracy of both the classifier . 
 

Python




from sklearn import datasets
import numpy as np
import random
import matplotlib.pyplot as plt
  
iris = datasets.load_iris()
    # >>>iris["data"]
predictors = iris.data[:, 0:2]
outcomes = iris.target
 
plt.plot(predictors[outcomes == 0][:, 0], predictors[outcomes == 0][:, 1], "ro")
plt.plot(predictors[outcomes == 1][:, 0], predictors[outcomes == 1][:, 1], "go")
plt.plot(predictors[outcomes == 2][:, 0], predictors[outcomes == 2][:, 1], "bo")
 
k = 5; filename ="iris_grid.pdf"; limits =(4, 8, 1.5, 4.5); h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename)
plt.show()
 
from sklearn.neighbors import KNeighborsClassifier #predictions from scikit
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(predictors, outcomes)
sk_predictions = knn.predict(predictors)
 
my_predictions = np.array([knn_predict(p, predictors, outcomes, 5) for p in predictors])
 
   # >>>sk_predictions == my_predictions
   # >>>np.mean(sk_predictions == my_predictions)
print(" prediction by scikit learn : ")
print(100 * np.mean(sk_predictions == outcomes))
print(" prediction by own model : ")
print(100 * np.mean(my_predictions == outcomes))   
 # our homemade predicter is actually somewhat better

Output: It seems from the output that our classifier is actually performing better than the sklearn classifier.
Reference : 
 

This article is contributed by Amaryta Ranjan Saikia. If you like GeeksforGeeks and would like to contribute, you can also write an article using write.geeksforgeeks.org or mail your article to review-team@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
 


My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!