Open In App

Project | kNN | Classifying IRIS Dataset

Last Updated : 19 Sep, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Introduction | kNN Algorithm

Statistical learning refers to a collection of mathematical and computation tools to understand data.In what is often called supervised learning, the goal is to estimate or predict an output based on one or more inputs.The inputs have many names, like predictors, independent variables, features, and variables being called common.The output or outputs are often called response variables, or dependent variables.If the response is quantitative – say, a number that measures weight or height, we call these problems regression problems.If the response is qualitative– say, yes or no, or blue or green, we call these problems classification problems.This case study deals with one specific approach to classification.The goal is to set up a classifier such that when it’s presented with a new observation whose category is not known, it will attempt to assign that observation to a category, or a class, based on the observations for which it does know the true category.This specific method is known as the k-Nearest Neighbors classifier, or kNN for short.Given a positive integer k, say 5, and a new data point, it first identifies those k points in the data that are nearest to the point and classifies the new data point as belonging to the most common class among those k neighbors.
Aim: Build our very own k – Nearest Neighbor classifier to classify data from the IRIS dataset of scikit-learn.
 

Distance between two points

We are going to write a function, which will find the distance between two given 2-D points in the x-y plane.We will import numpy, to take help of numpy arrays for storing the coordinates.Finding the distance between two points will help in finding the nearest neighbor of the input point. 
 

Python




import numpy as np
  
def distance(p1, p2):
    return np.sqrt(np.sum(np.power(p2-p1, 2))) #distance between two points
p1 = np.array([1, 1])   #coordinate x = 1, y = 1
p2 = np.array([4, 4])   #coordinate x = 4, y = 4
distance(p1, p2)


Majority vote counter

We will create a 3 x 3 matrix of points with the help of numpy array to build the environment of dispersed points in the plane.We will also create a function called majority_vote() to find the highest count/vote of a particular vote list, e.g ( 1, 2, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 3, 2, 3) etc.This is indirectly the mode of the given data, so can also be calculated with the help of scipy statistics module.We will create another function called majority_vote_short() which will perform the same functionality as majority_vote() but will make use of mode() from scipy.stats.Both these functions will be necessary in predicting the points later. 
Our aim is to build a kNN classifier, so we need to develop an algorithm to find the nearest neighbours of a given set of points.Suppose we need to insert a point into x-y plane within an environment of given set of existing points.We will have to classify the point we wish to insert into one of the category of the existing points and then insert accordingly.So, we will build a function find_nearest_neighbours() to find the nearest neighbor of the given point.It will take in (i)The point we wish to insert (ii)set of existing points and (iii)k helps with the indices, as parameters to the function.We will visualize the situation by plotting the x-y plane filled with points with the help of matplotlib. 
 

Python




import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
  
points = np.array([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3], [3, 1], [3, 2], [3, 3]])  
   #points = existing points
p = np.array([2.5, 2])   #p = point we wish to insert
  
def majority_vote(votes):
    vote_counts = {}
    for vote in votes:
        if vote in vote_counts:
           vote_counts[vote]+= 1
        else:
            vote_counts[vote]= 1
    winners = []
    max_count = max(vote_counts.values())
    for vote, count in vote_counts.items():
        if count == max_count:
            winners.append(vote)
    return random.choice(winners) #returns winner randomly if there are more than 1 winner
  
#>>>votes =[1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 1, 1, 1, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 1, 1, 2]
#sample vote counts above
# >>>vote_counts = majority_vote(votes)
  
def majority_vote_short(votes):
    mode, count = ss.mstats.mode(votes)
    return mode
  
def find_nearest_neighbours(p, points, k = 5):  #algorithm to find the nearest neighbours
    distances = np.zeros(points.shape[0])
    for i in range(len(distances)):
        distances[i]= distance(p, points[i])
    ind = np.argsort(distances)      #returns index, according to sorted values in array
    return ind[:k]
  
ind = find_nearest_neighbours(p, points, 2);print(points[ind])
 #gives the nearest neighbour's for this sample case
  
plt.plot(points[:, 0], points[:, 1], "ro")
plt.plot(p[0], p[1], "bo")
plt.axis([0.5, 3.5, 0.5, 3.5])
plt.show()


kNN Predict around Synthetic Data

After finding the nearest neighbors, we will have to predict the category of the input point.We will build a function called knn_predict() which will predict the category of the point we wish to insert.We can build another function called generate_synth_data() to generate synthetic points in the x-y plane. 
 

Python




import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
  
''' add the functions and libraries from previous programmes '''
  
def knn_predict(p, points, outcomes, k = 5):
    ind = find_nearest_neighbours(p, points, k)
    return majority_vote(outcomes[ind])
  
outcomes = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])
knn_predict(np.array([2.5, 2.7]), points, outcomes, k = 2)
  
def generate_synth_data(n = 50):
    points = np.concatenate((ss.norm(0, 1).rvs((n, 2)), ss.norm(1, 1).rvs((n, 2))), axis = 0)
    outcomes = np.concatenate((np.repeat(0, n), np.repeat(1, n)))
    return (points, outcomes)
  
n = 20
plt.figure()
plt.plot(points[:n, 0], points[:n, 1], "ro")
plt.plot(points[n:, 0], points[n:, 1], "bo")
plt.show()


kNN Prediction GRID

We will build a function called make_prediction_grid() which will make a grid and allot the different class of points in the grid.Another function plot_prediction_grid() must be created to plot the outputs of make_prediction_grid() using matplotlib.
 

Python




import numpy as np
import random
import scipy.stats as ss
import matplotlib.pyplot as plt
  
def make_prediction_grid(predictors, outcomes, limits, h, k):
    (x_min, x_max, y_min, y_max) = limits
    xs = np.arange(x_min, x_max, h)
    ys = np.arange(y_min, y_max, h)
    xx, yy = np.meshgrid(xs, ys)
  
    prediction_grid = np.zeros(xx.shape, dtype = int)
    for i, x in enumerate(xs):
        for j, y in enumerate(ys):
            p = np.array([x, y])
            prediction_grid[j, i] = knn_predict(p, predictors, outcomes, k)
    return (xx, yy, prediction_grid)
  
def plot_prediction_grid (xx, yy, prediction_grid, filename):
    """ Plot KNN predictions for every point on the grid."""
    from matplotlib.colors import ListedColormap
    background_colormap = ListedColormap (["hotpink", "lightskyblue", "yellowgreen"])
    observation_colormap = ListedColormap (["red", "blue", "green"])
    plt.figure(figsize =(10, 10))
    plt.pcolormesh(xx, yy, prediction_grid, cmap = background_colormap, alpha = 0.5)
    plt.scatter(predictors[:, 0], predictors [:, 1], c = outcomes, cmap = observation_colormap, s = 50)
    plt.xlabel('Variable 1'); plt.ylabel('Variable 2')
    plt.xticks(()); plt.yticks(())
    plt.xlim (np.min(xx), np.max(xx))
    plt.ylim (np.min(yy), np.max(yy))
    plt.savefig(filename)
  
(predictors, outcomes) = generate_synth_data()
# >>>predictors.shape
# >>>outcomes.shape
k = 5; filename ="knn_synth_5.pdf"; limits =(-3, 4, -3, 4); h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename)
plt.show()


Output: The plot shown here is a grid of two class, visually shown as pink and green.We tried to predict the class of the points based on their position and environment.The green points must fall in the green bricks of the grid and the red in the pink bricks of the grid.Look at the enlarged view to visually check the classifiers work.
 

Classifying the IRIS Dataset

We will test our classifier on a scikit learn dataset, called “IRIS”.For importing “IRIS”, we need to import datasets from sklearn and call the function datasets.load_iris().The “IRIS” dataset holds information on sepal length, sepal width, petal length & petal width for three different class of Iris flower – Iris-Setosa, Iris-Versicolour & Iris-Verginica.Based on the data from the dataset, we need to classify and visualize them using our classifier.The Sci-kit learn (sklearn) library already holds a pre built classifier.We will compare both the classifiers, [scikitlearn vs the one that we built] and check/compare prediction accuracy of both the classifier . 
 

Python




from sklearn import datasets
import numpy as np
import random
import matplotlib.pyplot as plt
   
iris = datasets.load_iris()
    # >>>iris["data"]
predictors = iris.data[:, 0:2]
outcomes = iris.target
  
plt.plot(predictors[outcomes == 0][:, 0], predictors[outcomes == 0][:, 1], "ro")
plt.plot(predictors[outcomes == 1][:, 0], predictors[outcomes == 1][:, 1], "go")
plt.plot(predictors[outcomes == 2][:, 0], predictors[outcomes == 2][:, 1], "bo")
  
k = 5; filename ="iris_grid.pdf"; limits =(4, 8, 1.5, 4.5); h = 0.1
(xx, yy, prediction_grid) = make_prediction_grid(predictors, outcomes, limits, h, k)
plot_prediction_grid(xx, yy, prediction_grid, filename)
plt.show()
  
from sklearn.neighbors import KNeighborsClassifier #predictions from scikit
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(predictors, outcomes)
sk_predictions = knn.predict(predictors)
  
my_predictions = np.array([knn_predict(p, predictors, outcomes, 5) for p in predictors])
  
   # >>>sk_predictions == my_predictions
   # >>>np.mean(sk_predictions == my_predictions)
print(" prediction by scikit learn : ")
print(100 * np.mean(sk_predictions == outcomes))
print(" prediction by own model : ")
print(100 * np.mean(my_predictions == outcomes))    
 # our homemade predicter is actually somewhat better


Output: It seems from the output that our classifier is actually performing better than the sklearn classifier.

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads