kNN Algorithm Python Implementation

kNN is for k Nearest Neighbour.

kNN can be used as a classification algorithm as well as the regression algorithm. I am applying kNN for this project as a classification algorithm. kNN algorithm chooses k nearest neighbors of the test data and then classifies the test data to that class which has the highest frequency among the k nearest neighbors.

In simple words, let us take an example of the below image


We have 5 red plots and 5 green plot in the coordinate system. Now, we have test data which has been shown as the blue plot. Our task is to predict the class of the blue plot. We will now calculate the euclidean distance of the blue plot from other plots. The value of K suggests the number of nearest points that we have to consider. For K = 3, we select the 3 nearest (lowest to highest Euclidean Distance) neighbors. We have 3 plots. Most of the plots belong to class "Red", thus the test data belongs to class "Red".

Now, choosing K = 4,  we select the 4 nearest (lowest to highest Euclidean Distance) neighbors. We have 4 plots. We have 3 red plots and 1 green plot. The majority of them are "Red", thus the test plot is of "Red" class.

Now, choosing K = 7,  we select the 7 nearest (lowest to highest Euclidean Distance) neighbors. We have 4 plots. We have 5 red plots and 2 green plots. The majority of them are "Red", thus the test plot is of "Red" class.

Now, the question arises, how do we select the value of K? We got the correct class for K=3 as well as for K=5 as well as for K=7. 

When datasets are completely linearly separable, smaller values of K can easily distinguish between classes. When datasets have a blurry boundary, it becomes tough to separate the classes and this is where we have to experiment with the various values of K. 

Have a look at the following "dirty" plots below for various values of K along with its classification boundary.






The above data is of test data of Iris Flowers after applying PCA. Increasing K misclassifies some points and decreasing K classifies all the points correctly. When accuracies of various values of K were plotted, we got the following result.


The smallest value of K gave zero error but with a high validation error. Thus, we were completing overfitting our data. As we increased K, train accuracy decreased, training error increased but testing accuracy decreased. At the intersection of both plots, we will find our adequate value of K

I have used Iris Flower Dataset and my own testing set.

First, we will import numpy module

import numpy as np


Now, we will open our train iris flowers dataset file and load the data into a numpy array.

train_file = open('iris.txt') train_data = np.empty((0, 5)) line = train_file.readline() while line: arr = line.split(',') train_data = np.append(train_data,np.array([[arr[0], arr[1], arr[2], arr[3], 0 if arr[4].strip()=='Iris-setosa' else 1 if arr[4].strip()=='Iris-versicolor' else 2]]), 0) line = train_file.readline()








   

Close the train iris flower dataset and open the training dataset

train_data = train_data.astype(float) train_file.close() test_file = open('iris_test.txt')




Iterate through the test array and calculate the euclidean distance of the each test data from every train data points and store them in a numpy array. Append the label of train data with which the test data was subtracted along the axis=1. Square the difference and then apply the square root over the entire numpy array. 

euc = np.empty((0, 4)) k = 5 label = np.empty((0, 1)) while line: arr = line.split(',') test_arr = np.array([[arr[0], arr[1], arr[2], arr[3]]]).astype(float) label = np.append(label, np.array([[0 if arr[4].strip()=='Iris-setosa' else 1 if arr[4].strip()=='Iris-versicolor' else 2]])) inp = train_data[:, :4] onp = inp - test_arr onp = onp**2 label = train_data[:, 4] onp = np.append(np.sqrt(np.sum(onp, axis=1)).reshape(-1, 1), train_data[:, 4].reshape(-1, 1), 1)










We will get our Euclidean distance of each test dataset from the training set. Now sort the distances based on their distance i.e. 1st column. Select the first K rows. We will calculate the frequency of each class within the selected K rows. The class having maximum frequency wins and the test data gets classified to that class. 

frequency = np.zeros((1, 3)) temp = onp[onp[:,0].argsort()][0:k, 1].reshape(-1, k) for i in temp.reshape(-1, 1): frequency[0, int(i)] = frequency[0, int(i)] + 1 print(np.argmax(frequency)) line = test_file.readline() test_file.close()








The complete code is as follows:

import numpy as np train_file = open('iris.txt') train_data = np.empty((0, 5)) line = train_file.readline() while line: arr = line.split(',') train_data = np.append(train_data,np.array([[arr[0], arr[1], arr[2], arr[3], 0 if arr[4].strip()=='Iris-setosa' else 1 if arr[4].strip()=='Iris-versicolor' else 2]]), 0) line = train_file.readline() train_data = train_data.astype(float) train_file.close() test_file = open('iris_test.txt') line = test_file.readline() euc = np.empty((0, 4)) k = 5 label = np.empty((0, 1)) while line: arr = line.split(',') test_arr = np.array([[arr[0], arr[1], arr[2], arr[3]]]).astype(float) label = np.append(label, np.array([[0 if arr[4].strip()=='Iris-setosa' else 1 if arr[4].strip()=='Iris-versicolor' else 2]])) inp = train_data[:, :4] onp = inp - test_arr onp = onp**2 label = train_data[:, 4] onp = np.append(np.sqrt(np.sum(onp, axis=1)).reshape(-1, 1), train_data[:, 4].reshape(-1, 1), 1) frequency = np.zeros((1, 3)) temp = onp[onp[:,0].argsort()][0:k, 1].reshape(-1, k) for i in temp.reshape(-1, 1): frequency[0, int(i)] = frequency[0, int(i)] + 1
print('Iris-setosa' if np.argmax(frequency)==0 else 'Iris-Versicolor' if np.argmax(frequency)==1 else 'Iris-Virginica')
line = test_file.readline() test_file.close()



The output of our above prediction is with K = 5:



The training data is as Iris Flowers as a text file which can be downloaded from here. The test file is as follows

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica

When I increase the value of K to 10 and have a look at the frequency table, we get the following ouput of frequency table



Euclidean distance was calculated for each test data from each training data. Each distance had a label associated with them. This label is of the train data with which the test data was subtracted. The distances were then sorted. Among the sorted ones, the initial 10 were chosen as K = 10. The class with maximum frequency among the chosen 10 was chosen as the winner. 
  
The 0th index is Iris-setosa, 1st is Iris-versicolor and 2nd is Iris-Virginica. The first 10 nearest(train data) neighbors from the test data 5.1,3.5,1.4,0.2,Iris-setosa had ' Iris-setosa' as label hence, all 10 points were setosa thus 10 is in the first index, 0 in second and 0 in third.

However, for some cases, there were 6 virginica points and 4 Versicolor points among the first 10 selected points. Virginica having the highest frequency, wins and the test data gets Virginica as its class.

Enjoy.


Comments