Skip to main content

Implementation Of KNN (From Scratch in PYTHON)

Implementation Of KNN (From Scratch in PYTHON)

Implementation Of KNN (From Scratch in PYTHON)
Implementation Of KNN (From Scratch in PYTHON)

KNN classifier is one of the simplest but strong supervised machine learning algorithm. It can  be used for both classification and regression problems. There are some libraries in python to implement KNN, which allows a programmer to make KNN model easily without using deep ideas of mathematics. But if we try to implement KNN from scratch it becomes a bit tricky.
Before getting into the program lets recall the algorithm of KNN:

Algorithm for K-NN:

1.   Load the given data file into your program
2.   Initialize the number of neighbor to be considered i.e. ‘K’ (must be odd).
3.   Now for each tuples (entries or data point) in the data file we perform:
                      i.   Calculate distance between the data point (tuple) to be classified and each data points in the given data file.
                    ii.   Then add the distances corresponding to data points (data entries) in given data file (probably by adding column for distance).
                  iii.   Sort the data in data file from smallest to largest (in ascending order) by the distances.
4.   Pick the first K entries from the sorted collection of data.
5.   Observe the labels of the selected K entries.
6.   For classification, return the mode of the K labels and for regression, return the mean of K labels.

Now we are all ready to dive into the code.  let's implement KNN from Scratch (Using pandas and Numpy only). We are going to classify the iris data into its different species by observing different 4 features: sepal length, sepal width, petal length, petal width. We have all together 150 observations(tuples) and we will make KNN classifying model on the basis of these observations.

Link to download iris dataset- iris.csv


import pandas as pd
import numpy as np
import operator

# loading data file into the program. give the location of your csv file
dataset = pd.read_csv("E:/input/iris.csv")
print(dataset.head()) # prints first five tuples of your data.

# making function for calculating euclidean distance
def E_Distance(x1, x2, length):
    distance =
0
   
for x in range(length):
        distance += np.square(x1[x] - x2[x])
   
return np.sqrt(distance)

# making function for defining K-NN model

def knn(trainingSet, testInstance, k):
    distances = {}
    length = testInstance.shape[
1]
   
for x in range(len(trainingSet)):
        dist = E_Distance(testInstance
, trainingSet.iloc[x], length)
        distances[x] = dist[
0]
    sortdist =
sorted(distances.items(), key=operator.itemgetter(1))
    neighbors = []
   
for x in range(k):
        neighbors.append(sortdist[x][
0])
    Count = {} 
# to get most frequent class of rows
   
for x in range(len(neighbors)):
        response = trainingSet.iloc[neighbors[x]][-
1]
       
if response in Count:
            Count[response] +=
1
       
else:
            Count[response] =
1
   
sortcount = sorted(Count.items(), key=operator.itemgetter(1), reverse=True)
   
return (sortcount[0][0], neighbors)

# making test data set
testSet = [[6.8, 3.4, 4.8, 2.4]]
test = pd.DataFrame(testSet)

# assigning different values to k
k = 1

k1 = 3
k2 = 11

# supplying test data to the model
result, neigh = knn(dataset, test, k)
result1
, neigh1 = knn(dataset, test, k1)
result2
, neigh2 = knn(dataset, test, k2)

# printing output prediction

print(result)
print(neigh)
print(result1)
print(neigh1)
print(result2)
print(neigh2)



The Output of above program is:

   sepal.length  sepal.width  petal.length  petal.width variety

0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
4
4
4
Virginica
[141]
Virginica
[141, 145, 110]
Virginica
[141, 145, 110, 115, 139, 147, 77, 148, 140, 112, 144]

Comments

Popular posts from this blog

Understanding KNN(K-nearest neighbor) with example

Understanding KNN(K-nearest neighbor) with example.  It is probably, one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes, in order to make prediction for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance). Assumption of KNN : K- NN algorithm is based on the principle that, “the similar things exist closer to each other or Like things are near to each other.” In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. over fitting of model. In the same way if ‘K’ is v...

What are various Data Pre-Processing techniques? What is the importance of data pre-processing?

What is Data Pre-Processing? What is the importance of data pre-processing? The real-world data are susceptible to high noise, contains missing values and a lot of vague information, and is of large size. These factors cause degradation of quality of data. And if the data is of low quality, then the result obtained after the mining or modeling of data is also of low quality. So, before mining or modeling the data, it must be passed through the series of quality upgrading techniques called data pre-processing. Thus, data pre-processing can be defined as the process of applying various techniques over the raw data (or low quality data) in order to make it suitable for processing purposes (i.e. mining or modeling). What are the various Data Pre-Processing Techniques? Fig: Methods of Data Pre-Processing source: Fotolia Once we know what data pre-processing actually does, the question might arise how is data processing done? Or how it all happens? The answer is obvious; there are series o...