Skip to main content

Basic Concepts Of KNN Algorithm

Basic Concepts Of K-NN Algorithm 

It is probably, one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes, in order to make prediction for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance).

Principle: K- NN algorithm is based on the principle that, “the similar things exist closer to each other or Like things are near to each other.”

In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. over fitting of model. In the same way if ‘K’ is very large then it leads to High Bias, Low variance i.e. under fitting of model. There are many researches done on selection of right value of K, however in most of the cases taking ‘K’ = {square-root of (total number of data ‘n’)} gives pretty good result. If the value ‘K’ comes to be odd then it’s all right else we make it odd either by adding or subtracting 1 from it.

K-NN
Classification for k=3
K-NN works pretty well with a small number of input variables (p), but there are more chances of error in prediction when the number of inputs becomes very large.

It is based on the simple mathematics that we used in high school level to measure the distance between two data points in graph. Some of the distance measuring techniques (Formulae) that we can use for K-NN classification are:

  •         Euclidean Distance:

Euclidean Distance
fig: Euclidean Distance

  •         Manhattan Distance:


Manhattan Distance
fig: Manhattan Distance

  •         Minkowski Distance:


Minkowski distance
fig: Minkowski Distance
4.       
For p=1, we get Manhattan Distance and for p=2, we get Euclidean Distance. So, we can say that Minkowski distance is generalized form of Manhattan Distance, Euclidean Distance.

Among these methods, Euclidean Distance method is widely used.

Algorithm for K-NN:

1.      Load the given data file into your program
2.      Initialize the number of neighbor to be considered i.e. ‘K’ (must be odd).
3.      Now for each tuples (entries or data point) in the data file we perform:
                                i.            Calculate distance between the data point (tuple) to be classified and each data points in the given data file.
                              ii.            Then add the distances corresponding to data points (data entries) in given data file (probably by adding column for distance).
                            iii.            Sort the data in data file from smallest to largest (in ascending order) by the distances.
4.      Pick the first K entries from the sorted collection of data.
5.      Observe the labels of the selected K entries.
6.      For classification, return the mode of the K labels and for regression, return the mean of K labels.

Decision Boundary for K-NN:

Decision boundary for classification using K-NN algorithm
source: stackoverflow.com

 Advantages and Disadvantage of K-NN:

Advantages:
1.      The Kalgorithm is quiet easy and simple to implement as it does not include much of mathematics.
2.      We can do solve both classification and regression problem using K-NN algorithm.
Disadvantages:
1.      The algorithm becomes highly slower as the size of data increases.


Comments

Popular posts from this blog

Understanding KNN(K-nearest neighbor) with example

Understanding KNN(K-nearest neighbor) with example.  It is probably, one of the simplest but strong supervised learning algorithms used for classification as well regression purposes. It is most commonly used to classify the data points that are separated into several classes, in order to make prediction for new sample data points. It is a non-parametric and lazy learning algorithm. It classifies the data points based on the similarity measure (e.g. distance measures, mostly Euclidean distance). Assumption of KNN : K- NN algorithm is based on the principle that, “the similar things exist closer to each other or Like things are near to each other.” In this algorithm ‘K’ refers to the number of neighbors to consider for classification. It should be odd value.  The value of ‘K’ must be selected carefully otherwise it may cause defects in our model. If the value of ‘K’ is small then it causes Low Bias, High variance i.e. over fitting of model. In the same way if ‘K’ is v...

Implementation Of KNN (From Scratch in PYTHON)

Implementation Of KNN (From Scratch in PYTHON) Implementation Of KNN (From Scratch in PYTHON) KNN classifier is one of the simplest but strong supervised machine learning algorithm. It can  be used for both classification and regression problems. There are some libraries in python to implement KNN, which allows a programmer to make KNN model easily without using deep ideas of mathematics. But if we try to implement KNN from scratch it becomes a bit tricky. Before getting into the program lets recall the algorithm of KNN: Algorithm for K-NN: 1.     Load the given data file into your program 2.     Initialize the number of neighbor to be considered i.e. ‘K’ (must be odd). 3.     Now for each tuples (entries or data point) in the data file we perform:                        i.     Calculate distance between the data point (tuple) to be classified and each data points in the gi...

Supervised Machine Learning

Supervised Machine Learning What Is Supervised Learning?  It is the machine learning algorithm that learns from labeled data. After the data is analyzed and learned, the algorithm determines which label should be given to new data supplied by the user based on pattern and associating the patterns to the unlabeled new data. Supervised Learning algorithm has two categories i.e Classification & Regression Classification predicts the class or category in which the data belongs to. e.g.: Spam filtering and detection, Churn Prediction, Sentiment Analysis, image classification. Regression predicts a numerical value based on previously observed data. e.g.: House Price Prediction, Stock Price Prediction. Classification Classification is one of the widely and mostly used techniques for determining class the dependent belongs to base on the one or more independent variables. For simple understanding, what classification algorithm does is it simply makes a decision boundary between data po...