K-Nearest Neighbor
Data Science

K-Nearest Neighbor

Introduction:

K Nearest Neighbor is one of the fundamental algorithms in supervised machine learning. Machine learning models use a set of input values to predict output values. KNN is one of the simplest forms of machine learning that is used to solve both classification and regression problems but mostly used for classification. It classifies the data point on how its neighbor is classified. 
KNN is also known as an instance-based model or a lazy learner because it doesn’t construct an internal model. 

For classification problems, it will find the k nearest neighbors and predict the class by the majority vote of the nearest neighbors.

For regression problems, it will find the k nearest neighbors and predict the value by calculating the mean value of the nearest neighbors.

KNN classifies the new data points based on the similarity measure of the earlier stored data points. For example, if we have a dataset of apple and bananas. KNN will store similar measures like shape and color. When a new object comes it will check its similarity with the color (red or yellow) and shape.

Topics covered in this story:

  1. What is KNN Classification?
  2. How to find the optimum k value?
  3. How to find the k nearest neighbors?
  4. When should we use KNN Algorithm?
  5. Implementation of KNN
  6. Advantages & Disadvantages of KNN Algorithm
  7. Conclusion

What is KNN Classification?

Let’s learn how to classify data using the KNN algorithm. Suppose we have two classes Category A and Category B.

Now we have to predict the class of new data point (shown in the figure). We have to predict whether the new data point belongs to class Category A or Category B.

First, we have to determine k value. k denotes the number of neighbors.
Second, we have to determine the nearest k neighbors based on distance.

This algorithm finds the k nearest neighbor, and classification is done based on the majority class of the k nearest neighbors.

Now, the question arise,
1. How to find the optimum k value?
2. How to find the k nearest neighbors?

How to find the optimum k value?

One of the trickiest questions to be asked is how we should choose the K value. Choosing the right value of K is called parameter tuning and it’s necessary for better results. K in KNN is the number of nearest neighbors considered for assigning a label to the current point. If the value of K is too small then there is a probability of overfitting the model and if it is too large then the algorithm becomes computationally expensive. 

  • One should not use a low value of K= 1 because it may lead to overfitting i.e. during the training phase performs good but during the testing phase, the model performs badly. Choosing a high value of K can also lead to underfitting i.e. it performs poorly during the training and testing phase.
  • K = sqrt (total number of data points).
  • Odd value of K is always selected to avoid confusion between 2 classes. Since KNN predicts the class based on the majority voting mechanism, the chances of getting into a tie situation will be minimized.
  • Selecting the value of K depends on individual cases and sometimes the best method of choosing K is to run through different values of K and verify the outcomes. Using cross-validation, the KNN algorithm can be tested for different values of K and the value of K that results in good accuracy can be considered as an optimal value for K.
  • We can also use an error plot or accuracy plot to find the most favorable K value. Plot possible k values against error and find the k with minimum error and that k value is chosen as the favorable k value.

How to find the k nearest neighbors?

There are different techniques to find the k nearest neighbors, such as

  • Euclidean distance
  • Manhattan distance
  • Minkowski distance

One of the most used techniques is the Euclidean distance.

Euclidean distance is based on the Pythagoras theorem,

Pythagoras theorem - In any right-angled triangle, the square of the hypotenuse (longest side of a triangle) is equal to the sum of the other two sides of the triangle.

Euclidean distance – Euclidean distance is used to calculate the distance between two points in a plane or a three-dimensional space. Let’s calculate the distance between P1 and P2.

  1. KNN algorithm calculates the distance of all data points from the query points using techniques like Euclidean distance.
  2. Then, it will select the k nearest neighbors.
  3. Then based on the majority voting mechanism, KNN algorithm will predict the class of the query point.

When should we use KNN Algorithm?

KNN algorithm is a good choice if you have a small dataset and the data is noise free and labeled. When the data set is small, the classifier completes execution in shorter time duration. If your dataset is large, then KNN, without any hacks, is of no use. 

Implementation of KNN

  • Import the libraries

 

  • Load the data
  • Feature scaling

StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For example, the dataset will contain a column with values on scale 20–70 and another column with values on scale 80–200. As these two columns are different in scale, they are standardized to have a common scale while building a machine learning model.

  • Splitting dataset into training and test set
  • KNN algorithm
  • Predictions and Evaluations
  • Choosing best K value

Now it’s time to improve the model and find out the optimal k value.

 

 

From the plot, we can see that the smallest error we got is 0.16 at K=30. Further on, we visualize the plot between accuracy and K value.

 

 

Now we can see the improved results. We got the accuracy of 0.84 at K=30. As we already derived the error plot and got the minimum error at k=30, so we will get better efficiency at that K value.

Advantages & Disadvantages of KNN Algorithm

Advantages-

  • It is very easy to understand and implement
  • It is well suited for small datasets
  • Useful for classification or regression

Disadvantages-

  • It fails when variables have different scales
  • It is sensitive to outliers and missing values
  • Does not work well with large datasets
  • It does not work well with high dimensions

Conclusion:

In this article, we came across the introduction and working principle of KNN. We also looked at various distance metrics used in KNN to compute the distance between data points. 

  • Ankita Das
  • Dec, 27 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.