Data Science and Artificial Intelligence

K-Nearest Neighbor Algorithm

This entry is part 11 of 17 in the series Machine Learning Algorithms

This blog discusses the fundamental concepts of the k-Nearest Neighbour Classification Algorithm, popularly known by the name KNN classifiers. Classification in Machine Learning is a technique of learning where a particular instance is mapped against one among many labels. The labels are prespecified to train your model. The machine learns the pattern from the data in such a way that the learned representation successfully maps the original dimension to the suggested label/class without any more intervention from a human expert.

How do k-Nearest Neighbors Work

The kNN algorithm belongs to a family of instance-based, competitive learning and lazy learning algorithms.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions. A kNN algorithm is an extreme form of instance-based methods because all training observations are retained as a part of the model.

It is a competitive learning algorithm because it internally uses competition between model elements (data instances) to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete to “win” or be most alike to a given unseen data instance and contribute to a prediction.

Lazy learning refers to the fact that the algorithm does not build a model until the time a prediction is required. It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.

Finally, kNN is powerful because it does not assume anything about the data, other than that the distance measure can be calculated consistently between any two instances. Thus, it is called non-parametric or non-linear as it does not assume a functional form.

Euclidean distance

Euclidean distance is the most commonly used distance measure. It is also called simply distance. The usage of Euclidean distance measure is highly recommended when the data is dense or continuous. Euclidean distance is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them. The Pythagorean theorem gives this distance between two points. A generalized term for the Euclidean norm is the L2 norm or L2 distance.

We will use the implementation provided by the python machine learning framework known as scikit-learn.

Problem Statement:

To build a simple KNN classification model for predicting the quality of the car given, here are a few of the other car attributes.

Data details

==========================================
1. Title: Car Evaluation Database
==========================================

The dataset is available at: “http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

2. Sources:
  (a) Creator: Marko Bohanec
  (b) Donors: Marko Bohanec([email protected])
               Blaz Zupan      ([email protected])
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for
   multi-attribute decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation
   of HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
 cars according to the following concept structure:

   CAR                      car acceptability
   PRICE                  overall price
   buying               buying price
   maint                price of the maintenance
   TECH                   technical characteristics
   COMFORT              comfort
   doors              number of doors
   person capacity in terms of persons to carry
   lug_boot           the size of luggage boot
   safety               estimated safety of the car

   Input attributes are printed in lowercase.
5. Number of Instances: 1728 (instances completely cover the attribute space)
6. Number of Attributes: 6

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %)

Tools to be used:

Numpy, pandas, scikit-learn

Python Implementation with code:

1) Import necessary libraries

Import the necessary modules from specific libraries.

import os
import numpy as np
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

2) Load the data set

Use pandas module to read the bike data from the file system. Check few records of the dataset.

data =
 pd.read_csv('data/car_quality/car.data',names=['buying','maint','doors','persons','lug_boot','safety','class'])
data.head()

  buying maint doors persons lug_boot safety class
0 vhigh  vhigh 2     2       small    low    unacc
1 vhigh  vhigh 2     2       small    med    unacc
2 vhigh  vhigh 2     2       small    high   unacc
3 vhigh  vhigh 2     2       med      low    unacc
4 vhigh  vhigh 2     2       med      med    unacc

3) Check information about the data set

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

4) Identify the target variable

data['class'],class_names = pd.factorize(data['class'])

The target variable is marked as a class in the data frame. The values are present in a string format. However, the algorithm requires the variables to be coded into its equivalent integer codes. We can convert the string categorical values into integer codes using factorize method of the pandas library.

Let’s check the encoded values now.

print(class_names)
print(data['class'].unique())

Index([u'unacc', u'acc', u'vgood', u'good'], dtype='object')
[0 1 2 3]

As we can see the values has been encoded into 4 different numeric labels.

5) Identify the predictor variables and encode any string variables to equivalent integer codes

data['buying'],_ = pd.factorize(data['buying'])
data['maint'],_ = pd.factorize(data['maint'])
data['doors'],_ = pd.factorize(data['doors'])
data['persons'],_ = pd.factorize(data['persons'])
data['lug_boot'],_ = pd.factorize(data['lug_boot'])
data['safety'],_ = pd.factorize(data['safety'])
data.head()

 buying maint doors persons lug_boot safety class
0     0     0    0       0   0       0      0
1     0     0    0       0   0       1      0
2     0     0    0       0    0      2      0
3     0     0    0       0    1      0      0
4     0     0    0       0    1      1      0

Check the data types now:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null int64
maint       1728 non-null int64
doors       1728 non-null int64
persons     1728 non-null int64
lug_boot    1728 non-null int64
safety      1728 non-null int64
class       1728 non-null int64
dtypes: int64(7)
memory usage: 94.6 KB

Everything is now converted in integer form.

6) Select the predictor feature and select the target variable

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

7) Train test split:

# split data randomly into 70% training and 30% test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=0)

8) Training/model fitting

# train the decision tree
## Instantiate the model with 5 neighbors. 
model = KNeighborsClassifier(n_neighbors=5)
## Fit the model on the training data.
model.fit(X_train, y_train)

9) Model parameters study :

# use the model to make predictions with the test data
y_pred = dtree.predict(X_test)
# how did our model perform?
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 32
Accuracy: 0.94

As you can see the algorithm was able to achieve the classification accuracy of 94% on held out set. Only 32 samples were misclassified.

The model actually has a 100% accuracy score, since this is a very simplistic data set with distinctly separable classes. But there you have it. That’s how to implement K-Nearest Neighbors with scikit-learn. Load your favorite data set and give it a try!

Learn Python programming online here at Acadgild.

How to decide the value of n-neighbors

Choosing a large value of K will lead to a greater amount of execution time & underfitting. Selecting the small value of K will lead to overfitting. There is no such guaranteed way to find the best value of K.

from sklearn.metrics import accuracy_score
for K in range(25):
    K_value = K+1
    neigh = KNeighborsClassifier(n_neighbors = K_value)
    neigh.fit(X_train, y_train) 
    y_pred = neigh.predict(X_test)
    print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)

('Accuracy is ', 83.62235067437379, '% for K-Value:', 1)
('Accuracy is ', 80.15414258188824, '% for K-Value:', 2)
('Accuracy is ', 89.21001926782274, '% for K-Value:', 3)
('Accuracy is ', 88.82466281310212, '% for K-Value:', 4)
('Accuracy is ', 93.83429672447014, '% for K-Value:', 5)
('Accuracy is ', 92.8709055876686, '% for K-Value:', 6)
('Accuracy is ', 92.8709055876686, '% for K-Value:', 7)
('Accuracy is ', 89.78805394990366, '% for K-Value:', 8)
('Accuracy is ', 90.94412331406551, '% for K-Value:', 9)
('Accuracy is ', 88.82466281310212, '% for K-Value:', 10)
('Accuracy is ', 89.40269749518305, '% for K-Value:', 11)
('Accuracy is ', 88.6319845857418, '% for K-Value:', 12)
('Accuracy is ', 88.82466281310212, '% for K-Value:', 13)
('Accuracy is ', 89.01734104046243, '% for K-Value:', 14)
('Accuracy is ', 89.78805394990366, '% for K-Value:', 15)
('Accuracy is ', 88.6319845857418, '% for K-Value:', 16)
('Accuracy is ', 88.82466281310212, '% for K-Value:', 17)
('Accuracy is ', 88.4393063583815, '% for K-Value:', 18)
('Accuracy is ', 88.6319845857418, '% for K-Value:', 19)
('Accuracy is ', 88.6319845857418, '% for K-Value:', 20)
('Accuracy is ', 88.2466281310212, '% for K-Value:', 21)
('Accuracy is ', 89.01734104046243, '% for K-Value:', 22)
('Accuracy is ', 89.21001926782274, '% for K-Value:', 23)
('Accuracy is ', 89.01734104046243, '% for K-Value:', 24)
('Accuracy is ', 89.59537572254335, '% for K-Value:', 25)

It shows that we are getting a 93.83% accuracy on K = 5. Hence, we are considering K =5 for this tutorial.

Series Navigation<< XGBoost AlgorithmDecision Trees >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close