All CategoriesData Science and Artificial Intelligence

KNN in Python

KNN stands for K Nearest Neighbour is the easiest, versatile and popular supervised machine learning algorithm.  This algorithm is used in various applications such as finance, healthcare, image, and video recognition.

KNN is used for both regression and classification problems and is a non-parametric algorithm which means it doesn’t make any assumption about the underlying data, it makes its selection based on the proximity to other data points regardless of what feature the numerical values represent.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

In this blog, we will read about KNN and its implementation using a dataset in Python.  

Working of KNN

When we have several data points that belong to some specific class or category and a new data point gets introduced, the KNN algorithm decides which class this new datapoint would belong to on the basis of some factor. 

The K, in KNN, is the number of nearest neighbors that surrounds the new data point and is the core deciding factor. 

We pick a value for K and will take K nearest neighbors of the new data point according to their Euclidean distance. 

Suppose that the value of K = 5, we will choose 5 nearest neighbors to the new data point whose euclidean distance will be less. 

Among these neighbors(K), we will count the number of data points in each category and the new data point will be assigned to that category to which the majority of 5 nearest points belong.

As we can see in the above image the new data(denoted by +), belongs to class 1 that has the majority of neighbors.

Since we now have a basic idea of how KNN works, we will begin our coding in Python using the ‘Wine’ dataset.

The Wine dataset is a popular dataset which is famous for multi-class classification problems. This data is the result of a chemical analysis of wines grown in the same region in Italy using three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The dataset comprises 13 features and a target variable(a type of cultivars).

This data has three types of cultivar classes: ‘class_0’, ‘class_1’, and ‘class_2’. Here, you can build a model to classify the type of cultivar. The dataset has been imported from the Sklearn library as shown below.

Importing all the necessary libraries:

import numpy as np
import pandas as pd

#importing the dataset
from sklearn.datasets import load_wine

wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
X.head()
y = pd.Categorical.from_codes(wine.target, wine.target_names)
y = pd.get_dummies(y)
y.head()

 X and y are the predictors and the target variable respectively. Since the target variable is a categorical one consisting of 3 categories of flower species, we have used ‘Categorical.from_codes’.

Also, Using get_dummies() function we have converted our categories of ‘type of cultivators’ into dummy variables.

Checking the info of X and y.

X.info()
y.info()

Checking the shape of X and y.

print(X.shape)
print(y.shape)

Hence our dataset is free from null values.

Standardizing the Variables.

Before training our data, it is always a good practice to scale the features so that all of them can be uniformly evaluated. Refer to the Link for better understanding. 

For scaling, we will import the StandardScalar class from the Sklearn library.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#fitting scaler to the feature
scaler.fit(X)

Use the .transform() method to transform the features into a scaled version.

scaled_features = scaler.transform(X)

Converting the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.

df_feat = pd.DataFrame(scaled_features,columns=X.columns)
df_feat.head()

Hence it looks pretty clear that the variables have been scaled.

Splitting our data into Training and Test data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_features,y, test_size=0.20)

Implementing KNN algorithm

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

In the above code the KNN class ‘KNeighborsClassifier’  is initialized with one parameter, i.e. n_neigbours. This is basically the value for the K and there is no fixed value for this parameter. For now, we have set its value as 5.

Making Prediction

pred = knn.predict(X_test)

Evaluating the algorithm

For evaluating an algorithm, confusion matrix, precision, recall, and f1 score are the most commonly used metrics.

from sklearn.metrics import classification_report,confusion_matrix

print(confusion_matrix(y_test.values.argmax(axis=1), pred.argmax(axis=1)))

The results show that our KNN algorithm was able to classify 33 records correctly.

print(classification_report(y_test,pred))
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, pred))

We got a classification rate of 91.66%, which can be considered as very good accuracy.

This is how we implement a dataset using the KNN algorithm. I hope you find this algorithm useful. 

Keep visiting our website for more blogs on Data Science and Data Analytics.

Series Navigation<< Diabetes Prediction using Logistic Regression in RAssociation Rule mining Using R >>

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close