KNN stands for K Nearest Neighbour is the easiest, versatile and popular supervised machine learning algorithm. This algorithm is used in various applications such as finance, healthcare, image, and video recognition.
KNN is used for both regression and classification problems and is a non-parametric algorithm which means it doesn’t make any assumption about the underlying data, it makes its selection based on the proximity to other data points regardless of what feature the numerical values represent.
In this blog, we will read about KNN and its implementation using a dataset in Python.
Working of KNN
When we have several data points that belong to some specific class or category and a new data point gets introduced, the KNN algorithm decides which class this new datapoint would belong to on the basis of some factor.
The K, in KNN, is the number of nearest neighbors that surrounds the new data point and is the core deciding factor.
We pick a value for K and will take K nearest neighbors of the new data point according to their Euclidean distance.
Suppose that the value of K = 5, we will choose 5 nearest neighbors to the new data point whose euclidean distance will be less.
Among these neighbors(K), we will count the number of data points in each category and the new data point will be assigned to that category to which the majority of 5 nearest points belong.
As we can see in the above image the new data(denoted by +), belongs to class 1 that has the majority of neighbors.
Since we now have a basic idea of how KNN works, we will begin our coding in Python using the ‘Wine’ dataset.
The Wine dataset is a popular dataset which is famous for multi-class classification problems. This data is the result of a chemical analysis of wines grown in the same region in Italy using three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
The dataset comprises 13 features and a target variable(a type of cultivars).
This data has three types of cultivar classes: ‘class_0’, ‘class_1’, and ‘class_2’. Here, you can build a model to classify the type of cultivar. The dataset has been imported from the Sklearn library as shown below.
Importing all the necessary libraries:
X and y are the predictors and the target variable respectively. Since the target variable is a categorical one consisting of 3 categories of flower species, we have used ‘Categorical.from_codes’.
Also, Using get_dummies() function we have converted our categories of ‘type of cultivators’ into dummy variables.
Checking the info of X and y.
Checking the shape of X and y.
Hence our dataset is free from null values.
Standardizing the Variables.
Before training our data, it is always a good practice to scale the features so that all of them can be uniformly evaluated. Refer to the Link for better understanding.
For scaling, we will import the StandardScalar class from the Sklearn library.
Use the .transform() method to transform the features into a scaled version.
Converting the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.
Hence it looks pretty clear that the variables have been scaled.
Splitting our data into Training and Test data.
Implementing KNN algorithm
In the above code the KNN class ‘KNeighborsClassifier’ is initialized with one parameter, i.e. n_neigbours. This is basically the value for the K and there is no fixed value for this parameter. For now, we have set its value as 5.
Evaluating the algorithm
For evaluating an algorithm, confusion matrix, precision, recall, and f1 score are the most commonly used metrics.
The results show that our KNN algorithm was able to classify 33 records correctly.
We got a classification rate of 91.66%, which can be considered as very good accuracy.
This is how we implement a dataset using the KNN algorithm. I hope you find this algorithm useful.
Keep visiting our website for more blogs on Data Science and Data Analytics.