Data Analytics with R, Excel & Tableau
Trending

K-Nearest Neighbors using R

KNN  stands for K-Nearest Neighbors is a type of supervised machine learning algorithm used to solve classification and regression problems. 

In this, predictions are made for a new instance (x) by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances.

In KNN, K is the number of nearest neighbors. The number of neighbors is the deciding factor. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

For finding closest similar points, we find the distance between points using distance measures such as Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance. KNN has the following basic steps:

  1. Calculate distance
  2. Find closest neighbors
  3. Group the similar data

In this blog we will be analysing the ___  dataset using the KNN algorithm. 

lets dive into the coding part

Loading the required packages

Loading the dataset and getting the structure of the dataset using the str() function.

We can see there are 4 variables viz., admit, gre, gpa and rank. Where admit is the target variable and the remaining 3 are predictors.

Checking for null value if present.

As we can see there is no null value present in the whole dataset.

Since the target variable ‘admit’ has two values 0 and 1, where 0 depicts False and 1 depicts True.

Hence we will factorize the two values into ‘Yes’ and ‘No’.

Summarizing the dataset

We can see the number of admissions taken is 127 and not taken is 273.

Data Partitioning

Partitioning the dataset into training and test data

KNN Model

Here we are using the function trainControl() that controls the computational nuances of the train function.

Here with trainControl() we are performing 10-fold cross validation. 

Fitting the train data

Here ROC was used to select the optimal model using the largest value. Hence the value for k is 39.

Model Performance

Plotting the fitted model

We are now calculating the variable importance for object produced by train data using the varImp() function.

The importance of ROC curve variable is shown where the variable gpa is the most important and then the variable rank.

Predictions

Making predictions on the training and test data.

From the above result we can see that from the confusion matrix generated, 7 is the number that which in actual is yes but is predicted as no. Similarly 26 is the numbers which in actual is no but is predicted as yes. Hence these values has been misclassified.

Also the model is predicting an accuracy of 71% on the test data.

Keep visiting our website for more blogs on Data Science and Data Analytics.

Suggested Reading:

Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.

Keep visiting our website for more blogs on Data Science and Data Analytics.

Badal Kumar

Data Analyst at Aeon Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close