Data Analytics with R, Excel & Tableau
Trending

Analyzing USArrest dataset using K-means Clustering

In this blog we are using a USArrest dataset and will implement K-means Clustering algorithm.

Introduction

Kmeans clustering algorithm is an iterative algorithm that tries to partition the dataset into distinct non-overlapping clusters where each datapoint belongs to only one group. 

It assigns the data points to the clusters such that the euclidean distance between the data points and the cluster’s centroid is at the minimum.

This is a systematic approach for identifying and analyzing patterns and trends in crime using USArrest dataset. The model that we will be building in this blog, can predict regions which have high probability of crime occurrence and can visualize crime prone areas. 

This dataset contains statistics, in arrest per 100,000 residents for assault, murder and rape in each of the 50 US States in 1973. The percentage of the population living in urban areas is also given. The aim of the dataset is to see if there is any dependency between the state been acquired and the arrest history.

Let us now dive into the coding part

Fetching the working directory

Loading the dataset data(“USArrest”). This dataset is inbulit with R, You can directly load the dataset and can see the first few records of the data using the head() function.

Getting the structure of the dataset using the str() function.

Summarizing the dataset using the summary() function.

Checking for null values if any

We can see there is no null value present in the dataset.

Checking for Correlation 

We will now check for the correlation between all the variables by using the corrplot() function.

It gives the following output

corelation graph

We can observe from the above result screenshot that the 3 crime variables are correlated with each other, that is, Assault-Murder, Rape-Assault and Rape-Murder.

Scaling the data 

UsArrest

Displaying the first few columns of the dataset after scaling it.

We can see that the data points have been standardized that is, it has been scaled. Scaling is done to make the variables comparable. 

Standardizing consists of transforming the variables such that they have zero mean and standard deviation as 1. 

Now we will load two of the libraries, that is, cluster and factoextra that are the required R packages.

factoextra

cluster is for computing clustering algorithms and factoextra for ggplot2-based elegant visualization of clustering results. 

Computing K-means

We’ll use only a subset of the data by taking 10 random rows among the 50 rows in the data set.

sample

Computing Euclidean Distance

We will now compute the Euclidean distance by using the dist() function.

To make it easier to see the distance information generated by the dist() function, we are reformatting the distance vector into a matrix using the as.matrix() function.

As we can see Euclidean Distance is placed in a matrix and only 4 cities are shown where distances are rounded to 1 decimal place.

Visualization

We have used fviz_dist() from the factoextra package to visualize the distance matrices.

 Euclidean Distance in R

It shows the following output.

Heat map

In the above graph the Red color shows the closest distence and Blue color shows maximum distence.

Elbow Method

Now we are defining clusters such that the total intra-cluster variation (total within-cluster sum of squares) is minimized.

elbow curve

Optimal number of clusters

Similar to the elbow method, there is a  function fviz_nbclust() that is used to visualize and determine the optimal number of clusters.

optimal number of clusters., wss

Extracting Results

From the above various results we came to know that 4 is the optimal number of clusters, we can perform the final analysis and extract the results using these 4 clusters.

The output of Kmeans returns a list of components. The most important one are listed below:

  • cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
  • centers: A matrix of cluster centers.
  • totss: The total sum of squares.
  • withinss: Vector of within-cluster sum of squares, one component per cluster.
  • tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
  • betweenss: The between-cluster sum of squares, i.e. $totss-tot.withinss$.
  • size: The number of points in each cluster.

These components can be accessed as follows

Adding point classification to the original data.

Visualizing K-means Clusters

K-mean cluster

Hence we have computed the optimal number of clusters and visualize K-mean clustring.

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below. 

Suggested reading:

Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.

Keep visiting our website for more blogs on Data Science and Data Analytics.

Series Navigation<< Text Mining using RAnalyzing Wine dataset using K-means Clustering >>

Badal Kumar

Data Analyst at Aeon Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close