In this blog we are using a USArrest dataset and will implement KMeans Clustering algorithm.
Kmeans clustering algorithm is an iterative algorithm that tries to partition the dataset into distinct non-overlapping clusters where each datapoint belongs to only one group.
It assigns the data points to the clusters such that the euclidean distance between the data points and the cluster’s centroid is at the minimum.
This is a systematic approach for identifying and analyzing patterns and trends in crime using USArrest dataset. The model that we will be building in this blog, can predict regions which have high probability of crime occurrence and can visualize crime prone areas.
This dataset contains statistics, in arrest per 100,000 residents for assault, murder and rape in each of the 50 US States in 1973. The percentage of the population living in urban areas is also given. The aim of the dataset is to see if there is any dependency between the state been acquired and the arrest history.
Let us now dive into the coding part
Fetching the working directory
Loading the dataset data(“USArrest”). This dataset is inbulit with R, You can directly load the dataset and can see the first few records of the data using the head() function.
Getting the structure of the dataset using the str() function.
Summarizing the dataset using the summary() function.
Checking for null values if any
We can see there is no null value present in the dataset.
Checking for Correlation
We will now check for the correlation between all the variables by using the corrplot() function.
It gives the following output
We can observe from the above result screenshot that the 3 crime variables are correlated with each other, that is, Assault-Murder, Rape-Assault and Rape-Murder.
Scaling the data
Displaying the first few columns of the dataset after scaling it.
We can see that the data points have been standardized that is, it has been scaled. Scaling is done to make the variables comparable.
Standardizing consists of transforming the variables such that they have zero mean and standard deviation as 1.
Now we will load two of the libraries, that is, cluster and factoextra that are the required R packages.
cluster is for computing clustering algorithms and factoextra for ggplot2-based elegant visualization of clustering results.
We’ll use only a subset of the data by taking 10 random rows among the 50 rows in the data set.
Computing Euclidean Distance
We will now compute the Euclidean distance by using the dist() function.
To make it easier to see the distance information generated by the dist() function, we are reformatting the distance vector into a matrix using the as.matrix() function.
As we can see Euclidean Distance is placed in a matrix and only 4 cities are shown where distances are rounded to 1 decimal place.
We have used fviz_dist() from the factoextra package to visualize the distance matrices.
It shows the following output.
In the above graph the Red color shows the high similarity and Blue color shows low similarity.
Now we are defining clusters such that the total intra-cluster variation (total within-cluster sum of squares) is minimized.
Optimal number of clusters
Similar to the elbow method, there is a function fviz_nbclust() that is used to visualize and determine the optimal number of clusters.
From the above various results we came to know that 4 is the optimal number of clusters, we can perform the final analysis and extract the results using these 4 clusters.
The output of Kmeans returns a list of components. The most important one are listed below:
- cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
- centers: A matrix of cluster centers.
- totss: The total sum of squares.
- withinss: Vector of within-cluster sum of squares, one component per cluster.
- tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
- betweenss: The between-cluster sum of squares, i.e. $totss-tot.withinss$.
- size: The number of points in each cluster.
These components can be accessed as follows
Adding point classification to the original data.
Visualizing K-means Clusters
Hence we have computed the optimal number of clusters and visualize K-mean clustring.
Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.
Keep visiting our website for more blogs on Data Science and Data Analytics.