In this blog we will be analyzing the popular Wine dataset using K-means clustering algorithm.
We have done an analysis on USArrest Dataset using K-means clustering in our previous blog, you can refer to the same from the below link:
This wine dataset is a result of chemical analysis of wines grown in a particular area. The analysis determined the quantities of 13 constituents found in each of the three types of wines. The attributes are: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phenols, Flavonoids, Non-Flavonoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. The data set has 178 observations and no missing values.
You can download the dataset from the link.
Our goal is to try to group similar observations together and determine the number of possible clusters (it may differ from 3). This would help us make predictions and reduce dimensionality.
Let us now dive into the coding part
Loading the dataset and getting the first few records of the dataset
Getting the structure of the dataset using the str() function.
We can see the dataset has 178 rows and 14 columns
Summarizing the dataset using the summary() function.
To check any missing values, hence no missing value present in the whole dataset.
Scaling the Data
Displaying the first few columns of the dataset after scaling it.
We can see that the data points have been standardized that is, it has been scaled. Scaling is done to make the variables comparable.
Standardizing consists of transforming the variables such that they have zero mean and standard deviation as 1.
Now we will load two of the libraries, that is, cluster and factoextra that are the required R packages.
Now we are defining clusters such that the total intra-cluster variation (total within-cluster sum of squares) is minimized.
It creates the below graph
Optimal number of clusters
Similar to the elbow method, there is a function fviz_nbclust() that is used to visualize and determine the optimal number of clusters.
From the above various results, we came to know that 3 is the optimal number of clusters, we can perform the final analysis and extract the results using these 3 clusters.
Determine cluster, a vector of integers (from 1: k) indicating the cluster to which each point is allocated.
Determining cluster size that is, the number of points in each cluster.
Visualizing K-means Clusters
2D representation of clusters
Hence, we have computed the optimal number of clusters that are 3 in numbers and visualize K-mean clustering.
Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.
Keep visiting our website for more blogs on Data Science and Data Analytics.