Data Analytics with R, Excel & Tableau
Trending

Analyzing Wine dataset using K-means Clustering

In this blog we will be analyzing the popular Wine dataset using K-means clustering algorithm.

We have done an analysis on USArrest Dataset using K-means clustering in our previous blog, you can refer to the same from the below link:

This wine dataset is a result of chemical analysis of wines grown in a particular area. The analysis determined the quantities of 13 constituents found in each of the three types of wines.  The attributes are: Alcohol, Malic acid, Ash, Alkalinity of ash, Magnesium, Total phenols, Flavonoids, Non-Flavonoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. The data set has 178 observations and no missing values.

You can download the dataset from the link.

Our goal is to try to group similar observations together and determine the number of possible clusters (it may differ from 3). This would help us make predictions and reduce dimensionality.

Let us now dive into the coding part

Loading the dataset and getting the first few records of the dataset

Getting the structure of the dataset using the str() function.

We can see the dataset has 178 rows and 14 columns

Summarizing the dataset using the summary() function.

To check any missing values, hence no missing value present in the whole dataset.

Scaling the Data

Displaying the first few columns of the dataset after scaling it.

We can see that the data points have been standardized that is, it has been scaled. Scaling is done to make the variables comparable.

Standardizing consists of transforming the variables such that they have zero mean and standard deviation as 1.

Now we will load two of the libraries, that is, cluster and factoextra that are the required R packages.

Elbow Method

Now we are defining clusters such that the total intra-cluster variation (total within-cluster sum of squares) is minimized.

It creates the below graph

Optimal number of clusters

Similar to the elbow method, there is a  function fviz_nbclust() that is used to visualize and determine the optimal number of clusters.

Extracting Results

From the above various results, we came to know that 3 is the optimal number of clusters, we can perform the final analysis and extract the results using these 3 clusters.

Determine cluster, a vector of integers (from 1: k) indicating the cluster to which each point is allocated.

Determining cluster size that is, the number of points in each cluster.

Visualizing K-means Clusters

2D representation of clusters

Hence, we have computed the optimal number of clusters that are 3 in numbers and visualize K-mean clustering.

Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.

Suggested reading:

Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.

Keep visiting our website for more blogs on Data Science and Data Analytics.

Series Navigation<< Analyzing USArrest dataset using K-means ClusteringPrincipal Component Analysis with R >>

Badal Kumar

Data Analyst at Aeon Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close