Data Science and Artificial Intelligence

K-Means Clustering Algorithm

This entry is part 16 of 17 in the series Machine Learning Algorithms

Introduction:

Through this blog, beginners will get a thorough understanding of the k-Means Clustering Algorithm.

Clustering:

Clustering is the most important unsupervised learning problem which deals with finding structure in a collection of unlabeled data (like every other problem of this kind). A loose definition of clustering could be “the process of organizing objects into groups that have members who are similar in one way or the other”. A cluster is therefore a collection of objects which are “similar” to each other but “dissimilar” to those belonging to other clusters. We can show this with a simple graphical example:

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

In this case, we can easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.

The Goals of Clustering

So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide the basis of good clustering? Unfortunately, there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user who must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).

K-Means

K-Means is one of the most popular “clustering” algorithms. K-Means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid.

K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

How it works:

The distance metric used to calculate similarity in step 1 is Euclidean distance.

Problem Statement :

To build a simple K-Means model for clustering the car data into different groups.

Data details

==========================================
1. Title: Car Evaluation Database
==========================================

The dataset is available at: “http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

2. Sources:
  (a) Creator: Marko Bohanec
  (b) Donors: Marko Bohanec([email protected])
               Blaz Zupan      ([email protected])
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute
   decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation
   of HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
   cars according to the following concept structure:

   CAR                      car acceptability
   PRICE                  overall price
   buying               buying price
   maint                price of the maintenance
   TECH                   technical characteristics
   COMFORT              comfort
   doors              number of doors
   persons            capacity in terms of persons to carry
   lug_boot           the size of luggage boot
   safety               estimated safety of the car
   Input attributes are printed in lowercase. 
5. Number of Instances: 1728 (instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %)

Tools to be used:

Numpy,pandas,scikit-learn

Python Implementation with code :

1) Import necessary libraries

Import the necessary modules from specific libraries.

import os
import numpy as np
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

2) Load the data set

Use pandas module to read the bike data from the file system. Check few records of the dataset.

data = 
pd.read_csv('data/car_quality/car.data',names=['buying','maint','doors','persons','lug_boot','safety','class'])
data.head()

  buying maint doors persons lug_boot safety class
0 vhigh  vhigh 2     2       small    low    unacc
1 vhigh  vhigh 2     2       small    med    unacc
2 vhigh  vhigh 2     2       small    high   unacc
3 vhigh  vhigh 2     2       med      low    unacc
4 vhigh  vhigh 2     2       med      med    unacc

3) Check information about the data set

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

The training dataset has 1728 rows and 7 columns.

There are no missing values in the dataset.

4) Identify the target variable

data['class'],class_names = pd.factorize(data['class'])

The target variable is marked as a class in the data frame. The values are present in string format. However, the algorithm requires the variables to be coded into its equivalent integer codes. We can convert the string categorical values into an integer code using factorize method of the pandas library.

Let’s check the encoded values now.

print(class_names)
print(data['class'].unique())

Index([u'unacc', u'acc', u'vgood', u'good'], dtype='object')
[0 1 2 3]

As we can see the values has been encoded into 4 different numeric labels.

5) Identify the predictor variables and encode any string variables to equivalent integer codes

data['buying'],_ = pd.factorize(data['buying'])
data['maint'],_ = pd.factorize(data['maint'])
data['doors'],_ = pd.factorize(data['doors'])
data['persons'],_ = pd.factorize(data['persons'])
data['lug_boot'],_ = pd.factorize(data['lug_boot'])
data['safety'],_ = pd.factorize(data['safety'])
data.head()

  buying maint doors persons lug_boot safety class

0 0      0     0     0       0        0      0

1 0      0     0     0       0        1      0

2 0      0     0     0       0        2      0

3 0      0     0     0       1        0      0

4 0      0     0     0       1        1      0

Check the data types now :

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null int64
maint       1728 non-null int64
doors       1728 non-null int64
persons     1728 non-null int64
lug_boot    1728 non-null int64
safety      1728 non-null int64
class       1728 non-null int64
dtypes: int64(7)
memory usage: 94.6 KB

Everything is now converted in integer form.

6) Select the predictor feature and select the target variable

In clustering, there is no target variable as such. However, clustering helps us to find a cluster which can be used as weak labels. These weak labels can bootstrap our supervised learning.

This technique is widely used for semi-supervised learning.

We need not worry about it as of now. You can just consider that y value will be used to validate the accuracy of weak labeling going ahead.

X = data
y = data.iloc[:,-1]

7) Training/model fitting

model = KMeans(n_clusters=4, random_state=123)
model.fit(X)

8) Check few parameters of the learned cluster (Model parameters study):

Check the cluster centroids or means:

It should be 4 vectors or a matrix with 4 rows since the number of clusters we have fitted is 4.

Let’s check

model.cluster_centers_

[[1.76355748e+00 1.82429501e+00 8.37310195e-01 1.09327549e+00
  1.05206074e+00 1.12581345e+00 8.45986985e-01 7.05422993e+00]
 [1.72631579e+00 1.75438596e+00 2.50175439e+00 9.29824561e-01
  9.64912281e-01 9.07017544e-01 3.19298246e-01 1.87719298e+00]
 [5.00000000e-01 1.50000000e+00 1.50000000e+00 1.00000000e+00
  1.00000000e+00 1.00000000e+00 1.94444444e-01 4.50000000e+00]
 [2.18490566e+00 3.88679245e-01 4.98113208e-01 9.88679245e-01
  9.84905660e-01 9.81132075e-01 2.45283019e-01 1.06581410e-14]]

Check the goodness of the cluster i.e. within the sum of a square of the model:

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as the inertia or within-cluster sum-of-squares Inertia, or the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent clusters are.

The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean j of the samples in the cluster. The means are commonly called the cluster “centroids”.

The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum of squared criterion:

Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called  “curse of dimensionality”). Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.

model.inertia_

9447.907008065738

The lesser this number, the better the model fit.

9) Check the quality of the weak classification by the model

labels = model.labels_

# check how many of the samples were correctly labeled
correct_labels = sum(y == labels)

print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))

Result: 456 out of 1728 samples were correctly labeled.
correct 0.26 classification

We have achieved a weak classification accuracy of 26% by our unsupervised model.

How do we know the value of an optimal value of K

The value of k basically means that in how many clusters you can separate out the most homogeneous data. Choosing a large value of K will lead to a greater amount of execution time. Selecting a small value of K might or might not give a good fit. There is no such guaranteed way to find the best value of K. However, we can run a few experiments and find the value of each model’s inertia and plot it on a graph. This plot is also known as an elbow plot. We basically try to find the value of k from where there is a major shift in the value of inertia.

Let’s plot the graph.

from mpl_toolkits.mplot3d import Axes3D

def elbow_plot(data, maxK=40, seed_centroids=None):
    """
        parameters:
        - data: pandas DataFrame (data to be fitted)
        - maxK (default = 10): integer (maximum number of clusters with which to run k-means)
        - seed_centroids (default = None ): float (initial value of centroids for k-means)
    """
    sse = {}
for k in range(1, maxK):
        print("k: ", k)
        if seed_centroids is not None:
            seeds = seed_centroids.head(k)
            kmeans = KMeans(n_clusters=k, max_iter=500, n_init=100, random_state=0, init=np.reshape(seeds, 
(k,1))).fit(data)
            data["clusters"] = kmeans.labels_
        else:
            kmeans = KMeans(n_clusters=k, max_iter=300, n_init=100, random_state=0).fit(data)
            data["clusters"] = kmeans.labels_
        # Inertia: Sum of distances of samples to their closest cluster center
        sse[k] = kmeans.inertia_
    plt.figure()
    plt.plot(list(sse.keys()), list(sse.values()))
    plt.show()
    return

elbow_plot(X, maxK=10)

By the plot, we can see that there is a kink at k=2. Hence k =2 can be considered a good number of the cluster to cluster this data.

Let’s check the inertia and weak classification with a number of cluster = 2.

model = KMeans(n_clusters=2,random_state=123)
model.fit(X)

labels = model.labels_

# check how many of the samples were correctly labeled
correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))
print("correct %.02f classification " % (correct_labels/float(y.size)))

Result: 869 out of 1728 samples were correctly labeled.
correct 0.50 classification

In conclusion, our weak labeling was found to be as good as a weak supervised classification model.

Pros and Cons

Time Complexity

K-means is linear in the number of data objects i.e. O(n), where n is the number of data objects. The time complexity of most of the hierarchical clustering algorithms is quadratic i.e. O(n2). Therefore, for the same amount of data, hierarchical clustering will take a quadratic amount of time. Imagine clustering 1 million records?

Shape of Clusters

K-means works well when the shape of the clusters is hyper-spherical (or circular in 2 dimensions). If the natural clusters occurring in the dataset are non-spherical then probably K-means is not a good choice.

Repeatability

K-means starts with a random choice of cluster centers, therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. However, with hierarchical clustering, you will most definitely get the same clustering results.

Of course, K-means clustering requires prior knowledge of K (or a number of clusters), whereas in hierarchical clustering you can stop at whatever level (or clusters) you wish.

Acadgild provides Python programming course for beginner’s. To know more visit Acadgild.

Series Navigation<< Principal Component AnalysisRandom Forest >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close