Data Science and Artificial Intelligence

Decision Trees

This entry is part 5 of 9 in the series Machine Learning Algorithms

This blog aims at imparting beginners with adequate information on the fundamental concepts of a decision tree so that they can develop their first tree model in no time.  

A decision tree is a support tool that uses a tree-like graph or model of decisions and their possible consequences. It is one way to display an algorithm that contains only conditional control statements.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to leaf represent classification rules.

A decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems.

It works for both categorical and continuous input and output variables.

Types of Decision Trees

Types of the decision tree are based on the type of target variable we have. It can be of two types:

Categorical Variable Decision Tree:

A decision tree which has a categorical target variable is called categorical variable decision tree.

Continuous Variable Decision Tree:

A decision tree which has continuous target variable then it is called as the continuous variable decision tree.


Let’s say we have a problem to predict whether a bike is good or not. This can be judged by using a decision tree classifier.

However, to qualify the bike into the good or bad category, mileage becomes an important factor. Mileage is measured using a contiguous value hence it can be measured using the decision tree regressor.

Important Terminology related to Decision Trees

Here’s a look at the basic terminology used with Decision trees:

Root Node:

It represents the entire population or sample, and this further gets divided into two or more homogeneous sets.


It is a process of dividing a node into two or more sub-nodes.

Decision Node:

When a sub-node splits into further sub-nodes, then it is called a decision node.

Leaf/ Terminal Node:

Nodes that do not split are called Leaf or Terminal nodes.\


When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.

Branch / Sub-Tree:

A sub-section of entire tree is called a branch or sub-tree.

Parent and Child Node:

A node, which is divided into sub-nodes is called the parent node of sub-nodes whereas sub-nodes are the children of a parent node.

How to split nodes?

Making strategic splits drastically affects a tree’s accuracy. The decision criteria are different for classification and regression trees. Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of the resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

There are a few algorithms to find the optimum split. Let’s look at the following index to understand the mathematics behind them.

Gini Index

Gini index says, if we randomly select two items from a population, they must be of the same class and probability for this is 1 if the population is pure.

It works with the categorical target variable “Success” or “Failure”. It performs only binary splits. Higher the value of Gini, higher the homogeneity. CART (Classification and Regression Tree) uses the Gini method to create binary splits.

Steps to Calculate Gini for a split

  1. Calculate Gini for sub-nodes, using formula sum of the square of probability for success and failure (p^2+q^2).
  2. Calculate Gini for split using weighted Gini score of each node of that split.

Information Gain

We can derive information gain from entropy as 1 – Entropy. Entropy is a way of measuring the amount of impurity in a given set of data. It is represented by a formula:

The ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is equally divided it has the entropy of one.

We will use implementation provided by the python machine learning framework known as scikit-learn to understand Decision Trees.

Problem Statement:

To build a Decision Tree model for prediction of car quality given other attributes about the car.

Data details

1. Title: Car Evaluation Database

The dataset is available at: “

2. Sources:
  (a) Creator: Marko Bohanec
  (b) Donors: Marko Bohanec([email protected])
               Blaz Zupan      ([email protected])
   (c) Date: June, 1997

3. Past Usage:

  The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for
   multi-attribute decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation
   of HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
   cars according to the following concept structure:

   CAR                      car acceptability
   PRICE                  overall price
   buying               buying price
   maint                price of the maintenance
   TECH                   technical characteristics
   COMFORT              comfort
   doors              number of doors
   persons            capacity in terms of persons to carry
   lug_boot           the size of luggage boot
   safety               estimated safety of the car

   Input attributes are printed in lowercase. 

5. Number of Instances: 1728 (instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N          N[%]
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %)

Tools to be used:

Numpy, pandas, scikit-learn

Python Implementation with code :

0) Import necessary libraries

Import the necessary modules from specific libraries.

import os
import numpy as np
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree, metrics

1) Load the data set

Use pandas module to read the bike data from the file system. Check a few records of the dataset.

data =

  buying maint doors persons lug_boot safety class
0 vhigh  vhigh 2     2       small    low    unacc
1 vhigh  vhigh 2     2       small    med    unacc
2 vhigh  vhigh 2     2       small    high   unacc
3 vhigh  vhigh 2     2       med      low    unacc
4 vhigh  vhigh 2     2       med      med    unacc

2) Check a few information about the data set

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

The training dataset has 1728 rows and 7 columns.

There are no missing values in the dataset

3) Identify the target variable

data['class'],class_names = pd.factorize(data['class'])

The target variable is marked as a class in the data frame. The values are present in string format. However, the algorithm requires the variables to be coded into its equivalent integer codes. We can convert the string categorical values into an integer code using factorize method of the pandas library.

Let’s check the encoded values now.


Index([u'unacc', u'acc', u'vgood', u'good'], dtype='object')
[0 1 2 3]

As we can see the values has been encoded into 4 different numeric labels.

4) Identify the predictor variables and encode any string variables to equivalent integer codes

data['buying'],_ = pd.factorize(data['buying'])
data['maint'],_ = pd.factorize(data['maint'])
data['doors'],_ = pd.factorize(data['doors'])
data['persons'],_ = pd.factorize(data['persons'])
data['lug_boot'],_ = pd.factorize(data['lug_boot'])
data['safety'],_ = pd.factorize(data['safety'])

        buying maint doors persons lug_boot safety class
0       0      0     0     0       0        0      0
1       0      0     0     0       0        1      0
2       0      0     0     0       0        2      0
3       0      0     0     0       1        0      0
4       0      0     0     0       1        1      0

Check the data types now:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null int64
maint       1728 non-null int64
doors       1728 non-null int64
persons     1728 non-null int64
lug_boot    1728 non-null int64
safety      1728 non-null int64
class       1728 non-null int64
dtypes: int64(7)
memory usage: 94.6 KB

Everything is now converted to integer form.

5) Select the predictor feature and the target variable

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

6) Train test split:

# split data randomly into 70% training and 30% test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=0)

7) Training/model fitting

# train the decision tree
dtree = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0), y_train)

8) Model parameters study:

# use the model to make predictions with the test data
y_pred = dtree.predict(X_test)
# how did our model perform?
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 96
Accuracy: 0.82

As you can see the algorithm was able to achieve a classification accuracy of 82% on the held out set. Only 96 samples were misclassified. 

Learn Python Programming Online now. Visit our website to know more.

Visualization of the decision graph:

import graphviz
feature_names = X.columns

dot_data = tree.export_graphviz(dtree, out_file=None, filled=True, rounded=True,
graph = graphviz.Source(dot_data)  

Series Navigation<< K-Nearest Neighbor AlgorithmNaive Bayesian Model >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles