All CategoriesData Analytics with R, Excel & TableauData Science and Artificial Intelligence

Decision Tree in Python

In this blog, we will discuss Decision Trees and their implementation in Python with the help of a visualized graph.

In our previous blog, we have learned about the decision tree and its implementation in R using a dataset. If you are familiar with R programming language we suggest our readers to go through the blog from the below link:

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

https://acadgild.com/blog/decision-tree-using-r

As we know, Decision Tree is a popular supervised machine learning algorithm that is used for carrying out both classification and regression tasks. So decision trees are a binary tree like flowchart where each node represents the feature variables and are split in such a way that the branches represent a group of observation based on the feature variables and finally the leaves which represents the final outcome for the dataset.

The main objective of the decision tree is to split data in such a way that each element in one group belongs to the same category. Decision tree graphs are easily interpreted. 

The splitting up of data is based on some measures that partition data into the best possible manner. In order to split on, we need a way of measuring how good the split is. The most popular measures are:

  • Gini index
  • Information gain

Gini Index: This is used to measure impurity or the quality of a split of a node. The scikit learn implementation of the DecisionTreeClassifier uses gini by default. 

It works with the categorical target variable “Success” and “Failure” and performs only binary splits. 

The degree of the Gini index falls between 0 and 1, where 0 denotes that all the elements belong to a certain class and 1 denotes that the elements are randomly distributed across various classes. When the value of Gini is equal to 0, the node is considered pure and no further split is done. 

Information Gain: information gain is derived from entropy. Entropy is a way of measuring the amount of impurity in a given set of data

Information gain is used to determine which feature or attribute gives us the maximum information about a class. 

High entropy means that we have a collection of different classes and a low entropy means that we have predominantly one class, therefore, we keen on splitting the node in a way that decreases the entropy. 

In this blog, we will be using the popular Iris Data Set. This dataset is perhaps the best known database to be found in the pattern recognition literature. The dataset contains 3 classes of 50 instances each, where each class refers to a type of Iris plant namely Setosa, Versicolor and Virginica. One is linearly separable from the other 2 and the latter are not linearly separable from each other.

The predicted attribute is the class of Iris plant.

The dataset can be loaded from the sklearn library itself.

So let us begin our coding in Python.

We’ll import all the necessary libraries. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

We will now load the dataset from  the sklearn library

from sklearn.datasets import load_iris
iris = load_iris()

Initializing X and y as the predictors and the target variable respectively.

X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)

Since the target variable is a categorical one consisting of 3 categories of flower species, we have used ‘Categorical.from_codes’. This constructor is useful when we have categories datatype.

Checking the first few records of both the variables.

X.head()
y = pd.get_dummies(y)
y.head()

Using get_dummies() function we have converted our categories of flower species into dummy variables.

Checking the info of X and y respectively

X.info()
y.info()

We have 150 respective non-null values in our dataset.

Checking the statistical data of our predictor variables

X.describe()

Getting the unique values of the target variable

y.nunique

From the above output, we can see that we have 50 instances each of the three respective species of the plant. 

Visualizing the independent feature 

X['petal width (cm)'].plot.hist()
plt.show()

About 50 flowers in this dataset have values between 0.1 and 0.5. 

Splitting the data into training and test data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Fitting the model with the train data

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
prediction =  dt.predict(X_test)

Evaluating model

Importing all the classes from sklearn library to do the evaluation

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score

print(classification_report(y_test, prediction))
print(confusion_matrix(y_test.values.argmax(axis=1), prediction.argmax(axis=1)))

As we can see, our decision tree classifier correctly classified 29/30 plants.

score = accuracy_score(y_test, prediction)
score

The accuracy of our model is 96%, which is pretty good.

Tree Visualization

Scikit learn has some built-in visualization capabilities for decision trees. We might not use it often as it requires us to install graphviz. 

Graphviz is a visualization library and can be installed using the below command:

conda install graphviz

or

conda install python-graphviz

and 

conda install pydot

from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydot 
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(dt, out_file=None, filled=True, rounded=True, feature_names=iris.feature_names, class_names=iris.target_names)

graph = graphviz.Source(dot_data)  
graph

export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png or displayable form on Jupyter.

In the decision tree chart, each internal node has a decision rule that splits the data. Gini is referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node.

This is a pruned tree that is less complex, explainable, and easy to understand.

Petal length(cm) < = 2.6 is the first question the decision tree asks if the petal length is less than 2.6 cm and based on the result it either follows the true or false path.

gini=0.443 this is the Gini score which is a metric that quantifies the purity of the leaf/node. A Gini score of 0 means that the node is pure. 

Samples tell us that how many examples are at that node

Value is that vector of samples for each class

And this brings us to the end of our blog. I hope this helps you in understanding the decision trees classifier. Do leave us a comment for any query or suggestion.

Keep visiting our website for more blogs on Data Science and Data Analytics.

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close