In this blog, we will discuss Decision Trees and their implementation in Python with the help of a visualized graph.
In our previous blog, we have learned about the decision tree and its implementation in R using a dataset. If you are familiar with R programming language we suggest our readers to go through the blog from the below link:
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
As we know, Decision Tree is a popular supervised machine learning algorithm that is used for carrying out both classification and regression tasks. So decision trees are a binary tree like flowchart where each node represents the feature variables and are split in such a way that the branches represent a group of observation based on the feature variables and finally the leaves which represents the final outcome for the dataset.
The main objective of the decision tree is to split data in such a way that each element in one group belongs to the same category. Decision tree graphs are easily interpreted.
The splitting up of data is based on some measures that partition data into the best possible manner. In order to split on, we need a way of measuring how good the split is. The most popular measures are:
- Gini index
- Information gain
Gini Index: This is used to measure impurity or the quality of a split of a node. The scikit learn implementation of the DecisionTreeClassifier uses gini by default.
It works with the categorical target variable “Success” and “Failure” and performs only binary splits.
The degree of the Gini index falls between 0 and 1, where 0 denotes that all the elements belong to a certain class and 1 denotes that the elements are randomly distributed across various classes. When the value of Gini is equal to 0, the node is considered pure and no further split is done.
Information Gain: information gain is derived from entropy. Entropy is a way of measuring the amount of impurity in a given set of data
Information gain is used to determine which feature or attribute gives us the maximum information about a class.
High entropy means that we have a collection of different classes and a low entropy means that we have predominantly one class, therefore, we keen on splitting the node in a way that decreases the entropy.
In this blog, we will be using the popular Iris Data Set. This dataset is perhaps the best known database to be found in the pattern recognition literature. The dataset contains 3 classes of 50 instances each, where each class refers to a type of Iris plant namely Setosa, Versicolor and Virginica. One is linearly separable from the other 2 and the latter are not linearly separable from each other.
The predicted attribute is the class of Iris plant.
The dataset can be loaded from the sklearn library itself.
So let us begin our coding in Python.
We’ll import all the necessary libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
We will now load the dataset from the sklearn library
from sklearn.datasets import load_iris iris = load_iris()
Initializing X and y as the predictors and the target variable respectively.
X = pd.DataFrame(iris.data, columns = iris.feature_names) y = pd.Categorical.from_codes(iris.target, iris.target_names)
Since the target variable is a categorical one consisting of 3 categories of flower species, we have used ‘Categorical.from_codes’. This constructor is useful when we have categories datatype.
Checking the first few records of both the variables.
y = pd.get_dummies(y) y.head()
Using get_dummies() function we have converted our categories of flower species into dummy variables.
Checking the info of X and y respectively
We have 150 respective non-null values in our dataset.
Checking the statistical data of our predictor variables
Getting the unique values of the target variable
From the above output, we can see that we have 50 instances each of the three respective species of the plant.
Visualizing the independent feature
X['petal width (cm)'].plot.hist() plt.show()
About 50 flowers in this dataset have values between 0.1 and 0.5.
Splitting the data into training and test data.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
Fitting the model with the train data
from sklearn.tree import DecisionTreeClassifier dt = DecisionTreeClassifier() dt.fit(X_train, y_train) prediction = dt.predict(X_test)
Importing all the classes from sklearn library to do the evaluation
from sklearn.metrics import classification_report, confusion_matrix from sklearn.metrics import accuracy_score print(classification_report(y_test, prediction))
As we can see, our decision tree classifier correctly classified 29/30 plants.
score = accuracy_score(y_test, prediction) score
The accuracy of our model is 96%, which is pretty good.
Scikit learn has some built-in visualization capabilities for decision trees. We might not use it often as it requires us to install graphviz.
Graphviz is a visualization library and can be installed using the below command:
conda install graphviz
conda install python-graphviz
conda install pydot
from IPython.display import Image from sklearn.externals.six import StringIO from sklearn.tree import export_graphviz import pydot import graphviz from sklearn import tree
dot_data = tree.export_graphviz(dt, out_file=None, filled=True, rounded=True, feature_names=iris.feature_names, class_names=iris.target_names) graph = graphviz.Source(dot_data) graph
export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png or displayable form on Jupyter.
In the decision tree chart, each internal node has a decision rule that splits the data. Gini is referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node.
This is a pruned tree that is less complex, explainable, and easy to understand.
Petal length(cm) < = 2.6 is the first question the decision tree asks if the petal length is less than 2.6 cm and based on the result it either follows the true or false path.
gini=0.443 this is the Gini score which is a metric that quantifies the purity of the leaf/node. A Gini score of 0 means that the node is pure.
Samples tell us that how many examples are at that node
Value is that vector of samples for each class
And this brings us to the end of our blog. I hope this helps you in understanding the decision trees classifier. Do leave us a comment for any query or suggestion.
Keep visiting our website for more blogs on Data Science and Data Analytics.