Data Analytics with R, Excel & Tableau
Trending

Decision Tree using R

Decision tree is one of the most popular supervised learning algorithms used in machine learning. This algorithm is mostly used for classification as well as regression problems. 

The decision trees are constructed with an approach that identifies ways to split the dataset based on different conditions. These are generally in the  form of if-then-else statements. It is a tree-like graph with nodes representing the attributes where we ask the questions, edges represents the answers to the questions and the leaves represent the actual outcomes.

Decision tree are applicable in cases where the uncertainty concerning which outcome will actually happen or when the user has an objective he is trying to achieve:max profit/optimize costs.

Taking an instance that we have 5 days data of my friend which tells whether he will come to play or not based on some weather conditions as below:

DayWeatherTemperatureHumidityWindPlay
1Sunny
HotHighWeakNo
2CloudyHotHighWeakYes
3SunnyMildNormalStrongYes
4
CloudyMildHighStrongYes
5RainyMildHighStrongNo

We will form a decision tree based on the above table which will be shown something like this:

Hence in the above tree we can see that each node represents an attribute or feature, the branches represents the outcome of that node and the leaves are where the final decisions are made.

In this blog, we will build a model using the Cardiotocography dataset. The dataset consists of measurements of fetal heart rate (FHR) and uterine contractions (UC) features on cardiotocography classified by expert obstetricians. 2126 fetal cardiotocography (CTGs) were automatically processed and the respective diagnostic features measured. CTGs are classified by three expert obstetricians and consensus classification label as Normal, Suspect or Pathologic. You can get the dataset from the below link.

Dataset:  https://acadgildsite.s3.amazonaws.com/wordpress_images/r/cardiography/Cardiotocographic.csv

So let us begin our coding in R.

Loading the dataset and fetching the first few records. 

There are 22 columns present in this dataset which states:

  • LB: FHR(Fetal heart rate) baseline (beats per minute)
  • AC:  # of accelerations per second 
  • FM: # of fetal movements per second
  • UC: # of uterine contractions per second
  • DL: # of light decelerations per second
  • DS: # of severe deceleration per second
  • DP: # of prolonged decelerations per second
  • ASTV – percentage of time with abnormal short term variability
  • MSTV – mean value of short term variability
  • ALTV – percentage of time with abnormal long term variability
  • MLTV – mean value of long term variability
  • Width – width of FHR histogram
  • Min – minimum of FHR histogram
  • Max – Maximum of FHR histogram
  • Nmax – # of histogram peaks
  • Nzeros – # of histogram zeros
  • Mode – histogram mode
  • Mean – histogram mean
  • Median – histogram median
  • Variance – histogram variance
  • Tendency – histogram tendency
  • CLASS – FHR pattern class code (1 to 10)
  • Target variable:
    • NSP – fetal state class code (N=normal; S=suspect; P=pathologic)

Getting the structure or information about each variable of the dataset using the str() function.

Hence all the variables are either integer or float data types.

Getting the statistical summary of the dataset using the summary() function

Checking for null values

Hence no null value present in the dataset.

Now, since we have the values of target variable in 3 levels that is 1, 2 and 3. We are using the factor() function which are used to convert the data objects which are used to categorize the data and store it as levels. They can store both strings and integers.

After converting the target variables into factors. We will now split our data into training and validation sets and will set the seed of R’s random number generator, which is useful for creating simulations or random objects that can be reproduced.

Now the dataset has been split 80% into training data stated by index 1 and 20% into validation data stated by index 2 respectively. 

We will now import the ‘Party’package

The R package “party” is used to create decision trees.

The package “party” has the function ctree() which is used to create and analyze decision tree.

Here we have given the independent variables as LB, AC, FM and dependent variables to be NSP.

Using the plot() function to plot the decision tree graph. 

Here we can see that the nodes represents the independent variables, branches refer to the values that are to be compared and the leaves represents the target variable with its 3 levels.

We will now make predictions using the predict() function using ‘tree’ variable taken from ‘party’ package and ‘validate’ data.

Decision tree using the package ‘rpart’.

The rpart library that stands for Recursive Partitioning and Regression Trees,  the resulting models can be represented as binary trees.

Here the binary tree has been created using the training dataset.

Again creating the tree using rpart by initializing the attribute ‘extra = 1’ which means that it displays the number of observations that fall in the node (per class for class objects; prefixed by the number of events for poisson and exp models). Similar to text.rpart’s use.n=TRUE.

We will again form a tree this time initializing ‘extra = 2’, which means that Class models: display the classification rate at the node, expressed as the number of correct classifications and the number of observations in the node. Poisson and exp models: display the number of events.

Hence the tree will look something as below

We will again make prediction using the predict() function, but this time using the variable ‘tree1’ taken from ‘rpart’ library and ‘validate’ data.

Creating confusion matrix using the table() function. Table() function is also helpful in creating Frequency tables with condition and cross tabulations.

Here values 1, 2 and 3 depicts the three levels of target variable NSP where the values represents Normal, Suspect and Pathologic.

Computing the accuracy by taking the proportion of true positive and true negative over the sum of the matrix as shown below.

Calculating the misclassification error on the training data.

Again creating the confusion matrix using the validation dataset.

Calculating the  misclassification error for validation dataset.

Here you can see that the misclassification error for train data is 0.19 whereas the misclassification error for test data is 0.21.

And this brings us to the end of this blog. Hope you find this helpful.

Suggested Reading:

Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.


Series Navigation<< Hierarchical Clustering with RPredicting Low Infant Birth Weight using Decision Tree >>

Badal Kumar

Data Analyst at Aeon Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close