In our previous blog We have explained the concept of Decision Tree with the help of Cardiotocography dataset.
In this blog we will be using the ‘Risk Factors associated with Low Infant Birth Weight’ dataset using the decision tree algorithm. The data were collected at Baystate Medical Center, Springfield, Mass during 1986. The objective of this dataset is to assess factors associated with low birth weight babies in Baystate Medical Center. Low birth weight is defined as an infant born with a weight of less than 2500 g. It is one of the major public health problems worldwide.
Therefore we will predict whether the infant born is under the weight of 2.5 kg or not based on variables predictors(independent variables).
Before moving further I would suggest our blog readers to go through the previous post to understand the concepts better.
So let us begin our coding in R.
We will import the necessary libraries first
The packages MASS and rpart have been imported.
MASS package is used to import the ‘birthwt’ dataset and rpart for creating Decision tree for the same dataset.
We will now load the data and fetch the first few records.
There are 189 rows and 10 columns, namely:
- Age: mother’s age in years.
- Lwt: mother’s weight in pounds at last menstrual period.
- Race: mother’s race (1 = white, 2 = black, 3 = other).
- Smoke: smoking status during pregnancy.
- Ptl: number of previous premature labours.
- Ht: history of hypertension.
- Ui: presence of uterine irritability.
- Ftv: number of physician visits during the first trimester.
- Bwt: birth weight in grams.
And target variable:
- Low: indicator of birth weight less than 2.5 kg.
Checking the percentage of uniques values for each level in a particular variable
Here the value under feature low depicts that for of the 2 levels ‘0’ and ‘1’ under this column 1.1 is the percentage value that is unique for this feature.
Likewise, for the column race out of the 2 levels ‘1’, ‘2’ and ‘3’, 1.6 is the percentage value that is unique for this feature.
Converting all the categorical variables into factors.
Here variables with different levels have been converted into Factors.
Checking for null value if any
Getting the summary of the dataset using the summary() function.
In the target variable we can see that it has 59 values that corresponds to the number of infants that were born with weight less than 2.5kg.
Splitting the data into training and test datasets.
We have split our data into training and test data in the ratio 80:20 according to our target variable i.e., ‘birthwt$low’
Lets fit out decision tree the model.
We have fitted our training data using rpart function and plotted the tree.
Again Visualizing the same decision tree using rpart function()
From the above graphs we can infer that if the value of ptl be 0,2 or 3 we are getting a value of 46% that corresponds to the situation where the weight of infants are less than 2.5 kg based on ‘race’ and 33% based on ‘lwt.’
Also if the value of ptl is not equal to 0,2,3 we are getting a value of 12% that corresponds to the situation where the weight of infants are more than 2.5 kg.
Making predictions using the test data.
Hence our model have predicted that 23 data correctly corresponds to class 0 that is the weight of infants are less than 2.5 kg and 9 data correctly responds to class 1 that is situation where the weight of infants are more than 2.5 kg.
Evaluating the accuracy
Our model has an accuracy of 84%.
Calculating the misclassification error
We will now calculate and plot the ROC-AUC curve
We use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve whenever we want to check or visualize the performance of the multi – class classification problem.
It is one of the most important evaluation metrics for checking any classification model’s performance.
Now that our ROC have been built, we will calculate the Area under the ROC curve(AUC)
Higher the AUC, the better the model is at predicting values. Since our model has an AUC value of 77% which is quite good.
And this brings to the end of this blog. We hope you find this blog helpful.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.