Random forest is one of the most important Machine Learning algorithms which is used in regression and classification problems.
The core idea behind the Random Forest algorithm is, it generates multiple small decision trees from random subsets of the original data, then aggregating the result of multiple predictors of varying depth, gives a better prediction than the best individual predictor.
This group of decision trees or predictors is called an ensemble and this technique is called Ensemble Learning.
In our previous blog We have explained the working of decision trees with the help of cardiograpy dataset. Before proceeding further We recommend our readers to go through our previous blog to understand the concept of decision trees and the dataset better.
In this blog also, we will use the same cardiography dataset and build a model using the random forest algorithm to find the accuracy of patient belonging to the category of NSP.
You can download the dataset from the below link:
So let us begin our coding in R
Loading the data and fetching first few records.
Getting the structure of the dataset using the str() function
We will now use the as.factor() function to convert the data objects which are used to categorize the data of the target variable ‘NSP’ and store it as levels. They can store both strings and integers.
Summarizing the statistical figures using the summary() function.
Fetching the occurrence/frequency of each class present in the Target variable.
Level 1 that is, the ‘Normal’ state has occured maximum number of times.
Splitting the dataset into training and test data
Our data has now been split into training and validation data in the ratio of 70:30.
Applying Random Forest algorithm to build the model
To apply the random forest algorithm we have first imported the ‘randomForest’ library.
We will then fit the model on training data
Here we can see the error rate OOB that stands for Out Of Bag is 5.84%.
OOB data is the data that has been left out in the original dataset while taking random samples for training data from the original dataset.
These samples are also known as Bootstrap sample and the prediction error using the data which is not in Bootstrap sample is the OOB error rate.
Summarizing the attributes of random forest
Creating the confusion matrix
From the above result we can see that the data 1175, 144 and 115 have been correctly classified to the respective class 1, 2 and 3.
Also at class 2 level that is the Suspect state has the maximum error of 28% and the least error rate is found for class 1 level.
Plotting graph for error rate
From the graph it is seen that the error lines got somewhat constant from the value of trees=300, therefore we will give the value for ntree as 300.
ntree refers to the number of trees that grow in Random Forest. By default the value of ntree is equal to 500.
Tuning the random forest model for better accuracy.
Here the OOB error rate is the least when the value of mtry is equal to 8.
Mtry is the number of variables available for splitting at each tree node.
Again fitting the Random forest model on the training data after tuning the model by giving the value of ntree = 300 and mtry = 8
It is observed that after tuning the model the error rate has been slightly decreased to 5.58%, therefore the accuracy is 94.86%.
Checking for the number of nodes for the trees
The maximum frequency for the number of nodes could be found in the range 75-85.
Checking the model performance based on variables
Graph 1 test how worse the model performs or how impure the nodes are without each variable for mean decrease accuracy
Graph 2 tells us how pure the model is at the end of the tree without each variable for mean decrease Gini
Quantifying the values of each predictor variable against the target variable in our dataset
Finding out which predictor variables actually used in the Random Forest
Creating partial plot on the variable ASTV(i.e., percentage of time with abnormal short term variability) when the class value of the target variable(NSP) = 1
Therefore the class value for NSP is 1, when the value of ASTV is less than 60.
Creating partial plot on the variable ASTV(i.e., percentage of time with abnormal short term variability) when the class value of the target variable(NSP) = 2
Therefore the class value of NSP is 2 when the value of ASTV is between 50 to 70 and it is difficult to find where the patient is Suspect or not at these values of ASTV
Creating partial plot on the variable ASTV(i.e., percentage of time with abnormal short term variability) when the class value of the target variable(NSP) = 3
Therefore the class value of NSP is 3 when the value of ASTV is greater than 60.
Extracting the information of single tree from the forest.
Plotting the multidimensional scaling plot of proximity matrix for the train data of the target variable.
The data points for class value 1of NSP shown in red seems to be more scattered as compared to class value 2 shown in blue which is very less scattered and class value 3 shown in green that is not at all scattered.
Getting the actual values
We can see that the actual and predicted values are similar.
We will now create the confusion matrix and check for accuracy based on the train data
Create the confusion matrix and checking for accuracy based on the test data
Hence we got an accuracy of 94.86% on our test data with a 95% confidence interval in the range of 92%-96%.
We hope this post has been helpful in understanding Random Forest. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.