Data Science and Artificial Intelligence

Random Forest

This entry is part 17 of 17 in the series Machine Learning Algorithms

Introduction:

This blog deals with the Ensembling Machine Learning Algorithm called Random Forest. By the end of this blog, beginners started with fundamental concepts of a Random Forest and quickly help them to build their first Random Forest model.

Random forest is a tree-based algorithm which involves building several trees (decision trees), combining their output to improve the ability of the model to generalize.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Ensemble methods are supervised learning models which combine the predictions of multiple smaller models to improve predictive power and generalization.

Say, you want to buy a car. But you are uncertain of its quality. You ask 20 people who have previously bought cars. 12 of them said, ” the car is excellent.” Since the majority is in favor, you decide to go for it. This is how we use ensemble methods in machine learning too.

The smaller models that combine to make the ensemble model are referred to as base models. Ensemble methods often result in a considerably higher performance than any of the individual base models can achieve.

Bias-Variance Tradeoff

Bias and Variance are two sources of error that prevent models from generalizing beyond the training set. Bias is simplifying assumptions made by the model to make the learning easier. Parametric models like linear regression and logistic regression have underlying assumptions and therefore have the high bias, whereas models like K nearest neighbors and Decision Trees have the low bias. Variance is the amount the model will differ if different training data was used. If a model has a higher variance, it means that it has learned to model the specifics of the training data supplied to it. Models like linear and logistic regression generally have low variance and models like Decision Trees have high variance. For any model, we want to reduce both the bias and variance errors. Increasing bias decreases variance and increasing variance decreases bias. Therefore, these two sources of error need to be balanced in such a way that the model produces more reliable results.

Two popular families of ensemble methods


BAGGING

Several estimators are built independently on subsets of the data and their predictions are averaged. Typically, the combined estimator is usually better than any of the single base estimators.

Bagging can reduce variance with little to no effect on bias.

ex: Random Forests


BOOSTING

Base estimators are built sequentially. Each subsequent estimator focuses on the weaknesses of the previous estimators. In other words, several weak models “team up” to produce a powerful ensemble model

Boosting can reduce bias without incurring higher variance.

ex: Gradient Boosted Trees, AdaBoost

Conditions for ensembles to outperform base models

For an ensemble method to perform better than a base classifier, it must meet these two criteria:

  1. Accuracy: the combination of base classifiers must outperform random guessing.
  2. Diversity: base models must not be identical in classification/regression estimates.

Bagging

The ensemble method used today is called bagging, which is the short form for bootstrap aggregating.

Bagging builds multiple base models with resampled training data with replacement. We train k base classifiers on n different samples of training data. Using random subsets of the data to train base models promotes more differences between the base models.

Random Forests, which “bag” decision trees, can achieve very high classification accuracy.

Bagging’s magic – the decrease of model variance

One of the biggest advantages of Random Forests is that they decrease variance without increasing bias. Essentially you can get a better model without having to a tradeoff between bias and variance.


VARIANCE DECREASE

Base model estimates are averaged together, so the variability of model predictions (across hypothetical samples) is lower.


NO/LITTLE BIAS INCREASE

The bias remains the same as the bias of the individual base models. The model is still able to model the “true function” since the base models’ complexity is unrestricted (low bias).

We will use implementation provided by the python machine learning framework known as scikit-learn to look at bagging.

Problem Statement:

To build a simple Random Forest model for prediction of car quality given other attributes about the car.

Data details

==========================================
1. Title: Car Evaluation Database
==========================================

The dataset is available at: “http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

2. Sources:
  (a) Creator: Marko Bohanec
  (b) Donors: Marko Bohanec([email protected])
               Blaz Zupan      ([email protected])
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for
   multi-attribute decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation
   of HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
   cars according to the following concept structure:

   CAR                      car acceptability
   PRICE                  overall price
   buying               buying price
   maint                price of the maintenance
   TECH                   technical characteristics
   COMFORT              comfort
   doors              number of doors
   persons            capacity in terms of persons to carry
   lug_boot           the size of luggage boot
   safety               estimated safety of the car

   Input attributes are printed in lowercase. 
5. Number of Instances: 1728 (instances completely cover the attribute space)
6. Number of Attributes: 6
7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %)

Tools to be used:

Numpy, pandas, scikit-learn

Python Implementation with code:

1. Import necessary libraries

Import the necessary modules from specific libraries.

import os
import numpy as np
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn import  metrics, model_selection, preprocessing
from sklearn.ensemble import  RandomForestClassifier

2. Load the data set

Use pandas module to read the bike data from the file system. Check few records of the dataset.

data = 
pd.read_csv('data/car_quality/car.data',names=['buying','maint','doors','persons','lug_boot','safety','class'])
data.head()

  buying maint doors persons lug_boot safety class
0 vhigh  vhigh 2     2       small    low    unacc
1 vhigh  vhigh 2     2       small    med    unacc
2 vhigh  vhigh 2     2       small    high   unacc
3 vhigh  vhigh 2     2       med      low    unacc
4 vhigh  vhigh 2     2       med      med    unacc

3. Get some information about the data set

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

The train dataset has 1728 rows and 7 columns.

There are no missing values in the dataset

4. Identify the target variable

data['class'],class_names = pd.factorize(data['class'])

The target variable is marked as a class in the data frame. The values are present in string format. However, the algorithm requires the variables to be coded into its equivalent integer codes. We can convert the string categorical values into integer code using factorize method of the pandas library.

Let’s check the encoded values now.

print(class_names)
print(data['class'].unique())

Index([u'unacc', u'acc', u'vgood', u'good'], dtype='object')
[0 1 2 3]

As we can see the values have been encoded into 4 different numeric labels.

5. Identify the predictor variables and encode any string variables to equivalent integer codes

data['buying'],_ = pd.factorize(data['buying'])
data['maint'],_ = pd.factorize(data['maint'])
data['doors'],_ = pd.factorize(data['doors'])
data['persons'],_ = pd.factorize(data['persons'])
data['lug_boot'],_ = pd.factorize(data['lug_boot'])
data['safety'],_ = pd.factorize(data['safety'])
data.head()

        buying maint doors persons lug_boot safety class
0       0      0     0     0        0       0      0
1       0      0     0     0        0       1      0
2       0      0     0     0        0       2      0
3       0      0     0     0        1       0      0
4       0      0     0     0        1       1      0

Check the data types now:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null int64
maint       1728 non-null int64
doors       1728 non-null int64
persons     1728 non-null int64
lug_boot    1728 non-null int64
safety      1728 non-null int64
class       1728 non-null int64
dtypes: int64(7)
memory usage: 94.6 KB

Everything is now converted in integer form.

6.  Select the predictor feature and select the target variable

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

Train test split :

# split data randomly into 70% training and 30% test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=0)

Training/model fitting

# we can achieve the above two tasks using the following codes
# Bagging: using all the features
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)

Model parameters study :

# use the model to make predictions with the test data
y_pred = model.predict(X_test)
# how did our model perform?
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 19
Accuracy: 0.96

As you can see the algorithm was able to achieve the classification accuracy of 96% on the held-out set. Only 19 samples were misclassified.

Algorithm Advantages:

  • Random Forest can be used to solve both kinds of problems: regression and classification.
  • Is capable of handling high dimensional data sets.
  • Can be used to extract out relevant features.
  • Handles missing data effectively internally.

Algorithm Disadvantages:

  • Difficult to interpret because of various trees involved internally
  • It tends to return erratic predictions for observations out of range of training data. For example, the training data contains two variable x and y. The range of x variable is 30 to 70. If the test data has x = 200, the random forest would give an unreliable prediction.
  • It can take a much longer time than expected to grow a large number of trees. 

Visit our Machine Learning page to know more.

Series Navigation<< K-Means Clustering Algorithm

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close