Data Science and Artificial Intelligence

Random Forest Algorithm for Regression

This entry is part 7 of 17 in the series Machine Learning Algorithms

Introduction to Random Forest Algorithm:

The goal of the blog post is to equip beginners with the basics of the Random Forest algorithm so that they can build their first model easily.

Ensemble methods are supervised learning models which combine the predictions of multiple smaller models to improve predictive power and generalization.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

The smaller models that combine to make the ensemble model are referred to as base models. Ensemble methods often result in considerably higher performance than any of the individual base models.

Two popular families of ensemble methods


BAGGING

Several estimators are built independently on subsets of the data and their predictions are averaged. Typically, the combined estimator is usually better than any of the single base estimator.

Bagging can reduce variance with little to no effect on bias.

ex: Random Forests


BOOSTING

Base estimators are built sequentially. Each subsequent estimator focuses on the weaknesses of the previous estimators. In essence, several weak models “team up” to produce a powerful ensemble model. 

Boosting can reduce bias without incurring higher variance.

ex: Gradient Boosted Trees, AdaBoost

Bagging

The ensemble method we will be using today is called bagging, which is short for bootstrap aggregating.

Bagging builds multiple base models with resampled training data with replacement. We train k base classifiers on k different samples of training data. Using random subsets of the data to train base models promotes more differences between the base models.

We can use the BaggingRegressor class to form an ensemble of regressors. One such Bagging algorithms are random forest regressor. A random forest regressor is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

Random Forest Regressors uses some kind of splitting criterion to measure the quality of a split. Supported criteria are “MSE” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “Mean Absolute Error” for the mean absolute error.

Problem Statement:

To predict the median prices of homes located in the Boston area given other attributes of the house.

Data details

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM   per capita crime rate by town
        - ZN   proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS   proportion of non-retail business acres per town
        - CHAS   Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX   nitric oxides concentration (parts per 10 million)
        - RM   average number of rooms per dwelling
        - AGE   proportion of owner-occupied units built prior to 1940
        - DIS   weighted distances to five Boston employment centres
        - RAD   index of accessibility to radial highways
        - TAX   full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B   1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT   % lower status of the population
        - MEDV   Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.   

Tools used:

  • Pandas
  • Numpy
  • Matplotlib
  • scikit-learn

Import necessary libraries

Import the necessary modules from specific libraries.

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import mean_squared_error

from sklearn.ensemble import RandomForestRegressor

Load the data set

Use the pandas module to read the taxi data from the file system. Check few records of the dataset.

# #############################################################################
# Load data
boston = datasets.load_boston()
print(boston.data.shape, boston.target.shape)
print(boston.feature_names)

(506, 13) (506,)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
data = pd.DataFrame(boston.data,columns=boston.feature_names)
data = pd.concat([data,pd.Series(boston.target,name='MEDV')],axis=1)
data.head()

     CRIM    ZN    INDUS CHAS NOX   RM    AGE   DIS    RAD    TAX     PTRATIO B      LSTAT MEDV
0    0.00632 18.0  2.31  0.0  0.538 6.575 65.2  4.0900 1.0    296.0   15.3    396.90 4.98  24.0
1    0.02731 0.0   7.07  0.0  0.469 6.421 78.9  4.9671 2.0    242.0   17.8    396.90 9.14  21.6
2    0.02729 0.0   7.07  0.0  0.469 7.185 61.1  4.9671 2.0    242.0   17.8    392.83 4.03  34.7
3    0.03237 0.0   2.18  0.0  0.458 6.998 45.8  6.0622 3.0    222.0   18.7    394.63 2.94  33.4
4    0.06905 0.0   2.18  0.0  0.458 7.147 54.2  6.0622 3.0    222.0   18.7    396.90 5.33  36.2

Select the predictor and target variables

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

Train test split :

x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y,test_size=0.10, 
                                                                          random_state=42,
                                                                          shuffle=True)

Training/model fitting:

Fit the model to selected supervised data

n_estimators=100
# Fit regression model
# Estimate the score on the entire dataset, with no missing values
model = RandomForestRegressor(random_state=0, n_estimators=n_estimators)
model.fit(x_training_set, y_training_set)

Model parameters study :

The coefficient R^2 is defined as (1 – u/v), where u is the residual sum of squares ((y_true – y_pred) ** 2).sum() and v is the total sum of squares ((y_true – y_true.mean()) ** 2).sum().

from sklearn.m etrics import mean_squared_error, r2_score
model_score = model.score(x_training_set,y_training_set)
# Have a look at R sq to give an idea of the fit ,
# Explained variance score: 1 is perfect prediction
print(“ coefficient of determination R^2 of the prediction.: ',model_score)
y_predicted = model.predict(x_test_set)

# The mean squared error
print("Mean squared error: %.2f"% mean_squared_error(y_test_set, y_predicted))
# Explained variance score: 1 is perfect prediction
print('Test Variance score: %.2f' % r2_score(y_test_set, y_predicted))

Coefficient of determination R^2 of the prediction :  0.982022598521334
Mean squared error: 7.73
Test Variance score: 0.88

Accuracy report with test data :

Let’s visualize the goodness of the fit with the predictions being visualized by a line

# So let's run the model against the test data

from sklearn.model_selection import cross_val_predict




fig, ax = plt.subplots()

ax.scatter(y_test_set, y_predicted, edgecolors=(0, 0, 0))

ax.plot([y_test_set.min(), y_test_set.max()], [y_test_set.min(), y_test_set.max()], 'k--', lw=4)

ax.set_xlabel('Actual')

ax.set_ylabel('Predicted')

ax.set_title("Ground Truth vs Predicted")

plt.show()

Conclusion:

We can see that our R2 score and MSE are both very good. This means that we have found a well-fitting model to predict the median price value of a house. There can be a further improvement to the metric by doing some preprocessing before fitting the data.

Series Navigation<< Using Gradient Boosting for Regression ProblemsLinear Regression >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close