Free Shipping

Secure Payment

easy returns

• Home
• Blog
• Using Decision Trees for Regression Problems

# Using Decision Trees for Regression Problems

The goal of this blog post is to equip beginners with an understanding of the basics of the Decision Tree Regressor algorithm and quickly help them to build their first model.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errorsâ€”that is, the average squared difference between theÂ estimated values and what is actually estimated.

The MSE is a measure of the quality of an estimatorâ€”it is always non-negative, and values closer to zero are better.

The Mean Squared Error is given by:

#### Problem Statement

To predict the median prices of homes located in the Boston area when other attributes of the house are given.

#### Data details

```Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics: Â

Â Â Â :Number of Instances: 506

Â Â Â :Number of Attributes: 13 numeric/categorical predictive
Â Â Â
Â Â Â :Median Value (attribute 14) is usually the target

Â Â Â :Attribute Information (in order):
Â Â Â Â Â Â Â - CRIM Â Â per capita crime rate by town
Â Â Â Â Â Â Â - ZN Â Â proportion of residential zoned land for lots over 25,000 sq.ft.
Â Â Â Â Â Â Â - INDUS Â Â proportion of non-retail business acres per town
Â Â Â Â Â Â Â - CHAS Â Â Charles river dummy variable (= 1 if tract bounds river; 0 otherwise)
Â Â Â Â Â Â Â - NOX Â Â nitric oxides concentration (parts per 10 million)
Â Â Â Â Â Â Â - RM Â Â average number of rooms per dwelling
Â Â Â Â Â Â Â - AGE Â Â proportion of owner-occupied units built prior to 1940
Â Â Â Â Â Â Â - DIS Â Â weighted distances to five Boston employment centers
Â Â Â Â Â Â Â - RAD Â Â index of accessibility to radial highways
Â Â Â Â Â Â Â - TAX Â Â full-value property-tax rate per \$10,000
Â Â Â Â Â Â Â - PTRATIO Â pupil-teacher ratio by town
Â Â Â Â Â Â Â - B Â Â 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
Â Â Â Â Â Â Â - LSTAT Â Â % lower status of the population
Â Â Â Â Â Â Â - MEDV Â Â median value of owner-occupied homes in \$1000's

Â Â Â :Missing Attribute Values: None

Â Â Â :Creator: Harrison, D. and Rubinfeld, D.L.```

This is a copy of UCI ML housing dataset.

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D., and Rubinfeld, D.L. â€˜HedonicÂ prices and the demand for clean airâ€™, J. Environ. Economics & Management,Â vol.5, 81-102, 1978. Â Â Used in Belsley, Kuh & Welsch, â€˜Regression diagnosticsÂ â€¦â€™, Wiley, 1980. Â Â N.B. Various transformations are used in the table onÂ pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

• Pandas
• Numpy
• Matplotlib
• scikit-learn

#### Python Implementation with Code

##### Import necessary libraries

Import the necessary modules from specific libraries

```import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import mean_squared_error

from sklearn.tree import DecisionTreeRegressor```

Use the pandas module to read the taxi data from the file system. Check few records of the dataset.

```# #############################################################################
print(boston.data.shape, boston.target.shape)
print(boston.feature_names)

(506, 13) (506,)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']```
```data = pd.DataFrame(boston.data,columns=boston.feature_names)
data = pd.concat([data,pd.Series(boston.target,name='MEDV')],axis=1)

CRIM    ZN   INDUS CHAS NOX   RM    AGE  DIS    RAD TAX   PTRATIO B      LSTAT MEDV
0 0.00632 18.0 2.31  0.0  0.538 6.575 65.2 4.0900 1.0 296.0 15.3    396.90 4.98  24.0
1 0.02731 0.0  7.07  0.0  0.469 6.421 78.9 4.9671 2.0 242.0 17.8    396.90 9.14  21.6
2 0.02729 0.0  7.07  0.0  0.469 7.185 61.1 4.9671 2.0 242.0 17.8    392.83 4.03  34.7
3 0.03237 0.0  2.18  0.0  0.458 6.998 45.8 6.0622 3.0 222.0 18.7    394.63 2.94  33.4
4 0.06905 0.0  2.18  0.0  0.458 7.147 54.2 6.0622 3.0 222.0 18.7    396.90 5.33  36.2```

#### Select the predictor and target variables

The target variable is MEDV which is the Median value of owner-occupied homes in \$1000â€™s. The rest are predictor variables.

```X = data.iloc[:,:-1]
y = data.iloc[:,-1]```

#### Train test split

The whole dataset is split into training and test set. Training data is used to train the model and the test set is to evaluate how well the model performed.

```x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y,test_size=0.10,random_state=42,
shuffle=True)```

#### Training/model fitting

Fit the model to selected supervised data.

```# Fit regression model
# Estimate the score on the entire dataset, with no missing values
model = Â DecisionTreeRegressor(max_depth=5,random_state=0)
model.fit(x_training_set, y_training_set)

The coefficient of determination R^2 of the prediction:Â 0.9179598310471841
Mean squared error: 7.95
Test Variance score: 0.87

```

#### Model parameters study

The coefficient R^2 is defined as (1 â€“ u/v), where u is the residual sum of squares ((y_true â€“ y_pred) ** 2).sum() and v is the total sum of squares ((y_true â€“ y_true.mean()) ** 2).sum().

```from sklearn.metrics import mean_squared_error, r2_score
model_score = model.score(x_training_set,y_training_set)
# Have a look at R sq to give an idea of the fit ,
# Explained variance score: 1 is perfect prediction
print(â€œ coefficient of determination R^2 of the prediction.: ',model_score)
y_predicted = model.predict(x_test_set)

# The mean squared error
print("Mean squared error: %.2f"% mean_squared_error(y_test_set, y_predicted))
# Explained variance score: 1 is perfect prediction
print('Test Variance score: %.2f' % r2_score(y_test_set, y_predicted))

The coefficient of determination R^2 of the prediction:Â 0.982022598521334
Mean squared error: 7.73
Test Variance score: 0.88```

#### Accuracy report with test data :

Letâ€™s check the goodness of the fit with the predictions visualized as a line.

```# So let's run the model against the test data
from sklearn.model_selection import cross_val_predict

fig, ax = plt.subplots()
ax.scatter(y_test_set, y_predicted, edgecolors=(0, 0, 0))
ax.plot([y_test_set.min(), y_test_set.max()], [y_test_set.min(), y_test_set.max()], 'k--', lw=4)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
ax.set_title("Ground Truth vs Predicted")
plt.show()```

#### Conclusion:

We can see that our R2 score and MSE are both very good. This means that we have found a good fitting model to predict the median price value of a house. There can be a further improvement to the metric by doing some preprocessing before fitting the data. However, the task of the post was to provide you with enough knowledge to implement your first model. You can build over the existing pipeline and report your accuracies.