Multiple Linear Regression

This entry is part 14 of 21 in the series Machine Learning Algorithms

Introduction

The goal of this blog post is to equip beginners with the basics of the Linear Regression algorithm with multiple variables predicting the outcome of the target variable. This is also known as Multiple Linear Regression.

Simple linear regression model has a continuous outcome and one predictor, whereas a multiple linear regression model has a continuous outcome and multiple predictors (continuous or categorical). A simple linear regression model would have the form:

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

A multivariable or multiple linear regression model would take the form:

where y is a continuous dependent variable, x is a single predictor in the simple regression model, and x1, x2, â€¦, xk are the predictors in the multiple regression model.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors â€” that is, the average squared difference between the estimated values and what is actually estimated.

Multiple linear regression can model more complex relationship which comes from various features together. They should be used in cases where one particular variable is not evident enough to map the relationship between the independent and the dependent variable.

Letâ€™s work on a case study to understand this better.

Problem Statement

To predict the relative performance of a computer hardware given other associated attributes of the hardware.

Data details

```Computer Hardware dataset
===========================
URL : https://archive.ics.uci.edu/ml/datasets/Computer+Hardware
1. Title: Relative CPU Performance Data
2. Source Information
-- Creators: Phillip Ein-Dor and Jacob Feldmesser
Â Â Â Â -- Ein-Dor: Faculty of Management; Tel Aviv University; Ramat-Aviv;
Â Â Â Â Â Â Â Tel Aviv, 69978; Israel
Â Â -- Donor: David W. Aha ([email protected]) (714) 856-8779 Â Â
Â Â -- Date: October, 1987

3. Past Usage:
Â Â Â 1. Ein-Dor and Feldmesser (CACM 4/87, pp 308-317)
Â Â Â Â Â Â -- Results:
Â Â Â Â Â Â Â Â Â -- linear regression prediction of relative cpu performance
Â Â Â Â Â Â Â Â Â -- Recorded 34% average deviation from actual values
Â Â Â 2. Kibler,D. & Aha,D. (1988). Â Instance-Based Prediction of
Â Â Â Â Â Â Real-Valued Attributes. Â In Proceedings of the CSCSI (Canadian
Â Â Â Â Â Â AI) Conference.
Â Â Â Â Â Â -- Results:
Â Â Â Â Â Â Â Â Â -- instance-based prediction of relative cpu performance
Â Â Â Â Â Â Â Â Â -- similar results; no transformations required
Â Â Â - Predicted attribute: cpu relative performance (numeric)

4. Relevant Information:
Â Â -- The estimated relative performance values were estimated by the authors
Â Â Â Â Â using a linear regression method. Â See their article (pp 308-313) for
Â Â Â Â Â more details on how the relative performance values were set.

5. Number of Instances: 209

6. Number of Attributes: 10 (6 predictive attributes, 2 non-predictive,
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 1 goal field, and the linear regression guess)

7. Attribute Information:
Â Â 1. vendor name: 30
Â Â Â Â Â (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec,
Â Â Â Â Â Â dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson,
Â Â Â Â Â Â microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry,
Â Â Â Â Â Â sratus, wang)
Â Â 2. Model Name: many unique symbols
Â Â 3. MYCT: machine cycle time in nanoseconds (integer)
Â Â 4. MMIN: minimum main memory in kilobytes (integer)
Â Â 5. MMAX: maximum main memory in kilobytes (integer)
Â Â 6. CACH: cache memory in kilobytes (integer)
Â Â 7. CHMIN: minimum channels in units (integer)
Â Â 8. CHMAX: maximum channels in units (integer)
Â Â 9. PRP: published relative performance (integer)
Â 10. ERP: estimated relative performance from the original article (integer)

8. Missing Attribute Values: None

9. Class Distribution: the class value (PRP) is continuously valued.
Â Â PRP Value Range: Â Â Number of Instances in Range:
Â Â 0-20 Â Â Â Â Â Â Â Â Â Â Â Â Â Â 31
Â Â 21-100 Â Â Â Â Â Â Â Â Â Â Â Â 121
Â Â 101-200 Â Â Â Â Â Â Â Â Â Â Â 27
Â Â 201-300 Â Â Â Â Â Â Â Â Â Â Â 13
Â Â 301-400 Â Â Â Â Â Â Â Â Â Â Â 7
Â Â 401-500 Â Â Â Â Â Â Â Â Â Â Â 4
Â Â 501-600 Â Â Â Â Â Â Â Â Â Â Â 2
Â Â above 600 Â Â Â Â Â Â Â Â Â 4

Summary Statistics:
Â Â Â Â Â Â Min Max Â Â Mean SD Â Â Â Â PRP Correlation
Â Â MCYT: Â Â 17 1500 Â 203.8 260.3 Â Â -0.3071
Â Â MMIN: Â Â 64 32000 2868.0 Â 3878.7 0.7949
Â Â MMAX: Â Â 64 64000 11796.1 11726.6 Â 0.8630
Â Â CACH: Â Â 0 256 Â Â 25.2 40.6 Â Â Â Â 0.6626
Â Â CHMIN: Â 0 52 Â Â 4.7 6.8 Â Â Â Â Â 0.6089
Â Â CHMAX: Â 0 176 Â Â 18.2 26.0 Â Â Â Â 0.6052
Â Â PRP: Â Â Â 6 1150 Â 105.6 160.8 Â Â Â 1.0000
Â Â ERP: Â Â 15 1238 Â 99.3 154.8 Â Â Â 0.9665```

• Pandas
• Numpy
• Matplotlib
• scikit-learn

Import necessary libraries

Import the necessary modules from specific libraries.

```import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import mean_squared_error

from sklearn import linear_model```

Use the pandas module to read the taxi data from the file system. Check few records of the dataset.

```names = ['VENDOR','MODEL_NAME','MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP' ];

VENDOR  MODEL_NAME  MYCT MMIN MMAX CACH CHMIN CHMAX PRP ERP
0  adviser 32/60   125 256  6000 256  16   128   198   199
1  amdahl  470v/7  29  8000 32000 32  8    32    269   253
2  amdahl  470v/7a 29  8000 32000 32  8    32    220   253
3  amdahl  470v/7b 29  8000 32000 32  8    32    172   253
4  amdahl  470v/7c 29  8000 16000 32  8    16    132   132```

Feature selection

Letâ€™s select only the numerical fields for model fitting.

```data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 10 columns):
VENDOR Â Â Â Â Â Â Â 209 non-null object
MODEL_NAME Â Â Â 209 non-null object
MYCT Â Â Â Â Â Â Â Â Â 209 non-null int64
MMIN Â Â Â Â Â Â Â Â Â 209 non-null int64
MMAX Â Â Â Â Â Â Â Â Â 209 non-null int64
CACH Â Â Â Â Â Â Â Â Â 209 non-null int64
CHMIN Â Â Â Â Â Â Â Â 209 non-null int64
CHMAX Â Â Â Â Â Â Â Â 209 non-null int64
PRP Â Â Â Â Â Â Â Â Â Â 209 non-null int64
ERP Â Â Â Â Â Â Â Â Â Â 209 non-null int64
dtypes: int64(8), object(2)

```

We can see that barring the first two variables rest are numeric in nature. Letâ€™s only pick the numeric fields.

```categorical_ = data.iloc[:,:2]
numerical_ = data.iloc[:,2:]

MYCT MMIN MMAX  CACH CHMIN CHMAX PRP ERP
0 125  256  6000  256  16    128   198 199
1 29   8000 32000 32   8     32    269 253
2 29   8000 32000 32   8     32    220 253
3 29   8000 32000 32   8     32    172 253
4 29   8000 16000 32   8     16    132 132```

Select the predictor and target variables

```X = numerical_.iloc[:,:-1]
y = numerical_.iloc[:,-1]```

Train test split

```x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y,test_size=0.10,
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â random_state=42,
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â shuffle=True)```

Normalize the data

Before we do the fitting, letâ€™s normalize the data so that the data is centered around the mean and has unit standard deviation.

```from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Fit on training set only.
scaler.fit(x_training_set)

# Apply transform to both the training set and the test set.
x_training_set = scaler.transform(x_training_set)
x_test_set = scaler.transform(x_test_set)```
```y_training_set = y_training_set.values.reshape(-1, 1)
y_test_set Â = y_test_set.values.reshape(-1, 1)

y_scaler = StandardScaler()
# Fit on training set only.
y_scaler.fit(y_training_set)

# Apply transform to both the training set and the test set.
y_training_set = y_scaler.transform(y_training_set)
y_test_set = y_scaler.transform(y_test_set)```

Training/model fitting

Fit the model to selected supervised data

```model = linear_model.LinearRegression()
model.fit(x_training_set,y_training_set)```

Model parameters study

The coefficient R^2 is defined as (1 – u/v), where u is the residual sum of squares ((y_true – y_pred) ** 2).sum() and v is the total sum of squares ((y_true – y_true.mean()) ** 2).sum().

```from sklearn.metrics import mean_squared_error, r2_score
model_score = model.score(x_training_set,y_training_set)
# Have a look at R sq to give an idea of the fit ,
# Explained variance score: 1 is perfect prediction
print(â€œ coefficient of determination R^2 of the prediction.: ',model_score)
y_predicted = model.predict(x_test_set)

# The mean squared error
print("Mean squared error: %.2f"% mean_squared_error(y_test_set, y_predicted))
# Explained variance score: 1 is perfect prediction
print('Test Variance score: %.2f' % r2_score(y_test_set, y_predicted))

Coefficient of determination R^2 of the prediction : Â 0.9583846753218253
Mean squared error: 0.39
Test Variance score: 0.93```

Accuracy report with test data

Letâ€™s visualize the goodness of the fit with the predictions being visualized by a line.

```# So let's run the model against the test data
from sklearn.model_selection import cross_val_predict

fig, ax = plt.subplots()
ax.scatter(y_test_set, y_predicted, edgecolors=(0, 0, 0))
ax.plot([y_test_set.min(), y_test_set.max()], [y_test_set.min(), y_test_set.max()], 'k--', lw=4)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
ax.set_title("Ground Truth vs Predicted")
plt.show()```

Conclusion

We can see that our R2 score and MSE are both very good. This means that we have found a well-fitting model to predict the median price value of a house. There can be a further improvement to the metric by doing some preprocessing before fitting the data.

Series Navigation<< Naive Bayesian ModelPrincipal Component Analysis >>

Abhay Kumar

Abhay Kumar, lead Data Scientist â€“ Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Close