Data Science and Artificial Intelligence

Linear Regression on Boston Housing data

Linear Regression is one of the algorithms of Machine Learning that is categorized as a Supervised Learning algorithm.

Linear regression is used to find the relationship between the target and one or more predictors. Here the target is the dependent variable and the predictors are the independent variables.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

In this blog, we are using the Boston Housing dataset which contains information about different houses.  We can also access this data from the sci-kit learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features.

The features of the dataset can be summarized as follows:

  • CRIM: This column represents per capita crime rate by town
  • ZN: This column represents the proportion of residential land zoned for lots larger than 25,000 sq.ft.
  • INDUS: This column represents the proportion of non-retail business acres per town.
  • CHAS: This column represents the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
  • NOX: This column represents the concentration of the nitric oxide (parts per 10 million)
  • RM: This column represents the average number of rooms per dwelling
  • AGE: This column represents the proportion of owner-occupied units built prior to 1940
  • DIS: This column represents the weighted distances to five Boston employment centers
  • RAD: This column represents the index of accessibility to radial highways
  • TAX: This column represents the full-value property-tax rate per $10,000
  • PTRATIO: This column represents the pupil-teacher ratio by town
  • B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
  • LSTAT: This is the percentage lower status of the population
  • MEDV: This is the median value of owner-occupied homes in $1000s

So let’s get started with our coding in Python.

First, we will import all the important libraries.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn

We will then load the boston dataset from the sklearn library.

from sklearn.datasets import load_boston
boston = load_boston()

Now we will load the data into a pandas dataframe and then will print the first few rows of the data using the head() function.

bos = pd.DataFrame(boston.data)
bos.head()

We will now rename the columns as the description of the dataset given above.

bos.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
bos.head()

The variable MEDV indicates the prices of the houses and is the target variable. The rest of the variables are the predictors based on which we will predict the value of the house.

In the above result, we can see that the target variable ‘MEDV’ is missing from the data. We will create a new column of target values and add them to the dataframe.

bos['MEDV'] = boston.target

Fetching more information about the dataset using the info() function.

bos.info()

From the above information, we can see that the 14 columns present in the dataset contain all non-null values with float data types.

Checking the statistical values of the dataset using the describe() function.

bos.describe()

We will now check for null values if any present in the dataset.

bos.isnull().sum()

There is no null value present in the dataset.

EDA

Exploratory Data Analysis is a very important step before training the model. We will use some visualizations to understand the relationship of the target variable with other variables.

We will first plot the distribution of the target variable MEDV. For this we will use the distplot() function from the seaborn library.

sns.distplot(bos['MEDV'])
plt.show()

From the above output we can see that the values of MEDV is normally distributed with some of the outliers.

We will now visualize the pairplot which shows the relationships between all the features present in the dataset.

sns.pairplot(bos)

We will now use the heatmap function from the seaborn library to plot the correlation matrix.

corr_mat = bos.corr().round(2)
sns.heatmap(data=corr_mat, annot=True)

From the above two graphs, we can clearly see that the feature RM has a positive correlation with MEDV.

Based on the above observations we will plot an lmplot between RM and MEDV to see the relationship between the two more clearly.

sns.lmplot(x = 'RM', y = 'MEDV', data = bos)

Splitting the data into Training and Test Data

We will now split the dataset into training and test data. We do this to train our model with 80% of the samples and test with the remaining 20%.

We are using the train_test_split function from the sklearn library to split the data.

X = bos[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]

y = bos['MEDV']

X is the independent variable and y is the dependent variable.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

Training the Model

We will now train our model using the LinearRegression function from the sklearn library.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)

Prediction

We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.

prediction = lm.predict(X_test)

plt.scatter(y_test, prediction)

Plotting the data frame for the actual and predicted value and plotting a graph for the same.

df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')

From the above graph, we can see that there is not much difference between the actual and predicted values,  Hence our predicted model seems to work pretty well.

Model Evaluation

We will now evaluate the model using the metrics and r2_score function from sklearn library.

Here we will evaluate the Mean Absolute Error, Mean Squared Error, Root Mean Squared Error and R-squared value.

The value of R-square ranges from 0 to 1 where value ‘1’ ( or near to 1) indicates predictor perfectly accounts for all the variation in Y.

from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))

The R squared value is moderately nearer to the value 1 which seems to be a good start. However, we will keep on working to increase the model’s performance by working on more examples in our upcoming blogs.

Do drop us a comment for any query or suggestion. Keep visiting our website for more blogs on Data Science and Data Analytics.

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close