All CategoriesData Science and Artificial Intelligence

Assumptions of Linear Regression

There are some basic assumptions of Linear Regression for which we must test our data in order to correctly apply Linear Regression. If these assumptions are being violated then we may obtain biased and misleading results.

In this blog, we will discuss these assumptions, in brief, using the ‘Advertising’ dataset, verify those assumptions and ways to overcome if these assumptions are being violated using Python.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Linear Regression is one of the important algorithms in Machine Learning. This algorithm is mainly used for regression problems. In one of our previous blog posts, the end to end implementation of this algorithm has already been presented using the ‘Boston dataset’. We assume our readers will have little basic knowledge of Linear Regression and its implementation. If not you can go through our previous blog to understand the implementation of Linear Regression in a detailed way.

The data-set used ahead contains information about money spent on advertisements and their generated Sales. Money was spent on TV, radio and newspaper ads. It has 3 features namely ‘TV’, ‘radio’ and ‘newspaper’ and target variable ‘Sales’. The dataset contains information on investments made on advertisements and their generated sales. These advertisements were made through electronic (TV & Radio) and print media (Newspaper). 

The dataset contains the below fields.

Features:

  • TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
  • Radio: advertising dollars spent on Radio
  • Newspaper: advertising dollars spent on Newspaper

Target variable:

  • Sales: sales of a single product in a given market (in thousands of widgets)

Let us begin by loading our dataset and then verifying the assumptions one by one.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

advert = pd.read_csv(r'Aeon\Advertising.csv')
advert.head()

Assumptions:

1. Linearity: The assumption state there should be a linear relationship between the independent and dependent variables. We can determine the linearity through the use of Scatter plots. 

Since our dataset has 3 independent variables namely, ‘TV’, ‘Radio’, ‘Newspaper’ and the dependent variable ‘Sales’, we will verify the linearity between all the independent variables and the dependent variable using the scatter plot.

for c in advert.columns[:-1]:
    plt.title("{} vs. \nSales".format(c))
    plt.scatter(x=advert[c],y=advert['Sales'],color='blue',edgecolor='k')
    plt.grid(True)
    plt.xlabel(c,fontsize=14)
    plt.ylabel('Sales')
    plt.show()

From the above output, we can see that there is a perfect linear relationship between TV and Sales, a moderate linear relationship between Radio and Sales and a non-linear relationship between Newspaper and Sales.

Violation in this assumption can be fixed by applying log transformation to the independent variables and then plotting the scatterplot between the two.

2. No or Little Multicollinearity: Multicollinearity is a situation where the independent variables are highly correlated with each other. Therefore these assumptions say that there should be no or a little correlation between the independent variables. The presence of correlated independent variables imposes a serious problem to our regression model as the coefficients will be wrongly estimated.         

We can check for multicollinearity with the help of a correlation matrix or VIF factor. 

Verifying multicollinearity using correlation matrix or heat map.

df = advert[['TV', 'Radio', 'Newspaper']]
sns.heatmap(df.corr(), annot = True)

If we find any values in which the absolute value of their correlation is >=0.8, the multicollinearity assumption is being broken.

VIF stands for Variance Inflation Factor and is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. VIF value <= 4 suggests no multicollinearity whereas value >=10 implies strong multicollinearity. 

Calculating VIF values for the independent variables

from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

for i in range(len(advert.columns[:-1])):
    v=vif(np.matrix(advert[:-1]),i)
    print("Variance inflation factor for 
                              {}:{}".format(advert.columns[i],round(v,2)))

The feature ‘TV’ has a VIF value greater than 10 which indicates significant multicollinearity.

The violation of this assumption can be fixed by removing the independent variable with high VIF value or which are highly correlated, however, removing a feature may eliminate necessary information from the dataset. We can also transform many variables into one by taking the average value or else we can use PCA to reduce features to a smaller set of uncorrelated components.

3. No Autocorrelation: Autocorrelation refers to the situation where there is a presence of correlation in error terms or when the residuals are dependent on each other. Hence this assumption says that there should be NO autocorrelation in error terms in our data. 

This kind of scenario usually occurs in time series or paneled data where the next instant is dependent on the previous instant. 

We can test for Autocorrelation with the Durbin-Watson test.

Since our dataset is not as mentioned above we would not be verifying this assumption and would move ahead with the next one.

4. Normality: The assumption state, the residuals of the regression should be normally distributed. The residuals are also known as the error and is the difference between the predicted value and the observed value. 

The test of normality applies to the model’s residuals and not the variables themselves. This can be tested visually by plotting the residuals as a histogram, and/or using a probability plot.

One of the ways to visually test this assumption is through the use of the Q-Q-Plot. Q-Q stands for the Quantile-Quantile plot and is a technique to compare two probability distributions in a visual manner.

To generate this Q-Q plot we will be using scipy’s probplot function where we compare a variable of our chosen to a normal probability.

Before plotting the Q-Q plot we will first fit a model using stats model formula model, then the diagnostics will be run on this model.

import statsmodels.formula.api as smf

model = smf.ols("Sales ~ TV + Radio + Newspaper", data= advert).fit()
model.summary()

Histogram of Normalized residuals

plt.figure(figsize=(8,5))
plt.hist(model.resid_pearson,bins=20,edgecolor='k')
plt.ylabel('Count')
plt.xlabel('Normalized residuals')
plt.title("Histogram of normalized residuals")
plt.show()

Visualizing Q-Q plot of the residual

from statsmodels.graphics.gofplots import qqplot

plt.figure(figsize=(8,5))
fig=qqplot(model.resid_pearson,line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Ordered Values",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()

In the above output, the scatter plot is a set of data points that are observed, while the regression line is the prediction.

We know that our residuals are perfectly normally distributed as the residuals represented as dots in blue are falling on the red line. A few points outside of the line is due to our small sample size. 

The Q-Q plot and the histogram above shows that the normality assumption is satisfied pretty good.

If this assumption is violated we can fix it by a nonlinear transformation of target variable or features or removing/treating potential outliers.

5. Homoscedasticity: this is the most vital assumption for linear regression if this assumption is violated then the standard errors will be biased. The standard errors are used to conduct a significance level and calculate the confidence intervals.

This is a situation when the error terms or the residuals should have constant variance with respect to the independent or dependent variables. It can be easily tested with a scatterplot of the residuals. 

If by looking at the scatter plot of the residuals from our linear regression analysis we notice a pattern, this would be a clear sign that this assumption is being violated and is heteroscedastic. Refer to the image below for better understanding

Plot to verify homoscedasticity

p=plt.scatter(x=model.fittedvalues,y=model.resid,edgecolor='k')
xmin=min(model.fittedvalues)
xmax = max(model.fittedvalues)
plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Fitted vs. residuals plot")
plt.grid(True)
plt.show()

From the above output, it is clear that the residuals have constant variance and homoscedasticity is not violated.

Violation of this assumption can be fixed by log transformation of the dependent variable.

As we have tested our model has passed all the basic assumptions of linear regression and hence is a qualified model to predict results. Also, we understand the influence of independent variables (predictor variables) on our dependent variable.

Here is a visual recap:

If you have any queries on the above blog post please leave a comment we will get back to you. 

Keep visiting our Acadgild blog site for more informative blogs on data science, data analysis, and big data blog posts. Thank you.

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close