Data Science and Artificial Intelligence

Linear Regression

This entry is part 1 of 9 in the series Machine Learning Algorithms


This blog guides beginners to get kickstarted with the basics of linear regression concepts so that they can easily build their first linear regression model. The modeling aspect of linear regression is the focus of this blog.

Linear Regression is one of the most fundamental and widely used Machine Learning Algorithms. It’s usually among the first few topics which people pick while learning predictive modeling. Linear Regression models the relationship between a dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line). The dependent variable is continuous. The independent variable(s) can be continuous or discrete, and the nature of the relationship is linear.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Linear relationships can either be positive or negative. A positive relationship between two variables basically means that an increase in the value of one variable also implies an increase in the value of the other variable. A negative relationship between two variables means that an increase in the value of one variable implies a decrease in the value of the other variable.

Mathematical Explanation:

A simple linear regression has one independent variable. Mathematically, the line representing a simple linear regression is expressed through a basic equation:

Y = mX + b + e


     m is the slope

     X is the predictor variable

     b is the intercept/bias term   

     Y is the predicted target variable

     e is the error term


Look at Linear Regression using python’s machine learning framework: scikit-learn.

Problem Statement:

Predict the price of a car given its compression ratio.

Data details:

Fare_amount is the column that corresponds to the target variable, which needs to be predicted.

VendorID: The ID of the taxi vendor.

tpep_pickup_datetime: The time at which the passenger is picked up.

tpep_dropoff_datetime: The time at which the passenger is dropped off.

passenger_count: Number of passengers on the ride.

trip_distance: The distance covered by the trip.

RatecodeID: The rate type of the taxi trip.

store_and_fwd_flag: This flag indicates whether the trip record was held in a vehicle.

PULocationID: Pickup Location ID.

DOLocationID: Drop off Location ID.

payment_type: Mode of payment.

fare_amount: Fare for a trip.

Extra: Extra charges.

mta_tax: Metropolitan Transit Authority Tax

tip_amount: Amount given as a tip.

tolls_amount: Amount given at toll booths.

improvement_surcharge: Surcharge in lieu of rate hike.

total_amount: Total amount to be paid.

Tools used:

Pandas, Numpy, Matplotlib, scikit-learn

Python Implementation with code:

0. Import necessary libraries

Import the necessary modules from specific libraries.

from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

1. Load the data set

Use pandas module to read the taxi data from the file system. Check few records of the dataset.

taxi_train = "data/taxi-fare-train.csv"
taxi_test = "data/taxi-fare-test.csv"
tax_train = pd.read_csv(taxi_train)

vendor_id  rate_code passenger_count trip_time_in_secs trip_distance payment_type fare_amount
0          CMT       1         1     1271 3.8           CRD     17.5
1          CMT       1         1     474  1.5           CRD     8.0
2          CMT       1         1     637  1.4           CRD     8.5
3          CMT       1         1     181  0.6           CSH     4.5
4          CMT       1         1     661  1.1           CRD     8.5

2. Select the predictor feature for Simple Regression, select the target variable – predictor feature is chosen as trip_distance since fare amount is definitely related to distance covered.

X = tax_train['trip_distance']
y = tax_train['fare_amount']

3. Check 5-num summary of selected predictor feature


count    728541.000000
mean          2.741597
std           3.298091
min           0.000000
25%           1.000000
50%           1.700000
75%           3.000000
max          98.700000
Name: trip_distance, dtype: float6

5. Train test split:

from sklearn.model_selection import train_test_split

x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y)

Dome some reshaping of the variable for visualization

x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.values,
x_test_set.values, y_training_set.values, y_test_set.values
x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.reshape(-1, 1),
x_test_set.reshape(-1, 1), y_training_set.reshape(-1, 1), y_test_set.reshape(-1, 1)

Do some initial visual inspection between predictor and a target variable

# So let's plot some of the data 
# - this gives some core routines to experiment with different parameters
plt.title('Relationship between dependent and target variable')
plt.scatter(x_training_set, y_training_set,  color='black')

6. Training/model fitting:

Fit the model to selected supervised data

lm = linear_model.LinearRegression(),y_training_set)

7. Model parameters study:

from sklearn.metrics import mean_squared_error, r2_score

model_score = lm.score(x_training_set,y_training_set)
# Have a look at R sq to give an idea of the fit ,
# Explained variance score: 1 is perfect prediction
print('R sq: ',model_score)

y_predicted = lm.predict(x_test_set)

# The coefficients
print('Coefficients: ', lm.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test_set, y_predicted))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test_set, y_predicted))

('R sq: ', 0.7729861480277364)
('Coefficients: ', array([[2.55423554]]))
Mean squared error: 21.33
Variance score: 0.77

8. Accuracy report with test data:

Let’s visualize the goodness of the fit with the predictions being visualized by a line

# So let's run the model against the test data
y_predicted = lm.predict(x_test_set)

plt.title('Comparison of Y values in test and the Predicted values')
plt.ylabel('Test Set')
plt.xlabel('Predicted values')
plt.plot(x_test_set, y_predicted, color='blue', linewidth=3)


9. Prediction:

Algorithm Advantages:

  • Extremely simple method
  • When relationships between the independent variables and the dependent variable are almost linear, shows optimal results.
  • Very easy and intuitive to use and understand
  • Even when it doesn’t fit the data exactly, we can use it to find the nature of the relationship between the two variables.

Algorithm Disadvantages:

  • Linear regression is limited to predicting the numeric output.
  • Very sensitive to the anomalies in the data (or outliers)
  • If we have a number of parameters than the number of samples available then the model starts to model the noise rather than the relationship between the variables.
  • Regression coefficients are biased by the data imbalance.

Learn Python programming online here at Acadgild.

Series NavigationLogistic Regression >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles