This blog guides beginners to get kickstarted with the basics of linear regression concepts so that they can easily build their first linear regression model. The modeling aspect of linear regression is the focus of this blog.
Linear Regression is one of the most fundamental and widely used Machine Learning Algorithms. It’s usually among the first few topics which people pick while learning predictive modeling. Linear Regression models the relationship between a dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line). The dependent variable is continuous. The independent variable(s) can be continuous or discrete, and the nature of the relationship is linear.
Linear relationships can either be positive or negative. A positive relationship between two variables basically means that an increase in the value of one variable also implies an increase in the value of the other variable. A negative relationship between two variables means that an increase in the value of one variable implies a decrease in the value of the other variable.
A simple linear regression has one independent variable. Mathematically, the line representing a simple linear regression is expressed through a basic equation:
Y = mX + b + e Here: m is the slope X is the predictor variable b is the intercept/bias term Y is the predicted target variable e is the error term
Look at Linear Regression using python’s machine learning framework: scikit-learn.
Predict the price of a car given its compression ratio.
Fare_amount is the column that corresponds to the target variable, which needs to be predicted.
VendorID: The ID of the taxi vendor.
tpep_pickup_datetime: The time at which the passenger is picked up.
tpep_dropoff_datetime: The time at which the passenger is dropped off.
passenger_count: Number of passengers on the ride.
trip_distance: The distance covered by the trip.
RatecodeID: The rate type of the taxi trip.
store_and_fwd_flag: This flag indicates whether the trip record was held in a vehicle.
PULocationID: Pickup Location ID.
DOLocationID: Drop off Location ID.
payment_type: Mode of payment.
fare_amount: Fare for a trip.
Extra: Extra charges.
mta_tax: Metropolitan Transit Authority Tax
tip_amount: Amount given as a tip.
tolls_amount: Amount given at toll booths.
improvement_surcharge: Surcharge in lieu of rate hike.
total_amount: Total amount to be paid.
Pandas, Numpy, Matplotlib, scikit-learn
Python Implementation with code:
0. Import necessary libraries
Import the necessary modules from specific libraries.
from sklearn import linear_model import pandas as pd import matplotlib.pyplot as plt import numpy as np
1. Load the data set
Use pandas module to read the taxi data from the file system. Check few records of the dataset.
taxi_train = "data/taxi-fare-train.csv" taxi_test = "data/taxi-fare-test.csv"
tax_train = pd.read_csv(taxi_train) tax_train.head() vendor_id rate_code passenger_count trip_time_in_secs trip_distance payment_type fare_amount 0 CMT 1 1 1271 3.8 CRD 17.5 1 CMT 1 1 474 1.5 CRD 8.0 2 CMT 1 1 637 1.4 CRD 8.5 3 CMT 1 1 181 0.6 CSH 4.5 4 CMT 1 1 661 1.1 CRD 8.5
2. Select the predictor feature for Simple Regression, select the target variable – predictor feature is chosen as trip_distance since fare amount is definitely related to distance covered.
X = tax_train['trip_distance'] y = tax_train['fare_amount']
3. Check 5-num summary of selected predictor feature
X.describe() count 728541.000000 mean 2.741597 std 3.298091 min 0.000000 25% 1.000000 50% 1.700000 75% 3.000000 max 98.700000 Name: trip_distance, dtype: float6
5. Train test split:
from sklearn.model_selection import train_test_split x_training_set, x_test_set, y_training_set, y_test_set = train_test_split(X,y)
Dome some reshaping of the variable for visualization
x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.values, x_test_set.values, y_training_set.values, y_test_set.values x_training_set, x_test_set, y_training_set, y_test_set = x_training_set.reshape(-1, 1), x_test_set.reshape(-1, 1), y_training_set.reshape(-1, 1), y_test_set.reshape(-1, 1)
Do some initial visual inspection between predictor and a target variable
# So let's plot some of the data # - this gives some core routines to experiment with different parameters plt.title('Relationship between dependent and target variable') plt.scatter(x_training_set, y_training_set, color='black') plt.show()
6. Training/model fitting:
Fit the model to selected supervised data
lm = linear_model.LinearRegression() lm.fit(x_training_set,y_training_set)
7. Model parameters study:
from sklearn.metrics import mean_squared_error, r2_score model_score = lm.score(x_training_set,y_training_set) # Have a look at R sq to give an idea of the fit , # Explained variance score: 1 is perfect prediction print('R sq: ',model_score) y_predicted = lm.predict(x_test_set) # The coefficients print('Coefficients: ', lm.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(y_test_set, y_predicted)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % r2_score(y_test_set, y_predicted)) ('R sq: ', 0.7729861480277364) ('Coefficients: ', array([[2.55423554]])) Mean squared error: 21.33 Variance score: 0.77
8. Accuracy report with test data:
Let’s visualize the goodness of the fit with the predictions being visualized by a line
# So let's run the model against the test data y_predicted = lm.predict(x_test_set) plt.title('Comparison of Y values in test and the Predicted values') plt.ylabel('Test Set') plt.xlabel('Predicted values') plt.plot(x_test_set, y_predicted, color='blue', linewidth=3) plt.xticks(()) plt.yticks(()) plt.show()
- Extremely simple method
- When relationships between the independent variables and the dependent variable are almost linear, shows optimal results.
- Very easy and intuitive to use and understand
- Even when it doesn’t fit the data exactly, we can use it to find the nature of the relationship between the two variables.
- Linear regression is limited to predicting the numeric output.
- Very sensitive to the anomalies in the data (or outliers)
- If we have a number of parameters than the number of samples available then the model starts to model the noise rather than the relationship between the variables.
- Regression coefficients are biased by the data imbalance.
Learn Python programming online here at Acadgild.