Data Source: Auto Insurance in Sweden
X = number of claims, Y = total payment for all the claims in thousands of Swedish Kronor for geographical zones in Sweden
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
Reference: Swedish Committee on Analysis of Risk Premium in Motor Insurance
Model Representation
In this problem we have an input variable – X and one output variable – Y. And we want to build a linear relationship between these variables. Here the input variable is called Independent Variable and the output variable is called Dependent Variable. We can define this linear relationship as follows:
drawing a line between X and Y which would estimate the relationship between X and Y.
But how do we find these coefficients? That’s the learning procedure. We can find these using different approaches. One is called Ordinary Least Square Method and other one is called Gradient Descent Approach.
Ordinary Least Square Method
Earlier we discussed that we will approximate the relationship between X and Y to a line. Let’s say we have a few inputs and outputs. And we plot these scatter points in 2D space, we will get an image similar to this.
As you can see, there is a straight line in the graph and that is what we aim to accomplish. Firstly, we need to minimize the error of the output model. A good model will always have the least error. We can find this line by reducing the error. The error of each point is the distance between line and the concerned point. This is illustrated as follows
This method is called Ordinary Least Square Method. Now we will implement this model in Python.
Implementation:
# Importing Necessary Libraries
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams[‘figure.figsize’] = (20.0, 10.0)
# Reading Data
You can download the data set from the following link:
https://s3.amazonaws.com/acadgildsite/wordpress_images/datasets/slr06/slr06.xls
data = pd.read_excel(‘slr06.xls’)
print(data.shape)
data.head()
Output:
*** No CODEPAGE record, no encoding_override: will use ‘ascii’
(63, 2)
Output:
# Collecting X and Y
X = data.iloc[:,0].values
Y = data.iloc[:,1].values
# Mean X and Y
mean_x = np.mean(X)
mean_y = np.mean(Y)
# Total number of values
m = len(X)
numer = 0
denom = 0
for i in range(m):
numer += (X[i] – mean_x) * (Y[i] – mean_y)
denom += (X[i] – mean_x) ** 2
b1 = numer / denom
b0 = mean_y – (b1 * mean_x)
# Print coefficients
print(b1, b0)
Output:
3.4138235600663664 19.99448575911481
How do we interpret the regression coefficients for linear relationships?
Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This statistical control, that regression provides, is important because it isolates the role of one variable from all of the others in the model. Here, we have our coefficients.
That is the linear model.
Visualisation:
# Plotting Values and Regression Line
max_x = np.max(X) + 100
min_x = np.min(X) – 100
# Calculating line values x and y
x = np.linspace(min_x, max_x, 1000)
y = b0 + b1 * x
# Ploting Line
plt.plot(x, y, color=’#58b970′, label=’Regression Line’)
# Ploting Scatter Points
plt.scatter(X, Y, c=’#ef5423′, label=’Scatter Plot’)
plt.xlabel(‘Head Size in cm3’)
plt.ylabel(‘Brain Weight in grams’)
plt.legend()
plt.show()
This model is not bad. But we need to find how good the model is. There are many methods to evaluate models. We will use the Root Mean Squared Error and Coefficient of Determination ( R2 Score). Root Mean Squared Error (RMSE) RMSE is the square root of sum of all errors divided by number of values, or mathematically,
# Calculating Root Mean Squares Error
rmse = 0
for i in range(m):
y_pred = b0 + b1 * X[i]
rmse += (Y[i] – y_pred) ** 2
rmse = np.sqrt(rmse/m)
print(rmse)
Output:
35.365829968791466
Coefficient of Determination ( Score)
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determinations for multiple regressions. The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data.
ss_t = 0
ss_r = 0
for i in range(m):
y_pred = b0 + b1 * X[i]
ss_t += (Y[i] – mean_y) ** 2
ss_r += (Y[i] – y_pred) ** 2
r2 = 1 – (ss_r/ss_t)
print(r2)
Output:
0.8333466719794502
Now we have implemented Simple Linear Regression Model using Ordinary Least Square Method. Now we will see how to implement the same model using a Machine Learning Library called scikit-learn.
The scikit-learn approach:
The scikit-learn is a machine learning library in Python. Let’s see how we can build the Simple Linear Regression Model using scikit-learn.
# Import libraries and tools
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Cannot use Rank 1 matrix in scikit learn
X = X.reshape((m, 1))
# Creating Model
reg = LinearRegression()
# Fitting training data
reg = reg.fit(X, Y)
# Y Prediction
Y_pred = reg.predict(X)
# Calculating RMSE and Score
mse = mean_squared_error(Y, Y_pred)
rmse = np.sqrt(mse)
r2_score = reg.score(X, Y)
print(np.sqrt(mse))
print(r2_score)
Output:
35.365829968791466
0.8333466719794502
You can see that this is exactly equal to model we built from scratch, but this process requires simpler and less lines of code.
Now let us move forward to Multiple Linear Regression.
Multiple Linear Regression
Multiple Linear Regression is a type of Linear Regression when the input has multiple features (variables).
Model Representation
Gradient Descent
Gradient Descent is an optimization algorithm. We will optimize our cost function using Gradient Descent Algorithm.
Implementation
We will use a student score dataset in this case study. In this particular dataset, we have math, reading and writing exam scores of 1000 students. We will try to predict the score of a writing exam from math and reading scores. Thus, we have 2 features (input variables). Let us first start by importing the dataset.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams[‘figure.figsize’] = (20.0, 10.0)
from mpl_toolkits.mplot3d import Axes3D
# Reading Data
You can download the data set from the following link:
https://drive.google.com/drive/u/0/folders/192X4XJbfiRkiLSTvKYSxYMrjm5u1dBYs
data = pd.read_csv(‘student.csv’)
print(data.shape)
data.head()
Output: (1000, 3)
Math | Reading | Writing | |
0 | 48 | 68 | 63 |
1 | 62 | 81 | 72 |
2 | 79 | 80 | 78 |
3 | 76 | 83 | 79 |
4 | 59 | 64 | 62 |
We will get scores to an array.
# We will get scores to an array.
math = data[‘Math’].values
read = data[‘Reading’].values
write = data[‘Writing’].values
# Ploting the scores as scatter plot
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(math, read, write, color=’#ef1234′)
ax.set_xlabel(‘math’)
ax.set_ylabel(‘read’)
ax.set_zlabel(‘write’)
ax.set_title(r’3D plot of features’)
plt.show()
# Now we will generate our X, Y and β.
m = len(math)
x0 = np.ones(m)
X = np.array([x0, math, read]).T
# Initial Coefficients
W = np.array([0, 0, 0])
Y = np.array(write)
alpha = 0.0001
# We define our cost function.
def cost_function(X, Y, W):
m = len(Y)
J = np.sum((X.dot(W) – Y) ** 2)/(2 * m)
return J
inital_cost = cost_function(X, Y, W)
print(inital_cost)
Output: 2470.11
# 100000 Iterations
newW, cost_history = gradient_descent(X, Y, W, alpha, 100000)
# New Values of B
print(newW)
# Final Cost of new B
print(cost_history[-1])
Output:
[-0.47889172 0.09137252 0.90144884]10.475123473539167
Model Evaluation – RMSE
def rmse(Y, Y_pred):
rmse = np.sqrt(sum((Y – Y_pred) ** 2) / len(Y))
return rmse
# Model Evaluation – R2 Score
def r2_score(Y, Y_pred):
mean_y = np.mean(Y)
ss_tot = sum((Y – mean_y) ** 2)
ss_res = sum((Y – Y_pred) ** 2)
r2 = 1 – (ss_res / ss_tot)
return r2
Y_pred = X.dot(newW)
print(rmse(Y, Y_pred))
print(r2_score(Y, Y_pred))
Output:
4.5771439727277885
0.9097223273061554
We have a low value of RMSE score and a good score. I guess the model is pretty good.
Now we will implement this model using scikit-learn for multiple linear regression.
The scikit-learn Approach
The scikit-learn approach is very similar to Simple Linear Regression Model and simple too. Let’s implement this.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# X and Y
X = np.array([math, read]).T
Y = np.array(write)
# Model Intialization
reg = LinearRegression()
# Data Fitting
reg = reg.fit(X, Y)
# Y Prediction
Y_pred = reg.predict(X)
# Model Evaluation
rmse = np.sqrt(mean_squared_error(Y, Y_pred))
r2 = reg.score(X, Y)
print(rmse)
print(r2)
Output:
4.572887051836439
0.9098901726717316