Data Science and Artificial Intelligence

Linear regression Case Study

Predict the amount of insurance claim given the number of claims

Data Source: Auto Insurance in Sweden

X = number of claims, Y = total payment for all the claims in thousands of Swedish Kronor for geographical zones in Sweden

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Reference: Swedish Committee on Analysis of Risk Premium in Motor Insurance

Model Representation

In this problem we have an input variable – X and one output variable – Y. And we want to build a linear relationship between these variables. Here the input variable is called Independent Variable and the output variable is called Dependent Variable. We can define this linear relationship as follows:



drawing a line between X and Y which would estimate the relationship between X and Y.

But how do we find these coefficients? That’s the learning procedure. We can find these using different approaches. One is called Ordinary Least Square Method and other one is called Gradient Descent Approach.

Ordinary Least Square Method

Earlier we discussed that we will approximate the relationship between X and Y to a line. Let’s say we have a few inputs and outputs. And we plot these scatter points in 2D space, we will get an image similar to this.







As you can see, there is a straight line in the graph and that is what we aim to accomplish. Firstly, we need to minimize the error of the output model. A good model will always have the least error. We can find this line by reducing the error. The error of each point is the distance between line and the concerned point. This is illustrated as follows

This method is called Ordinary Least Square Method. Now we will implement this model in Python.







#import Necessary Libraries

%matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = (20.0, 10.0)


# Reading Data

You can download the data set from the following link:

data = pd.read_excel('slr06.xls')


*** No CODEPAGE record, no encoding_override: will use ‘ascii’

(63, 2)









# Collecting X and Y

X = data.iloc[:,0].values
Y = data.iloc[:,1].values



# Mean X and Y


mean_x = np.mean(X)
mean_y = np.mean(Y)
# Total number of values
m = len(X)
numer = 0
denom = 0
for i in range(m):
numer += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
b1 = numer / denom
b0 = mean_y - (b1 * mean_x)
# Print coefficients
print(b1, b0)


3.4138235600663664 19.99448575911481

How do we interpret the regression coefficients for linear relationships?

Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. This statistical control, that regression provides, is important because it isolates the role of one variable from all of the others in the model. Here, we have our coefficients.



That is the linear model.


# Plotting Values and Regression Line
max_x = np.max(X) + 100
min_x = np.min(X) - 100
# Calculating line values x and y
x = np.linspace(min_x, max_x, 1000)
y = b0 + b1 * x
# Ploting Line
plt.plot(x, y, color='#58b970', label='Regression Line')
# Ploting Scatter Points
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('Head Size in cm3')
plt.ylabel('Brain Weight in grams')







This model is not bad. But we need to find how good the model is. There are many methods to evaluate models. We will use the Root Mean Squared Error and Coefficient of Determination ( R2 Score). Root Mean Squared Error (RMSE) RMSE is the square root of sum of all errors divided by number of values, or mathematically,




# Calculating Root Mean Squares Error
rmse = 0
for i in range(m):
y_pred = b0 + b1 * X[i]
rmse += (Y[i] - y_pred) ** 2
rmse = np.sqrt(rmse/m)



Coefficient of Determination ( Score)

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determinations for multiple regressions. The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. In general, the higher the R-squared, the better the model fits your data.








ss_t = 0
ss_r = 0
for i in range(m):
y_pred = b0 + b1 * X[i]
ss_t += (Y[i] - mean_y) ** 2
ss_r += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_r/ss_t)



Now we have implemented Simple Linear Regression Model using Ordinary Least Square Method. Now we will see how to implement the same model using a Machine Learning Library called scikit-learn.

The scikit-learn approach:

The scikit-learn is a machine learning library in Python. Let’s see how we can build the Simple Linear Regression Model using scikit-learn.

# Import libraries and tools
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Cannot use Rank 1 matrix in scikit learn
X = X.reshape((m, 1))
# Creating Model
reg = LinearRegression()
# Fitting training data
reg =, Y)
# Y Prediction
Y_pred = reg.predict(X)
# Calculating RMSE and  Score
mse = mean_squared_error(Y, Y_pred)
rmse = np.sqrt(mse)
r2_score = reg.score(X, Y)





You can see that this is exactly equal to model we built from scratch, but this process requires simpler and less lines of code.

Now let us move forward to Multiple Linear Regression.

Multiple Linear Regression

Multiple Linear Regression is a type of Linear Regression when the input has multiple features (variables).

Model Representation











Gradient Descent

Gradient Descent is an optimization algorithm. We will optimize our cost function using Gradient Descent Algorithm.





















We will use a student score dataset in this case study. In this particular dataset, we have math, reading and writing exam scores of 1000 students. We will try to predict the score of a writing exam from math and reading scores. Thus, we have 2 features (input variables). Let us  first start by importing the dataset.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)
from mpl_toolkits.mplot3d import Axes3D

# Reading Data

You can download the data set from the following link:

data = pd.read_csv('student.csv')

Output: (1000, 3)

Math Reading Writing
0 48 68 63
1 62 81 72
2 79 80 78
3 76 83 79
4 59 64 62

We will get scores to an array.

# We will get scores to an array.
math = data['Math'].values
read = data['Reading'].values
write = data['Writing'].values
# Ploting the scores as scatter plot
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(math, read, write, color='#ef1234')
ax.set_title(r'3D plot of features')










# Now we will generate our X, Y and β.
m = len(math)
x0 = np.ones(m)
X = np.array([x0, math, read]).T
# Initial Coefficients
W = np.array([0, 0, 0])
Y = np.array(write)
alpha = 0.0001
# We define our cost function.
def cost_function(X, Y, W):
m = len(Y)
J = np.sum(( - Y) ** 2)/(2 * m)
return J
inital_cost = cost_function(X, Y, W)

Output: 2470.11












# 100000 Iterations
newW, cost_history = gradient_descent(X, Y, W, alpha, 100000)
# New Values of B
# Final Cost of new B


[-0.47889172  0.09137252  0.90144884]






Model Evaluation – RMSE

def rmse(Y, Y_pred):
rmse = np.sqrt(sum((Y - Y_pred) ** 2) / len(Y))
return rmse
# Model Evaluation - R2 Score
def r2_score(Y, Y_pred):
mean_y = np.mean(Y)
ss_tot = sum((Y - mean_y) ** 2)
ss_res = sum((Y - Y_pred) ** 2)
r2 = 1 - (ss_res / ss_tot)
return r2
Y_pred =
print(rmse(Y, Y_pred))
print(r2_score(Y, Y_pred))





We have a low value of RMSE score and a good  score. I guess the model is pretty good.

Now we will implement this model using scikit-learn for multiple linear regression.


The scikit-learn Approach

The scikit-learn approach is very similar to Simple Linear Regression Model and simple too. Let’s implement this.

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

# X and Y

X = np.array([math, read]).T
Y = np.array(write)
# Model Intialization
reg = LinearRegression()
# Data Fitting
reg =, Y)
# Y Prediction
Y_pred = reg.predict(X)
# Model Evaluation
rmse = np.sqrt(mean_squared_error(Y, Y_pred))
r2 = reg.score(X, Y)





An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

One Comment

  1. Hi sir ,i am a beginner of data analyse ,your article really did a great help to me and i wanna have a try on what you did. Could you please provide an available download path of the student.csv cause i don’t have permission to download that data file.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles