In this article we will be performing Regression Analysis with R on cars data set to predict labour cost.
Linear regression is used for finding linear relationship between different variables that can be categorized into target and one or more predictors.
There are two types of linear regression:
- Simple and
- Multiple
Simple Linear Regression
Simple Linear Regression is used to find relationship between two continuous variables. One is the independent variable which is also known as the predictor and other is dependent variable also known as response or target.
For example, the relationship between height and weight can be considered as one of the scenarios of linear regression.
Multiple Linear Regression
Multiple linear regression is a statistical technique that uses several independent variables to predict the outcome of a dependent or response variable. The goal of multiple linear regression is to model the linear relationship between the independent variables and dependent variable.
For example the final score of a student in an exam is dependent on various factors like attendance, practical marks and internal test, can be considered as one of the scenarios of multiple regression.
Data file :
This is the dataset with 1624 observations(rows) and 7 variables(columns) in which we will be performing multiple linear regression to predict Labour cost based on various factors like Mileage of the car, Labour hours etc.
While predicting the model we will carry out the following operations:
- Scatter Plot: This is a powerful data visualization tool that uses dots to represent the values for two different variables, one along the x-axis and other along the y-axis.
It is used to show the relationship between two variables.
Here we will be plotting graphs for three different variables(mileage, Labour Cost, Labour hours).
- Developing a Linear Model: After plotting the graph, we will build the linear model using three different variables and check for its significance level and would drop the one having low significance level.
- Comparing Full and Reduced Model using ANOVA: We will compare the two models for which we have built the Linear Model and check which one of it gives better values.
- Prediction: After performing ANOVA Test, we will keep the model that gives better values and do predictions on it for the Labour cost by giving different values of Labour Hours at 95% Level of Confidence.
- Confidence Interval: It is a range of values we are fairly sure our true value lies in. The common ranges of interval are 90%, 95% and 99%.
The confidence interval tells you how confident you are in your results. If you are 95% or 98% sure then that is usually considered “good enough” in statistics.
Let’s begin with the coding now:
We’ll first load the data set in R and process it:
In the above code we have load the dataset in the variable dataset and then again assigned it to a new variable cars for better understanding.
The head() function is used to return the first few records of all the dataset. As we can see there are 7 columns present viz, Vehicle, FM, Mileage, LH(Labour Hour), LC(Labour Cost), mc and State.
The str() function here outputs the structure of data.
The summary() function summarizes the data and get us six statistics min, max, mean, median, 1st and 3rd quartile. It also state the null values present in the column, if any.
Here in our dataset there is no missing value present.
Now we will visualize the data using the scatterplot graph to see the correlation between variables.
Here the pair() function is used to plot a scatterplot for columns 3 to 5 that is, Mileage, LH and LC.
We can see from the above graph that there is perfect correlation between variables LH and LC as all the points on a scatterplot lie on a straight line.
Building Linear model
The lm() function or Linear Model is used to create a linear model. It accepts a number of arguments. Here we are creating the model using variables Mileage and LH to predict for the target i.e., LC of the cars dataset.
#multiple linear regression
From the above output we can check for the intercept and coefficient rate of Mileage and LH. We will summarize the model further, to get the statistical information.
After summarizing the model we can see the significance value(column: Pr(>|t|) ) of Mileage is having low value (as we can also check from the significance codes from the table) and also falls under the low range of confidence value i.e., below 90% range. Therefore we will drop the Mileage column.
Where on the other hand we can see the significance value for the variable LH is high(as indicated by ‘ *** ’).
Now we’ll build another model using the predictor LH.
We got a similar result and as already we know that the value for LH is significant, we will do the ANOVA test to compare for the full and reduced model.
Anova test
Here we are conducting ANOVA Test, to compare two nested models: a “full” and a “reduced” model to test multiple and single predictors at a time respectively.
We have initialized the models: model(with predictors Mileage and LH) as full_model and model_1(with predictors LH) as reduced_model.
From the above result, it is concluded that model_1(LC~LH) is better as compared to model(LC~Mileage + LH)
Prediction
Now we will do predictions on model_1 at 95% level of confidence, giving different inputs for LH i.e., 12 and 15 hours respectively.
From the above result we can see that we have predicted Labour Cost per Labour Hour with three different values of fit, lower and upper in Rupees, using Multiple Linear Regression.
This is how we perform Regression Analysis to do predictions. For any query or suggestion do drop us a comment below. Hope you find this blog helpful.
Keep visiting our website Acadgild, for more blogs on Data Science and Data Analysis.
Suggested reading:
https://acadgild.com/blog/linear-model-building Using Airquality Data Set with R.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.