In this article, we will be predicting premium insurance policyholders using Linear Regression with R.
We have already performed Multiple Linear Regression problem in our previous blog which you can refer for better understanding:
In this blog, we have used a dataset that contains data about the age, sex, BMI, region where he lives, of a person. It also states whether the person has any child or not and the person has a smoking habit or not.
We will predict which of the above category of the person would be responsible to make him the premium insurance holder. The person who will be charged more would be the premium policyholder.
The dataset has the below columns. You can also download the same from the given link:
Column details –
- age: age of primary beneficiary
- sex: gender- female, male
- bmi: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight.
- children: Number of children covered by health insurance
- smoker: Yes\ NO
- region: the policyholder’s residential area in the US, northeast, southeast, southwest, northwest.
- charges: Individual medical costs billed by health insurance
Now, since we got a brief introduction about the dataset, we will now begin with the coding. So let’s dive in.
We will first load the data set in R and process it:
Then we will install the required packages.
We will then load the above-installed library packages.
Reading the dataset and store it in a variable “insurance”.
The head() function is used to return the first few records of all the dataset. As we can see there are 7 columns present which we have already discussed.
Now we will do EDA that helps us to explore our dataset so that we can do analysis on the same. The describe() function is used to get the statistical description of the dataset.
The str() function gives the structure of data.
We now come to know that the dataset has 7 variables, and 1338 observations(rows).
The summary() function summarizes the data and gets us six statistics min, max, mean, median, 1st, and 3rd quartile. It also states the null values present in the column if any.
As we can check there are no null values present in our dataset.
We again cross-checked is there any null value present in the whole dataset by using the any() function.
any() is a predicate function that takes a predicate function/object and returns TRUE or FALSE if any item in the object evaluates to true.
Hence we see there is no null value present.
Now we will visualize the data using the Boxplot. Boxplots are a measure of how well distributed is the data in a data set.
Here, we are comparing the distribution of data across data sets by drawing boxplots for columns 1 and 3 that is, age and BMI. Here the outliers are shown with the red color.
A histogram is also used for visualization and through this, we can identify the distribution and frequency of the data.
Here we are creating histograms for columns 1, 3 and 7 that is, age, BMI and charges.
A bar chart represents data in rectangular bars with a length of the bar proportional to the value of the variable.
Here we are plotting bar charts for columns 2, 4, 5 and 6 that is, sex, children, smoker, and region.
We will now find the correlation between age, BMI, children and charges.
We can see from the above graph that there is more correlation between variable age and charges.
Performing further correlation operation.
The corrplot package is a graphical display of a correlation matrix, confidence interval. It also contains some algorithms to do matrix reordering. The mutate_all() function in R creates new columns for all the available columns. Positive correlations are displayed in blue and negative correlations in red color.
From the above graph, it can be shown that there is a positive correlation between smoker
Performing Correlation between the Dependent Variable i.e., charges with all other independent variables.
As we can again see that the correlation value between the smoker and the dependent variable is the highest.
We will now create a model with all the independent variables to check for the significance values and the R-squared value.
As we can see the variable sex and region have low significance values as compared to all other variables whose significance level is high. Therefore dropping these variables would not make any impact on our model. Here the R-squared value for this model is 0.7509.
We will create one more model eliminating the column sex and region.
Here we see that the R-squared value has been reduced to 0.7497, which is less as compared to model_1.
The ANOVA test would compare model_1 with model_2 to check which model would give better results.
We can see from the above result that the first model perform better than the second model, so we will use model_1 for our predictions.
We now prepare our training and test data and perform prediction on model_1.
The above results show us the error difference between the actual and predicted values and also the error percentage.
Also we can see that the 5th row that indicates the column “Smoker” has the least error value.
Our model was able to predict the premium insurance for policy holders with a mean difference of ~19%.
While sex and region have no major contributors to the model, the model without those variables actually performed slightly worse. Therefore, if the region was further broken down by state, it may provide more accuracy.
Therefore we can conclude from the above results that the variable “Smoker” is highly correlated with “Charges”. That is a Smoker is very likely to hold premium insurance.
This is how we perform Regression Analysis to do predictions. For any query or suggestion do drop us a comment below. Hope you find this blog helpful.
Keep visiting our website Acadgild, for more blogs on Data Science and Data Analysis.
https://acadgild.com/blog/linear-model-building Using Airquality Data Set with R.