In this article we will be predicting the Salary class using Logistic Regression in R.
We have already performed Logistic Regression problem in one of our previous blogs which you can refer for better understanding:
Diabetes Prediction using Logistic Regression in R
In this blog we have used a dataset that contains an individual’s annual income that results from various factors. It is also based on some other factors such as an individual’s education level, age, gender, occupation, and etc.
The dataset contains 16 columns in which the Target field is the Income which is divided into two classes: <=50K and >50K. We can explore the possibility in predicting income level based on the individual’s personal information.
The dataset “adult” was found in the UCI Machine Learning Repository.
This project explores logistic regression using the UCI Adult Income data set. We will try to predict the salary class of a person based upon the given information. This is from an assigned project from Data Science and Machine Learning with R
Let us begin with the coding part. You can download the dataset from the below link:
Setting up filepath
Loading the dataset and reading the first few records using the head() function.
The dataset is stored in a variable “adult” and shows 6 rows and 8 out of 15 columns.
Fetching the structure of the dataset using the str() function.
Summarizing the dataset using the summary() function.
As we can see there is no null values present in our dataset.
Cross checking to see if there is a single null value present in the whole dataset.
Hence, no null value present.
We can see from the structure output that some of the columns have a large number of factors. We can clean these columns by combining similar factors, thus reducing the total number of factors.
a) Combining the workclass column
As we have seen that there are 9 factors in this column, we will combine it into 6 columns as shown below.
b) Combining the marital.status column
We can reduce these factors into the following groups:
c) Combining the country column
There are a lot of factors present in the country column, we can reduce them to their respective regions as shown in the below output.
Now we have to re-assign these altered columns to factors since we had to change them to characters:
Dealing with Missing Data
During the data cleaning process we came across some of the missing values that were present in the form of ‘?’. We can convert these values to NA so we can deal with it in a more efficient manner.
Converting ‘?’ to NA
Omitting the NA value
NA values have been omitted from the dataset.
Exploratory Data Analysis
Firstly we will plot a histogram of ages that is colored by income
Here the colored part is indicative of percentage. From this plot we can see that the percentage of people who make above 50K peaks out at roughly 35% between ages 30 and 50.
Next we will plot a histogram of hours worked per week by people.
From the above graph it is clear that the highest number of hours worked per week is 40.
Now we will depict the income class by the region where they stay in.But first we need to change the name of the country column to region.
It shows the following output.
From the above output it is clear that people from North America have the highest income, where around 11000 people earn more than 50k and people around 30000 earn less than or equal to 50000.
Building the Model
The purpose of this model is to classify people into two groups, below 50k or above 50k in income. We will build the model using training data, and then predict the salary class using the test data.
Splitting the data into Train and Test
We will split the dataset into training data and test data in 80% and 20% respectively, using the caTools.
Training the Model
While training our model we have used the glm() function that tells R to run a generalized linear model. ‘income ~ .’ means that we want to model income using every available feature. family = binomial() is used because we are predicting a binary outcome, below 50k or above 50k.
Making predictions on the Trained data, by applying ROC and AUC curve, as shown below.
we get the output as:
The above graph shows that the accuracy we got from the Train data is 90%
Making predictions on the Test data as shown below.
We are now converting probabilities to values as shown below
Here we have initialized predictions on the Test data using our Logistic Regression Model. We had specify type = “response” above, to get predicted probabilities instead of probability on the logit scale. The accuracy here shows to be 85%.
Applying ROC and AUC Curve on the Test data.
It shows the below output.
We get the accuracy from the Test data to be 90%.
We will now compare our results using a confusion matrix.
A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.
The most basic terms used in this matrix are:
- true positives (TP): These are cases in which we predicted yes and the actual result is also true.
- true negatives (TN): We predicted no, and the actual result is also false.
- false positives (FP): We predicted yes, but the actual result is false. (Also known as a “Type I error.”)
- false negatives (FN): We predicted no, but the actual result is true. (Also known as a “Type II error.”)
Since our predictions are predicted probabilities, we specify probabilities that are above or equal to 50% will be TRUE (above 50K) and anything below 50% will be FALSE (below 50K)
Hence, our logit model is 90% accurate to predict the salary class of a person based upon the given information.
Hope you find this blog helpful. In case of any query or suggestions drop us a comment below.
Keep visiting our website for more blogs on Data Science and Data Analytics.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.