In this blog, we will learn how to perform predictive analysis with the help of a dataset using the Logistic Regression Algorithm.
The dataset used in this blog is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The datasets consist of several medical predictor variables and one target variable, that is the outcome. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
We will build a machine learning model to accurately predict whether the patients have diabetes or not.
Before moving further, we should first understand what is Logistic Regression and why we use it.
Logistic regression is a classification algorithm used to assign observations to a discrete set of data.
Examples of classification problems are Email spam or not spam, Online transactions Fraud or not Fraud, Person is diabetic or not.
It is a Machine Learning algorithm which is used for classification problems, which is a predictive analysis algorithm and is based on the concept of probability.
We expect our model to give us a set of outputs based on probability when we pass the inputs and returns a probability score between 0 and 1.
Now, since we have a brief knowledge of Logistic Regression, let us begin with the coding part.
You can download the dataset from the link: Dataset
We will first set up the filepath representing the directory of the R process.
getwd() returns an absolute filepath representing the current working directory of the R process.
Loading the requrired library packages.
Loading the dataset.
The head() function is used to return the first few records of all the dataset.
We will now check if any null values are present in the dataset.
We can see from the above result that there is no null value present in the dataset.
Summarizing the dataset using the summary() function.
We will find the structure of the dataset using the str() function.
We can see from the above result that there are 9 columns present in the dataset. The variables Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction and Age are responsible for the variable Outcome, that states whether a person has diabetes or not. Where 1 says ‘Yes’ and 0 says ‘No’.
We will now check for the range of people with respect to their age.
We have made use of the factor() function that is used to represent categorical data. It can be ordered or unordered.
From the above result, we can see that the values are maximum between the range 21 to 30 and 31 to 40. That is people are maximum in numbers between the ages of 21 and 30 being 417 in numbers.
Visualizing the above range of ages with the help of Histogram.
The above code will show the following output.
Visualizing the same with the help of Barplot for a better understanding of the dataset.
It will show the following output:
Plotting Age category against BMI with the help of Boxplot.
It gives the following output:
Age between 21 to 30 has the maximum outliers, which has been shown with Red dots.
Plotting a correlation matrix against all the variables present in the dataset.
It is inferred that no correlation exists between the variables.
Plotting it using corrplot() function which is a graphical representation of the above correlation matrix.
It shows the following graph.
The above graph shows that there is no strong correlation observed between variables. So we can do further analysis without dropping any columns.
Train and Test Data
We will now install the caTools that Contains several basic utility functions including: moving window statistic functions, read/write for GIF and ENVI binary files, fast calculation of AUC, LogitBoost classifier, etc. It has been called here to split our data into Train and Test data.
Splitting the dataset into Train and Test data into 80% and 20% respectively.
Calculating the total number of rows
Total number of Train data rows
Total number of Test data rows
Fitting model using all the independent variables.
Here we have fitted our model based on Train data.
The AIC here is an estimator of the relative quality of statistical models for a given dataset. AIC estimates the quality of each model. Thus, AIC provides a means for model selection. A good model is the one that has minimum AIC among all the other models.
Predicting Outcome on Training dataset
The average prediction for each of the two outcomes
Now we will carry out operation to find the average prediction for each of the two outcomes(0 and 1) against all other variables of the dataset.
The ROC curve that stands for Receiver Operating Characteristic (ROC) is a curve that is used to assess the accuracy of a continuous measurement for predicting a binary outcome. It generally shows the performance of a classification model at all classification thresholds.
This curve plots two parameters:
- True Positive Rate
- False Positive Rate
AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve. It is used in classification analysis in order to determine which of the used models predicts the classes best.
Generating ROC curve on train data.
Generating AUC curve
It gives the below graph
From the above graph it is inferred that we get an accuracy rate of 84% on our Train data.
Making predictions on our Test Data
We see that the above output gives us the accuracy rate as 74%. Lets improve the performance of the model.
We get the following output
From the above graph it is inferred that we get an accuracy rate of 82% on our Test data. Hence, the model is 82% accurate to predict whether the person is Diabetic or not.
This brings us to the end of this blog. Hope you find this article helpful. For any query or suggestions do drop a comment below.
Keep visiting our website for more blogs on Data Science and Data Analytics.
https://acadgild.com/blog/linear-model-building Using Airquality Data Set with R.https://acadgild.com/blog/premium-insurance-policyholders-using-linear-regression-with-r