Logistic Regression is a supervised Machine Learning algorithm and despite the word ‘Regression’, it is used in binary classification. By binary classification, it meant that it can only categorize data as 1 (yes/success) or a 0 (no/failure).
In other words, we can say that the Logistic Regression model predicts P(Y=1) as a function of X.
Assumptions in Logistic Regression
- The dependent variable in Logistic Regression requires to be binary.
- Only meaningful variables should be included
- The model should have little or no multicollinearity that means that the independent variables should be independent of each other
- Logistic Regression requires quite large sample sizes.
Examples of Logistic Regression include:
- Predicting whether an email is spam or not
- Predicting whether a student will pass or fail an exam, etc
In Logistic Regression, we use the Sigmoid function to describe the probability that a sample belongs to one of the two classes. The shape of the sigmoid functions determines the probabilities predicted by our model.
In mathematics, the below equation as a Sigmoid function:
P = 1 / (1+e^(-y))
Where y is the equation of line : y=mx+c
No matter what values we have for y, a Sigmoid function ranges from 0 to 1.
The Sigmoid function looks like below:
In this blog, we will understand the working of Logistic regression by building a model using the Advertising dataset. Please note that this is not a real dataset but a sample one that has been created for your understanding. You can download the dataset from this Link.
The dataset consists of 10 columns. The classification goal is to predict whether the user will click on an ad featuring on websites (1) or not (0) based on various features variables.
Let us begin our coding in Python.
We will begin by importing all the necessary libraries.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Loading the CSV file by mentioning the correct path of the file.
advert = pd.read_csv(r'advertising.csv') #Fetching the first few records of the data. advert.head(10)
The input variables include:
- Daily time spent on site: the time spent by the user on a website in minutes
- Age: Age of the user(numeric)
- Area income: The average income of the geographical area of the user
- Daily internet usage: The average minutes a day the user is on the internet
- Ad topic line: The headline of the ads that are being displayed on the website
- City: The city where the user resides in
- Male: Whether the user is a male or not (1: yes and 0: no)
- Country: The country where the user lives in
- Timestamp: The time at which the user clicked on Ad or closed window
- Clicked on Ad: Whether the user clicked on an ad or not(1: yes and 0: No)
Checking the ‘info’ of the dataset
We can see that there is a total of 1000 entries(rows) and 10 columns.
Checking the statistical figures of the dataset.
Getting the counts of unique values for the target column and the ‘Male’ columns respectively.
advert['Clicked on Ad'].value_counts()
From the above output, it is clear that 50% of the users clicked on the Ad while browsing the internet while 50% of them did not.
Hence the number of males out of 1000 users is 481.
Checking the total number of null values in the dataset.
Luckily we do not have to deal with handling the missing values as our dataset doesn’t contain any missing values.
Grouping by the target variable with the mean of other feature variables.
advert.groupby('Clicked on Ad').mean()
It can be inferred that:
The average age of the user who clicked on the Ad is higher than that of users who didn’t.
The user who clicked on the ad has less average daily time spent on site as compared to the user who didn’t.
Grouping by the target variable and the ‘Male’ column together.
advert.groupby(['Clicked on Ad','Male']).size()
It is clear that the number of people who clicked on the Ad is 231 which are male and 269 others.
Now we will visualize the data using Matplotlib and seaborn library to see the patterns and trends in the dataset.
Creating histogram for the Age column
sns.set_style('whitegrid') sns.distplot(advert['Age'], kde = False, bins = 40)
The above graph shows that maximum users are of the age ranged between 25-45.
Creating jointplot for the columns ‘Age’ and ‘Area Income’
sns.jointplot(x = 'Age', y = 'Area Income', data = advert) plt.show()
There is no visible linear relation was found between the two variables. However, people age between 20-45 were found to have more income.
Creating jointplot for the columns ‘Age’ and ‘Daily Time spent on site’
sns.jointplot(x = 'Age', y ='Daily Time Spent on Site', data = advert, kind = 'kde', color = 'red') plt.show()
People of age between 20-45 spend more time on site daily.
Visualizing the number of males
sns.countplot(x = 'Male', data = advert, palette= 'pastel')
Hence fewer numbers of males(categorized as 1) as compared to others(categorized as 0).
Using countplot to visualize what numbers of Males and others have clicked on the ad
sns.countplot(x = 'Clicked on Ad', data = advert, hue = 'Male', color = 'red')
From the above graph, we can see that 1 and 0 refer to whether the user clicked on an ad or not, whereas red and pink color refers to males and others respectively.
Therefore there are fewer males who clicked on the ad as compared to ‘Others’ which are more in number.
Also the number of males who clicked on the ad is less as compared to those who didn’t.
Creating pairplot for the whole dataset
sns.pairplot(advert, hue = 'Clicked on Ad') plt.plot()
Now since our data is prepared we will now split our data into training and test datasets.
X = advert[['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage']] y = advert['Clicked on Ad']
Here X and y are independent and dependent features respectively. The columns ‘Ad Topic Line’, ‘City’, ‘Male’, ‘Country’, ‘Timestamp’, are not numeric and don’t have much impact on the dataset. Hence we will not consider these features.
Splitting the data into training and test datasets using sklearn library.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Now since our data has been split into training and test dataset in 80:20 ratio respectively. We now import the Logistic Regression class from sklearn library and would create the instance for the same. We then call the fit() function to train the model with the training dataset.
from sklearn.linear_model import LogisticRegression #Creating an instance of Logistic Regression class logreg = LogisticRegression() logreg.fit(X_train, y_train)
Now we’ll check how the model performs against data that it hasn’t been trained on.
prediction = logreg.predict(X_test)
Since it was a classification problem, we use a confusion matrix to measure the accuracy of our model.
from sklearn.metrics import confusion_matrix conf_Matrix = confusion_matrix(y_test, prediction) print(conf_Matrix)
From our confusion matrix, we conclude that:
- True-positive: 86 (We predicted a positive result and it was positive)
- True-negative: 94 (We predicted a negative result and it was negative)
- False-positive: 3 (We predicted a positive result and it was negative)
- False-negative: 17 (We predicted a negative result and it was positive)
Computing the classification report which states the precision, recall, f1-score and support.
from sklearn.metrics import classification_report print(classification_report(y_test,prediction))
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative. Said another way, “for all instances classified positive, what percent was correct?”
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. Said another way, “for all instances that were actually positive, what percent was classified correctly?”
The F-beta score can be interpreted as a weighted harmonic mean of precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
The support is the number of occurrences of each class in y_test.
This brings us to the end of this blog. I hope you find this blog helpful. For any query or suggestions do drop us a comment blog. Keep visiting our website for more blogs on Data Science and Data Analytics.