This blog gives you insights into the fundamental concepts of logistic regression so that students will be able to build their first logistic regression model with confidence. Logistic regression is one of the most foundational and widely used Machine Learning Algorithms. It’s usually among the first few topics which people pick while learning predictive modeling. It’s not a regression algorithm but a probabilistic classification model.
Classification in Machine Learning is a technique of learning where an instance is mapped to one of the many labels. The machine learns patterns from data in such a way that the learned representation successfully maps the original dimension to the suggested label/class without any more intervention from a human expert.
Logistic regression in its plain form is used to model the relationship between one or more predictor variables to a binary categorical target variable. The target variable is marked as “1” and “0”. Since the target is binary, vanilla logistic regression is referred to as the binary logistic regression. The binary logistic regression model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). For example, we may find that plants with a high level of a fungal infection (X1) fall into the category “the plant lives” (Y) less often than those plants with a low level of infection. Thus, as the level of infection rises, the probability of the plant living decreases.
Logistic regression is a linear model which can be subjected to nonlinear transforms. The logistic regression formula is derived from the standard linear equation for a straight line. The linear representation(-inf,+inf) is converted to a probability representation (0-1) using the sigmoidal curve.
Logistic regression is a linear model which can be subjected to nonlinear transforms. The logistic regression formula is derived from the standard linear equation for a straight line. The linear representation(-inf,+inf) is converted to a probability representation (0-1) using the sigmoidal curve. Hence, we need to represent the particular linear function in terms of another function which can squeeze the generated score to a continuous number between 0 and 1. Sigmoid is one function which does this. It squeezes any number generated by a function, in this case, a (w*x + b) between 0 and 1.
Following is the graph for the sigmoidal function:
It ensures that the generated number is always between 0 and 1 since the numerator is always smaller than the denominator by 1. See below,
Derivation from linear function to a sigmoid function:
The outcome in binomial logistic regression can be a 0 or a 1. The idea is then to estimate the probability of an outcome being a 1 or a 0. Given that the probability of the outcome being a 1 is given by p then the probability of it not occurring is given by 1-p. This is a special case of Binomial distribution called the Bernoulli distribution.
The idea in logistic regression is to cast the problem in the form of a generalized linear regression model.
where ŷ =predicted value, x= independent variables and the β are coefficients to be learned.
This can be compactly expressed in vector form:
Then
Logistic regression comes from the fact that linear regression can also be used to perform classification problem. Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm). This model was proposed as a means of using linear regression to the problems which were not directly suited for application of linear regression.
The fundamental equation of the generalized linear model is:
g(E(y)) = α + βx1 + γx2 here g() is a link function. The role of the link function is to ‘link’ the expectation of y to linear predictor.
GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between the link function and independent variables.
Hence, logistic regression is a special case of linear regression when the outcome variable is categorical, and the log of odds is the dependent variable.
In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Thus, the logit function acts as a link between logistic regression and linear regression.
Taking the natural exponential on both sides gives:
Add 1 on both sides
This looks like the sigmoid function isn’t it.
Find the implementation of logistic regression below done using the python machine learning framework known as scikit-learn.
Problem Statement:
To build a simple logistic regression model for prediction of demand for bikes given the values about the wind speed.
Data details
========================================== Bike Sharing Dataset ========================================== The dataset is available at “https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset” ========================================= Background ========================================= Bike sharing systems are the new generation of traditional bike rentals where the entire process from membership to rental and return has become automated. Through these systems, the user can easily rent a bike from one location and return it to another location. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousand bicycles. Today, there’s a great interest in these systems due to their important roles in traffic, environmental and health issues. Apart from interesting real-world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for research purposes. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns the bike sharing system into a virtual sensor network that can be used for keeping track of mobility in the city. Hence, it is expected that most of the important events in the city could be detected via monitoring these data. ========================================= Data Set ========================================= Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com. ========================================= Files ========================================= - Readme.txt - hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours - day.csv - bike sharing counts aggregated on daily basis. Records: 731 days ========================================= Dataset characteristics ========================================= Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv - instant: record index - dteday : date - season : season (1:springer, 2:summer, 3:fall, 4:winter) - yr : year (0: 2011, 1:2012) - mnth : month ( 1 to 12) - hr : hour (0 to 23) - holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule) - weekday : day of the week - workingday : if day is neither weekend nor holiday is 1, otherwise is 0. + weathersit : - 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp : Normalized temperature in Celsius. The values are divided to 41 (max) - atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max) - hum: Normalized humidity. The values are divided to 100 (max) - windspeed: Normalized wind speed. The values are divided to 67 (max) - casual: count of casual users - registered: count of registered users - cnt: count of total rental bikes including both casual and registered ========================================= License ========================================= Use of this dataset in publications must be cited to the following publication: [1] Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3. @article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-0040-3}, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, pages={1-15} }
Python Implementation with code:
0. Import necessary libraries
Import the necessary modules from specific libraries.
import pandas as pd from glob import glob from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split from sklearn import metrics, cross_validation from sklearn.metrics import mean_squared_error, r2_score
1. Load the data set
Use pandas module to read the bike data from the file system. Check few records of the dataset.
data_list = glob('data/bike-sharing/*') day_df = pd.read_csv(data_list[2]) day_df.head()
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt 0 1 2011-01-01 1 0 1 0 6 0 2 0.344167 0.363625 0.805833 0.160446 331 654 985 1 2 2011-01-02 1 0 1 0 0 0 2 0.363478 0.353739 0.696087 0.248539 131 670 801 2 3 2011-01-03 1 0 1 0 1 1 1 0.196364 0.189405 0.437273 0.248309 120 1229 1349 3 4 2011-01-04 1 0 1 0 2 1 1 0.200000 0.212122 0.590435 0.160296 108 1454 1562 4 5 2011-01-05 1 0 1 0 3 1 1 0.226957 0.229270 0.436957 0.186900 82 1518 1600
2. Identify the target variable and convert it into a categorical variable
The problem we are solving is to model the demand of the bikes. If we observe in the dataset, the last column which says “cnt” gives us the total bikes used on a particular day. This will be our target column. However, this column contains continuous values. In order to build a logistic regression model, we should have a target variable which is discrete. Hence let’s convert the particular column into a categorical column by thresholding it on a particular value. Let’s check a five-num summary of the target column.
day_df.iloc[:,-1].describe() count 731.000000 mean 4504.348837 std 1937.211452 min 22.000000 25% 3152.000000 50% 4548.000000 75% 5956.000000 max 8714.000000 Name: cnt, dtype: float64
This show that avg demand of the bike per day has been around 4504. Since we are trying to predict a number which shows a higher demand. Let’s pick a number close to the 50th percentile of the distribution. We fix a threshold of 4600 i.e. any number >4600 can be considered as high demand.
day_df['High'] = day_df.cnt.map(lambda x: 1 if x>4600 else 0)
2. Select the predictor feature and select the target variable
X = day_df[['windspeed']] y = day_df['High']
3. Train test split:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
4. Training/model fitting
model = LogisticRegression() model.fit(X_train, y_train)
5. Model parameters study:
predicted = model.predict(X_test) # generate evaluation metrics print(metrics.accuracy_score(y_test, predicted)) 0.5727272727272728
The accuracy is ~58 % which is not a very good model. This is however expected in case of the univariate logistic regression. The reason being, one predictor variable might not be sufficient enough for making a decision. Model tuning, feature engineering, data preprocessing are a few tricks that can help improve our model. This blog emphasized how to create a model based on a given dataset.
Predicting the probability that bike demand will be high
Let’s predict the probability of the bike demand for a random day which is absent in the dataset. The wind speed on that particular day is minimum and maximum. The expectation is when the wind speed is high people would prefer to ride less and when the wind speed is low people would prefer to ride more. Let’s validate the same with a model.
Case 1: when the wind speed is minimum
test_data = day_df.windspeed.min() print(model.predict(test_data)) print(model.predict_proba(test_data)) [1] [[0.44937194 0.55062806]
This means that bike demand is on the higher side and the confidence is ~55%
Case 2: when the wind speed is maximum
test_data = day_df.windspeed.max() print(model.predict(test_data)) print(model.predict_proba(test_data)) [0] [[0.64939015 0.35060985]]
This means that bike demand is on the lower side and the confidence is ~65%.
Logistic Regression Assumptions:
- A binary logistic regression requires the dependent variable to be binary.
- For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
- Meaningful variables should be included.
- The independent variables should be independent of each other.
- The features are linearly related to the logits of the target variable.
- Sample size should be sufficient – at least 50 records per predictor.