Data Science and Artificial Intelligence

Logistic Regression With PCA – Speeding Up and Benchmarking

This entry is part 5 of 17 in the series Machine Learning Algorithms

Introduction to PCA Algorithm:

When data becomes too much in its dimension then it becomes a problem for pattern learning. Too much information is bad because of two things: compute and execution time and quality of the model fit. When the dimension of the data is too high we need to find a way to reduce it. But that reduction has to be done in such a way that we maintain the all relevant information in the original data.  The algorithm that we are going to discuss in this article does this job. The algorithm is quite famous and widely used in a variety of tasks. Its name is Principal Component Analysis aka PCA.

The main purpose of the principal component analysis is to find the dimensions of maximum variance and to recast the data into these fewer dimensions so that the information needed for the Machine Learning Algorithm to do its job is intact.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

PCA is used to transform a high-dimensional dataset into a smaller-dimensional subspace; into a new coordinate system. In the new coordinate system, the first axis corresponds to the first principal component, which is the component that explains the greatest amount of the variance in the data.

In simple words, the principal component analysis is a method of extracting important variables known as principal components from a large set of variables available in a data set. It captures as much information as possible from the original high dimensional data. It represents the original data in terms of its principal components in a new dimension space.

Summary of PCA:

Applications of PCA :

  • Visualization
  • Denoising
  • Data Compression
  • Speeding up ML algorithms

Problem Statement:

Speed up Handwriting recognition learning

Solution:

We will solve this problem by forming the classification pipeline on the MNIST dataset.

About the Dataset 

 

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

Four files are available on this site:

  • train-images-idx3-ubyte.gz: training set images (9912422 bytes)
  • train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
  • t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
  • t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
Parameters Number
Classes 10
Samples per class ~7000 samples per class
Samples total 70000
Dimensionality 784
Features integers values from 0 to 255

The MNIST database of handwritten digits is available on the following website: MNIST Dataset

Train a model with all components

from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import numpy as np

Load the Dataset :

# You can add the parameter data_home to wherever to where you want to download your data
mnist = fetch_mldata('MNIST original')

Check data information:

print(mnist.data.shape)
print(mnist.COL_NAMES)
print(mnist.target.shape)

(70000, 784)
['label', 'data']
(70000,)
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]

There are 70,000 records of 784 dimensions. The labels are a 70,000-dimensional vector. The dimension has been exported under name ‘data’ and labels are exported as ‘target’.

Split the data into train/test :

# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split(
    mnist.data, mnist.target, test_size=1/7.0, random_state=0)

Standardize the data :

scaler = StandardScaler()

# Fit on training set only.
scaler.fit(train_img)

# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

Notice that we have done the fitting on the training set only and then applied that to the test data as well.

Initialize a benchmarking data frame:

Let’s initialize a pandas data frame that would hold:

  • Variance: The variance of the original data that is retained
  • N_component: number of principal components
  • Timing: time to fit training
  • Accuracy: Percentage of records correctly classified.

We will capture the above attributes from each experiment run.

benchmark_cols = ['Variance retained','n_Components','Time(s)','Accuracy_percentage']
benchmark = pd.DataFrame(columns = benchmark_cols)

Train the model with all data:

Train a logistic regression on all data and record the training time and accuracy.

The variance and num of components will be obviously 1.0 and 784.

variance = 1.0
n_components = train_img.shape[1]

logisticRegr = LogisticRegression(solver = 'lbfgs')
start = time.time()
logisticRegr.fit(train_img, train_lbl)
end =  time.time()
timing = end-start
# Predict for Multiple Observations (images) at Once
predicted = logisticRegr.predict(test_img)
# generate evaluation metrics
accuracy = (metrics.accuracy_score(test_lbl, predicted))

a = dict(zip(benchmark_cols,[variance,n_components,timing,accuracy]))
benchmark = benchmark.append(a,ignore_index=True)

print(benchmark)

  Variance retained  n_Components Time(s)  Accuracy_percentage
0          1.00      784.0       72.379794 0.9155

Training on total was done in ~73 seconds and it yielded an accuracy of 91.%.

Now let’s train on the data with reduced variance. We will use PCA to reduce the no of components.

Decide on the variance percentages:

Fix the variances for which we would conduct the experiments.

variance_list = [0.95,0.90,0.85,0.80,0.75,0.70]

We would check how much time is taken to build an ML model having the specified data variances.

Define a function to run the same model with various variances :

def benchmark_pca(variance,train_img,train_lbl,test_img,test_lbl):
    global benchmark
    print(train_img.shape)
    pca = PCA(variance)
    pca.fit(train_img)
    n_components = pca.n_components_
    train_img = pca.transform(train_img)

    # pca.fit(test_img)
    test_img = pca.transform(test_img)
    logisticRegr = LogisticRegression(solver = 'lbfgs')
    start = time.time()
    logisticRegr.fit(train_img, train_lbl)
    end =  time.time()
 
    timing = end-start   

    # Predict for Multiple Observations (images) at Once
    predicted = logisticRegr.predict(test_img)   

    # generate evaluation metrics
    accuracy = (metrics.accuracy_score(test_lbl, predicted))
    #return 
    a = dict(zip(benchmark_cols,[variance,n_components,timing,accuracy]))
    benchmark = benchmark.append(a,ignore_index=True)

for variance in variance_list:
    benchmark_pca(variance,train_img,train_lbl,test_img,test_lbl)

Variance retained  n_Components Time(s)  Accuracy_percentage
0        1.00       784.0       72.379794    0.9155
1        0.95       330.0       39.592324    0.9200
2        0.90       236.0       30.176633    0.9169
3        0.85       184.0       23.074336    0.9154
4        0.80       148.0       19.963392    0.9127
5        0.75       120.0       19.286882    0.9105
6        0.70       98.0        17.231295    0.9075

Let’s plot the relation between accuracy and other elements.

import matplotlib.pyplot as plt
benchmark.plot(x=0,y=-1)
plt.title("variance vs accuracy")

import matplotlib.pyplot as plt
benchmark.plot(x=1,y=-1)
plt.title("no of components vs accuracy")

import matplotlib.pyplot as plt
benchmark.plot(x=2,y=-1)
plt.title("time vs accuracy")


 

 

Series Navigation<< Understand Power of Polynomials with Polynomial RegressionUsing Gradient Boosting for Regression Problems >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close