Data Science and Artificial Intelligence

Naive Bayesian Model

This entry is part 13 of 17 in the series Machine Learning Algorithms

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to the presence of any other feature. For example, a dress may be considered to be a shirt if it is red, printed, and has a sleeve. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this cloth is a shirt and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large datasets. Along with simplicity, Naive Bayes is known to outperform even the most-sophisticated classification methods.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

4 Applications of Naive Bayes Algorithms

Below are the common applications of Naive Bayes Algorithms:

  1. Real-time Prediction: As Naive Bayes is super fast, it can be used for making predictions in real time.
  2. Multi-class Prediction: This algorithm can predict the posterior probability of multiple classes of the target variable.
  3. Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers are mostly used in text classification (due to their better results in multi-class problems and independence rule) have a higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
  4. Recommendation System: Naive Bayes Classifier along with algorithms like Collaborative Filtering makes a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Mathematical Explanation:

Problem Statement:

To build a simple generative classification model called Naive Bayes for predicting the quality of the car given few of other car attributes.

Data details

==========================================
1. Title: Car Evaluation Database
==========================================

The dataset is available at “http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

2. Sources:
   (a) Creator: Marko Bohanec
   (b) Donors: Marko Bohanec   ([email protected])
               Blaz Zupan      ([email protected])
   (c) Date: June, 1997

3. Past Usage:

   The hierarchical decision model, from which this dataset is
   derived, was first presented in 

   M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for 
   multi-attribute decision making. In 8th Intl Workshop on Expert
   Systems and their Applications, Avignon, France. pages 59-78, 1988.

   Within machine-learning, this dataset was used for the evaluation of
   the HINT (Hierarchy INduction Tool), which was proved to be able to
   completely reconstruct the original hierarchical model. This,
   together with a comparison with C4.5, is presented in

   B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by
   function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

4. Relevant Information Paragraph:

   Car Evaluation Database was derived from a simple hierarchical
   decision model originally developed for the demonstration of DEX
   (M. Bohanec, V. Rajkovic: Expert system for decision
   making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates
   cars according to the following concept structure:

   CAR                      car acceptability
   PRICE                  overall price
   buying               buying price
   maint                price of the maintenance
   TECH                   technical characteristics
   COMFORT              comfort
   doors              number of doors
   persons            capacity in terms of persons to carry
   lug_boot           the size of luggage boot
   safety               estimated safety of the car

   Input attributes are printed in lowercase. Besides the target
   concept (CAR), the model includes three intermediate concepts:
   PRICE, TECH, COMFORT. 
5. Number of Instances: 1728 (instances completely cover the attribute space)

6. Number of Attributes: 6

7. Attribute Values:

   buying       v-high, high, med, low
   maint        v-high, high, med, low
   doors        2, 3, 4, 5-more
   persons      2, 4, more
   lug_boot     small, med, big
   safety       low, med, high

8. Missing Attribute Values: none

9. Class Distribution (number of instances per class)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %) 
   acc        384     (22.222 %) 
   good        69     ( 3.993 %) 
   v-good      65     ( 3.762 %)

Tools to be used:

Numpy, pandas, scikit-learn

Python Implementation with code:

1. Import necessary libraries

Import the necessary modules from specific libraries.

import os
import numpy as np
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics , model_selection
## Import the Classifier.
from sklearn.naive_bayes import GaussianNB

2. Load the data set

Use the pandas module to read the bike data from the file system. Check few records of the dataset.

data = 
pd.read_csv('data/car_quality/car.data',names=['buying','maint','doors','persons','lug_boot','safety','class'])
data.head()

  buying maint doors persons lug_boot safety class
0 vhigh  vhigh 2     2       small    low    unacc
1 vhigh  vhigh 2     2       small    med    unacc
2 vhigh  vhigh 2     2       small    high   unacc
3 vhigh  vhigh 2     2       med      low    unacc
4 vhigh  vhigh 2     2       med      med    unacc

3. Check a few information about the data set

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB

The train dataset has 1728 rows and 7 columns.

There are no missing values in the dataset

4. Identify the target variable

data['class'],class_names = pd.factorize(data['class'])

The target variable is marked as a class in the data frame. The values are present in string format. However, the algorithm requires the variables to be coded into its equivalent integer codes. We can convert the string categorical values into an integer code using factorize method of the pandas library.

Let’s check the encoded values now.

print(class_names)
print(data['class'].unique())

Index([u'unacc', u'acc', u'vgood', u'good'], dtype='object')
[0 1 2 3]

The values have been encoded into 4 different numeric labels.

5. Identify the predictor variables and encode any string variables to equivalent integer codes

data['buying'],_ = pd.factorize(data['buying'])
data['maint'],_ = pd.factorize(data['maint'])
data['doors'],_ = pd.factorize(data['doors'])
data['persons'],_ = pd.factorize(data['persons'])
data['lug_boot'],_ = pd.factorize(data['lug_boot'])
data['safety'],_ = pd.factorize(data['safety'])
data.head()

  buying maint doors persons lug_boot safety class
0 0      0     0     0       0        0      0
1 0      0     0     0       0        1      0
2 0      0     0     0       0        2      0
3 0      0     0     0       1        0      0
4 0      0     0     0       1        1      0

Check the data types now:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null int64
maint       1728 non-null int64
doors       1728 non-null int64
persons     1728 non-null int64
lug_boot    1728 non-null int64
safety      1728 non-null int64
class       1728 non-null int64
dtypes: int64(7)
memory usage: 94.6 KB

Everything is now converted to integer form.

Select the predictor feature and the target variable

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

6. Train test split:

# split data randomly into 70% training and 30% test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=123)

7. Training/model fitting

model = GaussianNB()
## Fit the model on the training data.
model.fit(X_train, y_train)

8. Model parameters study :

# use the model to make predictions with the test data
y_pred = model.predict(X_test)
# how did our model perform?
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 150
Accuracy: 0.71

Algorithm Advantages:

  • It is easy to apply and predicts the class of test data set fast. It also performs well in multi-class prediction
  • When the assumption of independence holds, a Naive Bayes classifier performs better compared to the other models like logistic regression as you need less training data.
  • It performs well in the case of categorical input variables compared to a numerical variable(s). For the numerical variable, a normal distribution is assumed (bell curve, which is a strong assumption).

Algorithm Disadvantages:

  • If the categorical variable has a category (in test data set), which was not observed in training data set, then the model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible to get a set of predictors which are completely independent.

This model has an accuracy score of 71% since this is a very simplistic dataset with distinctly separable classes. That’s how to implement Naive-Bayes with scikit-learn. Load your favorite dataset and give it a try!. From here on, all you need is practice.

Visit our website Acadgild to know more.

Series Navigation<< Decision TreesMultiple Linear Regression >>

Abhay Kumar

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close