All Categories

Building Spam Filtering Engine using Spark MlLib

How to build a spam filtering engine using Spark MlLib?

Here in this blog, we will build two spam classifications engine one by using logistic regression and the other by Naive Bayes. Finally, we will check the accuracy of these engines but Before going through the context, we recommend our users to go through our previous blogs on ‘Introduction to Machine Learning Using Spark’ and also learn What is Spam Filtering.

What is Spam Filtering?

Spam filtering is the process of detecting the unwanted or unsolicited email or text from getting into the user’s inbox. Spam filtering applications work on text filters. Text filters work by using algorithms to detect which words and phrases are most often used in the spam emails.

Now, let us build a spam filtering application using logistic regression.

First, we need to have the dataset of some texts. You can download the datasets from the below link.

https://drive.google.com/open?id=0ByJLBTmJojjzdTdvRC10TnhCa2M

Here there are two files where one contains the spam emails and the other contains the ham emails i.e., non-spam emails. Based on these datasets we need to train our model.

Load the data sets into Spark shell as shown below

val spam_mails = sc.textFile("file:///home/kiran/Documents/datasets/spam_filtering/spam")
val ham_mails = sc.textFile("file:///home/kiran/Documents/datasets/spam_filtering/ham")

Now we need to extract the features of this dataset. For feature extraction, we can use HashingTF in Spark.

In machine learning, feature hashing, also known as the hashing trick, is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array.

This can be done as follows

val features = new HashingTF(numFeatures = 1000)

This will create feature vectors by converting the text into bigrams of characters using n-gram model and hashing them to a length 1000 feature vector that can be passed into a Mllib application.

Now we need to map these features with our datasets. This can be done as follows

val Features_spam = spam_mails.map(mail => features.transform(mail.split(" ")))
val Features_ham = ham_mails.map(mail => features.transform(mail.split(" ")))

As spam filtering is a kind of supervised learning, we need to provide labeled data to the application. Labeled data typically consists of a bag of multidimensional feature vectors. A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification.

val positive_data = Features_spam.map(features => LabeledPoint(1, features))
val negative_data = Features_ham.map(features => LabeledPoint(0, features))

Now we need to create the training data for our application, training data will be the 60% of total data. So first we will club both the spam and ham datasets and then we will create the training and test data as follows.

Hadoop

val data = positive_data.union(negative_data)
data.cache()
val Array(training, test) = data.randomSplit(Array(0.6, 0.4))

Now we will be having training and test data in the ration of 60% & 40% respectively.

Let’s create a logistic regression learner which uses the LBFGS optimizer.

val logistic_Learner = new LogisticRegressionWithSGD()

We need to run the model using the training data.

val model = logistic_Learner.run(training)

Next, we need to test the model by creating a prediction label.

val predictionLabel = test.map(x=> (model.predict(x.features),x.label))

Calculate the accuracy of the model. Accuracy can be calculated by taking the matching terms from both the training and test data. This can be done as follows:

val accuracy = 1.0 * predictionLabel.filter(x => x._1 == x._2).count() / training.count()

Complete stack of this program is shown in the below screenshot.

In the above screenshot, you can see that the accuracy of this model based on the training and test data using logistic regression as 61.42%

Now we will build the same application using Naive Bayes and check for the accuracy.

To build your model using Naive Bayes, simple change you need to do is as shown below.

Val model = NaiveBayes.train(trainset, 1.0)

Your prediction label should be run on this model and is as follows.

val predictionLabel = test.map(x=> (model.predict(x.features),x.label))

In the above screenshot, you can see that the accuracy of this model based on the training and test data using Naive Bayes algorithm is 60.99.

We can see that the accuracy of the spam filtering model is almost the same with both the algorithms.

Hope this blog helps you learn how to build your first machine learning application using Spark. Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.

Spark

2 Comments

  1. How this spam filtering engine actually works in real world?
    And it would have become very simple to understand if you could have shown us a glimpse of its working in a real environment with its displayed input and output or screenshots of the working.
    Thanks and Regards,
    Pranav

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close