Big Data Hadoop & Spark

Introduction to Machine Learning Using Spark

Machine Learning Using Spark

Real time machine learning using spark applications are getting a lot of attentions in the recent times for solving real world problems in the real time more effectively. Spark is one of the big data processing engines that is perfectly suitable for this kind of applications.

Spark has an integrated framework of real-time streaming engine and Machine learning engine. Here we will have an introduction to machine learning using Spark. Spark’s machine learning engine is called as Mllib. Spark’s Machine Learning library, includes different machine learning algorithms for clustering, classification, collaborative filtering and many other machine learning tasks.

In our previous blog, we have ‘An Introduction to Beginner’s Guide for Spark.’

We can perform various functions with Spark:

  • SQL operations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.

  • Machine Learning: It has Machine Learning Library, MLib. It can perform Machine Learning without the help of MAHOUT.

  • Graph processing: It performs Graph processing by using GraphX component.

All the above features are inbuilt in Spark.

Spark provides two kinds of Machine learning API’s spark.mllib and Where spark.mllib package contains the machine learning API’s built on RDD’s and package contains the machine learning API’s built on DataFrames.

Now let us come to the concepts of Machine learning.

Machine learning is nothing but making the machine learn our existing data and give predictions about the feature.

There are different types of machine learning techniques:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised Learning
  • Reinforcement learning

Let’s have a brief introduction to all these models.

Supervised learning: This technique is used to predict by training the application using our existing training dataset i.e., labeled data and we use the application to predict the label for a unlabelled data. Supervised learning includes classification, regression,

Unsupervised learning: This technique is used to find the correlations and hidden parameters using the original data. This model is based on unlabelled data. Unsupervised learning includes social networking applications, language predictions etc.,

Semi-supervised learning: This technique uses both supervised and unsupervised learning techniques for performing predictive analytics. So it can use both labeled and unlabeled datasets. Semi-supervised applications include voice recognition, image categorization etc.,

Reinforcement learning: This technique is used to learn how to maximize a numerical reward by analyzing the various actions that lead to the same goal and suggesting which action will produce maximum rewards. Reinforcement applications include AI applications.


Algorithms provided by Spark’s Mllib

MLlib contains many algorithms and utilities.

ML algorithms include:

  • Classification: Logistic regression, Naive Bayes
  • Regression: Generalized linear regression, survival regression
  • Decision trees, random forests, and gradient boosted trees
  • Recommendation: Alternating least squares (ALS)
  • Clustering: K-means, Gaussian mixtures (GMMs
  • Topic Modeling: latent Dirichlet allocation (LDA)
  • Frequent itemsets, association rules, and sequential pattern mining

ML workflow utilities include:

  • Feature transformations: Standardization, normalization, hashing
  • ML Pipeline construction
  • Model evaluation and hyper-parameter tuning
  • ML persistence: saving and loading models and pipelines

Other utilities include:

  • Distributed linear algebra: SVD, PCA
  • Statistics: summary statistics, hypothesis testing

The best thing about Mllib is that it provides machine learning API’s in different languages like Scala, Java & Python. You can develop your machine learning application in any of these languages.

Overview of Machine Learning Algorithms in Spark Mllib


Classification is one of the kind of supervised learning that is used to classify an input to a particular pre-defined class. Applications like fraud transaction detection, email spam detection etc., can be developed using classification machine learning algorithm.

For performing classification, you need to have a labeled dataset to train your model and extract the features. Depending on these features your model will be able to predict whether the email is spam or not.

For building a classification model, we can use algorithms like Naive Bayes, logistic regression etc.

Regression: Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or ‘criterion variable’) changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Decision Trees: Decision tree learning uses a decision tree as a predictive model which maps observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees;

Recommendation: This is a kind of collaborative filtering which recommends users based on their preferences and information from other users. This is based on similarity like people who like similar items in past will like the similar items in future.

For example, in recommendations provided by e-commerce giants like Amazon.


Clustering is one of the kind of unsupervised learning that is used to categorize similar things into a one cluster. Clustering applications, groups, things, based on their similarity. For example news, applications categorize news into groups based on their generality.

For building a clustering model, we can use algorithms like K-Means.

Topic Modeling:

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for the discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is about 10% of cats and about 90% of dogs, there would probably be about 9 times more dog words than cat words. The “topics” produced by topic modeling techniques are clusters of similar words.

Steps to develop a Machine learning program

  • Feature Extraction and Domain knowledge

  • Feature Selection

  • Choice of Algorithm

  • Training

  • Choice of Metrics/Evaluation Criteria

  • Testing

In Feature extraction we need to have a deep knowledge on the data that we are handling and based on the data we will extract the features that are required for building machine learning application. For example to build a spam classifier, convert your e-mail text into useful features for training e.g. stemming, remove stop words, words frequency. Then evaluate these features (i.e. apply attribute selection method) to select the most significant ones

Machine learning provides a number of algorithms to work on so here you need to select the appropriate algorithm to build your application. For example, to develop a spam classifier we can use Naive Bayes or logistic regression or any other.

Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

In the choice of metrics, you need to evaluate how much accurate your model was. Depending on the outcomes here are the measures that are taken to evaluate the performance precision, recall, error rate, f1-measure, robustness etc.

In the testing phase, we will test our model by giving an unseen data set or a partition of data. Generally, the original dataset can be partitioned as training and test datasets in the ratio of 60% and 40%.

This is how a machine learning application can be built.

In our next blog, we will build a spam classifier application using Spark. We hope this blog helped you in getting an introduction to machine learning. Keep visiting our site for more updates on Big data and other technologies.



One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles