This is a first part of the series of posts, which will outline the importance of Spark in solving Machine Learning problems. This series covers complete steps that are necessary for a Data Science project. This series of steps commonly known as data pipeline in the industry consists of various data transformations and evolutions, each stitched together.
During this tutorial, I will be using the following technological components:
- Spark : RDD and DataFrame
- Python API
- MLLib and ML API
It is advisable to have a basic knowledge of SQL operations and Python programming concepts.
Before we take up the case study, let’s have a brief introduction to RDD and DataFrame API. These two APIs are the most important part of the whole Technology Stack. They are mainly used for data preparation, which includes data cleaning and data pre-processing. These two APIs are considered the most import part of the technology stack as 80% of the Machine Learning project involves data preparations. Therefore, you should have good knowledge of RDD and Data Frame API.
I will be discussing few of the frequently used APIs from RDD and Data Frame.
RDD stand for Resilient Distributed Datasets. RDDs are among the core API of Spark. It is the fundamental data structure in Spark. RDDs can be created in two ways:
- Parallelizing an existing collection in your driver program.
- Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, Data Frame, HBase, or any data source offering a Hadoop InputFormat.
Few Common Operations in RDDs:
- cache() : Puts the RDD in the default storage level (MEMORY_ONLY) and thereby allows caching increases the speed of processing by loading the data into the main memory. This is highly useful where interactive operations take place on one any specific RDD. In that case, we should put the RDD in cache so that all repetitive happens on it from memory and the need to calculate the RDD from the disk vanishes. We can understand this operation much clearly through the below examples:
Case 1: We can check the time of execution of an RDD action without caching.
Here, it took 0.90 seconds to load 1000 records of CSV.
Case 2: We can check the time of execution of an RDD action with caching.
This time, the same operation took 0.1 seconds; almost 9 times less time.
- collect() : Returns a list with all of the contents of the RDD. This method is only meant to be used on small RDDs as this method fetches all the RDD contents into the driver’s memory. However, in the driver memory, only a limited data can fit.
- take(n) : Takes the first n elements of the RDD. It works by first scanning one partition and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
- first() : Fetches the first element from the RDD.
This gives the first record i.e. the header
- filter(f) : Returns a new RDD with all the elements that meet the criteria defined as per function f.
Now, for a business case, we do not need the headers to be included in the actual data. We need to filter out the headers.
If we check the first row, we see that the headers have been removed.
- map() : Returns a new RDD by applying a function to all elements of the RDD.
Here, the map function is used to transform a string RDD to comma-separated string RDD.
- flatMap(f) : Returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
- distinct(): Returns a new RDD, containing the distinct elements in this RDD.
- aggregateByKey(zeroValue, seqFunc, combFunc) : Aggregates the value of each key, using given combine functions and a neutral “zero value”.
One of the use cases of this function is to apply a function to each category. For example, we can find average salary for each age using this function.
- sample(withReplacement, fraction, seed) : Returns an RDD which constitutes of a sample set from the actual RDD. Here, the size of the sample is specified by a fraction, and Seed is a constant, which can represent one constant set even if the set is picked after randomization. If you always use the same seed, you will always get the same random numbers. With “withReplacement” option, you need to specify whether the sample has to be picked from the parent RDD with replacement or withoutReplacement. If it is “withReplacement=True”, then each time you pick a sample there are chances of repetition of elements in the set. But if it is “withReplacement=False”, then each sample will be mutually exclusive.
- randomSplit(weights, seed) : Splits the RDD into lists as per the weights specified. As the name suggests, it does a random split i.e. the elements are randomly shuffled before splitting. This is mostly used for training data splits.
- count() : Counts the number of elements in the RDD.
This gives the total count or records available in the employee.csv file.
- reduce(f) : Reduces the elements of the RDD using the specified commutative and associative binary operation specified through f.
- sum() : Adds up the elements in this RDD.
- max() : Returns maximum element from the RDD.
- mean() : Returns average of all the elements from the RDD.
- min() : Returns minimum element from the RDD.
- stdev() : Computes the standard deviation of this RDD’s elements.
- stats() : Provides statistical summary of a numerical column like mean,max,min,stdev and count of records.
The above functionalities are just a few of the most frequently used APIs in RDD. In addition, there are few more APIs too which can be used for other data manipulation tasks. You can refer to API page of Spark for this.
In our next post, will be learning about another data manipulation API of Spark, its advantages and various functionalities available in it.