In this post, we will be discussing the various operations related to Transformations and Actions in RDD like Map, FlatMap, ReduceBykey, Sorting, etc.
We would recommend readers to go through your previous blog on Introduction to Spark, before moving to this blog.
Let’s begin with a Python program for finding the square of a number using 3 different methods.
- Using User-Defined functions.
- Using Map functions.
- Using Lambda and Map functions.
Let’s now see the implementation of all the above three scenarios.
100% Free Course On Big Data Essentials
Subscribe to our blog and get access to this course ABSOLUTELY FREE.
- Using User Defined functions:
Below is the screenshot of the program for finding the square of a number using user defined function.
- Using Map functions:
The map function applies a passed-in function to each item in an iterable object and returns a list containing the square of all the integers.
In the below example Integers are contained in array my_items passed as second argument of the map function.
Map calls square function on each list item and collects all the return values into a new list.
- Using Lambda and Map functions.
Using a construct called “lambda”, Python supports the creation of anonymous functions where functions are not bound to a name at run time.
The lambda function squares each item in the items list.
Now since we are aware of what Map and Lambda functions are, let’s discuss the implementation of Transformations and Actions in RDD by considering a real-time example.
We have taken an input file test_ip and in the first step we will be creating RDD from test_ip dataset.
Refer the below screenshot for test_ip input file.
- Creation of the RDD from external Dataset test_ip:
The initial call to the textFile method of variable sc (SparkContext) creates the first resilient distributed dataset (RDD) and in the below example the first RDD has been created as my_file.
- Applying Map transformation to create new RDD without dot punctuation:
In this step, we will apply the Map function which is a transformation on the created RDD. It returns a new RDD by applying the supplied function to each value in the original RDD. Here we use a lambda function, which replaces some common punctuation characters with spaces and convert to lower case, producing a new RDD.
The content of the newly created RDD can be viewed using the Take operation which is an Action.
- Using flatMap transformation to split the records of the input file:
In this step, we are again using flatMap transformation, which applies a function that takes each input value and returns a list. Each value of the list then becomes a new, separate value in the output RDD
In our example, the lines are split into words and then each word becomes a separate value in the output RDD.
- Invoking Map transformation to create the RDD with key value pair:
In the below step, we initiate second invocation using Map transformation. We use a function which replaces each original value in the input RDD with a tuple containing the word in the first position and the integer value 1 in the second position.
Now, the input RDD contains tuples of the form (<key>,<value>).
In the below step, a new RDD is created containing a tuple for each unique value of <key> in the input, where the value in the second position of the tuple is created by applying the supplied Lambda function to the <value>s with the matching <key> in the input RDD.
Here, the key will be the word and the Lambda function will sum up the word counts for each word.
The output RDD consists of a word stored at the first position and its count stored in the second position.
A Lambda function is mapped to the data to swap over the first and second values in each tuple. Because of this transformation, the word count appears first followed by word in the second position.
In this step, the Input RDD is sorted by the key value (the value at the first position in each tuple). Since the first position in the RDD is the frequency of the word, the most frequently occurring words occur first in the RDD as the false parameter is set in the script.
We hope this post has been helpful to understand the various transformation and action in Spark RDD.
In our next post, we will be implementing one case study using Spark.
Keep visiting our website www.acadgild.com for more posts on Big Data and other technologies. Click here to learn Spark from our Expert Mentors. In case you have any queries, please write to us at [email protected].