Free Shipping

Secure Payment

easy returns

24/7 support

Spark RDD Operations in Scala

 July 19  | 0 Comments

In this blog we will be discussing about the operations on RDD using Scala programming language.

Before starting the operations on RDD let’s have a look at our beginner’s guitde for spark to understand the installation of Spark.

https://acadgild.com/blog/beginners-guide-fr-spark/

To perform operations on RDD,we must know first how to create RDD.

Let’s have a look at the ways to create RDD in the below section:

Methods to Create RDD:

1.Referencing an external data set in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines. Here is an example invocation where we have taken a file test_file and have created RDD by using SparkContext’s textFile method.

Here we are creating a new RDD by loading a new dataset/textfile from HDFS. Please refer to the below screen shot for the same.


2. By parallelizing a collection of Objects (a list or a set) in the driver program.

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program. The elements of the collection are copied to form a distributed data set that can be operated on in parallel.

Below is the sample demonstration of the above scenario.

Here we have created some data in the variable data and we have loaded that data as an RDD using Parallelize method.

These are the two ways to create RDD’s in spark using Scala.

Now we will look into the RDD operations. RDD’s perform two types of operations. Transformations and actions.

Operations in RDD:

We will be using the below data set for operations in the following examples.

Transformations:

Any function that returns an RDD is a transformation ,elaborating it further we can say that Transformation are functions which create a new data set from an existing one by passing each dataset element through a function and returns a new RDD representing the results.

All transformations in Spark are lazy. They do not compute their results right away. Instead, they just remember the transformations applied to some base data set (e.g. a file). The transformations are only computed when an action requires a result that needs to be returned to the driver program.

In the below example shown in the screenshot we have applied the transformation using filter function.

Now let us see some transformations like map,flatmap,filter which are comonly used.

Map:
Map 
will take each row as input and return an RDD for the row.

Below is the sample demonstration of the above scenario.

You can see that in the above screen shot we have created a new RDD using sc.textFile method and we have used the map method to transform the created RDD.

In the first map method i.e., map_test we are splitting the each record by ‘\t’ tab delimeter as the data is tab separated. And you can see the result below the step.

Again we have transformed the RDD again by using the map method i.e., map_test1. Here we are creating two columns as a pair. We have used column2 and column3. You can see the result below the step.

Flat map:
flatMap will take a iterable data as input and returns the RDD as the contents of the iterator.

Below is the sample demonstration of the above scenario.

Previously the contents of map_test are iterable. Now after performing flatMap on the data it is not iterable.

Filter:
filter returns an RDD which meets the filter condition. Below is the sample demonstration of the above scenario.

You can see that all the records of India are present in the output.

ReduceByKey:
reduceByKey 
takes a pair of key and value pairs and combines all the values for each unique key. Below is the sample demonstration of the above scenario.

Here in this scenario we have taken a pair of Country and total medals columns as key and value and we are performing reduceByKey operation on the RDD.

We have got the final result as country and the total number of medals won by each country in all the olympic games.

Actions:

Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD and carries out all intermediate transformations and return the final results to the Driver program or writes it out to the file system.

Collect:
collect 
is used to return all the elements in the RDD. Refer the below screen shot for the same.

In the above screen shot we have displayed the Athelete name and his country as two elements from the map method and performed collect action on the newly created rdd as map_test1. The result is displayed on the screen.

Count:
count 
is used to return the number of elements in the RDD. Below is the sample demonstration of the above scenario.

In the above screen shot you can see that there are 8618 records in the RDD map_test1.

CountByValue:
countByValue 
is used to count the number of occurences of the elements in the RDD. Below is the sample demonstration of the above scenario.

In the above scenario we have taken a pair of Country and Sport. By performing countByValue action we have got the count of each country in a particular sport.

Reduce:
Below is the sample demonstration of the above scenario where we have taken the total number of medals column in the dataset and loaded into the RDD map_test1. On this RDD we are performing reduce operation. Finally we have got that there are a total of 9529 medals declared to the winners in olympic.

Take:
take 
will display the number of records we explicitly specify. Below is the sample demonstration of the above scenario.

Here in the above screen shot you can see that when we perform collect operation it will display all the elements of the RDD and when we perform take operation and we have gave the argument as 2 to limit the number of elements getting displayed.

Hope this blog helped you in understanding the RDD’s and the most comonly used RDD’s in scala. Keep visiting our site www.acadgild.com for more updates on BigData and other technologies.

>