Big Data Hadoop & Spark

Spark RDD Operations in Scala Part – 2

In our previous post, we had discussed the basic RDD operations in Scala. Now, let’s discuss some of the advanced spark RDD operations in Scala.

Here, we have taken two datasets, dept and emp, to work on advanced operations. Datasets look as shown below:

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

datasets

      [DeptNo DeptName]                              [Emp_no DOB FName Lname gender HireDate DeptNo]

Columns in both the dataset are tab separated.

Union:

The Union operation results in an RDD which contains the elements of both the RDD’s. You can refer to the below screen shot to see how the Union operation performs.


Here, we have created two RDDs and loaded the two datasets into them. We have performed Union operation on them, and from the result, you can see that both the datasets are combined and have printed the first 10 records of the newly obtained spark RDD. Here the 10th record is the first record of the second dataset.

Intersection:

Intersection returns the elements of both the RDD’s. Refer the below screen shot to know how to perform intersection.

Here we have split the datasets by using tab delimiter and have extracted the 1st column from the first dataset and the 7th column from the second dataset. We have also performed intersection on the datasets and the result is as displayed.

Cartesian:

The Cartesian operation will return the RDD containing the Cartesian product of the elements contained in both the RDDs. You can refer to the below screen shot for the same.


Here we have split the datasets by using tab delimiter and have extracted 1st column from the first dataset and 7th column from the second dataset. Then, we have performed the Cartesian operation on the RDDs and the results are displayed.

Subtract:

The Subtract operation will remove the common elements present in both the RDDs. You can refer to the below screen shot for the same.

Here, we have split the datasets by using tab delimiter and have extracted the 1st column from the first dataset and the 7th column from the second dataset. Then we have performed the Subtract operation on the RDDs and the results are displayed.

Foreach:

The foreach operation is used to iterate every element in the spark RDD. You can refer to the below screen shot for the same.


In the above screen shot, you can see that every element in the spark RDD emp are printed in a separate line.

Operations on Paired RDD’s:

Creating Pair RDD:

Hadoop

Here, we will create an RDD pair which consists of key and value pairs. To create a pair RDD, we need to import the RDD package by using the below statement:

import org.apache.spark.rdd.RDD

You can refer to the below screen shot for the same.


Here, we have split the dataset by using the tab as a delimiter and made the key value pairs as shown in the above screen shot.

Keys:

The Keys operation is used to print all the keys in the RDD pair. You can refer to the below screen shot for the same.


Values:

The Values operation is used to print all the values in the RDD pair. You can refer to the below screen shot for the same.

SortByKey:

The SortByKey operation returns the RDD that contains the key value pairs sorted by Keys. SortByKey accepts arguments true/false. ‘False’ will sort the keys in descending order and ‘True’ will sort the keys in ascending order. You can refer to the below screen shot for the same.l

RDD’s holding Objects:

Here, by using the case class, we will declare one object and will pass this case class as parameter to the RDD. You can refer to the below screen shot for the same.

Join:

The Join operation is used to join two RDDs. The default Join will be Inner join. You can refer to the below screen shot for the same.

Here, we have taken two case classes for the two datasets and have created two RDDs with the two datasets as the common element as key and the rest of the contents as value and have performed Join operation on the RDDs and the result is as displayed on the screen.

RighOuterJoin:

The RightOuterJoin operation returns the joined elements of both the RDDs, where the key must be present in the first RDD. You can refer to the below screen shot for the same.


Here, we have taken two case classes for the two datasets and have created two spark RDDs with the two datasets as the common element as key and the rest of the contents as values and we have performed rightOuterJoin operation on the RDDs and the result is as displayed on the screen.

LeftOuterJoin:

The LeftOuterJoin operation returns the joined elements of both the RDDs, where the key must be present in the second RDD. You can refer to the below screen shot for the same.

Here, we have taken two case classes for the two datasets and we have created two RDDs with the two datasets as the common element as key and the rest of the contents as value and we have performed the LeftOuterJoin operation on the RDDs and the result is as displayed on the screen.

CountByKey:

The CountByKEy operation returns the number of elements present for each key. You can refer to the below screenshot for the same.

Here, we have loaded the dataset and split the records by using tab as delimiter and created the pair as DeptNo and DeptName. Then, we have performed CountByKey operation and the result is as displayed.

SaveAsTextFile:

The SaveAsTExtFile operation stores the result of the RDD in a text File in the given output path. You can refer to the below screenshot for the same.

Persistence levels of Spark RDDs:

Whenever you want to store a RDD data into memory such that the RDD will be used multiple times or that RDD might have created after lots of complex processing in those situations, you can take the advantage of Cache or Persist.

You can make a Spark RDD to be persisted using the persist() or cache() functions on it. The first time it is computed in an action, it will be kept in memory on the nodes.

When you call persist(), you can specify that you want to store the RDD on the disk or in the memory or both. If it is in-memory, whether it should be stored in serialzed format or de-serialized format, you can define all those things.

Cache() is like persist() function only, where the storage level is set to memory only.

Here is an example of how to cache or persist a Spark RDD.

Caching data into Memory:

To cache some Spark RDD into memory, you can directly call .cache method on that RDD as shown below

scala> val data = sc.parallelize(1 to 10000000)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:27
scala> data.cache
res11: data.type = ParallelCollectionRDD[3] at parallelize at <console>:27

To remove the RDD from cache, you just call the method .unpersist on the RDD as shown below.

scala> data.unpersist()
res13: data.type = ParallelCollectionRDD[3] at parallelize at <console>:27

Now to specify the storage levels, you can use the persist method and you have to pass the parameters as shown below:

data.persist(DISK_ONLY) //To  persist the data on the disk
data.persist(MEMORY_ONLY) //To persist the data in the memory only
data.persist(MEMORY_AND_DISK) //to persist the data in both Memory and Disk

To use all these storage levels, you have to import the below package

import org.apache.spark.storage.StorageLevel._

Hope this post has been helpful in understanding the advanced Spark RDD operations in Scala. In case of any queries, feel free to drop us a comment below or email us at [email protected].

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

 

Spark

Tags

3 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close