In this post, we will be discussing Spark RDD and will be performing some basic operations on RDD.
Let’s start our discussion with MapReduce and the challenges associated with it which led to the innovation of Spark. MapReduce greatly simplified “Big Data” analysis on large clusters, but its inefficiency to handle iterative algorithm and interactive data mining tools led to the innovation of a new technology which could provide the abstractions for leveraging distributed memory.
In most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system. This results in substantial expenses due to data replication, disk I/O, and serialization, which can dominate application execution times.
When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as Logistic Regression, K-means clustering, and Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.
A framework like MapReduce is inefficient for an important class of emerging applications which reuse intermediate results across multiple computations. And the solution for this was to go for In-memory processing framework like Apache Spark.
Apache Spark is a general-purpose cluster computing platform designed for faster processing. When it comes to speed, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Spark is up to 10× faster than Hadoop for iterative applications, speeds up a real-world data analytics report by 40×, and can be used interactively to scan a 1 TB dataset with 5–7s latency.
Resilient Distributed Datasets (RDDs) are the core concepts in Spark. In order to understand how spark works, we should know what RDD’s are and how they work.
The Spark RDD is a fault tolerant, distributed collection of data that can be operated in parallel. Each RDD is split into multiple partitions, and spark runs one task for each partition. The Spark RDDs can contain any type of Python, Java or Scala objects, including user-defined classes. They are not actual data, but they are Objects, which contains information about data residing on the cluster. The RDDs try to solve these problems by enabling fault tolerant, distributed In-memory computations.
When we are going for In-memory processing, there are many constraints. RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.
Resilient Distributed Datasets
Let’s look at the features of Resilient Distributed Datasets in the below explanation:
- In Hadoop, we store the data as blocks and store them in different data nodes. In Spark, instead of following the above approach, we make partitions of the RDDs and store in worker nodes (datanodes) which are computed in parallel across all the nodes.
- In Hadoop, we need to replicate the data for fault recovery, but in case of Spark, replication is not required as this is performed by the RDDs.
- RDDs load the data for us and are resilient, which means they can be recomputed.
- RDDs perform two types of operations: Transformations, which creates a new dataset from the previous RDD and Actions, which return a value to the driver program after performing the computation on the dataset.
- RDDs keeps track of Transformations and checks them periodically. If a node fails, it can rebuild the lost RDD partition on the other nodes, in parallel.
Methods to Create RDD:
RDDs can be created in two different ways:
- Referencing an external data set in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
This method takes an URI for the file (either a local path on the machine or a hdfs://, s3n://, etc URI) and reads it as a collection of lines. Here is an example invocation where we have taken a file test_file and have created RDD by using SparkContext’s textFile method.
2. By parallelizing a collection of Objects (a list or a set) in the driver program.
Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program. The elements of the collection are copied to form a distributed data set that can be operated on in parallel.
Below is the sample demonstration of the above scenario.
Operations in RDD:
We will be using the below dataset for operations in the following examples.
Transformations create a new data set from an existing one by passing each dataset element through a function and returns a new RDD representing the results. In short, creating an RDD from an existing RDD is ‘transformation’.
All transformations in Spark are lazy. They do not compute their results right away. Instead, they just remember the transformations applied to some base data set (e.g. a file). The transformations are only computed when an action requires a result that needs to be returned to the driver program.
In the below example shown in the screenshot, we have applied the transformation using filter function.
Relevant code in scala:
val myRDD = myfile.filter(_.contains(“acadgild”));
Actions return final results of RDD computations. Actions trigger execution using lineage graph to load the data into original RDD and carry out all intermediate transformations and return the final results to the Driver program or writes it out to the file system.
In the below example, we have counted the number of records from existing RDD created earlier and then have printed the first line of the RDD.
We hope this has been helpful in learning the basic operations in RDD. Check our next post on the advanced operations on RDD.
Keep visiting our website for more updates.