Hadoop and Spark are the two terms that are frequently discussed among the Big Data professionals. But the big question is whether to choose Hadoop or Spark for Big Data framework.
In this blog we will compare both these Big Data technologies, understand their specialties and factors which are attributed to the huge popularity of Spark.
Since its inception, Hadoop distributed processing framework has evolved a lot and many components have been included on the top of its original core i.e HDFS and MapReduce.
To understand Spark, we should first understand the core components of Hadoop:
- HDFS – Storage solution in Hadoop
- MapReduce – Provides Processing solution
- Yarn Framework. – For Cluster management
We recommend you to refer our blog on Big Data Terminologies which would help you to understand all the components of Hadoop Ecosystem..
Earlier, multiple tasks related to Big Data processing, i.e. processing, scheduling and task allocation processes were the sole tasks of MapReduce. With the development of the YARN cluster manager, Hadoop has evolved by freeing the project from its total dependence upon Hadoop MapReduce. Scheduling and task allocation process is handled by Yarn and MapReduce is confined to run static batch processes.
How does MapReduce work and what is its constraint?
In MapReduce, all the data is written back to the physical storage medium after each operation. This means that it reads data from the disk and once this is completed, it writes the data back to the disk.
This process becomes inefficient when we need very low throughput. Since it has to read all the data from disk at the beginning of each stage of the process, it is very time consuming.
Rise of Spark
Storing the data electronically in RAM rather than storing it magnetically on disks makes it more volatile and this is where Spark comes into play. It does not store the data back to disk and since all the activities take place in-memory, Spark offers a faster way to process the data.
Spark arranges data into Resilient Distributed Data-sets (RDD) and also provides fault tolerance in a similar way Hadoop provides fault tolerance via replication of nodes.
You can refer to our blog, Spark basics and RDD in Spark to get a better understanding of RDD.
How does Spark achieve a faster speed than MapReduce?
Spark performs in-memory operations by copying the data from distributed storage into RAM memory which is much faster. As a result of this, the time consumed to read and write is reduced.
With in-memory caching abstraction, Spark caches input data sets in-memory and as a result, each operation does not warrant the data to be read from disk.
Points to remember
It is often thought that Spark runs entirely in-memory while MapReduce does not.
This is a misconception because Spark’s shuffle implementation includes the disk read/write.
It is very similar to shuffle operation in MapReduce. In Spark each record, once serialized, is written to disk on the map side. After this, the serialized data is fetched and deserialized on the reduce side.
How do Spark and Hadoop compliment each other?
From the above discussion, it is clear that Spark competes with MapReduce rather than the entire Hadoop ecosystem.
Let’s now discuss how a unified platform consisting of Hadoop and Spark is changing the way Big Data analytics is performed..
Spark doesn’t have its own distributed file system, but can use HDFS as its underlying storage. Although Hadoop and Spark do not perform exactly the same tasks, they are not mutually exclusive, owing to the unified platform where they work together.
As per Big Data professionals Spark has been found to work 100 times faster in many scenarios but owing to the fact that Spark do not have its own distributed storage system so it has been indeed included as a component in Hadoop ecosystem and many organisations are getting benefited with the unification for hadoop and spark framework.
So it would be unfair to term spark as replacement of Hadoop and considering MapReduce as redundant.
Spark can run on top of Hadoop, gets benefited from Hadoop’s cluster manager (YARN) and underlying storage (HDFS, HBase, etc.) or it can also run independently, integrating with alternative cluster managers like Mesos and alternative storage platforms like Cassandra and Amazon S3.
Spark has become another data processing engine in Hadoop ecosystem and is beneficial for businesses and communities because it provides higher capabilities on Hadoop stack.
What are the features that make Spark popular?
Refer the below screenshot for the architecture of Spark.
Spark has its own machine learning libraries, called MLib, whereas Hadoop system must be interfaced with a third-party machine learning library, for example, Apache Mahout.
Before moving ahead with the factors contributing to the success of Spark, let’s first understand the meaning of real time data because the ability of Spark to process real time data is one of the strong features.
In Real time processing data can be fed into an analytical application, and insights are immediately sent back to the user through a dashboard which in turn enables action to be taken.
This sort of processing is increasingly being used in many Big Data applications, for example recommendation engines used by retailers, monitoring the performance of industrial machinery used in the manufacturing industry etc.
If the data operation and reporting requirements are static, MapReduce will perform batch-mode processing to process the same data. But if analytics has to be performed on streaming data, like sensor data or data from applications requiring multiple operations, then Spark would be an apt choice to process the same.
A few, common and popular use cases of Spark include campaigns related to real time marketing, online product recommendations, cyber security analytics and machine log monitoring.
Big data scientists expect Spark to replace Hadoop in some scenarios, especially in instances where faster access to processed data is critical, especially real time data.
So, we can say that Spark’s rise can be attributed to its speed, capability to handle real streaming data and that too through a unified platform. This set of operations were earlier achieved in Hadoop by integrating it with other technologies like Storm.
Spark’s functionality for handling advanced data processing tasks such as real time stream processing and machine learning is far more advanced than Hadoop alone.This, along with the speed provided by in-memory operations and high efficiency in handling real time data are some of the reasons for its popularity.
Hence, we can conclude that both Spark and Hadoop have made their niche in Big Data Analytics but because of speed and ability to harness real time and streaming data, Spark is gaining popularity with Big Data Analytics Society.
We hope this blog was useful. We also recommend you to go through our blog,