Apache Hadoop and Apache Spark are the two big data frameworks that are frequently discussed among the Big Data professionals. But when it comes to selecting one framework for data processing, Big Data enthusiasts fall into the dilemma.
Talking about the most active open source Big Data community, Spark has overtaken Hadoop. While it will not be correct to compare both the big data frameworks, they both have many similarities which complement each other. Both Hadoop and Spark are Big Data framework, they have their own set of tools to carry out heavy lifting, but they don’t really serve the same purpose.
For almost a decade, Hadoop stormed the Big Data world and ruled it, but now, there is a new Apache product (Apache Spark) which is gaining more and more popularity because of it incredible performance.
Distributed storage have become the basis for many of the Big Data projects because it allows storing huge amount of data in a distributed fashion. Hadoop is one such framework that has distributed file system to load and store data. This is done using multiple nodes present in a cluster which is made using commodity hardware. This means you need not spend extra bucks for purchasing expensive hardware just to store data. One of the important features of Hadoop is its “scalable” nature. If the data grows beyond the capacity of existing cluster, you can easily add few systems to your existing cluster.
Spark, on the other hand, is a data-processing engine that processes data way faster than MapReduce, but it doesn’t have its storage unit. It has to be integrated with some other distributed storage like hdfs, s3, etc.al, for Big Data processing.
Even though Spark lacks storage unit, what really gives an edge over Hadoop is its “Speed”. The “in-memory” computation capability makes Spark around 10 to 100 times faster than Hadoop. As most of the operations are in-memory, the data seek time drops immensely. This is exactly opposite to working of MapReduce, where data is read from and written to disk.
As both Hadoop and MapReduce are open source Apache projects, there is no cost for licensing. Both big data frameworks are designed to work on commodity hardware, so how do cost difference come into the picture?
MapReduce performs disk-based processing and hence a company have to purchase faster disks to run MapReduce. Also, the number of disks require is high as Hadoop replicates data by 3x (default). As Spark seeks data from memory, the systems in which Spark runs require a large amount of RAM to keep everything in memory. On the other hand, it is also true that Spark reduces the required number of systems. So at some point of time, Spark probably reduces the cost per unit of computation.
To illustrate, “Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on one-tenth of the machines.” This feat won Spark the 2014 Daytona GraySort Benchmark.
Spark can handle real-time stream processing and machine learning way better than what Hadoop can do it alone. This, along with faster in-memory computation is the real reason for the growth of Spark’s popularity. Real-time data processing engine is being used by many of the big data enterprises.
Apart from performance, Spark is also well known for its ease of use. It comes with user-friendly APIs in Scala, Java, Python, and R. For SQL user’s, Spark has SparkSQL which allows them to work with spark without curving their existing knowledge much. Along with this, Spark also runs in interactive mode to get an immediate result of the actions. Spark includes its own machine learning libraries, called MLlib, whereas Hadoop must be integrated with any third-party machine learning library vendor like Mahout.
Just like Hadoop, Apache Spark too is ‘fault-tolerant’ and the credit goes to RDD. Resilient Distributed Datasets(RDD) is a fault tolerant collection of elements that can be operated in parallel.
By now, it would seem that using Spark is the default choice for big data application. However, that’s not true because it all depends on the requirement. For example, if your enterprise has the huge amount of structured data (id, name, address) and time is not a factor in your business use case, you might not require the advanced processing tools of Spark. For this kind of batch processing, Hadoop is the best fit. This will help in reducing the extra cost that would have by using Spark.
Although, both the Big Data frameworks i.e., Hadoop and Spark is seen as a competitor to each other, in reality, they complement each other. Hadoop provides features that Spark does not possess, such as a distributed file system and Spark provides real-time, in-memory processing for those data sets that require it. The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark to work together on the same team.
For more updates, keep visiting www.acadgild.com