Hadoop communities have put a lot of effort into Yarn architecture and one of the blaring challenges has been to open up this platform.
With continuous evolution and passage of time, several systems across a broad spectrum of technologies have been built on Yarn and have helped it emerge as a single cluster with resources getting managed in a uniform way and providing support to multi-tenant users and applications.
And so, in this post, we would be discussing some of the technologies that have been built with Yarn compatibility.
NoSQL: NoSQL is the systems which provide real-time CRUD operations and overcome the drawbacks of monolithic OLAP systems which prevent system architectures to scale out and provide responsive services. There are many NoSQL systems but HBase has been integrated with Hadoop more than often.
Aimed at using HDFS for its storage HBase has been closely benefitted from its close integration with MapReduce which allows batch processing in a simpler fashion very with HBase. Yarn provides HBase capability to run multiple HBase clusters on the same Hadoop cluster and this project has been carried out under the umbrella of the project called Hoya, known as HBase on Yarn.
Interactive SQL: Till recent times Apache Hive has been the evolved version of SQL for Hadoop platform. But since one has to often wait for minutes to get results after entering the queries, data scientists and analysts don’t find this conducive for quick probing of the data and experimenting with the data. And there has been a lot of effort from many contributors in Hadoop community to mitigate this issue. For instance, Cloudera came up with Impala projects which bypass MapReduce and operates by running its own Daemon on each slave node in the cluster.
To work with Yarn, Cloudera has come up with one more project called Llama which works with Yarn in such a way that the resources which get utilized by the Impala daemons will be utilized on the clusters that are understood by Apache Yarn.
On the other hand, Hortonworks has followed a different approach, where they have modified Hive itself by making it more interactive; they have made Hive to work with Tez (a Yarn DAG processing framework) by bypassing MapReduce to execute their work. Similarly, Facebook has started Presto which provides SQL solution on Hadoop.
Real-time Data Processing: Apache Storm has been one of the most preferred ways of doing real time data processing till recent times. Yahoo has started the project called Storm-Yarn which offers several advantages like running multiple storm clusters on Yarn and also elasticity to storm clusters. Similarly, Spark streaming is currently one of the most popular real-time data streaming projects. It has been developed as an extension to the Spark API and work in tandem with data sources like HDFS, Kafka, Flume and more.
Some other real-time data processing systems with Yarn integration are Apache S4 and Samza.
Graph processing: Graph Processing systems of current technological era permits the execution of distributed graph algorithms against large graphs containing billions of nodes and edges. Graph execution using Traditional MapReduce was very slow resulting in the execution of one Job per iteration owing the fact that this kind of execution required the entire graph data structure to be serialized into and from disk on each iteration.
Apache Giraph with a lot of updates has been running as a native Yarn application and have overcome some of the drawbacks of old graph processing techniques.
Bulk Synchronous parallel: Bulk synchronous parallel is a Distributed Processing method where a part of the overall problem is worked upon by multiple parallel workers. Once the data is exchanged between all those parallel workers, they wait for all the workers to complete the task before repeating the process. This co-ordination between all the workers is maintained by a Global Synchronization Mechanism.
Google Pregel is inspired by Bulk Synchronous parallel and Apache Giraph has been using the same technique for graph Iteration. Yarn also supports Apache Hama which is a general purpose BSP implementation
In memory: In-memory computing has been focused on performing computing activities like iterative processing and interactive Data mining. Apache Spark is one of the successful implementation of in-memory processing. Cloudera’s CDH5 distribution provides the environment where Spark runs on Apache Yarn. For more details of running Apache Spark on Yarn, refer this link.
Dag Execution: Dag execution is the technique where data processing logics are modelled into a directed acyclic graph and then executed in parallel over a large dataset by Directed Acyclic Graph execution engines.
Apache Tez is one of the most successful implementations of Dag execution. Tez provides developers an API framework to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands of nodes.
We hope this post has been helpful in understanding the various technologies which we could use extensively with Yarn. Keep visiting AcadGild for more blogs on Big Data and other technologies.