Big Data Hadoop & Spark

API Differences Between MapReduce Version 1 & 2

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Here, the users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all the intermediate values associated with the same intermediate key.

The Java Map Reduce API 1 also known as MRV1 was released with initial hadoop versions and the flaw associated with these initial versions was map reduce framework performing both the task of processing and resource management.

Map Reduce 2 or the Next Generation Map Reduce, was a long-awaited and much-needed upgrade to the techniques concerned with scheduling, resource management, and the execution occurring in Hadoop. Fundamentally, the improvements separate cluster resource management capabilities from Map Reduce-specific logic and this separation of processing and resource management were achieved via inception of YARN in later versions of HADOOP.

In this blog, we will be focusing on the API differences between Map-Reduce Version1 and Map-Reduce Version2.

The new API sometimes referred to as ‘Context Objects’, was designed to make the API easier to use in the future. It is type-incompatible with the old MRV1.

Hadoop
There are several notable differences between the two APIs. Let’s have a look at these differences below.

MRV1

MRV2

The MRV1 API uses Interfaces which means we can implement only available methods in that interfaces.

The MRV2 API favours abstract classes over interfaces since they are easier to evolve.

This means that you can add a method (with a default implementation) to an abstract class without breaking old implementations of the class. For example, the Mapper and Reducer interfaces in the old API are abstract classes in the new API.

MRV1 API can be found in org.apache.hadoop.mapred.

MRV2 API is available in the org.apache.hadoop.mapreduce package (and sub-packages).

MRV1 uses OutputCollecter and Reporter to communicate with the MapReduce system.

MRV2 uses API to make extensive use of context objects that allow the user code to communicate with the MapReduce system. (The role of the JobConf, the OutputCollector, and the Reporter from the old API is unified by Contexts objects in MRV2).

In MRV1, we can control the execution of mappers by writing a MapRunnable, but an equivalent does not exist for reducers.

MRV2 allows both mappers and reducers to control the execution flow by overriding the run() method. For example, records are processed in batches, or the execution can be terminated before all the records have been processed.

In MRV1, JobClient class performs the Job control.

Here, Job control is performed through the Job class in the new API, instead of the old one.

JobClient, which no longer exists in the new API.

The configuration has been unified. The old API has a special JobConf object for job configuration.

In MRV2, job configuration is done through a Configuration, possibly via some of

the helper methods on Job.

Output files are named in the MRV1 API, both map and reduce outputs are named part-nnnnn.

In MRV2, the map outputs come as part-m-nnnnn, and reduce outputs comes as part-r-nnnnn (where nnnnn is an integer designating the part number, starting from zero).

In MRV1 API, the reduce() method passes values as a java.util.Iterator. We need to use Iterator Object to retrieve the values.

In MRV2, the reduce() method passes values as a java.lang.Iterable and this change in the interface makes it easier to iterate over the values using Java’s for-each loop.

We hope this blog helped you in understanding the API differences between MRV1 and MRV2. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

Hadoop

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close