In this post, we will see how to test a Hadoop job using MRUnit testing. Before we proceed, let us understand:
Why testing is important?
In production, when you are working with huge amount of data, it is advisable not to run the job against the complete data. The Hadoop job might run for hours and then fail. In such case, you need to debug the code, re-compile the logic and re-run it again. This creates unnecessary delay in data processing. Also, chances are there that other jobs might depend/start on completion of your job. If your Hadoop job doesn’t finish on time, it might hamper overall analysis. So, it always advisable to test the logic before putting it into production.
100% Free Course On Big Data Essentials
Subscribe to our blog and get access to this course ABSOLUTELY FREE.
How to test the Hadoop MR job?
There are 2 different ways to test the logic.
- MRUnit testing
- Running the job against sample data
We will focus on the first part, ie; MRUnit
Unit testing is a software development process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation.
MRUnit is a test framework you can use to unit test MapReduce code. It was developed by Cloudera (a vendor with its own Hadoop distribution). It should be noted that MRUnit supports both the old (org.apache.hadoop.mapred) and the new (org.apache.hadoop.mapreduce) MapReduce APIs. In this post, we will demonstrate examples with new MapReduce APIs i.e MRV2 (org.apache.hadoop.mapreduce).
Types of MRUnit Tests
- A map test that only tests a map function (supported by the MapDriver class).
- A reduce test that only tests a reduce function (supported by the ReduceDriver class).
- A map and reduce test that tests both the map and reduce functions (supported by the MapReduceDriver class).
In this blog, we will see in depth, how to test a Map only job. Before we deep dive into the code, let us do some initial setup to run MRUnit. Download below jars and add it to the classpath of your environment.
- MRunit jar from here.
- Mockito jar
- Junit jar
- Hadoop-common jar
- Hadoop-mapreduce-client-core jar
To be on a safer side, I would suggest you to include all the jars present in Hadoop…/share/hadoop/common/lib directory.
NOTE: Make sure that the versions of jars are compatible according to your Hadoop version.
That’s all for the setup part. You are good to test the code now.
Map Driver Harness allows you to test a Mapper instance. You provide the input (k, v)* pairs that should be sent to the Mapper, and outputs you expect to be sent by the Mapper to the collector for those inputs. By calling runTest(), the harness will deliver the input to the Mapper and will check its outputs against the expected results. We will be testing the classic wordcount code.
Below is the mapper class for wordcount program.
If you are reading this post, I assume that you are already aware of the fundamentals of MapReduce. If not, I would suggest you to go through this blog.
On a high level, let me throw some lights on working of the above mapper code.
The input data is fed to Mapper class as <Key, value> pair. After some initial processing, output is sent to Reducer using context.write()
We want our code to beak the lines into words and against every word we will write 1 (which denotes the occurrences).
Hi welcome to Acadgild
Acadgild is into e-learning
Expected output from mapper
As a developer, we need to know this (referring to “expected output”).
Using MRUnit, let’s check whether the code works expected. Below is the code for testing.
Let us understand few important steps:
Line 26: Creating the object of mapper class you want to test. In my case, the name is MyMapper.
Line 28: The MRUnit driver class. We will use it in our test. This is the MapDriver, and as such you need to specify the key/value input and output types for the mapper you’re testing in this class.
Line 34: The withInput() method is used to specify an input <key/value>, which will be fed to the Mapper class. Input to mapper will be <KEY> (offset value i.e.; an Integer) and <VALUE> (the actual text).
Line 36: The withOutput() method is used to specify the output <key/value>, which MRUnit will compare against the output generated by the mapper.
Line 45: Run the test. If a failure’s encountered, it logs the discrepancy and throws an exception.
After Successful execution output would be like below :
NOTE: While using withOutput() method in mapper testing, sequence of output is important.
Ex: If I change the line 36 i.e.; mapdriver.withOutoput(new text(“welcome”), new IntWritable(1));
And write it somewhere else, my MRunit test will fail.
Below is the error stack:
Hope this blog helps in testing your code using MRUnit. For further updates, keep visiting www.acadgild.com