Big Data Hadoop & Spark

Testing your Scripts with PigUnit

In this blog, we will be explaining about the Pig Unit. Pig unit is nothing but unit testing of your Pig scripts. To get started with, let’s look at what is unit testing is, first.

Unit Testing

Unit Testing is a software development process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation. Unit testing is often automated but it can also be done manually.

Now, let’s move on to Pig unit.

Before getting into Pig unit let’s have some basics of pig

Pig Unit

According to the definition by Apache, Pig Unit is a simple xUnit framework that enables you to easily test your Pig scripts. With Pig Unit you can perform unit testing, regression testing, and rapid prototyping. No cluster set up is required if you run Pig in local mode.

Pig Unit uses the famous Java Junit framework class to test the Pig scripts. First, we will write a Java class which junit uses to run and inside that class we can provide our Pig scripts, input files and the expected output, so that it will automatically run the scripts and return the results. With the help of Pig unit we can reduce the time of debugging.

Unit Testing your own Pig script

For unit testing your own Pig script, we need to have maven and Pig eclipse plugin installed in your eclipse.

Maven Plugin for Eclipse

Open Eclipse IDE and follow the below steps:

  1. Click Help -> Install New Software…
  2. Click ‘Add’ button at top right corner
  3. In the pop up window, fill up the ‘Name’ as ‘M2Eclipse’ and ‘Location’ as ‘http://download.eclipse.org/technology/m2e/releases’ or ‘http://download.eclipse.org/technology/m2e/milestones/1.0’.
  4. Now, click ‘OK’.

Pig Eclipse plugin

Follow the below steps to install the Pig Eclipse plugin:

1. Click ‘Help’ -> Install New Software…

2. Click ‘Add’ button at top right corner

3. In the pop up window, fill up ‘Name’ as ‘pigEditor’ and ‘Location’ as ‘https://pig-eclipse.googlecode.com/svn/trunk/update_site’.

4. Now, click ‘OK’.

After the installation of above specified plugins, restart your Eclipse.

Now, follow the series of step mentioned below to create a maven project.

File–>New–>Other–>Maven–>Maven Project

Next, open the pom.xml file, following the below steps.

src–>target–>pom.xml

Here, copy the below repositories.

 <repositories>
                         <repository>
                              <id>cloudera-releases</id>
 <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
                           <releases>
                 <enabled>true</enabled>
                             </releases>
                           <snapshots>
                  <enabled>false</enabled>
                             </snapshots>
                           </repository>
                </repositories>

And, between the dependencies tags, input the below dependencies.

 <dependency>
            <groupId>org.javassist</groupId>
            <artifactId>javassist</artifactId>
            <version>3.18.1-GA</version>
</dependency>
<dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.3.0</version>
</dependency>
<dependency>
		<groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.2.1</version>
</dependency>
<dependency>
		<groupId>org.apache.pig</groupId>
            <artifactId>pig</artifactId>
            <classifier>h2</classifier>
            <version>0.15.0</version>
</dependency>
<dependency>
		<groupId>org.apache.pig</groupId>
            <artifactId>pigunit</artifactId>
            <version>0.15.0</version>
</dependency>
<dependency>
             <groupId>jline</groupId>
             <artifactId>jline</artifactId>
             <version>0.9.94</version>
</dependency>
<dependency>
            <groupId>org.antlr</groupId>
            <artifactId>antlr-runtime</artifactId>
            <version> 3.5</version>
</dependency>
<dependency>
		<groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>18.0</version>
</dependency>
<dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.2</version>
</dependency>
<dependency>
             <groupId>junit</groupId>
             <artifactId>junit</artifactId>
             <version>3.8.1</version>
             <scope>test</scope>
</dependency>
<dependency>
            <groupId>org.javassist</groupId>
		<artifactId>javassist</artifactId>
            <version>3.18.1-GA</version>
</dependency>

Note: In the dependency of Pig unit you must specify the version of the Pig.

Now, save your Pig script with the file extension ‘.pig’.

Let’s now run a sample word count program using Junit.

Below is the Pig script for performing word count:

A = load '/home/kiran/workspace/pigunit/input.data';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
dump D;

We have saved the above lines in a file called wordcount.pig and the input data is saved in the file with name ‘input.data’.

Now, in the AppTest.java class which is present in src/test/java, you can create your own method and inside that you need to create an object for the class PigTest, inside which we will pass our Pig scripts.

PigTest pigTest = new PigTest("/home/kiran/workspace/pigunit/wordcount.pig");

Here, you can see that we have created an object called ‘pigTest’ and we have passed our Pig script into it. The output will be given inside the assertOutput method.

Kick start your career in spark by attending free webinar on April 9th 2016.

There are different types of formats for specifying outputs and inputs to the above method. You can see the different arguments that can be passed to the assertOutput method in the below screen shot.

assertOutput

Inside the assertOutput we can give our expected output. Here we are declaring String array and giving our expected output as shown below:

 String[] output = {
"(1,all)","(1,from)","(1,Hello)","(1,acadgild!)"
};

Now the assertOutput statement will be as follows:

pigTest.assertOutput("D", output);

Note: Inside the assertOutput, the alias name should be the same as the final dump statement name of your pig script and the expected output name should be the array name where your output is given.

In this case, our statement to be dumped is D, so we have given the alias name as ‘D’ within the double quotes and the expected output is stored in the String array output. So, we have given the output variable name as the output.

Note: Inside the expected output, each key value pair should be given in brackets with double quotes and every key value pair is separated by a comma.

Now, let’s run this program using Junit Test.

Right Click–>Run As–>Junit Test

While running the program, you will get the test cases running in the console.

pig unit in eclipse console

If you notice the color of the test case, while running the test case, it will be in Blue color as shown in the above screen shot.

After the successful completion of the test case, it will turn to Green color as shown in the below screen shot.

pig unit test completion

Now, let’s have a look at the complete program.

package java1.pigunit1;
import junit.framework.TestCase;
import org.apache.pig.pigunit.PigTest;
public class AppTest extends TestCase {
public void testwordcount() throws Exception {
	   PigTest pigTest = new PigTest("/home/kiran/workspace/pigunit/wordcount.pig");
	    String[] output = {
	    	    "(1,all)","(1,from)","(1,Hello)","(1,acadgild!)"
	    };
		 pigTest.assertOutput("D", output);
	  }
 }

Note: The class name should extend the TestCase and every method name should start with the text test (Here we have given the method name as testwordcount). Otherwise you will get an error message saying Test case not found.

We can write any number of test cases in the same classes and each method will become a test case where we just need to specify the Pig script for that and the expected output.

Now, let’s add another test case in the same class. We will add one more line to the Pig script and limit the output to 1 so that we will get only one record.

A = load '/home/kiran/workspace/pigunit/input.data';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
E = limit D 1;
dump D;

We have named the statement as E. So the test case will be as follows

public void testwordcount1() throws Exception {
	   PigTest pigTest = new PigTest("/home/kiran/workspace/pigunit/wordcount.pig");
	    String[] output = {
	    	    "(1,all)"
	    };
	    pigTest.assertOutput("E", output);
	  }

Now, we can see the console and the test cases.

pig script test cases

Here, the two test cases have been executed successfully and the executed outputs are matching with our expected outputs.

Hope this blog had been clear in explaining how to perform Unit testing of your Pig scripts. Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies. Click here to learn Bigdata Hadoop from our Expert Mentors

Hadoop

3 Comments

  1. Pingback: Pig Testing Script | HadoopMinds

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close