All CategoriesBig Data Hadoop & Spark

Visualize the outcomes of pig scripts using Zeppelin

Data visualization is the representation of information in the form of Graphs, Charts, Diagrams, etc., Big Data analytics is all about analyzing the large sets of data you have and deriving valuable outcomes from the analysis which will help in all-round business development. The biggest challenge now after big data analysis is that how to visualize the outcomes after analysis. Zeppelin is one of the solution for visualizing the outcomes of your analysis.

Big Data Visualization Using Zeppelin

Zeppelin is an open-source multi-purpose Notebook offering the following features to your data:

  • Data Ingestion

  • Data Discovery

  • Data Analytics

  • Data Visualization and Collaboration

Apache Zeppelin acts as a cross platform providing interpreters with many languages so that you can compile the code through Zeppelin itself and visualize the outcomes.

In our previous blogs, we have shown how to visualize results of a hive query and the results of Spark jobs.

We recommend out users to refer our previous blogs on zeppelin for more information about zeppelin, please go through  data visualization using Zeppelin and integrating Spark with Zeppelin,.

In this blog, we will be giving a demo on how to visualize the output of Pig scripts using Zeppelin.

Zeppelin added the Pig interpreter from Zeppelin-0.7.0. You can download the latest version of Zeppelin from here.

After downloading the tar file of Zeppelin, untar is uses the following command:

tar -xvzf zeppelin-0.7.0-bin-netinst.tgz

After untaring, open the conf directory inside zeppelin-0.7.0-bin-netinst and make a copy of zeppelin-env.sh.template as zeppelin-env.sh

After creating a copy, open the zeppelin-env.sh file and add the following configurations:

export JAVA_HOME=/home/kiran/jdk1.8.0_65 #path to your JAVA_HOME
export ZEPPELIN_PORT=9900 #port number to run Zeppelin
export SPARK_HOME=/home/kiran/spark-1.5.1-bin-hadoop2.6 #path to your SPARK_HOME
export HADOOP_CONF_DIR=/home/kiran/hadoop-2.7.1/etc/hadoop #path to your HADOOP_CONF directory

After adding the above configurations, save and close the file.

Before starting Zeppelin, you need to install the interpreters to compile your programs or scripts.

To install interpreters, move into the bin folder of zeppelin-0.7.0-bin-netinst directory and run the command.

./install-interpreter.sh --all

The above command will install all the interpreters that are coming with zeppelin-0.7.0.

Note: If you have installed zeppelin-0.7.0-bin-all.tgz, then you need not install the interpreters separately.

Now, we will see how to run Pig scripts using Zeppelin and how to visualize them.

Open the Zeppelin server web UI by using the port number that you have given in the configuration folder. We have given the port number as 9900, now in the web browser type: localhost:9900

Zeppelin UI will look like this

Hadoop

All the codes or scripts should be written in a Notebook in the zeppelin. So in order to create a Zeppelin notebook, click on create new note and give a name to your notebook, and below you can select the default interpreter for this notebook. We have selected the default interpreter as pig.

If you select the default interpreter, you need not specify the prefix for compiling the code.

Now a notebook will be created with the name Zeppelin_pig. Open the notebook and you can see an empty paragraph.

Let us run a sample Pig script using Zeppelin. For that, we have taken daily show analysis. You can refer more about this here.

We will come across, Find the top five kinds of GoogleKnowlege_Occupation people who were guests in the show, in a particular time period.

We will copy the Pig script in this paragraph, and we will run to check for the output.

%pig
A = load '/dialy_show_guests' using PigStorage(',') AS (year:chararray,occupation:chararray,date:chararray,group:chararray,gusetlist:chararray);
B = foreach A generate occupation,date;
C = foreach B generate occupation,ToDate(date,'MM/dd/yy') as date;
D = filter C by ((date> ToDate('1/11/99','MM/dd/yy')) AND (date<ToDate('6/11/99','MM/dd/yy')));
E = group D by occupation;
F = foreach E generate group, COUNT(D) as cnt;
G = order F by cnt desc;
H = limit G 5;
dump

In the above screen shot, we have successfully run the Pig script through Zeppelin and got the output. But the data is not visualized.

To visualize this data, in the another paragraph, you need to run a foreach loop. You need to make the prefix as pig.query and you need to pull out the columns from the relation using a foreach loop as shown below.

%pig.query
foreach H generate $0,$1;

There are two columns in the output and the relation name is ‘H,’ so we have written a foreach loop to generate the two columns as output.

Now let us run this script now. In the below screen shot below, you can see that the results are being visualized.

For the same results, you can see the pie chart displayed below:

This is how you can visualize the outcomes of the Pig scripts using zeppelin.

We hope this blog helped you in learning how to run Pig scripts and visualize the outcomes of Pig scripts using Zeppelin. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

 

Hadoop

One Comment

  1. Created same setup but getting below error while running pig command in (mapreduce) mode
    java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.yarn.util.timeline.TimelineUtils
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:166)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:102)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.(ResourceMgrDelegate.java:96)
    at org.apache.hadoop.mapred.YARNRunner.(YARNRunner.java:112)
    at org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
    at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
    at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
    at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
    at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470)
    at org.apache.hadoop.mapred.JobClient.(JobClient.java:449)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:163)
    at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:308)
    at org.apache.pig.PigServer.launchPlan(PigServer.java:1474)
    at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1459)
    at org.apache.pig.PigServer.storeEx(PigServer.java:1118)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close