Data visualization is the way of representing your data in form of graphs or charts so that it is easier for the decision makers to decide based on the pictorial representations. Here comes the Apache Zeppelin which is an open source multipurpose Notebook offering the following features to your data:
Data Visualization and Collaboration
Apache Zeppelin provides your interpreter with many languages so that you can compile the code through Zeppelin itself and visualize the outcomes.
Let’s get deeper into Apache Zeppelin.
Download Zeppelin tar ball from the below link
untar the tar ball using the command
tar -xvzf zeppelin-0.6.2-bin-all.tgz
Now move into the zeppelin-0.6.0-bin-all folder and install the interpreter using the below command
Here the above command will install all the community maintained interpreters. After the completion of installation, you can move into the configuration folder to set the necessary configurations for your Zeppelin.
Move into the conf directory of Zeppelin and make a copy of zeppelin-env.sh.template file as zeppelin-env.sh as shown in the below screen shot.
Now open that zeppelin-env.sh file and set the below properties
export JAVA_HOME=/home/kiran/jdk1.8.0_65 #path to your JAVA_HOME export ZEPPELIN_PORT=9900 #port number to run Zeppelin export SPARK_HOME=/home/kiran/spark-1.5.1-bin-hadoop2.6 #path to your SPARK_HOME export HADOOP_CONF_DIR=/home/kiran/hadoop-2.7.1/etc/hadoop #path to your HADOOP_CONF directory
After making the necessary changes, save and close the file. Now you can start the Zeppelin server using the below command
After successfully starting the Zeppelin server, you can open the Zeppelin web UI on the port number specified in the zeppelin-env.sh file.
Open your browser and type localhost:9900. The web UI of Zeppelin is as shown in the below screenshot.
Now to see the list of interpreters available, click on the user as shown below.
In the next page, you can see all the list of interpreters that are available.
In the JDBC interpreter, you can see all the JDBC supported tools like Hive, Phoenix, Tazo, PSQL, etc.,
Now we will show a demo on how to query a table in Hive through Zeppelin and visualize the outcomes. We will connect Hive and Zeppelin using Hive-JDBC connector.
For integrating Hive with the Zeppelin, we need to provide the dependencies of the Hive into Zeppelin. To provide the dependency copy of the $HIVE_HOME/lib directory to the $ZEPPELIN_HOME/interpreter, you need to add the dependencies of Hadoop as well. Copy the $HADOOP_HOME/share/hadoop directory into $ZEPPELIN_HOME/interpreter directory.
Note: After adding the dependencies, you need to restart the Zeppelin server
Now click on ‘Create New Note’ on the Zeppelin web UI start-up page as shown in the below screenshot.
After clicking on ‘Create New Note’, give the name of your new note book. Here we have given the name as hive_zeppelin. And your note book looks like this.
The place where you write the code is called as a paragraph. Each paragraph is associated with its own output.
Before running Hive queries, make sure that all the Hadoop daemons are up and running and you have successfully started the hive-metastore and your hiveserver2.
Let us see some sample query on Hive i.e., Viewing the list of tables.
To execute Hive queries in Zeppelin interpreter, you need to provide the term %hive
The below script is present in our Zeppelin paragraph.
%hive show tables
Click on the ‘Run’ symbol at the right end of the paragraph to execute the script. Below the paragraph you will find the associated output. The same is shown in the below screenshot.
Now let us work on queries that involves mr jobs through Hive. To execute Hive queries involving mr jobs through Hive JDBC connections, you need to include the below properties in your hive-site.xml file
<property> <name>mapreduce.job.reduces</name> <value>1</value> <description>The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.jobtracker.address is "local". </description> </property>
Note: After setting this property you need to restart your Hive server
In our Hive, we have a table called olympics in which all the Olympics data is present. You can download the olympics data set from the below link.
Now let us see the description and the contents of the olympics table.
You can work on multiple language interpreters at the same time using different paragraphs by just mentioning the interpreter you want to use to execute your code. In the below screen shot, you can see that we have run two Hive queries in the two paragraphs.
In the first paragraph we ran
%hive describe olympics
In the second paragraph we ran
%hive select * from olympics
Now let us see how to visualize the data using Zeppelin. For that, we are running the below query which will calculate the average age of athletes of first 5 countries
%hive select olympic_country,AVG(olympic_age) from olympics group by olympic_country limit 5
In the below screenshot, you can see the output of this query.
You can click on the ‘Graph’ symbols that are associated with the output console. The first one is in tabular format.
Here is the output of bar chart which represents the output in bar graphs
Here is what the pie chart looks like
Similarly, you can visualize through an area covering graphs, line graphs, and bubble graphs. There are many options provided by Apache Zeppelin to work with many languages.
We hope this blog helped you understand how to work on Apache Zeppelin, an open source data analytics, and visualization tool. Keep visiting our site www.acadgild.com for more updates on big data and other technologies.