All CategoriesBig Data Hadoop & Spark

Data Visualization Using Apache Zeppelin

Data visualization is the way of representing your data in form of graphs or charts so that it is easier for the decision makers to decide based on the pictorial representations. Here comes the Apache Zeppelin which is an open source multipurpose Notebook offering the following features to your data:

  • Data Ingestion

  • Data Discovery

  • Data Analytics

  • Data Visualization and Collaboration

Apache Zeppelin provides your interpreter with many languages so that you can compile the code through Zeppelin itself and visualize the outcomes.

Let’s get deeper into Apache Zeppelin.

Download Zeppelin tar ball from the below link

http://zeppelin.apache.org/download.html

untar the tar ball using the command

tar -xvzf zeppelin-0.6.2-bin-all.tgz

Now move into the zeppelin-0.6.0-bin-all folder and install the interpreter using the below command

./bin/install-interpreter.sh –all

Here the above command will install all the community maintained interpreters. After the completion of installation, you can move into the configuration folder to set the necessary configurations for your Zeppelin.

Move into the conf directory of Zeppelin and make a copy of zeppelin-env.sh.template file as zeppelin-env.sh as shown in the below screen shot.

Now open that zeppelin-env.sh file and set the below properties

export JAVA_HOME=/home/kiran/jdk1.8.0_65 #path to your JAVA_HOME
export ZEPPELIN_PORT=9900 #port number to run Zeppelin
export SPARK_HOME=/home/kiran/spark-1.5.1-bin-hadoop2.6 #path to your SPARK_HOME
export HADOOP_CONF_DIR=/home/kiran/hadoop-2.7.1/etc/hadoop #path to your HADOOP_CONF directory

After making the necessary changes, save and close the file. Now you can start the Zeppelin server using the below command

zeppelin-daemon.sh

After successfully starting the Zeppelin server, you can open the Zeppelin web UI on the port number specified in the zeppelin-env.sh file.

Open your browser and type localhost:9900. The web UI of Zeppelin is as shown in the below screenshot.

Now to see the list of interpreters available, click on the user as shown below.

In the next page, you can see all the list of interpreters that are available.

In the JDBC interpreter, you can see all the JDBC supported tools like Hive, Phoenix, Tazo, PSQL, etc.,

Now we will show a demo on how to query a table in Hive through Zeppelin and visualize the outcomes. We will connect Hive and Zeppelin using Hive-JDBC connector.

For integrating Hive with the Zeppelin, we need to provide the dependencies of the Hive into Zeppelin. To provide the dependency copy of the $HIVE_HOME/lib directory to the $ZEPPELIN_HOME/interpreter, you need to add the dependencies of Hadoop as well. Copy the $HADOOP_HOME/share/hadoop directory into $ZEPPELIN_HOME/interpreter directory.

Note: After adding the dependencies, you need to restart the Zeppelin server

Now click on ‘Create New Note’ on the Zeppelin web UI start-up page as shown in the below screenshot.

Hadoop

After clicking on ‘Create New Note’, give the name of your new note book. Here we have given the name as hive_zeppelin. And your note book looks like this.

The place where you write the code is called as a paragraph. Each paragraph is associated with its own output.

Before running Hive queries, make sure that all the Hadoop daemons are up and running and you have successfully started the hive-metastore and your hiveserver2.

Let us see some sample query on Hive i.e., Viewing the list of tables.

To execute Hive queries in Zeppelin interpreter, you need to provide the term %hive

The below script is present in our Zeppelin paragraph.

%hive
show tables

Click on the ‘Run’ symbol at the right end of the paragraph to execute the script. Below the paragraph you will find the associated output. The same is shown in the below screenshot.

Now let us work on queries that involves mr jobs through Hive. To execute Hive queries involving mr jobs through Hive JDBC connections, you need to include the below properties in your hive-site.xml file

<property>
<name>mapreduce.job.reduces</name>
<value>1</value>
<description>The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.jobtracker.address is "local". </description>
</property>

Note: After setting this property you need to restart your Hive server

In our Hive, we have a table called olympics in which all the Olympics data is present. You can download the olympics data set from the below link.

https://drive.google.com/open?id=0ByJLBTmJojjzV1czX3Nha0R3bTQ

Now let us see the description and the contents of the olympics table.

You can work on multiple language interpreters at the same time using different paragraphs by just mentioning the interpreter you want to use to execute your code. In the below screen shot, you can see that we have run two Hive queries in the two paragraphs.

In the first paragraph we ran

%hive
describe olympics

In the second paragraph we ran

%hive
select * from olympics

Now let us see how to visualize the data using Zeppelin. For that, we are running the below query which will calculate the average age of athletes of first 5 countries

%hive
select olympic_country,AVG(olympic_age) from olympics group by olympic_country limit 5

In the below screenshot, you can see the output of this query.

You can click on the ‘Graph’ symbols that are associated with the output console. The first one is in tabular format.

Here is the output of bar chart which represents the output in bar graphs

Here is what the pie chart looks like

Similarly, you can visualize through an area covering graphs, line graphs, and bubble graphs. There are many options provided by Apache Zeppelin to work with many languages.

We hope this blog helped you understand how to work on Apache Zeppelin, an open source data analytics, and visualization tool. Keep visiting our site www.acadgild.com for more updates on big data and other technologies.

Hadoop

21 Comments

  1. You could likewise change the shade of the project i.e.
    the color which shows up in the Log” of GSA Internet search engine Ranker to make sure that you could quickly see which messages are for that project.

  2. Will there be necessary to configure maven to install Zeppelin.?
    I had followed each and every step mentioned by you but I got the following error :-
    Zeppelin daemon failed.
    Can please help me .

  3. thanx Kiran for your instant reply, u ppl r gr8.
    i had installed my hadoop in /usr/lib/hadoop-2.2.0
    java at /usr/lib/jdk-1.7.0_64
    Zeppelin has been downloaded on Desktop/newfolder/zeppelin-tar-file.
    in bash file i have set the path as mentioned above using
    export JAVA_HOME=/usr/lib/jdk-1.7.0_64
    can you please send me the installation documentation for linux in Hadoop environment.
    [email protected]

  4. Hi Kiran,
    Thanks for explaining the steps, I followed the steps and I am facing while executing the notebook %hive
    show tables
    error:
    org.apache.hive.jdbc.HiveDriver
    class java.lang.ClassNotFoundException
    java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    java.lang.Class.forName0(Native Method)
    java.lang.Class.forName(Class.java:264)
    org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:220)
    org.apache.zeppelin.jdbc.JDBCInterpreter.getStatement(JDBCInterpreter.java:233)
    org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:302)
    org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:408)
    org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
    org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
    org.apache.zeppelin.scheduler.Job.run(Job.java:176)
    org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    java.util.concurrent.FutureTask.run(FutureTask.java:266)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
    ERROR
    Took 0 sec. Last updated by anonymous at January 31 2017, 6:54:28 AM.
    Please let me know how to fix this, suggest me

    1. Hi Sarveswara,
      It seems hive JDBC driver is missing for executing hive queries through your zeppelin hive interpreter. Please copy hive-jdbc-$VERSION-standalone.jar which is present in your $HIVE_HOME/lib directory into $ZEPPELIN_HOME/interpreter/jdbc/ directory.

      1. Hi Kiran,
        Thanks for your prompt reply.
        I am facing the same issue and I copied standalone jar as you mentioned.
        -rw-r–r– 1 hadoop hadoop 2172168 Oct 12 14:22 guava-15.0.jar
        -rw-r–r– 1 hadoop hadoop 94046 Jan 31 20:02 hadoop-auth-2.7.3.jar
        -rw-r–r– 1 hadoop hadoop 3473404 Jan 31 18:58 hadoop-common-2.7.3.jar
        -rwxrwxr-x 1 hadoop hadoop 17491833 Jan 31 17:40 hive-jdbc-2.1.0-standalone.jar
        -rw-r–r– 1 hadoop hadoop 213911 Oct 12 14:22 jline-2.12.1.jar
        -rw-r–r– 1 hadoop hadoop 489884 Oct 12 14:22 log4j-1.2.17.jar
        -rw-r–r– 1 hadoop hadoop 648487 Oct 12 14:22 postgresql-9.4-1201-jdbc41.jar
        -rw-r–r– 1 hadoop hadoop 32119 Oct 12 14:22 slf4j-api-1.7.10.jar
        -rw-r–r– 1 hadoop hadoop 8866 Oct 12 14:22 slf4j-log4j12-1.7.10.jar
        -rw-r–r– 1 hadoop hadoop 26788 Oct 12 14:22 zeppelin-jdbc-0.6.2.jar
        Pwd:
        [[email protected] jdbc]$ pwd
        /opt/zeppelin/zeppelin-0.6.2-bin-all/interpreter/jdbc
        I am using 2 node cluster along with hive 2.1 and hadoop 2.7 from Apache sofware Foundation, not using any commercial distributions.
        Can you suggest next steps to proceed further, Thanks in advance

        1. Hi Sarveswara,
          According to the error message you have got, it is due to the missing of hive-jdbc jar file only. Please try restarting your zeppelin daemon after copying the file and also check for the log file in $ZEPPELIN_HOME/logs directory to find the exact issue.

          1. Hi Kiran,
            Thanks you very much for guiding.
            I have tried and now I am getting the below error
            Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000: null
            class java.sql.SQLException
            org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:219)
            org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:157)
            org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107)
            java.sql.DriverManager.getConnection(DriverManager.java:664)
            java.sql.DriverManager.getConnection(DriverManager.java:208)
            org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:222)
            org.apache.zeppelin.jdbc.JDBCInterpreter.getStatement(JDBCInterpreter.java:233)
            org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:302)
            org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:408)
            org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
            org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
            org.apache.zeppelin.scheduler.Job.run(Job.java:176)
            org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
            java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            java.util.concurrent.FutureTask.run(FutureTask.java:266)
            java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
            java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
            java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            java.lang.Thread.run(Thread.java:745)
            Looks like connectivity issue, your inputs are greatly appreciated!
            Thanks in advance

          2. Hi Sarveswara,
            Please make sure that your hive metastore and hive server2 services are up and running. Go to $HIVE_HOME/bin and type ./hive –service metastore to start hive metastore. To start hive server2 open another terminal and type ./hiveserver2 in the same $HIVE_HOME/bin directory. Now try running hive queries through zeppelin.

  5. Hi Kiran,
    I have started the hiveserver2 and metastore, and restarted the zeppelin server, but I have observed that zeppelin server process able see for a fraction of seconds, and disappears.
    [[email protected] bin]$ jps
    49393 SecondaryNameNode
    53072 RunJar
    54361 ZeppelinServer
    54406 Jps
    53254 RunJar
    49712 NodeManager
    49577 ResourceManager
    49186 DataNode
    49025 NameNode
    53504 RunJar
    [[email protected] bin]$ jps
    49393 SecondaryNameNode
    54425 Jps
    53072 RunJar
    53254 RunJar
    49712 NodeManager
    49577 ResourceManager
    49186 DataNode
    49025 NameNode
    53504 RunJar
    [[email protected] bin]$ ./zeppelin-daemon.sh status
    Zeppelin running but process is dead [FAILED]
    [[email protected] bin]$
    still facing the same issue:
    Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000: null
    class java.sql.SQLException
    org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:219)
    org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:157)
    org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107)
    java.sql.DriverManager.getConnection(DriverManager.java:664)
    java.sql.DriverManager.getConnection(DriverManager.java:208)
    org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:222)
    org.apache.zeppelin.jdbc.JDBCInterpreter.getStatement(JDBCInterpreter.java:233)
    org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:302)
    org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:408)
    org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
    org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
    org.apache.zeppelin.scheduler.Job.run(Job.java:176)
    org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    java.util.concurrent.FutureTask.run(FutureTask.java:266)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)
    ERROR
    Took 0 sec. Last updated by anonymous at January 31 2017, 11:43:28 PM.
    Can you please suggest me….
    Thanks in advance

  6. Hi Kiran,
    Many thanks for your guidance, Finally I am able to connect to hive.
    I have done changes to the jdbc interperter paramets changes and it got connected.
    Solution:
    hive.password xxxxxxx
    hive.url jdbc:hive2://xx.xxx.x.xxx:10000/default;auth=noSasl
    hive.user xxxxxxx
    Restarted the interpreter and it worked….
    Thanks
    Sarvesh

  7. Hi Kiran,
    Getting the below error –
    java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.hive.jdbc.HiveConnection.createBinaryTransport(HiveConnection.java:478)
    at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:201)
    at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:176)
    at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
    at java.sql.DriverManager.getConnection(DriverManager.java:664)
    at java.sql.DriverManager.getConnection(DriverManager.java:208)
    at org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:222)
    at org.apache.zeppelin.jdbc.JDBCInterpreter.open(JDBCInterpreter.java:176)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
    at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

    1. Hi Zeeshan,
      You need to add hadoop-common jar into your interpreter classpath.
      Copy the hadoop-common jar which is present in your $HADOOP_HOME/share/hadoop/common directory into $ZEPPELIN_HOME/interpreter/jdbc directory. Now restart the zeppelin daemon.

  8. Hey Krishan, After adding I get –
    Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: zeeshan is not allowed to impersonate hive
    class org.apache.hive.service.cli.HiveSQLException
    org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
    org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:247)
    org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:586)
    org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:192)
    org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
    java.sql.DriverManager.getConnection(DriverManager.java:664)
    java.sql.DriverManager.getConnection(DriverManager.java:208)
    org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection(JDBCInterpreter.java:222)
    org.apache.zeppelin.jdbc.JDBCInterpreter.getStatement(JDBCInterpreter.java:233)
    org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql(JDBCInterpreter.java:302)
    org.apache.zeppelin.jdbc.JDBCInterpreter.interpret(JDBCInterpreter.java:408)
    org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94)
    org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
    org.apache.zeppelin.scheduler.Job.run(Job.java:176)
    org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    java.util.concurrent.FutureTask.run(FutureTask.java:266)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)

    1. Hi Zeeshan,
      You need to give permissions to your hadoop_user. Follow the below steps to resolve the error.
      Stop all the hadoop daemons first and then add the below properties in your core-site.xml file
      < ​property>
      < ​name>hadoop.proxyuser.your_machineuser_name.hosts< ​/name>
      < ​value>*< ​/value>
      < ​/property>
      < ​property>
      < ​name>hadoop.proxyuser.your_machine_user_name.groups< ​/name>
      < ​value>*< ​/value>
      < ​/property>
      Now add the below property in your hive-site.xml file
      < ​property>
      < ​name>hive.server2.enable.doAs< ​/name>
      < ​value>true< ​/value>
      < ​/property>
      Now start all your hadoop daemons and try performing the same operation.

  9. Hi Kiran,
    I have followed all the steps as listed above but hive is not listed under JDBC interpreter. Hive is installed on my machine and HIVE_HOME is set to hive directory.

    1. Hi Srivastava,
      In zeppelin hive interpreter will come under JDBC interpreter only. You just need to add the supporting jar files i.e., hive-jdbc-*.jar in the $ZEPPELIN_HOME/interpreter/jdbc directory and set HIVE_HOME in zeppelin-env.sh file.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close