All CategoriesBig Data Hadoop & Spark

Visualize the outcomes of Spark using Zeppelin

Apache Spark is a fast and general engine for processing Big Data, with almost all the tools built-in. It has built-in modules for SQL, Streaming, Machine learning, Graph and can be integrated with many data sources like HDFS, S3, MongoDB, Cassandra and many more. Spark is one of the major projects of Apache with the highest number of committers.

Spark is widely used in the field of Data science also. Data scientists need powerful visualization tools to visualize their data and for fine tuning their learning. Apache Zeppelin helps in this regards.

Apache Zeppelin can be integrated with Spark to execute Spark jobs through Zeppelin server and query the results to get them visualized.

Apache Zeppelin provides interpreters with many languages so that you can compile the code through Zeppelin itself and visualize the outcomes.

Here in this blog, we will be giving a demo on how to integrate Spark with Zeppelin and how to visualize your outcomes.

We recommend our users to go through our previous blog on how to get started with Zeppelin.

Data visualization using Zeppelin

Install Zeppelin by following the steps specified in the above blog. Now to integrate it with Spark, open the zeppelin-env.sh file present in $zeppelin_HOME/conf directory and provide the below specified configurations.

export HADOOP_CONF_DIR=/home/kiran/hadoop-2.7.1/etc/hadoop - # yarn-site.xml is located in configuration directory in HADOOP_CONF_DIR.
export SPARK_HOME=/home/kiran/Spark-1.5.1-bin-hadoop2.6 - # (required) When it is defined, load it instead of Zeppelin embedded Spark libraries

After setting the configurations, save and close the file. Now open your Zeppelin dashboard and go to the list of interprets and search for Spark interpreter.

Click on edit to change the configurations based on your requirement. By default, Zeppelin sets its Spark master to local. If you want to change the master, you can change through these Spark interpreter configurations. After changing click on save and restart the interpreter.

Now start your Zeppelin server by using the below command. Move into $zeppelin_HOME/bin directory and type

./zeppelin-daemon.sh start

Now create a new Notebook for Spark and write your Spark codes. We have created a new note book for running Spark jobs with name Spark Zeppelin and we are running some sample codes here as shown in the below screen shot.

Here is the scala code we have used

val data = sc.textFile("/home/kiran/Documents/SensorFiles/HVAC.csv")
val header = data.first()
val data1 = data.filter(row => row != header)
data1.take(10).foreach(println)

After running this paragraph, you can see the first 10 lines of the dataset will get printed.

Hadoop

Default interpreter for Zeppelin is Spark, so you need not mention %Spark at the starting of the paragraph. Zeppelin will automatically understand that the code is of Spark.

To visualize the data, you need to convert your output into a dataframe and you need to fire one SQL query on the top of it. Here is an example of Machine and sensor data analysis using Spark.

Here, we have the temperatures collected every minute, from 20 top buildings all over the world. After this analysis, we can conclude the building in which country has the most number of temperature variation.

Here is the code to perform Machine and Sensor data analysis performed through Zeppelin:

val data = sc.textFile("/home/kiran/Documents/SensorFiles/HVAC.csv")
val header = data.first()
val data1 = data.filter(row => row != header)
case class hvac_cls(Date:String,Time:String,TargetTemp:Int,ActualTemp:Int,System:Int,SystemAge:Int,BuildingId:Int)
val hvac = data1.map(x=>x.split(",")).map(x => hvac_cls(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt,x(6).toInt)).toDF
hvac.registerTempTable("HVAC")
val hvac1 = sqlContext.sql("select *,IF((TargetTemp - ActualTemp) > 5, '1', IF((TargetTemp - ActualTemp) < -5, '1', 0)) AS tempchange from HVAC")
hvac1.registerTempTable("HVAC1")
val data2 = sc.textFile("/home/kiran/Documents/SensorFiles/building.csv")
val header1 = data2.first()
val data3 = data2.filter(row => row != header1)
case class building(buildid:Int,buildmgr:String,buildAge:Int,hvacproduct:String,Country:String)
 val build = data3.map(x=> x.split(",")).map(x => building(x(0).toInt,x(1),x(2).toInt,x(3),x(4))).toDF
build.registerTempTable("building")
val build1 = sqlContext.sql("select h.*, b.Country, b.hvacproduct from building b join HVAC1 h on buildid = BuildingId")
val test = build1.map(x => (new Integer(x(7).toString),x(8).toString))
val test1 = test.filter(x=> {if(x._1==1) true else false})
val test2 = test1.map(x=>(x._2,1)).reduceByKey(_+_).toDF
test2.registerTempTable("final")

In the below screen shot you can see the code of machine and sensor data analysis to find the building with most temperature changes and the associated console output.

We have registered the final output data in the table final. Now we need to fire a SQL query on the top of the table final to view the final result.

%sql
select * from finals

In the next paragraph, you can execute the above script. Here is the result in a bar graph. Building in Finland is changing more rapidly.

You can visualize the same using the pie chart as well as shown in the below screen shot.

For the explanation of the code, you can visit this blog.

Note: Spark-scala code in Spark shell is not-case sensitive but while running through Zeppelin the code and the variable names are case sensitive.

We hope this blog helped you in understanding how to visualize the outcomes of spark using Zeppelin. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

Spark

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close