
Apache Spark is a fast and general engine for processing Big Data, with almost all the tools built-in. It has built-in modules for SQL, Streaming, Machine learning, Graph and can be integrated with many data sources like HDFS, S3, MongoDB, Cassandra and many more. Spark is one of the major projects of Apache with the highest number of committers.
Spark is widely used in the field of Data science also. Data scientists need powerful visualization tools to visualize their data and for fine tuning their learning. Apache Zeppelin helps in this regards.
Apache Zeppelin can be integrated with Spark to execute Spark jobs through Zeppelin server and query the results to get them visualized.
Apache Zeppelin provides interpreters with many languages so that you can compile the code through Zeppelin itself and visualize the outcomes.
Here in this blog, we will be giving a demo on how to integrate Spark with Zeppelin and how to visualize your outcomes.
We recommend our users to go through our previous blog on how to get started with Zeppelin.
Data visualization using Zeppelin
Install Zeppelin by following the steps specified in the above blog. Now to integrate it with Spark, open the zeppelin-env.sh file present in $zeppelin_HOME/conf directory and provide the below specified configurations.
export HADOOP_CONF_DIR=/home/kiran/hadoop-2.7.1/etc/hadoop - # yarn-site.xml is located in configuration directory in HADOOP_CONF_DIR. export SPARK_HOME=/home/kiran/Spark-1.5.1-bin-hadoop2.6 - # (required) When it is defined, load it instead of Zeppelin embedded Spark libraries
After setting the configurations, save and close the file. Now open your Zeppelin dashboard and go to the list of interprets and search for Spark interpreter.
Click on edit to change the configurations based on your requirement. By default, Zeppelin sets its Spark master to local. If you want to change the master, you can change through these Spark interpreter configurations. After changing click on save and restart the interpreter.
Now start your Zeppelin server by using the below command. Move into $zeppelin_HOME/bin directory and type
./zeppelin-daemon.sh start
Now create a new Notebook for Spark and write your Spark codes. We have created a new note book for running Spark jobs with name Spark Zeppelin and we are running some sample codes here as shown in the below screen shot.
Here is the scala code we have used
val data = sc.textFile("/home/kiran/Documents/SensorFiles/HVAC.csv") val header = data.first() val data1 = data.filter(row => row != header) data1.take(10).foreach(println)
After running this paragraph, you can see the first 10 lines of the dataset will get printed.
Default interpreter for Zeppelin is Spark, so you need not mention %Spark at the starting of the paragraph. Zeppelin will automatically understand that the code is of Spark.
To visualize the data, you need to convert your output into a dataframe and you need to fire one SQL query on the top of it. Here is an example of Machine and sensor data analysis using Spark.
Here, we have the temperatures collected every minute, from 20 top buildings all over the world. After this analysis, we can conclude the building in which country has the most number of temperature variation.
Here is the code to perform Machine and Sensor data analysis performed through Zeppelin:
val data = sc.textFile("/home/kiran/Documents/SensorFiles/HVAC.csv") val header = data.first() val data1 = data.filter(row => row != header) case class hvac_cls(Date:String,Time:String,TargetTemp:Int,ActualTemp:Int,System:Int,SystemAge:Int,BuildingId:Int) val hvac = data1.map(x=>x.split(",")).map(x => hvac_cls(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt,x(6).toInt)).toDF hvac.registerTempTable("HVAC") val hvac1 = sqlContext.sql("select *,IF((TargetTemp - ActualTemp) > 5, '1', IF((TargetTemp - ActualTemp) < -5, '1', 0)) AS tempchange from HVAC") hvac1.registerTempTable("HVAC1") val data2 = sc.textFile("/home/kiran/Documents/SensorFiles/building.csv") val header1 = data2.first() val data3 = data2.filter(row => row != header1) case class building(buildid:Int,buildmgr:String,buildAge:Int,hvacproduct:String,Country:String) val build = data3.map(x=> x.split(",")).map(x => building(x(0).toInt,x(1),x(2).toInt,x(3),x(4))).toDF build.registerTempTable("building") val build1 = sqlContext.sql("select h.*, b.Country, b.hvacproduct from building b join HVAC1 h on buildid = BuildingId") val test = build1.map(x => (new Integer(x(7).toString),x(8).toString)) val test1 = test.filter(x=> {if(x._1==1) true else false}) val test2 = test1.map(x=>(x._2,1)).reduceByKey(_+_).toDF test2.registerTempTable("final")
In the below screen shot you can see the code of machine and sensor data analysis to find the building with most temperature changes and the associated console output.
We have registered the final output data in the table final. Now we need to fire a SQL query on the top of the table final to view the final result.
%sql select * from finals
In the next paragraph, you can execute the above script. Here is the result in a bar graph. Building in Finland is changing more rapidly.
You can visualize the same using the pie chart as well as shown in the below screen shot.
For the explanation of the code, you can visit this blog.
Note: Spark-scala code in Spark shell is not-case sensitive but while running through Zeppelin the code and the variable names are case sensitive.
We hope this blog helped you in understanding how to visualize the outcomes of spark using Zeppelin. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.