Big Data Hadoop & SparkBig Data Hadoop & Spark - Advanced

Apache Spark Graphx Tutorial 2017

In this blog, we are going to discuss Spark ecosystem i.e. Spark GraphX use case. This Spark component is used for graphs and graphs parallel computation. Like, RDD in Spark Core, Graph is a directed multigraph having attached properties to each Vertex (V) and Edge (E) which is the abstraction for graph processing.
We recommend readers to go through the link SparkSQL for a better understanding of this one.
As part of this blog, we are going to analyze a flight use case and perform analytics on the data. In this flight use case, we have many sources and destinations. Also, we will see the configuration files and codes to successfully complete the Analytics.
Before we go further and start, let us analyze the flight use case using a pictorial diagram with all the flight details such as source, destination and the distance between them.

As per the above diagram, a flight travels to three different places namely Mumbai, Delhi and Kolkata and the distances between these locations are labeled accordingly.
In order to analyze the flight routes, Spark GraphX is implemented. In Spark GraphX parlance, all the locations are tagged as Vertex (V) and all the connecting routes are tagged as Edge (E).
To understand better, let us create two tables i.e. Airport Table and Routes Table. Wherein, the following table represents the Airport table (i.e. Vertex table):

ID Property
1 BOM
2 DEL
3 CCU

And, the following table represents the Routes table (i.e. Edge table):

srcId dstId Property
1 2 1000
2 3 800
3 1 1200

So far, we have created the desired tables i.e. Airport and Routes so that Spark GraphX can be understood easily.
Now, we will try to answer the following five questions using GraphX APIs:

  1. How many airports are there?
  2. How many routes are there?
  3. Which route has a greater distance more than 1000 kms?
  4. Get the triplets of all the Vertices and Edges.
  5. Sort and get the longest route.

Before we start to get the environment ready for writing Spark GraphX code, initially we need to include “spark-graph” in build.sbt file as shown in the below screenshot:

Once we provide the dependency of Spark-Graphx in build.sbt file, all the dependent libraries, and jars would be automatically downloaded.
Once the library and jar files are downloaded successfully, the next step would be to write Scala programs to implement the logic to solve the above-mentioned five problems.

As per the above diagram, we will create an object of SparkSession, which provides SparkContext, SqlContext and HiveContext together in Spark 2.x. Thus, there is no need to create the SparkContext and SQLContext separately as we would do in Spark 1.x.

The above steps are not required because we can implement the SparkContext object from the SparkSession object as shown in the below code.

Now, moving forward with the further program, let us first create the first Vertex table using SparkContext (in Spark Core).

As per the above code, we have got the SparkContext using SparkSession object. And, to create a RDD, we need to pass Seq/Array/List of value to parallelize method of SparkContext. Here we pass the complete vertex table containing Id and Name of all the airports.
The below code helps to check results of the first record of verticesRDD.
And, it results to,

Now, we have to define a default vertex, which is one of the mandatory arguments for Graph().

Next, this is the right time to create an Edge table representing the distances between all the vertices. Similarly to verticesRDD, create edgeRDD to represent them.

The following code helps to check the top three records:

Now, finally create the Graph() class by passing three objects/rdds such as verticesRDD, edgeRDD, and defaultLocation.
The type of the class is returned after creating the Graph() class:
Graph[String, Int]
 
All the vertices and edges values are retrieved after getting the Graph() object. The following code is used if you want all the vertices.

And the output is:

Similarly, to retrieve all the edges, then write the following code:

And the output is,

Now, let us solve the above five questions:

  1. How many airports are there?

Answer: The following code helps you to find out the number of airports:

And, the output is 3

  1. How many routes are there?

Answer: The following code helps you to find the number of routes:

And, the output is 3

  1. Which route has a greater distance more than 1000 kms?

Answer: The following code finds the greater distance of all the routes which is more than 1000 kms:

And, the output is

  1. Get triplets – combination of vertices and edges.

Answer: The following code helps us to retrieve the triplets i.e. the combination of vertices and edges:

And, the output is,

  1. Find the longest route in the sorted order.

Answer: The following code retrieves the longest route in the sorted order:

And, the output is:

Hope this post helped you gain some knowledge on Spark Graphx. Enroll for Big Data and Hadoop Spark Training with Acadgild and become a successful hadoop developer.

One Comment

  1. this is such a informative blog.I am new to blogging and always try to learn new skill as I believe that blogging is the full time job for learning new things day by day but i want easiest way to understand this topic so can you explain?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close