Big Data Hadoop & Spark

A Brief Introduction to GraphX

GraphX is a new component in Apache Spark for graphs and graph-parallel computation. It is a distributed graph processing framework built on top of the Spark Core.
GraphX provides a set of operators (e.g., subgraph, joinVertices, and aggregateMessages) to support graph computation, however, it is not a graph database.

Basics of Graph

  • A graph is a mathematical representation that depicts the relationship between various objects/nodes. It consists of vertices and edges, where vertices are nodes/objects and edges are the lines connecting different vertices.

Image Courtesy:

  • Directed Graph: A directed graph has a directional edge (an edge that has a direction associated with it). A simple example will be a Twitter follower. The following image will help in better understanding.

Image Courtesy:
User Bob follows Carol without implying that user Carol also follows Bob.

  • Regular graph: A graph where each vertex has the same number of neighbors. For example, if user A is friends with user B, then B is also friend with A.


Introduction to GraphX API

  • First, we need to run the following commands in Spark shell to import the GraphX packages. With this, we will be able to use various classes and their methods defined in the GraphX package.

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

  • Property Graph: It is a directed multigraph (a directed graph with multiple parallel edges sharing the same source and destination vertex).
    • Each vertex is keyed by a unique 64-bit long identifier (vertexID).
    • Edges have corresponding source and destination vertex identifiers.
    • Ordering constraints on certexID is not imposed.
    • Properties are stored as objects (Scala/Java) with each edge and vertex in the graph.
    • It is immutable, distributed, and fault tolerant. If we wish to change the value of a graph, it cannot be done.

Image Courtesy:

    • In the above screenshot, user and ages are denoted by vertex, and likes is denoted by edges.
  • Next, we start by creating a property graph using arrays. Scala code is mentioned below to create a single vertex:

val vertexArray = Array(1L, (“Alice”,28)))

  • Vertices are completely created by:

  • To create the edge, we have to use Edge class with following parameters:

Edge(srcID, destID,attribute)

val edgeArray = Array( Edge(3L, 2L, 4))
srcID = 3
destID = 2
Attribute = 4

  • Edges are completely created by:

  • To create the Property graph, we need to create RDD from edgeArray and vertexArray so that we can pass them as constructor.

// Creating RDD’s of edge and vertex Array.
val vertexRDD: RDD[(Long , (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]]  = sc.parallelize(edgeArray)

  • After creating RDDs, we are ready to build a property graph by passing RDD of vertices and that of the edges. The type of property graph will be Graph (V, E).

  • Using graph.vertices and graph.edges, we can deconstruct a graph into its respective vertex and edges.
    • The following command for graph.vertices is used to find out users who are at least 40 years old.

graph.vertices.filter { case (id, (name, age)) => age >40 }.collect.foreach { case (id, (name,age)) println(s“$name is $age”)}

  • The following command for graph.edges is used to find total count for likes greater than 5.

graph.edges.filter { case Edge(src,dst,attr) => attr > 5}.collect

We hope this post has been helpful in understanding GraphX. In case of any queries, feel free to comment below, and we will get back to you at the earliest. And keep visiting for more updates on the Big Data and other technologies.

Suggested Reading

Spark Interview Questions


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles