Figure calculation: Processing hierarchical data using Spark Graphx Pregel API

Keywords: Big Data Spark

Today, distributed computing engines are the backbone of many analysis, batch, and streaming applications. Spark provides many advanced functions (pivot, analysis window function, etc.) to convert data out of the box. Sometimes you need to process hierarchical data or perform hierarchical calculations. Many database vendors provide functions such as "recursive CTE (common expression)" or "join" SQL clauses to query / transform hierarchical data. CTE is also called recursive query or parent-child query. In this article, we will look at how to use spark to solve this problem.

Tiered data overview –

There is a hierarchical relationship where one item of data is the parent of the other. Hierarchical data can be represented by graphical attribute object model, in which each row is a vertex (node), the connection is the edge (relationship) connecting the vertices, and the column is the attributes of the vertices.

Some use cases

  • Financial calculation - the sub account is accumulated to the parent account until the highest account
  • Create organizational hierarchy - employee relationship between manager and path
  • Use path to generate link graph between web pages
  • Any type of iterative calculation involving linked data


There are some challenges in querying hierarchical data in distributed systems

Data is connected, but it is distributed between partitions and nodes. The implementation to solve this problem should be optimized for performing iterations and shuffling data as needed.
The depth of the graph changes over time -- the solution should handle different depths and should not force the user to define it before processing.


One way to implement CTE in spark is to use the Graphx Pregel API.

What is the Graphx Pregel API?

Graphx is a Spark API for graphics and graphics parallel computing. Graph algorithms are iterative in nature, and the attributes of vertices depend on the attributes of vertices they connect directly or indirectly (connected by other vertices). Pregel is a vertex centered graph processing model developed by Google and spark graphX. It provides optimized variants of the pregel api.

How does the Pregel API work?

Pregel API processing includes executing super steps

Step 0:

Pass the initial message to all vertices
Sends the value as a message to the vertex to which it is directly connected

Step 1:

Receive messages from previous steps
change a value
Sends the value as a message to the vertex to which it is directly connected
Repeat step 1} until there is messaging, and stop when there is no more messaging.

Hierarchical data of use cases

The following table shows the sample employee data we will use to generate a top-down hierarchy. Here, the manager of the employee has EMP_ Mgr of ID value_ The ID field indicates.

Add the following as part of the process

Level (Depth)The level of the vertex in the hierarchy
PathThe path from the topmost vertex to the current vertex in the hierarchy
RootThe topmost vertex in a hierarchy, which is useful when there are multiple hierarchies in the dataset
IscyclicIf there is bad data, there is a circular relationship, and then mark it
IsleafIf a vertex has no parent node, it is marked


import org.apache.log4j.{Level, Logger}
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}

import scala.util.hashing.MurmurHash3

 * Pregel API
 * @author zyh
object PregelTest {

  // The code below demonstrates use of Graphx Pregel API - Scala 2.11+
  // functions to build the top down hierarchy

  //setup & call the pregel api
  //Set up and call the pregel api
  def calcTopLevelHierarcy(vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any,(Int,Any,String,Int,Int))] = {

    // create the vertex RDD
    // primary key, root, path
    val verticesRDD: RDD[(VertexId, (Any, Any, String))] = vertexDF
      .map{x=> (x.get(0),x.get(1) , x.get(2))}
      .map{ x => (MurmurHash3.stringHash(x._1.toString).toLong, ( x._1.asInstanceOf[Any], x._2.asInstanceOf[Any] , x._3.asInstanceOf[String]) ) }

    // create the edge RDD
    // top down relationship
    val EdgesRDD = edgeDF
      .map{x=> (x.get(0),x.get(1))}
      .map{ x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong,"topdown" )}

    // create graph
    val graph = Graph(verticesRDD, EdgesRDD).cache()

    val pathSeperator = """/"""

    // Initialization message
    // initialize id,level,root,path,iscyclic, isleaf
    val initialMsg = (0L,0,0.asInstanceOf[Any], List("dummy"),0,1)

    // add more dummy attributes to the vertices - id, level, root, path, isCyclic, existing value of current vertex to build path, isleaf, pk
    val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1) )

    val hrchyRDD = initialGraph.pregel(
      Int.MaxValue,            // The number of iterations, set to the current value, indicates that the iteration will continue indefinitely
    // build the path from the list
    val hrchyOutRDD ={case(id,v) => (v._8,(v._2,v._3,pathSeperator + v._4.reverse.mkString(pathSeperator),v._5, v._7 )) }


  //Change the value of the vertex
  def setMsg(vertexId: VertexId, value: (Long,Int,Any,List[String], Int,String,Int,Any), message: (Long,Int, Any,List[String],Int,Int)): (Long,Int, Any,List[String],Int,String,Int,Any) = {

    // The first message received is the initialization message initialMsg
    println(s"Set value: $value  Received message:  $message")

    if (message._2 < 1) { //superstep 0 - initialize
    else if ( message._5 == 1) { // set isCyclic (judge whether it is a ring)
      (value._1, value._2, value._3, value._4, message._5, value._6, value._7,value._8)
    } else if ( message._6 == 0 ) { // set isleaf
      (value._1, value._2, value._3, value._4, value._5, value._6, message._6,value._8)
    else { // set new values
      //( message._1,value._2+1, value._3, value._6 :: message._4 , value._5,value._6,value._7,value._8)

      ( message._1,value._2+1, message._3, value._6 :: message._4 , value._5,value._6,value._7,value._8)

  // Send values to vertices
  def sendMsg(triplet: EdgeTriplet[(Long,Int,Any,List[String],Int,String,Int,Any), _]): Iterator[(VertexId, (Long,Int,Any,List[String],Int,Int))] = {

    val sourceVertex: (VertexId, Int, Any, List[String], Int, String, Int, Any) = triplet.srcAttr
    val destinationVertex: (VertexId, Int, Any, List[String], Int, String, Int, Any) = triplet.dstAttr

    println(s" source: $sourceVertex   destination:   $destinationVertex")

    // Check whether it is a dead ring, that is, a is the leader of b and b is the leader of A
    // check for icyclic
    if (sourceVertex._1 == triplet.dstId || sourceVertex._1 == destinationVertex._1) {

      println(s"There is a dead ring    source: ${sourceVertex._1}        destination:  ${triplet.dstId}")

      if (destinationVertex._5 == 0) { //set iscyclic
        Iterator((triplet.dstId, (sourceVertex._1, sourceVertex._2, sourceVertex._3, sourceVertex._4, 1, sourceVertex._7)))
      } else {
    else {

      // Judge whether it is a leaf node or a node without child nodes. It belongs to a leaf node and the root node does not count. Therefore, the leaf nodes in the sample data are 3, 8 and 10
      if (sourceVertex._7==1) //is NOT leaf
        Iterator((triplet.srcId, (sourceVertex._1,sourceVertex._2,sourceVertex._3, sourceVertex._4 ,0, 0 )))
      else { // set new values
        Iterator((triplet.dstId, (sourceVertex._1, sourceVertex._2, sourceVertex._3, sourceVertex._4, 0, 1)))

  // Receive values from all connected vertices
  def mergeMsg(msg1: (Long,Int,Any,List[String],Int,Int), msg2: (Long,Int, Any,List[String],Int,Int)): (Long,Int,Any,List[String],Int,Int) = {

    println(s"Merge value:   $msg1     $msg2")

    // dummy logic not applicable to the data in this usecase

  // Test with some sample data
  def main(args: Array[String]): Unit = {

    // Mask log

    val spark: SparkSession = SparkSession

    val sc = spark.sparkContext

    // RDD to DF, implicit conversion
    import spark.implicits._

    val empData = Array(

      // If there is no top-level parent node in the test, a null pointer exception will occur. When building the graph, a null vertex will be generated according to the edge
      ("EMP001", "Bob", "Baker", "CEO", null.asInstanceOf[String])
      , ("EMP002", "Jim", "Lake", "CIO", "EMP001")
      , ("EMP003", "Tim", "Gorab", "MGR", "EMP002")
      , ("EMP004", "Rick", "Summer", "MGR", "EMP002")
      , ("EMP005", "Sam", "Cap", "Lead", "EMP004")
      , ("EMP006", "Ron", "Hubb", "Sr.Dev", "EMP005")
      , ("EMP007", "Cathy", "Watson", "Dev", "EMP006")
      , ("EMP008", "Samantha", "Lion", "Dev", "EMP007")
      , ("EMP009", "Jimmy", "Copper", "Dev", "EMP007")
      , ("EMP010", "Shon", "Taylor", "Intern", "EMP009")
      // Null pointers have nothing to do with duplicate vertex data
      // The null pointer is related to the fact that the parent node cannot be found in the vertex (it doesn't matter if the parent vertex is null, so the parent vertex needs to be found in the vertex list)
      , ("EMP011", "zhang", "xiaoming", "CTO", null)

    // create dataframe with some partitions
    val empDF = sc.parallelize(empData, 3)

    // primary key , root, path - dataframe to graphx for vertices
    val empVertexDF = empDF.selectExpr("emp_id","concat(first_name,' ',last_name)","concat(last_name,' ',first_name)")

    // parent to child - dataframe to graphx for edges
    val empEdgeDF = empDF.selectExpr("mgr_id","emp_id").filter("mgr_id is not null")

    // call the function
    val empHirearchyExtDF: DataFrame = calcTopLevelHierarcy(empVertexDF,empEdgeDF)
      .map{ case(pk,(level,root,path,iscyclic,isleaf)) => (pk.asInstanceOf[String],level,root.asInstanceOf[String],path,iscyclic,isleaf)}

    // extend original table with new columns
    val empHirearchyDF = empHirearchyExtDF.join(empDF , empDF.col("emp_id") === empHirearchyExtDF.col("emp_id_pk"))

    // print



Task execution

Spark jobs are broken down into jobs, phases and tasks. Because of its iterative nature, the Pregel API generates multiple jobs internally. A job is generated each time a message is delivered to a vertex. Since the data may be located on different nodes, each job may end with multiple shuffle s.

Note the long RDD lineage created when working with large datasets.


The Graphx Pregel API is very powerful and can be used to solve iterative problems or any graphical computing.

Posted by newyear498 on Thu, 25 Nov 2021 15:14:49 -0800