# Figure calculation: Processing hierarchical data using Spark Graphx Pregel API

Keywords: Big Data Spark

Today, distributed computing engines are the backbone of many analysis, batch, and streaming applications. Spark provides many advanced functions (pivot, analysis window function, etc.) to convert data out of the box. Sometimes you need to process hierarchical data or perform hierarchical calculations. Many database vendors provide functions such as "recursive CTE (common expression)" or "join" SQL clauses to query / transform hierarchical data. CTE is also called recursive query or parent-child query. In this article, we will look at how to use spark to solve this problem.

### Tiered data overview –

There is a hierarchical relationship where one item of data is the parent of the other. Hierarchical data can be represented by graphical attribute object model, in which each row is a vertex (node), the connection is the edge (relationship) connecting the vertices, and the column is the attributes of the vertices. ### Some use cases

• Financial calculation - the sub account is accumulated to the parent account until the highest account
• Create organizational hierarchy - employee relationship between manager and path
• Use path to generate link graph between web pages
• Any type of iterative calculation involving linked data

### Challenge

There are some challenges in querying hierarchical data in distributed systems

Data is connected, but it is distributed between partitions and nodes. The implementation to solve this problem should be optimized for performing iterations and shuffling data as needed.
The depth of the graph changes over time -- the solution should handle different depths and should not force the user to define it before processing.

### Solution

One way to implement CTE in spark is to use the Graphx Pregel API.

## What is the Graphx Pregel API?

Graphx is a Spark API for graphics and graphics parallel computing. Graph algorithms are iterative in nature, and the attributes of vertices depend on the attributes of vertices they connect directly or indirectly (connected by other vertices). Pregel is a vertex centered graph processing model developed by Google and spark graphX. It provides optimized variants of the pregel api.

### How does the Pregel API work?

Pregel API processing includes executing super steps

#### Step 0:

Pass the initial message to all vertices
Sends the value as a message to the vertex to which it is directly connected

#### Step 1:

change a value
Sends the value as a message to the vertex to which it is directly connected
Repeat step 1} until there is messaging, and stop when there is no more messaging.

##### Hierarchical data of use cases

The following table shows the sample employee data we will use to generate a top-down hierarchy. Here, the manager of the employee has EMP_ Mgr of ID value_ The ID field indicates. Add the following as part of the process

Level (Depth)The level of the vertex in the hierarchy
PathThe path from the topmost vertex to the current vertex in the hierarchy
RootThe topmost vertex in a hierarchy, which is useful when there are multiple hierarchies in the dataset
IscyclicIf there is bad data, there is a circular relationship, and then mark it
IsleafIf a vertex has no parent node, it is marked

### code

```import org.apache.log4j.{Level, Logger}
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}

import scala.util.hashing.MurmurHash3

/**
* Pregel API
* @author zyh
*/
object PregelTest {

// The code below demonstrates use of Graphx Pregel API - Scala 2.11+
// functions to build the top down hierarchy

//setup & call the pregel api
//Set up and call the pregel api
def calcTopLevelHierarcy(vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any,(Int,Any,String,Int,Int))] = {

// create the vertex RDD
// primary key, root, path
val verticesRDD: RDD[(VertexId, (Any, Any, String))] = vertexDF
.rdd
.map{x=> (x.get(0),x.get(1) , x.get(2))}
.map{ x => (MurmurHash3.stringHash(x._1.toString).toLong, ( x._1.asInstanceOf[Any], x._2.asInstanceOf[Any] , x._3.asInstanceOf[String]) ) }

// create the edge RDD
// top down relationship
val EdgesRDD = edgeDF
.rdd
.map{x=> (x.get(0),x.get(1))}
.map{ x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong,"topdown" )}

// create graph
val graph = Graph(verticesRDD, EdgesRDD).cache()

val pathSeperator = """/"""

// Initialization message
// initialize id,level,root,path,iscyclic, isleaf
val initialMsg = (0L,0,0.asInstanceOf[Any], List("dummy"),0,1)

// add more dummy attributes to the vertices - id, level, root, path, isCyclic, existing value of current vertex to build path, isleaf, pk
val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1) )

val hrchyRDD = initialGraph.pregel(
initialMsg,
Int.MaxValue,            // The number of iterations, set to the current value, indicates that the iteration will continue indefinitely
EdgeDirection.Out)(
setMsg,
sendMsg,
mergeMsg)

// build the path from the list
val hrchyOutRDD = hrchyRDD.vertices.map{case(id,v) => (v._8,(v._2,v._3,pathSeperator + v._4.reverse.mkString(pathSeperator),v._5, v._7 )) }

hrchyOutRDD
}

//Change the value of the vertex
def setMsg(vertexId: VertexId, value: (Long,Int,Any,List[String], Int,String,Int,Any), message: (Long,Int, Any,List[String],Int,Int)): (Long,Int, Any,List[String],Int,String,Int,Any) = {

// The first message received is the initialization message initialMsg
println(s"Set value: \$value  Received message:  \$message")

if (message._2 < 1) { //superstep 0 - initialize
(value._1,value._2+1,value._3,value._4,value._5,value._6,value._7,value._8)
}
else if ( message._5 == 1) { // set isCyclic (judge whether it is a ring)
(value._1, value._2, value._3, value._4, message._5, value._6, value._7,value._8)
} else if ( message._6 == 0 ) { // set isleaf
(value._1, value._2, value._3, value._4, value._5, value._6, message._6,value._8)
}
else { // set new values
//( message._1,value._2+1, value._3, value._6 :: message._4 , value._5,value._6,value._7,value._8)

( message._1,value._2+1, message._3, value._6 :: message._4 , value._5,value._6,value._7,value._8)
}
}

// Send values to vertices
def sendMsg(triplet: EdgeTriplet[(Long,Int,Any,List[String],Int,String,Int,Any), _]): Iterator[(VertexId, (Long,Int,Any,List[String],Int,Int))] = {

val sourceVertex: (VertexId, Int, Any, List[String], Int, String, Int, Any) = triplet.srcAttr
val destinationVertex: (VertexId, Int, Any, List[String], Int, String, Int, Any) = triplet.dstAttr

println(s" source: \$sourceVertex   destination:   \$destinationVertex")

// Check whether it is a dead ring, that is, a is the leader of b and b is the leader of A
// check for icyclic
if (sourceVertex._1 == triplet.dstId || sourceVertex._1 == destinationVertex._1) {

println(s"There is a dead ring    source: \${sourceVertex._1}        destination:  \${triplet.dstId}")

if (destinationVertex._5 == 0) { //set iscyclic
Iterator((triplet.dstId, (sourceVertex._1, sourceVertex._2, sourceVertex._3, sourceVertex._4, 1, sourceVertex._7)))
} else {
Iterator.empty
}
}
else {

// Judge whether it is a leaf node or a node without child nodes. It belongs to a leaf node and the root node does not count. Therefore, the leaf nodes in the sample data are 3, 8 and 10
if (sourceVertex._7==1) //is NOT leaf
{
Iterator((triplet.srcId, (sourceVertex._1,sourceVertex._2,sourceVertex._3, sourceVertex._4 ,0, 0 )))
}
else { // set new values
Iterator((triplet.dstId, (sourceVertex._1, sourceVertex._2, sourceVertex._3, sourceVertex._4, 0, 1)))
}
}
}

// Receive values from all connected vertices
def mergeMsg(msg1: (Long,Int,Any,List[String],Int,Int), msg2: (Long,Int, Any,List[String],Int,Int)): (Long,Int,Any,List[String],Int,Int) = {

println(s"Merge value:   \$msg1     \$msg2")

// dummy logic not applicable to the data in this usecase
msg2
}

// Test with some sample data
def main(args: Array[String]): Unit = {

Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

val spark: SparkSession = SparkSession
.builder
.appName(s"\${this.getClass.getSimpleName}")
.master("local")
.getOrCreate()

val sc = spark.sparkContext

// RDD to DF, implicit conversion
import spark.implicits._

val empData = Array(

// If there is no top-level parent node in the test, a null pointer exception will occur. When building the graph, a null vertex will be generated according to the edge
("EMP001", "Bob", "Baker", "CEO", null.asInstanceOf[String])
, ("EMP002", "Jim", "Lake", "CIO", "EMP001")
, ("EMP003", "Tim", "Gorab", "MGR", "EMP002")
, ("EMP004", "Rick", "Summer", "MGR", "EMP002")
, ("EMP005", "Sam", "Cap", "Lead", "EMP004")
, ("EMP006", "Ron", "Hubb", "Sr.Dev", "EMP005")
, ("EMP007", "Cathy", "Watson", "Dev", "EMP006")
, ("EMP008", "Samantha", "Lion", "Dev", "EMP007")
, ("EMP009", "Jimmy", "Copper", "Dev", "EMP007")
, ("EMP010", "Shon", "Taylor", "Intern", "EMP009")
// Null pointers have nothing to do with duplicate vertex data
// The null pointer is related to the fact that the parent node cannot be found in the vertex (it doesn't matter if the parent vertex is null, so the parent vertex needs to be found in the vertex list)
, ("EMP011", "zhang", "xiaoming", "CTO", null)
)

// create dataframe with some partitions
val empDF = sc.parallelize(empData, 3)
.toDF("emp_id","first_name","last_name","title","mgr_id")
.cache()

// primary key , root, path - dataframe to graphx for vertices
val empVertexDF = empDF.selectExpr("emp_id","concat(first_name,' ',last_name)","concat(last_name,' ',first_name)")

// parent to child - dataframe to graphx for edges
val empEdgeDF = empDF.selectExpr("mgr_id","emp_id").filter("mgr_id is not null")

// call the function
val empHirearchyExtDF: DataFrame = calcTopLevelHierarcy(empVertexDF,empEdgeDF)
.map{ case(pk,(level,root,path,iscyclic,isleaf)) => (pk.asInstanceOf[String],level,root.asInstanceOf[String],path,iscyclic,isleaf)}
.toDF("emp_id_pk","level","root","path","iscyclic","isleaf").cache()

// extend original table with new columns
val empHirearchyDF = empHirearchyExtDF.join(empDF , empDF.col("emp_id") === empHirearchyExtDF.col("emp_id_pk"))
.selectExpr(
"emp_id","first_name","last_name",
"title","mgr_id",
"level",
"root",
"path",
"iscyclic","isleaf"
)

// print
empHirearchyDF.show()

}
}
```

### output  