SparkSQL View Debugging Generated Code

Keywords: Big Data Spark SQL Scala Apache

Spark SQL (DataFrame) is introduced in websites and some books to generate final running statements based on the corresponding operations. This article starts with a simple, low-level problem and ends with a look at the generated code to find the root cause of the problem and a brief introduction to how to debug SparkSQL.

Sources of the problem:

1
2
3
4
5
6
7
8
9
case class Access(id:String,url:String,time:String){
def compute():(String, Int)
}
Object Access {
def apply(row:Row): Option[Access]
}

# main
df.map(Access(_)).filter(!_.isEmpty).map(_.get).map(_.compute)

After running, compute always reports NullPointerException exceptions. The operation of RDD and Scala is incomprehensible. How can it become Access(null,null,null)? Although df. flatMap (Access (). map (. compute) is working properly, I still want to see what SparkSQL has done!!!

What did SparkSQL do?

Spark RDD clearly defines the operation in RDD compute. SparkSQL's operations were eventually converted into a Logical Plan, and it was impossible to see what it did.

In fact, SparkSQL has an explaining method to view program execution plans, similar to the explaining method of database SQL. (The code is all posted here, and you can remove the comments yourself according to the situation.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
object AccessAnalyser {

  def main(args: Array[String]): Unit = {

    // conf

    // clean
    new File("target/generated-sources").listFiles().filter(_.isFile()).foreach(_.delete)

    sys.props("org.codehaus.janino.source_debugging.enable") = "true"
    sys.props("org.codehaus.janino.source_debugging.dir") = "target/generated-sources"

    val input = "r:/match10.dat"
    val output = "r:/output"
    def delete(f: File): Unit = {
      if (f.isDirectory) f.listFiles().foreach(delete)
      f.delete()
    }
    delete(new File(output))

    // program

    val conf = new SparkConf().setAppName("DPI Analyser").setMaster("local[10]")
    // fix windows path.
    conf.set(/*SQLConf.WAREHOUSE_PATH*/ "spark.sql.warehouse.dir", "spark-warehouse")

    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    import sqlContext.implicits._
    import org.apache.spark.sql.functions._

    val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "false") // Use first line of all files as header
      .option("quote", "'")
      .option("escape", "'")
      .option("delimiter", ",")
      .load(input)

    df
      .flatMap(Access(_))
      //      Map (Access (). filter ((t: Option [Access]) =>! T. isEmpty). map (. get) // sparksql does not apply to Option
      .map(_.compute)
      .explain(true)
      //      .toDF("id", "score")
      //      .groupBy("id").agg(sum("score") as "score")
      //      .sort("score", "id")
      //      .repartition(1)
      //      .write.format("com.databricks.spark.csv").save(output)

    sc.stop()
  }

}

Running the above code, the task execution plan is output in the console window:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
== Parsed Logical Plan ==
'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true], top level non-flat input object)._1, true) AS _1#20, assertnotnull(input[0, scala.Tuple2, true], top level non-flat input object)._2 AS _2#21]
+- 'MapElements <function1>, obj#19: scala.Tuple2
   +- 'DeserializeToObject unresolveddeserializer(newInstance(class com.github.winse.spark.access.Access)), obj#18: com.github.winse.spark.access.Access
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.github.winse.spark.access.Access, true], top level non-flat input object).id, true) AS id#12, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.github.winse.spark.access.Access, true], top level non-flat input object).url, true) AS url#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.github.winse.spark.access.Access, true], top level non-flat input object).time, true) AS time#14]
         +- MapPartitions <function1>, obj#11: com.github.winse.spark.access.Access
            +- DeserializeToObject createexternalrow(_c0#0.toString, _c1#1.toString, _c2#2.toString, StructField(_c0,StringType,true), StructField(_c1,StringType,true), StructField(_c2,StringType,true)), obj#10: org.apache.spark.sql.Row
               +- Relation[_c0#0,_c1#1,_c2#2] csv

== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true], top level non-flat input object)._1, true) AS _1#20, assertnotnull(input[0, scala.Tuple2, true], top level non-flat input object)._2 AS _2#21]
+- *MapElements <function1>, obj#19: scala.Tuple2
   +- MapPartitions <function1>, obj#11: com.github.winse.spark.access.Access
      +- DeserializeToObject createexternalrow(_c0#0.toString, _c1#1.toString, _c2#2.toString, StructField(_c0,StringType,true), StructField(_c1,StringType,true), StructField(_c2,StringType,true)), obj#10: org.apache.spark.sql.Row
         +- *Scan csv [_c0#0,_c1#1,_c2#2] Format: CSV, InputPaths: file:/r:/match10.dat, PushedFilters: [], ReadSchema: struct<_c0:string,_c1:string,_c2:string>

OK, when you see the execution plan, what does the generated code look like? And how to debug the generated code?

Hack source code

Before debugging, change the code to recompile catalyst for debugging and replace spark-catalyst_2.11 under maven:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
winse@Lenovo-PC ~/git/spark/sql/catalyst
$ git diff .
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/                                                                                          src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 16fb1f6..56bfbf7 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -854,7 +854,7 @@ object CodeGenerator extends Logging {
     val parentClassLoader = new ParentClassLoader(Utils.getContextOrSparkClassLoader)
     evaluator.setParentClassLoader(parentClassLoader)
     // Cannot be under package codegen, or fail with java.lang.InstantiationException
-    evaluator.setClassName("org.apache.spark.sql.catalyst.expressions.GeneratedClass")
     evaluator.setDefaultImports(Array(
       classOf[Platform].getName,
       classOf[InternalRow].getName,
@@ -875,12 +875,14 @@ object CodeGenerator extends Logging {

     logDebug({
       // Only add extra debugging info to byte code when we are going to print the source code.
-      evaluator.setDebuggingInformation(true, true, false)
+      evaluator.setDebuggingInformation(true, true, true)
       s"\n$formatted"
     })

     try {
-      evaluator.cook("generated.java", code.body)
+      evaluator.cook(code.body)
       recordCompilationStats(evaluator)
     } catch {
       case e: Exception =>

E:\git\spark\sql\catalyst>mvn clean package -DskipTests -Dmaven.test.skip=true

SparkSQL generates code using janino. Official documents provide debugging information: http://janino-compiler.github.io/janino/#debugging  . Simply explain the following three modifications:

  • Look at the org.codehaus.janino.Scanner constructor, and if debugging and optionalFileName==null are configured, the source code is saved to a temporary file.
  • I didn't expect to annotate setClassName at first. Then I copied CodeGenerator doCompile slowly to the example provided by the official website. I changed setClassName to setExtendedClass and turned it into a pop-up source page. See the setExtendedClass below and comment out the setClassName.
  • What the parameters in the source code can't be viewed is that this option is removed at compile time. Set debugVars to true.

Operation and debugging

Prepare for commissioning first:

  • Interrupt a breakpoint in the compute method and debug it.
  • Modify log4j log level: log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen=DEBUG
  • Import the project into eclipse (IDEA does not pop up the source code)

Then run. Click Generated Iterator in the Debug view, click the Find Source button in the pop-up code view, and then the Edit Source Lookup Path to add the path target/generated-sources (note that absolute paths are used here)! Next, step by step.

Debugging the generated code can better understand the execution plan of the previous explain. Seeing the code makes it easy to understand the original Access(null,null,null): the problem of object-to-field deserialization.

from:http://www.winseliu.com/blog/2016/10/12/sparksql-view-and-debug-generatecode/

Posted by NikkiLoveGod on Tue, 29 Jan 2019 20:51:15 -0800