Solution of Spark data skew

Keywords: Big Data hive Spark

Data skew caused by Shuffle

When data skew occurs during Shuffle, we generally follow the troubleshooting steps

① Check the WEB-UI page to check the execution of tasks in the Stage of each Job, and whether there is an obvious situation that the execution time is too long

② If the task reports an error, check the corresponding log exception stack information to see if there is a memory overflow

③ Sampling to view tilted key s

val result = rdd
  // withReplacement: indicates whether the sample is put back after being extracted. true indicates that it will be put back, which means that the extracted sample may be repeated
  // fraction: ratio of extracted data
  // Seed: represents a seed, which is randomly selected according to the seed
  // Extract 50% of the data
  .sample(true, 0.5)
  .map((_, 1))
  .reduceByKey(_ + _)
  // View top 3 of tilt key
  .take(3)

Solution 1: use Hive ETL to preprocess data

Scenario description

The Hive table causes data skew. If the data in the Hive table is not evenly distributed (for example, there are 100w data corresponding to a key, but there are only dozens of data corresponding to other keys), we need to frequently analyze the data through Spark, which will lead to data skew

Scheme description

At this time, we need to evaluate whether we can preprocess the data through Hive (during ETL or Join with other tables in advance). In the next Spark job, we do not need to use the original operation because the preprocessing has been done

Advantages and disadvantages of the scheme

① Advantages

The tilt of Spark is transferred to Hive's ETL stage

② Shortcomings

It will cause the ETL phase of Hive to tilt. Please check Hive SQL optimization

Solution 2: filter a few keys that cause skew

Scenario description

A few key s cause data skew

Scheme description

If the tilted key is useless data or filtering out the data corresponding to the tilted key does not affect the result, you can consider filtering out the tilted key directly

inputRdd.filter(_.equals("xxx"))

Advantages and disadvantages of the scheme

① Advantages

The occurrence of tilt key is directly avoided

② Shortcomings

There are few such scenes

The third solution: improve shuffle parallelism

Scenario description

The inclined key needs to be processed. At this time, improving the parallelism is the priority

Scheme description

When the shuffle operator is executed, the parallelism of the operator is directly increased, and this setting has the highest priority. Increase the number of task s that originally execute the tilt key, so as to improve the parallelism and reduce the computing time

val result = rdd
  .map((_, 1))
  // Increase parallelism to 500
  .reduceByKey(_ + _,500)

Advantages and disadvantages of the scheme

① Advantages

Effectively mitigate and mitigate the impact of data skew

② Shortcomings

Data skew has not been completely eradicated, but only alleviated. An extreme situation may occur: no matter how the number of tasks is increased, the final tilted key is still assigned to a task

It is generally used in combination with other schemes

Solution 4: two-stage aggregation (local aggregation + global aggregation)

Scheme description

Local aggregation: first add a random number to each key, and the key will change

# Before adding random numbers
(hello,1)(hello,1)(hello,1)(hello,1)(hello,1)

# After adding random numbers
(1_hello,1)(1_hello,1)(2_hello,1)(2_hello,1)(3_hello,1)

# The result of local aggregation, such as the reduceByKey operation
(1_hello,2)(2_hello,2)(3_hello,1)

Global aggregation: remove random values and perform global aggregation

# The result of local aggregation, such as the reduceByKey operation
(1_hello,2)(2_hello,2)(3_hello,1)

# Global aggregation
(hello,5)

Code example

 def main(args: Array[String]): Unit = {

    val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[2]").getOrCreate()

    val sc: SparkContext = spark.sparkContext
    
    // Suppose the tilted key is A
    val inputRdd: RDD[String] = sc.parallelize(Array(
      "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
      "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
      "A", "A", "A", "A", "B", "B", "B", "B", "B", "B",
      "B", "B", "C", "D", "E", "F", "G", "B", "B", "B"
    ))

    val random = new Random(10)

    // Local aggregation
    val mapRdd: RDD[(String, Int)] = inputRdd
      // Increase random value for each key
      .map(ele => (random.nextInt() + "_" + ele, 1))
      // polymerization
      .reduceByKey(_ + _, 1000)

    // Global aggregation
    val resultRdd: RDD[(String, Int)] = mapRdd
      // Remove random values for each key
      .map(ele => (ele._1.split("_")(1), ele._2))
      // polymerization
      .reduceByKey(_ + _)

    println(resultRdd.collect().toBuffer)

    sc.stop()
    
  }

Advantages and disadvantages of the scheme

① Advantages

For aggregate shuffle, it can directly solve the problem of data skew or greatly reduce the problem of data skew, and significantly improve the performance of Spark

② Shortcomings

There are few applicable scenarios, and the shuffle data skew generated by the Join cannot be solved

Data skew caused by Shuffle (Join)

Solution 1: convert reduce join to map join

Usage scenario (large table associated with small table)

When using Join operations on RDDS or Join statements in Spark SQL, the amount of data in one RDD or table is relatively small (such as several hundred megabytes or one or two gigabytes)

Scheme description

The Join operator is not used for connection operation, and the Broadcast variable and map operator are used to realize the Join operation, so as to completely avoid the operation of shuffle class and completely avoid the occurrence and occurrence of data skew

That is, the data in the smaller RDD is directly pulled to the memory of the Driver through the collect operator, and then a Broadcast variable is created for it and Broadcast to other Executor nodes

Code example

  def main(args: Array[String]): Unit = {

    val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[2]").getOrCreate()

    val sc: SparkContext = spark.sparkContext

    // Small table
    val student: List[(String, String)] = List(
      ("1", "Kyle"), ("2", "Jack"), ("3", "Lucy"), ("4", "Amy")
    )

    // Big watch
    val score: RDD[(String, Int)] = sc.parallelize(List(
      ("1", 90), ("2", 80), ("3", 65), ("4", 77),
      ("5", 68), ("6", 69), ("7", 57), ("8", 99)
    ))

    // radio broadcast
    val broadcast: Broadcast[List[(String, String)]] = sc.broadcast(student)

    val result: RDD[(String, Int)] = score
      .map {
        case (id, score) =>
          var temp = ""
          // Obtain the data in the broadcast and manually perform association matching
          for ((k, v) <- broadcast.value) {
            if (k.equals(id)) {
              // Get the student's name according to the id
              temp = v
            }
          }
          (temp, score)
      }
      // Filter out data not associated with
      .filter(_._1.nonEmpty)

    println(result.collect().toBuffer)

    sc.stop()
  }

Advantages and disadvantages of the scheme

① Advantages

The effect of data skew caused by Join operation is very good, because there is no shuffle at all, so there is no data skew

② Shortcomings

There are few applicable scenarios, which are only applicable to one large table and one large table. When we broadcast small table data, both Driver and executor will keep a small RDD data. If the data is too large, OOM will occur. Therefore, the change does not conform to the situation of large table Join large table.

The second solution: sample the tilt key and split the Join operation (a small number of key tilt)

Scheme description

When joining two RDD/Hive tables, if the amount of data on both sides is large, you can check the distribution of key s in the two RDD/Hive tables

If data skew occurs, it is because the data volume of a few keys in one RDD/Hive table is too large, while all keys in the other RDD/Hive table are evenly distributed. This scheme is adopted at this time

Realization idea

Note: both rdd1 and rdd2 have a large amount of data, but rdd1 is inclined and rdd2 is uniform

① The RDD containing a small number of tilted keys is sampled to obtain the tilted key

val skewedKey: Array[String] = inputRdd
  .sample(false, 0.5)
  .map((_, 1))
  .reduceByKey(_ + _)
  .sortBy(_._2, false)
  // Sort according to the amount of data corresponding to the key, and obtain the tilt key of Top3
  .take(3)
  .map(_._1)

② Filter the data of tilt keys from the RDD containing a small number of tilt keys to form an independent RDD

val sc: SparkContext = spark.sparkContext
// The RDD containing the tilt key is assumed to have a large amount of data
val rdd1: RDD[(String, String)] = sc.parallelize(List(
      ("aa", "Kyle"), ("bb", "Jack"), ("cc", "Lucy"), ("aa", "Amy")
    ))
// Filter out the data corresponding to the tilt key
val skewedRdd: RDD[(String, String)] = rdd1.filter(ele => skewedKey.contains(ele._1))

③ Filter out the data of non tilted keys from the RDD containing a small number of tilted keys to form an independent RDD

// Filter out the data corresponding to the non tilt key
val notSkewedRdd: RDD[(String, String)] = rdd1.filter(ele => !skewedKey.contains(ele._1))

④ Filter the data containing the tilted key from the non tilted RDD, expand it n times and form an independent RDD

// Filter out the data corresponding to the tilt key and expand it
val expandRdd: RDD[(String, String)] = rdd2
  .filter(ele => skewedKey.contains(ele._1))
  .flatMap(ele => {
    import scala.collection.mutable.ListBuffer
    val temp: ListBuffer[(String, String)] = ListBuffer()
    // Expand the data of the corresponding tilt key in the uniform RDD by 100 times
    for (i <- 1 to 100) {
      temp += (i + "_" + ele._1, ele._2)
    }
    temp
  })

⑤ Associate the data in the tilted rdd1 with the expanded data

val joinRdd1: RDD[(String, String)] = skewedRdd
  .map(ele => (random.nextInt(100) + "_" + ele._1, ele._2))
  .join(expandRdd)
  .map(ele => (ele._2._1.split("_")(1), ele._2._1))

⑥ Associate the non skewed data in rdd1 with rdd2

val joinRdd2: RDD[(String, String)] = notSkewedRdd.join(rdd2)

⑦ Merge the results of the two join s

val result: RDD[(String, String)] = joinRdd1.union(joinRdd2)

Advantages and disadvantages of the scheme

① Advantages

This scheme can be used for a few inclined key s

② Shortcomings

If the number of tilted key s is very large, this scheme is not applicable

Solution 3: Join with random prefix and capacity expansion RDD

Scheme description

When joining two RDD/Hive tables, if the data volume of both parties is relatively large, one party contains multiple tilted keys, and each tilted key may correspond to more than 1w + data, we need to use this scheme at this time

Realization idea

① Expand the RDD with uniform distribution by n times

// Filter out the data corresponding to the tilt key and expand it
val expandRdd: RDD[(String, String)] = rdd1
  .flatMap(ele => {
    import scala.collection.mutable.ListBuffer
    val temp: ListBuffer[(String, String)] = ListBuffer()
    // Expand the data of the corresponding tilt key in the uniform RDD by 100 times
    for (i <- 1 to 100) {
      temp += (i + "_" + ele._1, ele._2)
    }
    temp
  })

② Expand the tilted RDD and mark each data with a random value

val skewedRdd: RDD[(String, String)] = rdd2
  .map(ele => (random.nextInt(100) + "_" + ele._1, ele._2))

③ Associate two RDD S

val result: RDD[(String, String)] = expandRdd.union(skewedRdd)

Advantages and disadvantages of the scheme

① Advantages

The effect is significantly improved

② Shortcomings

After capacity expansion, the consumption of resources is relatively large

Posted by dstantdog3 on Sat, 16 Oct 2021 10:15:13 -0700