Data skew caused by Shuffle
When data skew occurs during Shuffle, we generally follow the troubleshooting steps
① Check the WEB-UI page to check the execution of tasks in the Stage of each Job, and whether there is an obvious situation that the execution time is too long
② If the task reports an error, check the corresponding log exception stack information to see if there is a memory overflow
③ Sampling to view tilted key s
val result = rdd // withReplacement: indicates whether the sample is put back after being extracted. true indicates that it will be put back, which means that the extracted sample may be repeated // fraction: ratio of extracted data // Seed: represents a seed, which is randomly selected according to the seed // Extract 50% of the data .sample(true, 0.5) .map((_, 1)) .reduceByKey(_ + _) // View top 3 of tilt key .take(3)
Solution 1: use Hive ETL to preprocess data
Scenario description
The Hive table causes data skew. If the data in the Hive table is not evenly distributed (for example, there are 100w data corresponding to a key, but there are only dozens of data corresponding to other keys), we need to frequently analyze the data through Spark, which will lead to data skew
Scheme description
At this time, we need to evaluate whether we can preprocess the data through Hive (during ETL or Join with other tables in advance). In the next Spark job, we do not need to use the original operation because the preprocessing has been done
Advantages and disadvantages of the scheme
① Advantages
The tilt of Spark is transferred to Hive's ETL stage
② Shortcomings
It will cause the ETL phase of Hive to tilt. Please check Hive SQL optimization
Solution 2: filter a few keys that cause skew
Scenario description
A few key s cause data skew
Scheme description
If the tilted key is useless data or filtering out the data corresponding to the tilted key does not affect the result, you can consider filtering out the tilted key directly
inputRdd.filter(_.equals("xxx"))
Advantages and disadvantages of the scheme
① Advantages
The occurrence of tilt key is directly avoided
② Shortcomings
There are few such scenes
The third solution: improve shuffle parallelism
Scenario description
The inclined key needs to be processed. At this time, improving the parallelism is the priority
Scheme description
When the shuffle operator is executed, the parallelism of the operator is directly increased, and this setting has the highest priority. Increase the number of task s that originally execute the tilt key, so as to improve the parallelism and reduce the computing time
val result = rdd .map((_, 1)) // Increase parallelism to 500 .reduceByKey(_ + _,500)
Advantages and disadvantages of the scheme
① Advantages
Effectively mitigate and mitigate the impact of data skew
② Shortcomings
Data skew has not been completely eradicated, but only alleviated. An extreme situation may occur: no matter how the number of tasks is increased, the final tilted key is still assigned to a task
It is generally used in combination with other schemes
Solution 4: two-stage aggregation (local aggregation + global aggregation)
Scheme description
Local aggregation: first add a random number to each key, and the key will change
# Before adding random numbers (hello,1)(hello,1)(hello,1)(hello,1)(hello,1) # After adding random numbers (1_hello,1)(1_hello,1)(2_hello,1)(2_hello,1)(3_hello,1) # The result of local aggregation, such as the reduceByKey operation (1_hello,2)(2_hello,2)(3_hello,1)
Global aggregation: remove random values and perform global aggregation
# The result of local aggregation, such as the reduceByKey operation (1_hello,2)(2_hello,2)(3_hello,1) # Global aggregation (hello,5)
Code example
def main(args: Array[String]): Unit = { val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[2]").getOrCreate() val sc: SparkContext = spark.sparkContext // Suppose the tilted key is A val inputRdd: RDD[String] = sc.parallelize(Array( "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "C", "D", "E", "F", "G", "B", "B", "B" )) val random = new Random(10) // Local aggregation val mapRdd: RDD[(String, Int)] = inputRdd // Increase random value for each key .map(ele => (random.nextInt() + "_" + ele, 1)) // polymerization .reduceByKey(_ + _, 1000) // Global aggregation val resultRdd: RDD[(String, Int)] = mapRdd // Remove random values for each key .map(ele => (ele._1.split("_")(1), ele._2)) // polymerization .reduceByKey(_ + _) println(resultRdd.collect().toBuffer) sc.stop() }
Advantages and disadvantages of the scheme
① Advantages
For aggregate shuffle, it can directly solve the problem of data skew or greatly reduce the problem of data skew, and significantly improve the performance of Spark
② Shortcomings
There are few applicable scenarios, and the shuffle data skew generated by the Join cannot be solved
Data skew caused by Shuffle (Join)
Solution 1: convert reduce join to map join
Usage scenario (large table associated with small table)
When using Join operations on RDDS or Join statements in Spark SQL, the amount of data in one RDD or table is relatively small (such as several hundred megabytes or one or two gigabytes)
Scheme description
The Join operator is not used for connection operation, and the Broadcast variable and map operator are used to realize the Join operation, so as to completely avoid the operation of shuffle class and completely avoid the occurrence and occurrence of data skew
That is, the data in the smaller RDD is directly pulled to the memory of the Driver through the collect operator, and then a Broadcast variable is created for it and Broadcast to other Executor nodes
Code example
def main(args: Array[String]): Unit = { val spark: SparkSession = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[2]").getOrCreate() val sc: SparkContext = spark.sparkContext // Small table val student: List[(String, String)] = List( ("1", "Kyle"), ("2", "Jack"), ("3", "Lucy"), ("4", "Amy") ) // Big watch val score: RDD[(String, Int)] = sc.parallelize(List( ("1", 90), ("2", 80), ("3", 65), ("4", 77), ("5", 68), ("6", 69), ("7", 57), ("8", 99) )) // radio broadcast val broadcast: Broadcast[List[(String, String)]] = sc.broadcast(student) val result: RDD[(String, Int)] = score .map { case (id, score) => var temp = "" // Obtain the data in the broadcast and manually perform association matching for ((k, v) <- broadcast.value) { if (k.equals(id)) { // Get the student's name according to the id temp = v } } (temp, score) } // Filter out data not associated with .filter(_._1.nonEmpty) println(result.collect().toBuffer) sc.stop() }
Advantages and disadvantages of the scheme
① Advantages
The effect of data skew caused by Join operation is very good, because there is no shuffle at all, so there is no data skew
② Shortcomings
There are few applicable scenarios, which are only applicable to one large table and one large table. When we broadcast small table data, both Driver and executor will keep a small RDD data. If the data is too large, OOM will occur. Therefore, the change does not conform to the situation of large table Join large table.
The second solution: sample the tilt key and split the Join operation (a small number of key tilt)
Scheme description
When joining two RDD/Hive tables, if the amount of data on both sides is large, you can check the distribution of key s in the two RDD/Hive tables
If data skew occurs, it is because the data volume of a few keys in one RDD/Hive table is too large, while all keys in the other RDD/Hive table are evenly distributed. This scheme is adopted at this time
Realization idea
Note: both rdd1 and rdd2 have a large amount of data, but rdd1 is inclined and rdd2 is uniform
① The RDD containing a small number of tilted keys is sampled to obtain the tilted key
val skewedKey: Array[String] = inputRdd .sample(false, 0.5) .map((_, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) // Sort according to the amount of data corresponding to the key, and obtain the tilt key of Top3 .take(3) .map(_._1)
② Filter the data of tilt keys from the RDD containing a small number of tilt keys to form an independent RDD
val sc: SparkContext = spark.sparkContext // The RDD containing the tilt key is assumed to have a large amount of data val rdd1: RDD[(String, String)] = sc.parallelize(List( ("aa", "Kyle"), ("bb", "Jack"), ("cc", "Lucy"), ("aa", "Amy") )) // Filter out the data corresponding to the tilt key val skewedRdd: RDD[(String, String)] = rdd1.filter(ele => skewedKey.contains(ele._1))
③ Filter out the data of non tilted keys from the RDD containing a small number of tilted keys to form an independent RDD
// Filter out the data corresponding to the non tilt key val notSkewedRdd: RDD[(String, String)] = rdd1.filter(ele => !skewedKey.contains(ele._1))
④ Filter the data containing the tilted key from the non tilted RDD, expand it n times and form an independent RDD
// Filter out the data corresponding to the tilt key and expand it val expandRdd: RDD[(String, String)] = rdd2 .filter(ele => skewedKey.contains(ele._1)) .flatMap(ele => { import scala.collection.mutable.ListBuffer val temp: ListBuffer[(String, String)] = ListBuffer() // Expand the data of the corresponding tilt key in the uniform RDD by 100 times for (i <- 1 to 100) { temp += (i + "_" + ele._1, ele._2) } temp })
⑤ Associate the data in the tilted rdd1 with the expanded data
val joinRdd1: RDD[(String, String)] = skewedRdd .map(ele => (random.nextInt(100) + "_" + ele._1, ele._2)) .join(expandRdd) .map(ele => (ele._2._1.split("_")(1), ele._2._1))
⑥ Associate the non skewed data in rdd1 with rdd2
val joinRdd2: RDD[(String, String)] = notSkewedRdd.join(rdd2)
⑦ Merge the results of the two join s
val result: RDD[(String, String)] = joinRdd1.union(joinRdd2)
Advantages and disadvantages of the scheme
① Advantages
This scheme can be used for a few inclined key s
② Shortcomings
If the number of tilted key s is very large, this scheme is not applicable
Solution 3: Join with random prefix and capacity expansion RDD
Scheme description
When joining two RDD/Hive tables, if the data volume of both parties is relatively large, one party contains multiple tilted keys, and each tilted key may correspond to more than 1w + data, we need to use this scheme at this time
Realization idea
① Expand the RDD with uniform distribution by n times
// Filter out the data corresponding to the tilt key and expand it val expandRdd: RDD[(String, String)] = rdd1 .flatMap(ele => { import scala.collection.mutable.ListBuffer val temp: ListBuffer[(String, String)] = ListBuffer() // Expand the data of the corresponding tilt key in the uniform RDD by 100 times for (i <- 1 to 100) { temp += (i + "_" + ele._1, ele._2) } temp })
② Expand the tilted RDD and mark each data with a random value
val skewedRdd: RDD[(String, String)] = rdd2 .map(ele => (random.nextInt(100) + "_" + ele._1, ele._2))
③ Associate two RDD S
val result: RDD[(String, String)] = expandRdd.union(skewedRdd)
Advantages and disadvantages of the scheme
① Advantages
The effect is significantly improved
② Shortcomings
After capacity expansion, the consumption of resources is relatively large