1. mapPartitionsWithIndex
Create RDD with a specified partition number of 2
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7),2)
View partition
scala> rdd1.partitions
- As follows:
res0: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691, org.apache.spark.rdd.ParallelCollectionPartition@692)
View the number of partitions
scala> rdd1.partitions.length //Result: res1: Int = 2
Create an iteration function
def func(index : Int, iter : Iterator[Int]) : Iterator[String] = { iter.toList.map(x => "[partID:" + index + ",val: " + x +"]").iterator }
View partition content
scala> rdd1.mapPartitionsWithIndex(func).collect()
The contents are as follows:
res2: Array[String] = Array([partID:0,val: 1], [partID:0,val: 2], [partID:0,val: 3], [partID:1,val: 4], [partID:1,val: 5], [partID:1,val: 6], [partID:1,val: 7])
2. aggregate aggregation is more flexible
Create RDD
scala> val rdd = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
RDD summation
scala> rdd.aggregate(0)(_+_,_+_)
The work of finding the maximum of each partition, and then summing up the maximum
Boot: How to maximize the array: val arr = Array(1,2,3)
arr.reduce(math.max(,)// Gets a maximum of 3
scala> rdd.aggregate(0)(math.max(_,_),_+_)
Result: res6: Int = 13
Note: The maximum value of the first partition is 4, the second partition is 9, and the sum is 13.
Find the maximum:
scala> rdd.aggregate(0)(math.max(_,_),math.max(_,_)) res0: Int = 9
There is a problem here: the initial value is 0, which must be less than all the elements in the array, otherwise the result will be wrong. To avoid this problem, you can define the initial value as the first element of rdd, as follows:
scala> rdd.aggregate(rdd.first)(math.max(_,_),math.max(_,_)) res6: Int = 9
Practice:
scala> rdd.aggregate(10)(math.max(_,_),_+_) res7: Int = 30
Explain:
Initial value is 10
Comparing 10 with the maximum 4 of the first partition, the result is 10.
Comparing 10 with the maximum of 9 in the first partition, the result is 10.
The results are: 10 + 10 + 10 = 3
scala> rdd.aggregate(6)(math.max(_,_),_+_) res8: Int = 21
Explain:
Initial value is 6
Comparing 6 with the first partition maximum 4, the result is 6.
Comparing 6 with the maximum of 9 in the first partition, the result is 9.
The results were: 6 + 6 + 9 = 21
scala> rdd.aggregate(3)(math.max(_,_),_+_) res9: Int = 16
Explain:
Initial value is 3
Comparing 3 with the first partition maximum 4, the result is 4.
Comparing 3 with the first partition maximum 9, the result is 9.
The results were: 3 + 4 + 9 = 16
scala> val rdd1 = sc.parallelize(List("a","b","c","d","e"),2) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> rdd1.aggregate("")(_+_,_+_) res10: String = abcde
def func2(index : Int, iter : Iterator[String]) : Iterator[String] = { iter.toList.map(x => "[partID:" + index + ",val: " + x +"]").iterator } ```java ```java scala> rdd1.mapPartitionsWithIndex(func2).collect() res13: Array[String] = Array([partID:0,val: a], [partID:0,val: b], [partID:1,val: c],[partID:1,val: d], [partID:1,val: e])
scala> rdd1.aggregate("|")(_+_,_+_) res15: String = ||ab|cde
scala> val rdd2 = sc.parallelize(List("12","23","345","4567"),2) rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y); res16: String = 24 scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y); res17: String = 42 scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y); res18: String = 24 scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y); res19: String = 42
Explain:
With the same code, we get two results.
The maximum length of the first partition is 2, and the maximum length of the second partition is 4.
Because there are two partition calculations and it is not known which one will return first, there will be two results.
scala> val rdd3 = sc.parallelize(List("12","23","345",""),2) scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y); res25: String = 01 scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y); res26: String = 10 scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y); res27: String = 10 scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y); res28: String = 01
Explain:
The first partition:
Comparing the initial value "" length with "12" length 2, the result is 0, calling toString, and the result is "0"
Comparing "0" length with "23" length 2, the result is 1, toString is called, and the result is "1".
The second partition:
Comparing the initial value with the length of "0" and "345", the result is 0, calling toString, and the result is "0".
Comparing the length of "0" with the length of "0", the result is 0, toString is called, and the result is "1".
Because there are two partition calculations and it is not known which one will return first, there will be two results.
3. AggateByKey operates on the same Key
//1. Create k-v pairs for RDD scala> val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("cat",13),("mouse",2))) pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:24
//2. View content def func3(index : Int, it : Iterator[Any]) : Iterator[Any] = { it.toList.map(x => "[partID:" + index + ",val: " + x +"]").iterator } scala> pairRDD.mapPartitionsWithIndex(func3).collect() res30: Array[Any] = Array([partID:0,val: (cat,2)], [partID:0,val: (cat,5)], [partID:0,val: (mouse,4)], [partID:1,val: (cat,12)], [partID:1,val: (cat,13)], [partID:1,val: (mouse,2)])
//Statistics of the total number of each animal, the sum of zones, and then the sum scala> pairRDD.aggregateByKey(0)(_+_,_+_).collect() res31: Array[(String, Int)] = Array((cat,32), (mouse,6)) //Find out the maximum number of elements in each partition of each animal, and then sum them up for each partition. scala> pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect() res32: Array[(String, Int)] = Array((cat,18), (mouse,6)) //Find out the maximum number of elements for each animal. scala> pairRDD.aggregateByKey(0)(math.max(_,_),math.max(_,_)).collect() res33: Array[(String, Int)] = Array((cat,13), (mouse,4))
Finish...