Introduction to spark 4 (RDD Advanced Operator 1)

Keywords: Programming Scala Apache Spark Java

1. mapPartitionsWithIndex

Create RDD with a specified partition number of 2

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7),2)

View partition

scala> rdd1.partitions

- As follows:

res0: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691, org.apache.spark.rdd.ParallelCollectionPartition@692)

View the number of partitions

scala> rdd1.partitions.length   //Result: res1: Int = 2

Create an iteration function

def func(index : Int, iter : Iterator[Int]) : Iterator[String] = {
	iter.toList.map(x => "[partID:" + index + ",val: " + x +"]").iterator
}

View partition content

scala> rdd1.mapPartitionsWithIndex(func).collect()

The contents are as follows:

res2: Array[String] = Array([partID:0,val: 1], [partID:0,val: 2], [partID:0,val: 3], [partID:1,val: 4], [partID:1,val: 5], [partID:1,val: 6], [partID:1,val: 7])

2. aggregate aggregation is more flexible

Create RDD

scala>  val rdd = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)	

RDD summation

scala>   rdd.aggregate(0)(_+_,_+_)

The work of finding the maximum of each partition, and then summing up the maximum

Boot: How to maximize the array: val arr = Array(1,2,3)
arr.reduce(math.max(,)// Gets a maximum of 3

scala> rdd.aggregate(0)(math.max(_,_),_+_)  

Result: res6: Int = 13
Note: The maximum value of the first partition is 4, the second partition is 9, and the sum is 13.

Find the maximum:

scala> rdd.aggregate(0)(math.max(_,_),math.max(_,_))
res0: Int = 9 

There is a problem here: the initial value is 0, which must be less than all the elements in the array, otherwise the result will be wrong. To avoid this problem, you can define the initial value as the first element of rdd, as follows:

scala> rdd.aggregate(rdd.first)(math.max(_,_),math.max(_,_))
res6: Int = 9

Practice:

scala> rdd.aggregate(10)(math.max(_,_),_+_)
res7: Int = 30

Explain:
Initial value is 10
Comparing 10 with the maximum 4 of the first partition, the result is 10.
Comparing 10 with the maximum of 9 in the first partition, the result is 10.
The results are: 10 + 10 + 10 = 3

scala> rdd.aggregate(6)(math.max(_,_),_+_)
res8: Int = 21

Explain:
Initial value is 6
Comparing 6 with the first partition maximum 4, the result is 6.
Comparing 6 with the maximum of 9 in the first partition, the result is 9.
The results were: 6 + 6 + 9 = 21

scala> rdd.aggregate(3)(math.max(_,_),_+_)
res9: Int = 16

Explain:
Initial value is 3
Comparing 3 with the first partition maximum 4, the result is 4.
Comparing 3 with the first partition maximum 9, the result is 9.
The results were: 3 + 4 + 9 = 16

scala> val rdd1 = sc.parallelize(List("a","b","c","d","e"),2)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> rdd1.aggregate("")(_+_,_+_)
res10: String = abcde
def func2(index : Int, iter : Iterator[String]) : Iterator[String] = {
	iter.toList.map(x => "[partID:" + index + ",val: " + x +"]").iterator
}
```java

```java
scala> rdd1.mapPartitionsWithIndex(func2).collect()
res13: Array[String] = Array([partID:0,val: a], [partID:0,val: b], [partID:1,val: c],[partID:1,val: d], [partID:1,val: e])
scala> rdd1.aggregate("|")(_+_,_+_)
res15: String = ||ab|cde
scala> val rdd2 = sc.parallelize(List("12","23","345","4567"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y);
res16: String = 24

scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y);
res17: String = 42

scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y);
res18: String = 24

scala> rdd2.aggregate("")((x,y)=>math.max(x.length,y.length).toString,(x,y)=>x+y);
res19: String = 42

Explain:
With the same code, we get two results.
The maximum length of the first partition is 2, and the maximum length of the second partition is 4.
Because there are two partition calculations and it is not known which one will return first, there will be two results.

scala> val rdd3 = sc.parallelize(List("12","23","345",""),2)	

scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y);
res25: String = 01

scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y);
res26: String = 10

scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y);
res27: String = 10

scala> rdd3.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y);
res28: String = 01

Explain:
The first partition:
Comparing the initial value "" length with "12" length 2, the result is 0, calling toString, and the result is "0"
Comparing "0" length with "23" length 2, the result is 1, toString is called, and the result is "1".
The second partition:
Comparing the initial value with the length of "0" and "345", the result is 0, calling toString, and the result is "0".
Comparing the length of "0" with the length of "0", the result is 0, toString is called, and the result is "1".
Because there are two partition calculations and it is not known which one will return first, there will be two results.

3. AggateByKey operates on the same Key

//1. Create k-v pairs for RDD
scala> val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("cat",13),("mouse",2)))
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[9] at parallelize at <console>:24
//2. View content
def func3(index : Int, it : Iterator[Any]) : Iterator[Any] = {
    it.toList.map(x => "[partID:" + index + ",val: " + x +"]").iterator
}

scala> pairRDD.mapPartitionsWithIndex(func3).collect()
res30: Array[Any] = Array([partID:0,val: (cat,2)], [partID:0,val: (cat,5)], [partID:0,val: (mouse,4)], [partID:1,val: (cat,12)], [partID:1,val: (cat,13)], [partID:1,val: (mouse,2)])
//Statistics of the total number of each animal, the sum of zones, and then the sum
scala> pairRDD.aggregateByKey(0)(_+_,_+_).collect()
res31: Array[(String, Int)] = Array((cat,32), (mouse,6))

//Find out the maximum number of elements in each partition of each animal, and then sum them up for each partition.
scala> pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect()
res32: Array[(String, Int)] = Array((cat,18), (mouse,6))

//Find out the maximum number of elements for each animal.
scala> pairRDD.aggregateByKey(0)(math.max(_,_),math.max(_,_)).collect()
res33: Array[(String, Int)] = Array((cat,13), (mouse,4))

Finish...

Posted by r4ck4 on Wed, 30 Jan 2019 19:06:16 -0800