2. Principle and Use of spark--spark core

Keywords: Big Data Spark Scala Apache JDBC

[TOC]

1. Some basic terms in spark

RDD: Elastically distributed datasets, the core focus of spark
Operators: Some functions for manipulating RDD
application: user-written spark Program (DriverProgram + ExecutorProgram)
job: an action class operator triggered operation
stage: A set of tasks that divide a job into several stages based on dependencies
task: There are multiple tasks within the same stage that perform the same operation (but handle different data), which is the smallest execution unit in the cluster

Maybe after speaking these concepts, I still don't understand them. That's OK, it's just a little impression first

2. Basic Principles and Use of RDD

2.1 What is RDD

RDD, full name: Resilient Distributed Dataset, also known as Elastic Distributed Dataset.It is the most basic data abstraction in a spark and represents a set of immutable, partitionable elements that can be computed in parallel.RDD is characterized by a data flow model: automatic fault tolerance, location-aware scheduling, and scalability.Maybe it's not clear yet, let me give you an example:
Assuming that I use sc.textFile(xxxx) to read data from a file in hdfs, the file's data is equivalent to an RDD, but in fact the file's data is processed on several different worker nodes, but logically in this spark cluster, it belongs to one RDD.That is why RDD is a logical concept, an abstract object of the entire cluster, distributed in the cluster.Thus, RDD is the key to spark's data distributed computing processing.For example:

Figure 2.1 RDD principles

2.2 RDD Properties

For RDD properties, there is a comment in the source code as follows:

* Internally, each RDD is characterized by five main properties:
*  - A list of partitions
 1. Is a group of partitions
 Understanding: RDDs are made up of partitions, each running on a different Worker, which enables distributed computing, the basic unit of the dataset.For RDD, each slice is processed by a computing task and determines the granularity of parallel computing.Users can specify the number of slices of an RDD when it is created or, if not, the default value will be used.The default value is the number of CPU cores assigned to the program.

*  - A function for computing each split
 2. split is understood as partition
 In RDD, there is a series of functions that are used to process the data in each partition that is computed.This is called an operator.RDD in Spark is computed in pieces, and each RDD implements a compute function to achieve this purpose.The compute function composes the iterator without saving the results of each calculation.
Operator type:
transformation   action

*  - A list of dependencies on other RDDs
 3. RDD has dependencies.Narrow dependency, wide dependency.
Stages need to be divided by dependencies, and tasks are performed by Stages.Each conversion of an RDD generates a new RDD, so there is a pipelining dependency between RDDs.When some partition data is lost, Spark can recalculate the missing partition data through this dependency instead of recalculating all partitions of the RDD.

*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
4. Can automatically create RDD s with partition rules
 When creating an RDD, you can specify partitions or customize partition rules.
Two types of partitioning functions are currently implemented in Spark, one is Hash-based Hash Partitioner and the other is RangePartitioner based on Range.Partitioner will only exist for the RD of key-value, and the value of Parisiner for non-key-value RD is None.The Partitioner function determines not only the number of slices in the RDD itself, but also the number of slices in the parent RDD Shuffle output.

*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
*    an HDFS file)
5. Prioritize nodes close to the file location to perform tasks.
Move computing, not data
 This needs to be explained: in general, spark is built on top of hdfs and reads data from hdfs for processing.hdfs is a distributed storage, such as A, B, C three data nodes, assuming that the data spark is processing happens to be stored on C nodes.If a spark places a task on node B or A at this time, it must first read the data from node C, then transfer it over the network to node A or B before it can process it, which is really performance-intensive.In this case, spark means priority over nodes that are close to processing data, that is, priority over C nodes.This saves time and performance for data transfer.That is, move the calculation without moving the data.

2.3 Create RDD

To create an RDD, you first need to create a SparkContext object:
//Create spark profile object. Set app name, master address, local representation as local mode.
//If it is committed to a cluster, it is usually not specified.Write-to-death is inconvenient because it can run on multiple clusters
val conf = new SparkConf().setAppName("wordCount").setMaster("local")
//Create spark context object
val sc = new SparkContext(conf)

Create an RDD from sc.parallelize():

sc.parallelize(seq,numPartitions)
seq: is a sequence object, such as list, array, etc.
numPartitions: Number of partitions, which can be unspecified, defaults to 2

Example:
val rdd1 = sc.parallelize(Array(1,2,3,4,5),3)
rdd1.partitions.length

Create from an external data source

val rdd1 = sc.textFile("/usr/local/tmp_files/test_WordCount.txt")

2.4 Operator Type

Operators are classified into transformation s and action s.
transformation:

Delayed calculation, lazy value, does not trigger calculation.Calculations are triggered only when an action operator is encountered.They simply remember the conversion actions that apply to the underlying dataset, such as a file.These transformations will only really work if an action occurs that requires the results to be returned to Driver.This design makes Spark run more efficiently

action:

Similar to transformation but triggers calculations directly without waiting

2.5 transformation operator

For the sake of illustration, creating an rdd is demonstrated using spark-shell:

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,8,34,100,79))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

2.5.1 map(func)

map[U](f:T=>U)()
A parameter is a function and requires a single function parameter and a single return value.Processing incoming data with functions and returning the processed data

Example:
//An anonymous function is passed in, each value *2 in rdd1 is returned with a new array of processing
scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

//Here collect is an action operator that triggers the calculation and prints the result
scala> rdd2.collect
res0: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 200, 158)

2.5.2 filter

filter(f:T=>boolean)
A parameter is a judgment function that determines whether an incoming parameter returns true or false, often used to filter data.Finally, return true's data

Example:
//Filter out data larger than 20
scala> rdd2.filter(_>20).collect
res4: Array[Int] = Array(68, 200, 158)

2.5.3 flatMap

flatMap(f:T=>U)
map then flat, flat is to expand and merge objects such as lists into a large list.And returns the processed data.This function is generally used to handle cases where more than one list contains more than one list

Example:
scala> val rdd4 = sc.parallelize(Array("a b c","d e f","x y z"))
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24

//Processing logic is to cut each string in an array by a space, generate multiple arrays, then expand multiple arrays and merge them into a new array
scala> val rdd5 = rdd4.flatMap(_.split(" "))
rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at flatMap at <console>:26

scala> rdd5.collect
res5: Array[String] = Array(a, b, c, d, e, f, x, y, z)

2.5.4 Collection Operations

union(otherDataset) Union
intersection(otherDataset) intersection
distinct([numTasks]))Duplicate removal

//Example:
scala> val rdd6 = sc.parallelize(List(5,6,7,8,9,10))
rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24

scala> val rdd7 = sc.parallelize(List(1,2,3,4,5,6))
rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24

//Union
scala> val rdd8 = rdd6.union(rdd7)
rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[22] at union at <console>:28

scala> rdd8.collect
res6: Array[Int] = Array(5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6)

//Duplicate removal
scala> rdd8.distinct.collect
res7: Array[Int] = Array(4, 8, 1, 9, 5, 6, 10, 2, 7, 3)                         
//intersection
scala> val rdd9 = rdd6.intersection(rdd7)
rdd9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at intersection at <console>:28

scala> rdd9.collect
res8: Array[Int] = Array(6, 5)

2.5.5 Grouping Operations

groupByKey([numTasks]): Just will be the same key Grouping aggregation in progress
reduceByKey(f:(V,V)=>V, [numTasks]) First, it will be the same key Of KV Aggregate and then value Do the operation.

scala> val rdd1 = sc.parallelize(List(("Tom",1000),("Jerry",3000),("Mary",2000)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("Jerry",1000),("Tom",3000),("Mike",2000)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:24

scala> val rdd3 = rdd1 union rdd2
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[34] at union at <console>:28

scala> rdd3.collect
res9: Array[(String, Int)] = Array((Tom,1000), (Jerry,3000), (Mary,2000), (Jerry,1000), (Tom,3000), (Mike,2000))

scala> val rdd4 = rdd3.groupByKey
rdd4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[35] at groupByKey at <console>:30

//Grouping
scala> rdd4.collect
res10: Array[(String, Iterable[Int])] = 
Array(
(Tom,CompactBuffer(1000, 3000)), 
(Jerry,CompactBuffer(3000, 1000)), 
(Mike,CompactBuffer(2000)), 
(Mary,CompactBuffer(2000)))

//Note: groupByKey is not recommended when using grouping functions because of poor performance, reducing ByKey is officially recommended
//Group and Aggregate
scala> rdd3.reduceByKey(_+_).collect
res11: Array[(String, Int)] = Array((Tom,4000), (Jerry,4000), (Mike,2000), (Mary,2000))

2.5.6 cogroup

This function is not very summary, just look at the example
scala> val rdd1 = sc.parallelize(List(("Tom",1),("Tom",2),("jerry",1),("Mike",2)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[37] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("jerry",2),("Tom",1),("Bob",2)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:24

scala> val rdd3 = rdd1.cogroup(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[40] at cogroup at <console>:28

scala> rdd3.collect
res12: Array[(String, (Iterable[Int], Iterable[Int]))] = 
Array(
(Tom,(CompactBuffer(1, 2),CompactBuffer(1))), 
(Mike,(CompactBuffer(2),CompactBuffer())), 
(jerry,(CompactBuffer(1),CompactBuffer(2))), 
(Bob,(CompactBuffer(),CompactBuffer(2))))

2.5.7 Sort

sortByKey(acsending:true/false) according to KV In key sort
sortBy(f:T=>U,acsending:true/false) General sorting, and sorting of the processed data, can be used to KV According to value Sort

//Example:
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,8,34,100,79))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

scala> rdd2.collect
res0: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 200, 158)

scala> rdd2.sortBy(x=>x,true)
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at sortBy at <console>:29

scala> rdd2.sortBy(x=>x,true).collect
res2: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 158, 200)                      

scala> rdd2.sortBy(x=>x,false).collect
res3: Array[Int] = Array(200, 158, 68, 16, 10, 8, 6, 4, 2)

Another example:

Requirements:
//We want to sort by value in KV, but SortByKey is sorted by key.

//Practice one:
1,The first step is to exchange the key value Exchange, then call sortByKey
2,KV Change position again
scala> val rdd1 = sc.parallelize(List(("tom",1),("jerry",1),("kitty",2),("bob",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[42] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(("jerry",2),("tom",3),("kitty",5),("bob",2)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[43] at parallelize at <console>:24

scala> val rdd3 = rdd1 union(rdd2)
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[44] at union at <console>:28

scala> val rdd4 = rdd3.reduceByKey(_+_)
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[45] at reduceByKey at <console>:30

scala> rdd4.collect
res13: Array[(String, Int)] = Array((bob,3), (tom,4), (jerry,3), (kitty,7))

//Change location, reorder, and then come back
scala> val rdd5 = rdd4.map(t => (t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1))
rdd5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[50] at map at <console>:32

scala> rdd5.collect
res14: Array[(String, Int)] = Array((kitty,7), (tom,4), (bob,3), (jerry,3)) 

//Practice 2:
//Using sortBy directly, you can sort directly by value

2.6 action operator

reduce

Similar to the previous reduceByKey, but used for non-KV data merging and is an action operator

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:24

scala> val rdd2 = rdd1.reduce(_+_)
rdd2: Int = 15

There are also some action operators:

reduce(func)    adopt func Function Aggregation RDD All elements in the
collect()   In the driver, all elements of the dataset are returned as arrays.Usually used only to trigger calculations
count()  Return RDD Number of elements of
first()   Return RDD The first element (similar to take(1))
take(n)   Returns the first from a dataset n Array of elements
takeSample(withReplacement,num, [seed]) Returns an array randomly sampled from a data set num Composed of elements, you can choose whether to replace inadequate parts with random numbers. seed Used to specify random number generator seed
takeOrdered(n, [ordering])  ,Returns the first from a dataset n Array of elements, sorted
saveAsTextFile(path)    Set the elements of the dataset to textfile Save form to HDFS File systems or other supported file systems, for each element, Spark Will be called toString Method to swap it into text in a file
saveAsSequenceFile(path)    Set the elements in the dataset to Hadoop sequencefile Save the format in the specified directory to enable HDFS Or other Hadoop Supported file systems.
saveAsObjectFile(path)  
countByKey()    In the light of(K,V)Type RDD,Return a(K,Int)Of map,Represents each key Number of corresponding elements.
foreach(func)   Run the function on each element of the dataset func Update.

2.7 RDD Cache Features

There is also a caching mechanism for RDDs, which means that RDDs are cached in memory or disk without duplicating computations.
Several operators are involved here:

cache() identifies that the rdd can be cached, and the default cache is in memory, with persist() being called at the bottom. 
persist() identifies that the rdd can be cached and that it is cached in memory by default
 Persist (newLevel: org.apache.spark.storage.StorageLevel) is similar to above, but you can specify the location of the cache

Locations that can be cached are:
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

There are two basic categories:
Pure memory cache
 Pure disk cache
 Disk+Memory Cache

In general, the default location is directly where the cache performs better in memory but consumes a lot of memory, so be aware that if you don't need to, don't cache.
Give an example:

Read a large file, count lines

scala> val rdd1 = sc.textFile("hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt MapPartitionsRDD[52] at textFile at <console>:24

scala> rdd1.count
res15: Long = 923452 
Trigger calculation, count rows

scala> rdd1.cache
res16: rdd1.type = hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt MapPartitionsRDD[52] at textFile at <console>:24
 Identify that this RDD can be cached without triggering calculations

scala> rdd1.count
res17: Long = 923452 
Trigger the calculation and cache the results

scala> rdd1.count
res18: Long = 923452
 Read data directly from the cache.

One thing to note is that when the cache method is called, it is only clear that the result can be cached when the RDD is identified for subsequent triggering calculations, not the current rdd.

2.8 RDD Fault Tolerance Mechanism--checkpoint

spark involves multiple conversion processes of RDD when calculating. If a partition of RDD fails to compute at this time, the result is lost.The easiest way is to start recalculating from scratch, but it's a waste of time.Checkpoint saves the RDD checkpoint state when the calculation is triggered and can be recalculated from the checkpoint if the later calculation is wrong.
Checkpoint generally sets a checkpoint path on a fault-tolerant, highly reliable file system (such as HDFS, S3, etc.) to store checkpoint data.Read directly from the checkpoint directory when an error occurs.There are two modes, local directory and remote directory.

2.8.1 Local Directory

In this mode, it is required to run in local mode, not in cluster mode, generally for test development

Sc.setCheckpointDir (Local Path) Sets the local checkpoint path
 rdd1.checkpoint Set Checkpoint
 rdd1.count encounters an action class operator, triggers the calculation, and generates a checkpoint in the checkpoint directory

2.8.2 Remote Directory (hdfs example)

This mode requires running in cluster mode for production environments

scala> sc.setCheckpointDir("hdfs://192.168.109.132:8020/sparkckpt0619")

scala> rdd1.checkpoint

scala> rdd1.count
res22: Long = 923452

Usage is similar, but directories are different

Note that when checkpoint is used, a line in the source code reads as follows:

this function must be called before any job has been executed on this RDD. It is strongly recommend that  this rdd is
persisted in memory,otherwise saving it on a file will require recomputation.

It roughly means:
This method is called before the calculation starts, that is, before the action operator.It is also a good idea to set this rdd cache in memory, otherwise you will need to recalculate it when saving checkpoints.

Dependency and stage principles in 2.9 RDD

2.9.1 RDD Dependency

This is a key concept in the working principle of RDD.
First of all, dependency means that there is a dependency between RDDs because the spark calculation involves the conversion of multiple RDDs.There are two different types of relationships between an RDD and its dependent parent RDD (s), narrow dependency and wide dependency.Look at the picture

Figure 2.2 Width-narrow dependence of RDD

Wide dependency:
The partition of a parent RDD is dependent on the partition of multiple child RDDs.This is the process of shuffling parent RDD data, because a partition of the parent RDD is dependent on the partition of multiple RDDs, which means that the data of the parent RDD needs to be shuffled and assigned to multiple RDDs. The process of shuffle is actually shuffle.The general reality is that partitions of multiple parent RDDs and partitions of multiple child RDDs depend on each other interlacing.

Narrow dependency:
partition of a parent RDD is dependent on at most one child RDD

2.9.2 stage division

Figure 2.3 RDD Dependency

DAG(Directed Acyclic Graph) is called directed acyclic graph. The original RDD forms a DAG through a series of transformations, dividing the DAG into different Stages depending on the dependency between the RDDs.The role of narrow and wide dependency is to divide stages, which are wide dependencies between stages and narrow dependencies within stages.
For narrow dependencies, since the partition s of parent and child RDD s are one-to-one dependencies, parent and child transformations can be performed in a single task, such as task0 above, where CDFs are all narrow dependencies, so direct CDF transformations can be performed in a single task.A stage has narrow dependencies inside it
For wide dependencies, due to the presence of shuffles, all parent RDDs are required to be processed before shuffles can be executed, and then child RDDs can be processed.Due to the existence of shuffle, the task chain must not be continuous, and the task chain needs to be re-planned, so wide dependency is the basis for dividing stage s.
In depth, why divide stage s?
After dividing the stage according to the wide dependency, the task chain cannot be continuous because of the wide dependency shuffle.Dividing a stage means that only narrow dependencies exist within a stage, and narrow dependencies are one-to-one. Then the task chain is continuous and there is no shuffle. For example, in task0 above, one partition in C->D->F, the conversion process is one-to-one, so it is a continuous task chain, placed in one task, and the other partition is similar, placed in task1.Because F->G is wide-dependent and requires shuffle, task chains cannot be continuous.A line like this strings together RDD conversion logic until it encounters a wide dependency, which is a task, and a task is actually a partitioned data conversion process.In spark, task is the smallest dispatching unit, and spark assigns task to a worker node that is close to partitioned data for execution.So spark scheduling is actually task.
Back to the original question, why divide the stage, because once you divide the stage according to its width dependency, it's easy to divide a task within the stage, where each task processes data from one partition, and then spark schedules the task to the corresponding worker node for execution.So from dividing stages to dividing tasks, the core is to implement parallel computing.
So, last but not least, the purpose of dividing stage s is to make it easier to divide task s

What is 2.9.3 RDD storage?

When it comes to this, we actually think about the question, is there data stored in RDD?In fact, it does not store the actual transformation chain of data, that is, the transformation chain of partitions, that is, the operators contained in the task.When a stage is divided and then a task is divided, it becomes clear what operators are inside a task, and then the calculation task is sent to the worker node for execution.This kind of calculation is called pipeline calculation mode, and the operator is in the pipeline.
Thus, RDD is actually called an elastic distributed dataset, which does not mean that it stores data, but rather a method function, or operator, that stores operational data.

2.10 RDD Advanced Operators

2.10.1 mapPartitionsWithIndex

def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U])

//Parameter description:
f Is a function parameter that needs to be customized.
f Receives two parameters, the first one being Int,Represents the partition number.The second Iterator[T]Represents all elements in the partition.

//With these two parameters, you can define a function that handles partitions.
Iterator[U] :  The result returned when the operation is completed.

//Give an example:
//Print out the elements in each partition, including the partition number.

//Create an rdd first with a specified number of partitions of 3
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> def fun1(index:Int,iter:Iterator[Int]):Iterator[String]={
| iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator
| }
fun1: (index: Int, iter: Iterator[Int])Iterator[String]

scala> rdd1.mapPartitionsWithIndex(fun1).collect
res0: Array[String] = Array(
[PartId: 0 , value = 1 ], [PartId: 0 , value = 2 ], 
[PartId: 1 , value = 3 ], [PartId: 1 , value = 4 ], [PartId: 1 , value = 5 ], 
[PartId: 2 , value = 6 ], [PartId: 2 , value = 7 ], [PartId: 2 , value = 8 ]
)

2.10.2 aggregate

Aggregation, similar to group by.
However, aggregates are aggregated locally (similar to combine in mr) before globally.Performance is better than using the reduce operator directly, because reduce is a direct global aggregation.

def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
Parameter description:
zeroValue: The initial value that will be added to each partition and eventually to the global operation
 SeqOp: (U, T)U: Local aggregation operation function
 CombOp: (U, U)U: Global Aggregation Operating Function

=================================================
Example 1:
Initial value is 10
scala> val rdd2 = sc.parallelize(List(1,2,3,4,5),2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:27

//Print to see partitions
scala> rdd2.mapPartitionsWithIndex(fun1).collect
res7: Array[String] = Array([PartId: 0 , value = 1 ], [PartId: 0 , value = 2 ], [PartId: 1 , value = 3 ], [PartId: 1 , value = 4 ], [PartId: 1 , value = 5 ])

//Find the maximum value for each partition, and finally the maximum value for each partition, then add each maximum value globally
scala> rdd2.aggregate(10)(max(_,_),_+_)
res8: Int = 30

Why is this 10?
Initial value of 10 means one more 10 for each partition
 Local operation with a maximum of 10 for each partition
 Global operation, also one more 10, namely 10 (local maximum) + 10 (local maximum) + 10 (global operation default) = 30

=================================================
Example 2:
There are two ways to sum all partition global data using aggregate:
1,reduce(_+_)
2,aggregate(0)(_+_,_+_)

2.10.3 aggregateByKey

Similar to aggregate operations, the difference is that the <key value>data of the operation only operates on the value in the same key.KV s of the same key are grouped locally and then aggregated for value.Then group globally and aggregate values.

 AggateByKey and reduceByKey achieve similar functionality but are more efficient than reduceByKey

Example:
val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2)
def fun1(index:Int,iter:Iterator[(String,Int)]):Iterator[String]={
iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator
}

pairRDD.mapPartitionsWithIndex(fun1).collect

scala> val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2)
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[16] at parallelize at <console>:27

scala> def fun1(index:Int,iter:Iterator[(String,Int)]):Iterator[String]={
| iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator
| }
fun1: (index: Int, iter: Iterator[(String, Int)])Iterator[String]

scala> pairRDD.mapPartitionsWithIndex(fun1).collect
res31: Array[String] = Array(
[PartId: 0 , value = (cat,2) ], [PartId: 0 , value = (cat,5) ], [PartId: 0 , value = (mouse,4) ],
[PartId: 1 , value = (cat,12) ], [PartId: 1 , value = (dog,12) ], [PartId: 1 , value = (mouse,2)
])

Requirements:
Find the animals with the most animals in each zone, and go ahead and
pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect

0:(cat,2) and (cat, 5) --> (cat, 5) (mouse, 4)
1:(cat,12)   (dog,12)   (mouse,2)

Sum: (cat, 17) (mouse, 6) (dog, 12) 

2.10.4 coalesce and repartition

Both are used for repartitioning
 repartition(numPartition) specifies the number of repartitions, shuffle must occur
 coalesce(numPartition,shuffleOrNot) specifies the number of repartitions, shuffle does not occur by default, shuffle can be specified

To see more operator usage, < http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html&gt;
Very detailed writing

2.11 Partition

spark comes with two partition classes:
HashPartitioner: This is the default partitioner, which is used by some operators involving shuffle.Partitions are involved in some operators where the minimum number of partitions can be specified.These partitions can only be used for KV pairs
RangePartitioner: Partitions are based on the range of key s, e.g. 1-100,101-200 are different partitions
Users can also customize partitions themselves by following these steps:
1. Inherit Partitioner class first, write partition logic inside, form a new partition class
2,rdd.partitionBy(new partiotionerClassxxx())
Example:

The data format is as follows:
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242

//Requirements:
//Write access logs for the same page to a separate file 

//Code:
package SparkExer

import org.apache.spark.{Partitioner, SparkConf, SparkContext}
import scala.collection.mutable

/**
  * Custom partitions:
  * 1,Inherit the Partitioner class, write partition logic inside, and form a new partition class
  * 2,rdd.partitionBy(new partiotionerClassxxx())
  */
object CustomPart {
  def main(args: Array[String]): Unit = {
    //Specify hadoop's home directory, some of the packages required to write files locally
    System.setProperty("hadoop.home.dir","F:\\hadoop-2.7.2")

    val conf = new SparkConf().setAppName("Tomcat Log Partitioner").setMaster("local")
    val sc = new SparkContext(conf)
    //Cut File
    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt").map(
      line => {
        val jspName = line.split(" ")(6)
        (jspName,line)
      }
    )

    //Extract all key s, that is, page names
    val rdd2 = rdd1.map(_._1).distinct().collect()
    //partition
    val rdd3 = rdd1.partitionBy(new TomcatWebPartitioner(rdd2))
    //Write partition data to a file
    rdd3.saveAsTextFile("G:\\test\\tomcat_localhost")
  }
}

class TomcatWebPartitioner(jspList:Array[String]) extends Partitioner{
  private val listMap = new mutable.HashMap[String,Int]()
  var partitionNum = 0

  //Plan the number of partitions based on the page name
  for (s<-jspList) {
    listMap.put(s, partitionNum)
    partitionNum += 1
  }

  //Return total number of partitions
  override def numPartitions: Int = listMap.size

  //Return a partition number by key
  override def getPartition(key: Any): Int = listMap.getOrElse(key.toString, 0)
}

2.12 Serialization issues

First of all, we know that a spark program is actually divided into two parts, a driver, which runs in an executor, and a normal executor, which runs a task that operates on RDD.So it can also be seen that only code that operates on RDD will run distributed and be assigned to run in multiple executors, but code that does not belong to RDD will not, it will only be executed in driver.That's the key.
Example:

object test {
    val sc = new SparkContext()
    print("xxxx1")

    val rdd1 = sc.textFile(xxxx)
    rdd1.map(print(xxx2)) 

}

For example, in the example above, print(xxx2) in rdd1 is executed in multiple executors because it is executed inside the rdd, while print outside ("xxxx1") is executed only in the driver and is not serialized, so it is virtually impossible to transfer it over the network.So the difference must be understood.So we know that if a variable is not inside an rdd, it cannot be obtained by programs on multiple executors.But what if we want to?And it doesn't need to be defined inside the rdd.Then the following shared variables are needed

2.13 Broadcast variables in spark (shared variables)

Broadcast variables are those that can be called by rdd operators running in different executor s without any internal definition of rdd operators.Common connection objects, such as connecting to databases such as mysql, can be set as broadcast variables so that only one connection can be created.
Example usage:

//Define a shared variable for sharing data read from mongodb that needs to be encapsulated as a map(mid1,[map(mid2,score),map(mid3,score)....)

    val moviesRecsMap = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MOVIES_RECS)
      .format("com.mongodb.spark.sql")
      .load().as[MoviesRecs].rdd.map(item=> {
      (item.mid, item.recs.map(itemRecs=>(itemRecs.mid,itemRecs.socre)).toMap)
    }).collectAsMap()

    //This is the key step, broadcasting variables out
    //Broadcast this variable and you can call it later in any executor
    val moviesRecsMapBroadcast = spark.sparkContext.broadcast(moviesRecsMap)
    //Since it's lazy loading, you need to call it manually once to actually broadcast it
    moviesRecsMapBroadcast.id

3. Small spark Cases

3.1 Top N Site Pages for Statistical Visits

Requirements: Before calculating the number of visits based on the site visit log N The page name of the bit
//The data format is as follows:
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242

//Code:
package SparkExer

import org.apache.spark.{SparkConf, SparkContext}

/**
  * Analyzing tomcat logs
  * Log example:
  * 192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/ HTTP/1.1" 200 259
  *
  * Count visits to each page
  */
object TomcatLog {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Tomcat Log analysis").setMaster("local")
    val sc = new SparkContext(conf)

    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt")
      .map(_.split(" ")(6))
      .map((_,1))
      .reduceByKey(_+_)
      .map(t=>(t._2,t._1))
      .sortByKey(false)
      .map(t=>(t._2,t._1))
      .collect()
    //Or you can sort directly by value using sortBy (. _2, false)

    //Remove the first N data from the rdd
    rdd1.take(2).foreach(x=>println(x._1 + ":" + x._2))
    println("=========================================")
    //Remove the last N data from the rdd
    rdd1.takeRight(2).foreach(x=>println(x._1 + ":" + x._2))
    sc.stop()
  }
}

3.2 Example of custom partitions

See the previous 2.11 partition example

3.3 spark connection mysql

package SparkExer

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.spark.{SparkConf, SparkContext}

object SparkConMysql {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local")
    val sc = new SparkContext(conf)
    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt")
      .map(_.split(" ")(6))

    rdd1.foreach(l=>{
      //jdbc operation needs to be included in rdd to be called by executor on all worker s, that is, serialized by borrowing rdd
      val jdbcUrl = "jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8"
      var conn:Connection = null
      //sql statement edit object
      var ps:PreparedStatement = null

      conn = DriverManager.getConnection(jdbcUrl, "root", "wjt86912572")
      //?Is a placeholder, followed by values in the form of ps1.setxxx(rowkey,value), in order
      ps = conn.prepareStatement("insert into customer values (?,?)")
      ps.setString(1,l)
      ps.setInt(2,1)

    })
  }
}

Be careful:
When spark operates on jdbc, there will be serialization problems if the database is operated on directly using jdbc.
Because in the spark distributed framework, all objects that operate on RDDs should be within RDDs.
It is possible to use it in the entire distributed cluster.That is, serialization is required.
Generally speaking, five workers share a jdbc connection object, and five workers create a connection object separately
 So when you define a jdbc connection object, you need to define it within the RDD

The above approach is cumbersome, and each data creates a new jdbc connection object
Optimize: Use rdd1.foreachPartition() to operate on each partition, not on each data
This saves database resources by creating only one jdbc connection object for each partition

package SparkExer

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.spark.{SparkConf, SparkContext}

object SparkConMysql {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local")
    val sc = new SparkContext(conf)
    val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt")
      .map(_.split(" ")(6))

    rdd1.foreachPartition(updateMysql)
    /**
      * The above approach is cumbersome, and each data creates a new jdbc connection object
      * Optimize: Use rdd1.foreachPartition() to operate on each partition, not on each data
      * This saves database resources by creating only one jdbc connection object for each partition
      */

  }

  def updateMysql(it:Iterator[String]) = {
    val jdbcUrl = "jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8"
    var conn:Connection = null
    //sql statement edit object
    var ps:PreparedStatement = null

    conn = DriverManager.getConnection(jdbcUrl, "root", "wjt86912572")
    //conn.createStatement()

    //ps = conn.prepareStatement("select * from customer")
    //What?Is a placeholder, followed by values in the form of ps1.setxxx(rowkey,value), in order
    ps = conn.prepareStatement("insert into customer values (?,?)")
    it.foreach(data=>{
      ps.setString(1,data)
      ps.setInt(2,1)
      ps.executeUpdate()
    })
    ps.close()
    conn.close()
  }
}

Another way to connect mysql is through a JdbcRDD object

package SparkExer

import java.sql.DriverManager

import org.apache
import org.apache.spark
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}

object MysqlJDBCRdd {
  def main(args: Array[String]): Unit = {
    val conn = () => {
      Class.forName("com.mysql.jdbc.Driver").newInstance()
      DriverManager.getConnection("jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8",
      "root",
      "wjt86912572")
    }
    val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local")
    val sc = new SparkContext(conf)
    //Create a jdbcrdd object
    val mysqlRdd = new JdbcRDD(sc,conn,"select * from customer where id>? and id<?", 2,7,2,r=> {
      r.getString(2)
    })

  }
}

//This object is very limited in use, can only be used for select, and must pass in two restrictions in where and specify partitions

4. shuffle problem

4.1 shuffle-induced data skew analysis

https://www.cnblogs.com/diaozhaojian/p/9635829.html

1. Data skew principle
 (1) When shuffle is performed, the same key on each node must be pulled to a task on a node for processing. If a key has a very large amount of data, data skewing will occur.
(2) Because of the partition rules after shuffle, there is too much data in a partition and the data is skewed  

2. Data skew problem discovery and positioning
 View the amount of data allocated by each task in the currently running stage through the Spark Web UI to further determine if the uneven data allocated by the task causes data skewing.
    After knowing which stage the data skew occurs in, we need to deduce which part of the corresponding code of the stage in which the skew occurs based on the principle of stage division, which part of the code must have a shuffle class operator.View the distribution of key s through countByKey.

3. Data skew solution
 Filter a few key s that cause tilting
 Increase the parallelism of shuffle operations
 Local and Global Aggregation

4.2 shuffle class operator

1,Duplicate removal:
def distinct()
def distinct(numPartitions: Int)

2,polymerization
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]

3,sort
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]

def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

4,Repartition

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)

5,Collection or table operations
def intersection(other: RDD[T]): RDD[T]

def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def intersection(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], numPartitions: Int): RDD[T]

def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

Posted by FeeBle on Fri, 15 Nov 2019 22:22:07 -0800