[TOC]
1. Some basic terms in spark
RDD: Elastically distributed datasets, the core focus of spark
Operators: Some functions for manipulating RDD
application: user-written spark Program (DriverProgram + ExecutorProgram)
job: an action class operator triggered operation
stage: A set of tasks that divide a job into several stages based on dependencies
task: There are multiple tasks within the same stage that perform the same operation (but handle different data), which is the smallest execution unit in the cluster
Maybe after speaking these concepts, I still don't understand them. That's OK, it's just a little impression first
2. Basic Principles and Use of RDD
2.1 What is RDD
RDD, full name: Resilient Distributed Dataset, also known as Elastic Distributed Dataset.It is the most basic data abstraction in a spark and represents a set of immutable, partitionable elements that can be computed in parallel.RDD is characterized by a data flow model: automatic fault tolerance, location-aware scheduling, and scalability.Maybe it's not clear yet, let me give you an example:
Assuming that I use sc.textFile(xxxx) to read data from a file in hdfs, the file's data is equivalent to an RDD, but in fact the file's data is processed on several different worker nodes, but logically in this spark cluster, it belongs to one RDD.That is why RDD is a logical concept, an abstract object of the entire cluster, distributed in the cluster.Thus, RDD is the key to spark's data distributed computing processing.For example:
Figure 2.1 RDD principles
2.2 RDD Properties
For RDD properties, there is a comment in the source code as follows:
* Internally, each RDD is characterized by five main properties: * - A list of partitions 1. Is a group of partitions Understanding: RDDs are made up of partitions, each running on a different Worker, which enables distributed computing, the basic unit of the dataset.For RDD, each slice is processed by a computing task and determines the granularity of parallel computing.Users can specify the number of slices of an RDD when it is created or, if not, the default value will be used.The default value is the number of CPU cores assigned to the program. * - A function for computing each split 2. split is understood as partition In RDD, there is a series of functions that are used to process the data in each partition that is computed.This is called an operator.RDD in Spark is computed in pieces, and each RDD implements a compute function to achieve this purpose.The compute function composes the iterator without saving the results of each calculation. Operator type: transformation action * - A list of dependencies on other RDDs 3. RDD has dependencies.Narrow dependency, wide dependency. Stages need to be divided by dependencies, and tasks are performed by Stages.Each conversion of an RDD generates a new RDD, so there is a pipelining dependency between RDDs.When some partition data is lost, Spark can recalculate the missing partition data through this dependency instead of recalculating all partitions of the RDD. * - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 4. Can automatically create RDD s with partition rules When creating an RDD, you can specify partitions or customize partition rules. Two types of partitioning functions are currently implemented in Spark, one is Hash-based Hash Partitioner and the other is RangePartitioner based on Range.Partitioner will only exist for the RD of key-value, and the value of Parisiner for non-key-value RD is None.The Partitioner function determines not only the number of slices in the RDD itself, but also the number of slices in the parent RDD Shuffle output. * - Optionally, a list of preferred locations to compute each split on (e.g. block locations for * an HDFS file) 5. Prioritize nodes close to the file location to perform tasks. Move computing, not data This needs to be explained: in general, spark is built on top of hdfs and reads data from hdfs for processing.hdfs is a distributed storage, such as A, B, C three data nodes, assuming that the data spark is processing happens to be stored on C nodes.If a spark places a task on node B or A at this time, it must first read the data from node C, then transfer it over the network to node A or B before it can process it, which is really performance-intensive.In this case, spark means priority over nodes that are close to processing data, that is, priority over C nodes.This saves time and performance for data transfer.That is, move the calculation without moving the data.
2.3 Create RDD
To create an RDD, you first need to create a SparkContext object: //Create spark profile object. Set app name, master address, local representation as local mode. //If it is committed to a cluster, it is usually not specified.Write-to-death is inconvenient because it can run on multiple clusters val conf = new SparkConf().setAppName("wordCount").setMaster("local") //Create spark context object val sc = new SparkContext(conf)
Create an RDD from sc.parallelize():
sc.parallelize(seq,numPartitions) seq: is a sequence object, such as list, array, etc. numPartitions: Number of partitions, which can be unspecified, defaults to 2 Example: val rdd1 = sc.parallelize(Array(1,2,3,4,5),3) rdd1.partitions.length
Create from an external data source
val rdd1 = sc.textFile("/usr/local/tmp_files/test_WordCount.txt")
2.4 Operator Type
Operators are classified into transformation s and action s.
transformation:
Delayed calculation, lazy value, does not trigger calculation.Calculations are triggered only when an action operator is encountered.They simply remember the conversion actions that apply to the underlying dataset, such as a file.These transformations will only really work if an action occurs that requires the results to be returned to Driver.This design makes Spark run more efficiently
action:
Similar to transformation but triggers calculations directly without waiting
2.5 transformation operator
For the sake of illustration, creating an rdd is demonstrated using spark-shell:
scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,8,34,100,79)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
2.5.1 map(func)
map[U](f:T=>U)() A parameter is a function and requires a single function parameter and a single return value.Processing incoming data with functions and returning the processed data Example: //An anonymous function is passed in, each value *2 in rdd1 is returned with a new array of processing scala> val rdd2 = rdd1.map(_*2) rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26 //Here collect is an action operator that triggers the calculation and prints the result scala> rdd2.collect res0: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 200, 158)
2.5.2 filter
filter(f:T=>boolean) A parameter is a judgment function that determines whether an incoming parameter returns true or false, often used to filter data.Finally, return true's data Example: //Filter out data larger than 20 scala> rdd2.filter(_>20).collect res4: Array[Int] = Array(68, 200, 158)
2.5.3 flatMap
flatMap(f:T=>U) map then flat, flat is to expand and merge objects such as lists into a large list.And returns the processed data.This function is generally used to handle cases where more than one list contains more than one list Example: scala> val rdd4 = sc.parallelize(Array("a b c","d e f","x y z")) rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24 //Processing logic is to cut each string in an array by a space, generate multiple arrays, then expand multiple arrays and merge them into a new array scala> val rdd5 = rdd4.flatMap(_.split(" ")) rdd5: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at flatMap at <console>:26 scala> rdd5.collect res5: Array[String] = Array(a, b, c, d, e, f, x, y, z)
2.5.4 Collection Operations
union(otherDataset) Union intersection(otherDataset) intersection distinct([numTasks]))Duplicate removal //Example: scala> val rdd6 = sc.parallelize(List(5,6,7,8,9,10)) rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:24 scala> val rdd7 = sc.parallelize(List(1,2,3,4,5,6)) rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <console>:24 //Union scala> val rdd8 = rdd6.union(rdd7) rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[22] at union at <console>:28 scala> rdd8.collect res6: Array[Int] = Array(5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6) //Duplicate removal scala> rdd8.distinct.collect res7: Array[Int] = Array(4, 8, 1, 9, 5, 6, 10, 2, 7, 3) //intersection scala> val rdd9 = rdd6.intersection(rdd7) rdd9: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at intersection at <console>:28 scala> rdd9.collect res8: Array[Int] = Array(6, 5)
2.5.5 Grouping Operations
groupByKey([numTasks]): Just will be the same key Grouping aggregation in progress reduceByKey(f:(V,V)=>V, [numTasks]) First, it will be the same key Of KV Aggregate and then value Do the operation. scala> val rdd1 = sc.parallelize(List(("Tom",1000),("Jerry",3000),("Mary",2000))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[32] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("Jerry",1000),("Tom",3000),("Mike",2000))) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:24 scala> val rdd3 = rdd1 union rdd2 rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[34] at union at <console>:28 scala> rdd3.collect res9: Array[(String, Int)] = Array((Tom,1000), (Jerry,3000), (Mary,2000), (Jerry,1000), (Tom,3000), (Mike,2000)) scala> val rdd4 = rdd3.groupByKey rdd4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[35] at groupByKey at <console>:30 //Grouping scala> rdd4.collect res10: Array[(String, Iterable[Int])] = Array( (Tom,CompactBuffer(1000, 3000)), (Jerry,CompactBuffer(3000, 1000)), (Mike,CompactBuffer(2000)), (Mary,CompactBuffer(2000))) //Note: groupByKey is not recommended when using grouping functions because of poor performance, reducing ByKey is officially recommended //Group and Aggregate scala> rdd3.reduceByKey(_+_).collect res11: Array[(String, Int)] = Array((Tom,4000), (Jerry,4000), (Mike,2000), (Mary,2000))
2.5.6 cogroup
This function is not very summary, just look at the example scala> val rdd1 = sc.parallelize(List(("Tom",1),("Tom",2),("jerry",1),("Mike",2))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[37] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("jerry",2),("Tom",1),("Bob",2))) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[38] at parallelize at <console>:24 scala> val rdd3 = rdd1.cogroup(rdd2) rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[40] at cogroup at <console>:28 scala> rdd3.collect res12: Array[(String, (Iterable[Int], Iterable[Int]))] = Array( (Tom,(CompactBuffer(1, 2),CompactBuffer(1))), (Mike,(CompactBuffer(2),CompactBuffer())), (jerry,(CompactBuffer(1),CompactBuffer(2))), (Bob,(CompactBuffer(),CompactBuffer(2))))
2.5.7 Sort
sortByKey(acsending:true/false) according to KV In key sort sortBy(f:T=>U,acsending:true/false) General sorting, and sorting of the processed data, can be used to KV According to value Sort //Example: scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,8,34,100,79)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> val rdd2 = rdd1.map(_*2) rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26 scala> rdd2.collect res0: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 200, 158) scala> rdd2.sortBy(x=>x,true) res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at sortBy at <console>:29 scala> rdd2.sortBy(x=>x,true).collect res2: Array[Int] = Array(2, 4, 6, 8, 10, 16, 68, 158, 200) scala> rdd2.sortBy(x=>x,false).collect res3: Array[Int] = Array(200, 158, 68, 16, 10, 8, 6, 4, 2)
Another example:
Requirements: //We want to sort by value in KV, but SortByKey is sorted by key. //Practice one: 1,The first step is to exchange the key value Exchange, then call sortByKey 2,KV Change position again scala> val rdd1 = sc.parallelize(List(("tom",1),("jerry",1),("kitty",2),("bob",1))) rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[42] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("jerry",2),("tom",3),("kitty",5),("bob",2))) rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[43] at parallelize at <console>:24 scala> val rdd3 = rdd1 union(rdd2) rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[44] at union at <console>:28 scala> val rdd4 = rdd3.reduceByKey(_+_) rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[45] at reduceByKey at <console>:30 scala> rdd4.collect res13: Array[(String, Int)] = Array((bob,3), (tom,4), (jerry,3), (kitty,7)) //Change location, reorder, and then come back scala> val rdd5 = rdd4.map(t => (t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1)) rdd5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[50] at map at <console>:32 scala> rdd5.collect res14: Array[(String, Int)] = Array((kitty,7), (tom,4), (bob,3), (jerry,3)) //Practice 2: //Using sortBy directly, you can sort directly by value
2.6 action operator
reduce
Similar to the previous reduceByKey, but used for non-KV data merging and is an action operator scala> val rdd1 = sc.parallelize(List(1,2,3,4,5)) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:24 scala> val rdd2 = rdd1.reduce(_+_) rdd2: Int = 15
There are also some action operators:
reduce(func) adopt func Function Aggregation RDD All elements in the collect() In the driver, all elements of the dataset are returned as arrays.Usually used only to trigger calculations count() Return RDD Number of elements of first() Return RDD The first element (similar to take(1)) take(n) Returns the first from a dataset n Array of elements takeSample(withReplacement,num, [seed]) Returns an array randomly sampled from a data set num Composed of elements, you can choose whether to replace inadequate parts with random numbers. seed Used to specify random number generator seed takeOrdered(n, [ordering]) ,Returns the first from a dataset n Array of elements, sorted saveAsTextFile(path) Set the elements of the dataset to textfile Save form to HDFS File systems or other supported file systems, for each element, Spark Will be called toString Method to swap it into text in a file saveAsSequenceFile(path) Set the elements in the dataset to Hadoop sequencefile Save the format in the specified directory to enable HDFS Or other Hadoop Supported file systems. saveAsObjectFile(path) countByKey() In the light of(K,V)Type RDD,Return a(K,Int)Of map,Represents each key Number of corresponding elements. foreach(func) Run the function on each element of the dataset func Update.
2.7 RDD Cache Features
There is also a caching mechanism for RDDs, which means that RDDs are cached in memory or disk without duplicating computations.
Several operators are involved here:
cache() identifies that the rdd can be cached, and the default cache is in memory, with persist() being called at the bottom. persist() identifies that the rdd can be cached and that it is cached in memory by default Persist (newLevel: org.apache.spark.storage.StorageLevel) is similar to above, but you can specify the location of the cache Locations that can be cached are: val NONE = new StorageLevel(false, false, false, false) val DISK_ONLY = new StorageLevel(true, false, false, false) val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2) val MEMORY_ONLY = new StorageLevel(false, true, false, true) val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2) val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false) val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2) val MEMORY_AND_DISK = new StorageLevel(true, true, false, true) val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2) val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false) val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2) val OFF_HEAP = new StorageLevel(true, true, true, false, 1) There are two basic categories: Pure memory cache Pure disk cache Disk+Memory Cache
In general, the default location is directly where the cache performs better in memory but consumes a lot of memory, so be aware that if you don't need to, don't cache.
Give an example:
Read a large file, count lines scala> val rdd1 = sc.textFile("hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt") rdd1: org.apache.spark.rdd.RDD[String] = hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt MapPartitionsRDD[52] at textFile at <console>:24 scala> rdd1.count res15: Long = 923452 Trigger calculation, count rows scala> rdd1.cache res16: rdd1.type = hdfs://192.168.109.132:8020/tmp_files/test_Cache.txt MapPartitionsRDD[52] at textFile at <console>:24 Identify that this RDD can be cached without triggering calculations scala> rdd1.count res17: Long = 923452 Trigger the calculation and cache the results scala> rdd1.count res18: Long = 923452 Read data directly from the cache.
One thing to note is that when the cache method is called, it is only clear that the result can be cached when the RDD is identified for subsequent triggering calculations, not the current rdd.
2.8 RDD Fault Tolerance Mechanism--checkpoint
spark involves multiple conversion processes of RDD when calculating. If a partition of RDD fails to compute at this time, the result is lost.The easiest way is to start recalculating from scratch, but it's a waste of time.Checkpoint saves the RDD checkpoint state when the calculation is triggered and can be recalculated from the checkpoint if the later calculation is wrong.
Checkpoint generally sets a checkpoint path on a fault-tolerant, highly reliable file system (such as HDFS, S3, etc.) to store checkpoint data.Read directly from the checkpoint directory when an error occurs.There are two modes, local directory and remote directory.
2.8.1 Local Directory
In this mode, it is required to run in local mode, not in cluster mode, generally for test development
Sc.setCheckpointDir (Local Path) Sets the local checkpoint path rdd1.checkpoint Set Checkpoint rdd1.count encounters an action class operator, triggers the calculation, and generates a checkpoint in the checkpoint directory
2.8.2 Remote Directory (hdfs example)
This mode requires running in cluster mode for production environments
scala> sc.setCheckpointDir("hdfs://192.168.109.132:8020/sparkckpt0619") scala> rdd1.checkpoint scala> rdd1.count res22: Long = 923452 Usage is similar, but directories are different
Note that when checkpoint is used, a line in the source code reads as follows:
this function must be called before any job has been executed on this RDD. It is strongly recommend that this rdd is persisted in memory,otherwise saving it on a file will require recomputation. It roughly means: This method is called before the calculation starts, that is, before the action operator.It is also a good idea to set this rdd cache in memory, otherwise you will need to recalculate it when saving checkpoints.
Dependency and stage principles in 2.9 RDD
2.9.1 RDD Dependency
This is a key concept in the working principle of RDD.
First of all, dependency means that there is a dependency between RDDs because the spark calculation involves the conversion of multiple RDDs.There are two different types of relationships between an RDD and its dependent parent RDD (s), narrow dependency and wide dependency.Look at the picture
Figure 2.2 Width-narrow dependence of RDD
Wide dependency:
The partition of a parent RDD is dependent on the partition of multiple child RDDs.This is the process of shuffling parent RDD data, because a partition of the parent RDD is dependent on the partition of multiple RDDs, which means that the data of the parent RDD needs to be shuffled and assigned to multiple RDDs. The process of shuffle is actually shuffle.The general reality is that partitions of multiple parent RDDs and partitions of multiple child RDDs depend on each other interlacing.
Narrow dependency:
partition of a parent RDD is dependent on at most one child RDD
2.9.2 stage division
Figure 2.3 RDD Dependency
DAG(Directed Acyclic Graph) is called directed acyclic graph. The original RDD forms a DAG through a series of transformations, dividing the DAG into different Stages depending on the dependency between the RDDs.The role of narrow and wide dependency is to divide stages, which are wide dependencies between stages and narrow dependencies within stages.
For narrow dependencies, since the partition s of parent and child RDD s are one-to-one dependencies, parent and child transformations can be performed in a single task, such as task0 above, where CDFs are all narrow dependencies, so direct CDF transformations can be performed in a single task.A stage has narrow dependencies inside it
For wide dependencies, due to the presence of shuffles, all parent RDDs are required to be processed before shuffles can be executed, and then child RDDs can be processed.Due to the existence of shuffle, the task chain must not be continuous, and the task chain needs to be re-planned, so wide dependency is the basis for dividing stage s.
In depth, why divide stage s?
After dividing the stage according to the wide dependency, the task chain cannot be continuous because of the wide dependency shuffle.Dividing a stage means that only narrow dependencies exist within a stage, and narrow dependencies are one-to-one. Then the task chain is continuous and there is no shuffle. For example, in task0 above, one partition in C->D->F, the conversion process is one-to-one, so it is a continuous task chain, placed in one task, and the other partition is similar, placed in task1.Because F->G is wide-dependent and requires shuffle, task chains cannot be continuous.A line like this strings together RDD conversion logic until it encounters a wide dependency, which is a task, and a task is actually a partitioned data conversion process.In spark, task is the smallest dispatching unit, and spark assigns task to a worker node that is close to partitioned data for execution.So spark scheduling is actually task.
Back to the original question, why divide the stage, because once you divide the stage according to its width dependency, it's easy to divide a task within the stage, where each task processes data from one partition, and then spark schedules the task to the corresponding worker node for execution.So from dividing stages to dividing tasks, the core is to implement parallel computing.
So, last but not least, the purpose of dividing stage s is to make it easier to divide task s
What is 2.9.3 RDD storage?
When it comes to this, we actually think about the question, is there data stored in RDD?In fact, it does not store the actual transformation chain of data, that is, the transformation chain of partitions, that is, the operators contained in the task.When a stage is divided and then a task is divided, it becomes clear what operators are inside a task, and then the calculation task is sent to the worker node for execution.This kind of calculation is called pipeline calculation mode, and the operator is in the pipeline.
Thus, RDD is actually called an elastic distributed dataset, which does not mean that it stores data, but rather a method function, or operator, that stores operational data.
2.10 RDD Advanced Operators
2.10.1 mapPartitionsWithIndex
def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) ⇒ Iterator[U]) //Parameter description: f Is a function parameter that needs to be customized. f Receives two parameters, the first one being Int,Represents the partition number.The second Iterator[T]Represents all elements in the partition. //With these two parameters, you can define a function that handles partitions. Iterator[U] : The result returned when the operation is completed. //Give an example: //Print out the elements in each partition, including the partition number. //Create an rdd first with a specified number of partitions of 3 scala> val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8),3) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> def fun1(index:Int,iter:Iterator[Int]):Iterator[String]={ | iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator | } fun1: (index: Int, iter: Iterator[Int])Iterator[String] scala> rdd1.mapPartitionsWithIndex(fun1).collect res0: Array[String] = Array( [PartId: 0 , value = 1 ], [PartId: 0 , value = 2 ], [PartId: 1 , value = 3 ], [PartId: 1 , value = 4 ], [PartId: 1 , value = 5 ], [PartId: 2 , value = 6 ], [PartId: 2 , value = 7 ], [PartId: 2 , value = 8 ] )
2.10.2 aggregate
Aggregation, similar to group by. However, aggregates are aggregated locally (similar to combine in mr) before globally.Performance is better than using the reduce operator directly, because reduce is a direct global aggregation. def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U) Parameter description: zeroValue: The initial value that will be added to each partition and eventually to the global operation SeqOp: (U, T)U: Local aggregation operation function CombOp: (U, U)U: Global Aggregation Operating Function ================================================= Example 1: Initial value is 10 scala> val rdd2 = sc.parallelize(List(1,2,3,4,5),2) rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:27 //Print to see partitions scala> rdd2.mapPartitionsWithIndex(fun1).collect res7: Array[String] = Array([PartId: 0 , value = 1 ], [PartId: 0 , value = 2 ], [PartId: 1 , value = 3 ], [PartId: 1 , value = 4 ], [PartId: 1 , value = 5 ]) //Find the maximum value for each partition, and finally the maximum value for each partition, then add each maximum value globally scala> rdd2.aggregate(10)(max(_,_),_+_) res8: Int = 30 Why is this 10? Initial value of 10 means one more 10 for each partition Local operation with a maximum of 10 for each partition Global operation, also one more 10, namely 10 (local maximum) + 10 (local maximum) + 10 (global operation default) = 30 ================================================= Example 2: There are two ways to sum all partition global data using aggregate: 1,reduce(_+_) 2,aggregate(0)(_+_,_+_)
2.10.3 aggregateByKey
Similar to aggregate operations, the difference is that the <key value>data of the operation only operates on the value in the same key.KV s of the same key are grouped locally and then aggregated for value.Then group globally and aggregate values. AggateByKey and reduceByKey achieve similar functionality but are more efficient than reduceByKey Example: val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2) def fun1(index:Int,iter:Iterator[(String,Int)]):Iterator[String]={ iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator } pairRDD.mapPartitionsWithIndex(fun1).collect scala> val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),2) pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[16] at parallelize at <console>:27 scala> def fun1(index:Int,iter:Iterator[(String,Int)]):Iterator[String]={ | iter.toList.map( x => "[PartId: " + index + " , value = " + x + " ]").iterator | } fun1: (index: Int, iter: Iterator[(String, Int)])Iterator[String] scala> pairRDD.mapPartitionsWithIndex(fun1).collect res31: Array[String] = Array( [PartId: 0 , value = (cat,2) ], [PartId: 0 , value = (cat,5) ], [PartId: 0 , value = (mouse,4) ], [PartId: 1 , value = (cat,12) ], [PartId: 1 , value = (dog,12) ], [PartId: 1 , value = (mouse,2) ]) Requirements: Find the animals with the most animals in each zone, and go ahead and pairRDD.aggregateByKey(0)(math.max(_,_),_+_).collect 0:(cat,2) and (cat, 5) --> (cat, 5) (mouse, 4) 1:(cat,12) (dog,12) (mouse,2) Sum: (cat, 17) (mouse, 6) (dog, 12)
2.10.4 coalesce and repartition
Both are used for repartitioning repartition(numPartition) specifies the number of repartitions, shuffle must occur coalesce(numPartition,shuffleOrNot) specifies the number of repartitions, shuffle does not occur by default, shuffle can be specified
To see more operator usage, < http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html>
Very detailed writing
2.11 Partition
spark comes with two partition classes:
HashPartitioner: This is the default partitioner, which is used by some operators involving shuffle.Partitions are involved in some operators where the minimum number of partitions can be specified.These partitions can only be used for KV pairs
RangePartitioner: Partitions are based on the range of key s, e.g. 1-100,101-200 are different partitions
Users can also customize partitions themselves by following these steps:
1. Inherit Partitioner class first, write partition logic inside, form a new partition class
2,rdd.partitionBy(new partiotionerClassxxx())
Example:
The data format is as follows: 192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239 192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242 //Requirements: //Write access logs for the same page to a separate file //Code: package SparkExer import org.apache.spark.{Partitioner, SparkConf, SparkContext} import scala.collection.mutable /** * Custom partitions: * 1,Inherit the Partitioner class, write partition logic inside, and form a new partition class * 2,rdd.partitionBy(new partiotionerClassxxx()) */ object CustomPart { def main(args: Array[String]): Unit = { //Specify hadoop's home directory, some of the packages required to write files locally System.setProperty("hadoop.home.dir","F:\\hadoop-2.7.2") val conf = new SparkConf().setAppName("Tomcat Log Partitioner").setMaster("local") val sc = new SparkContext(conf) //Cut File val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt").map( line => { val jspName = line.split(" ")(6) (jspName,line) } ) //Extract all key s, that is, page names val rdd2 = rdd1.map(_._1).distinct().collect() //partition val rdd3 = rdd1.partitionBy(new TomcatWebPartitioner(rdd2)) //Write partition data to a file rdd3.saveAsTextFile("G:\\test\\tomcat_localhost") } } class TomcatWebPartitioner(jspList:Array[String]) extends Partitioner{ private val listMap = new mutable.HashMap[String,Int]() var partitionNum = 0 //Plan the number of partitions based on the page name for (s<-jspList) { listMap.put(s, partitionNum) partitionNum += 1 } //Return total number of partitions override def numPartitions: Int = listMap.size //Return a partition number by key override def getPartition(key: Any): Int = listMap.getOrElse(key.toString, 0) }
2.12 Serialization issues
First of all, we know that a spark program is actually divided into two parts, a driver, which runs in an executor, and a normal executor, which runs a task that operates on RDD.So it can also be seen that only code that operates on RDD will run distributed and be assigned to run in multiple executors, but code that does not belong to RDD will not, it will only be executed in driver.That's the key.
Example:
object test { val sc = new SparkContext() print("xxxx1") val rdd1 = sc.textFile(xxxx) rdd1.map(print(xxx2)) }
For example, in the example above, print(xxx2) in rdd1 is executed in multiple executors because it is executed inside the rdd, while print outside ("xxxx1") is executed only in the driver and is not serialized, so it is virtually impossible to transfer it over the network.So the difference must be understood.So we know that if a variable is not inside an rdd, it cannot be obtained by programs on multiple executors.But what if we want to?And it doesn't need to be defined inside the rdd.Then the following shared variables are needed
2.13 Broadcast variables in spark (shared variables)
Broadcast variables are those that can be called by rdd operators running in different executor s without any internal definition of rdd operators.Common connection objects, such as connecting to databases such as mysql, can be set as broadcast variables so that only one connection can be created.
Example usage:
//Define a shared variable for sharing data read from mongodb that needs to be encapsulated as a map(mid1,[map(mid2,score),map(mid3,score)....) val moviesRecsMap = spark.read .option("uri", mongoConfig.uri) .option("collection", MOVIES_RECS) .format("com.mongodb.spark.sql") .load().as[MoviesRecs].rdd.map(item=> { (item.mid, item.recs.map(itemRecs=>(itemRecs.mid,itemRecs.socre)).toMap) }).collectAsMap() //This is the key step, broadcasting variables out //Broadcast this variable and you can call it later in any executor val moviesRecsMapBroadcast = spark.sparkContext.broadcast(moviesRecsMap) //Since it's lazy loading, you need to call it manually once to actually broadcast it moviesRecsMapBroadcast.id
3. Small spark Cases
3.1 Top N Site Pages for Statistical Visits
Requirements: Before calculating the number of visits based on the site visit log N The page name of the bit //The data format is as follows: 192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239 192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242 //Code: package SparkExer import org.apache.spark.{SparkConf, SparkContext} /** * Analyzing tomcat logs * Log example: * 192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/ HTTP/1.1" 200 259 * * Count visits to each page */ object TomcatLog { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Tomcat Log analysis").setMaster("local") val sc = new SparkContext(conf) val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt") .map(_.split(" ")(6)) .map((_,1)) .reduceByKey(_+_) .map(t=>(t._2,t._1)) .sortByKey(false) .map(t=>(t._2,t._1)) .collect() //Or you can sort directly by value using sortBy (. _2, false) //Remove the first N data from the rdd rdd1.take(2).foreach(x=>println(x._1 + ":" + x._2)) println("=========================================") //Remove the last N data from the rdd rdd1.takeRight(2).foreach(x=>println(x._1 + ":" + x._2)) sc.stop() } }
3.2 Example of custom partitions
See the previous 2.11 partition example
3.3 spark connection mysql
package SparkExer import java.sql.{Connection, DriverManager, PreparedStatement} import org.apache.spark.{SparkConf, SparkContext} object SparkConMysql { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local") val sc = new SparkContext(conf) val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt") .map(_.split(" ")(6)) rdd1.foreach(l=>{ //jdbc operation needs to be included in rdd to be called by executor on all worker s, that is, serialized by borrowing rdd val jdbcUrl = "jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8" var conn:Connection = null //sql statement edit object var ps:PreparedStatement = null conn = DriverManager.getConnection(jdbcUrl, "root", "wjt86912572") //?Is a placeholder, followed by values in the form of ps1.setxxx(rowkey,value), in order ps = conn.prepareStatement("insert into customer values (?,?)") ps.setString(1,l) ps.setInt(2,1) }) } } Be careful: When spark operates on jdbc, there will be serialization problems if the database is operated on directly using jdbc. Because in the spark distributed framework, all objects that operate on RDDs should be within RDDs. It is possible to use it in the entire distributed cluster.That is, serialization is required. Generally speaking, five workers share a jdbc connection object, and five workers create a connection object separately So when you define a jdbc connection object, you need to define it within the RDD
The above approach is cumbersome, and each data creates a new jdbc connection object
Optimize: Use rdd1.foreachPartition() to operate on each partition, not on each data
This saves database resources by creating only one jdbc connection object for each partition
package SparkExer import java.sql.{Connection, DriverManager, PreparedStatement} import org.apache.spark.{SparkConf, SparkContext} object SparkConMysql { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local") val sc = new SparkContext(conf) val rdd1 = sc.textFile("G:\\test\\tomcat_localhost_access_log.2017-07-30.txt") .map(_.split(" ")(6)) rdd1.foreachPartition(updateMysql) /** * The above approach is cumbersome, and each data creates a new jdbc connection object * Optimize: Use rdd1.foreachPartition() to operate on each partition, not on each data * This saves database resources by creating only one jdbc connection object for each partition */ } def updateMysql(it:Iterator[String]) = { val jdbcUrl = "jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8" var conn:Connection = null //sql statement edit object var ps:PreparedStatement = null conn = DriverManager.getConnection(jdbcUrl, "root", "wjt86912572") //conn.createStatement() //ps = conn.prepareStatement("select * from customer") //What?Is a placeholder, followed by values in the form of ps1.setxxx(rowkey,value), in order ps = conn.prepareStatement("insert into customer values (?,?)") it.foreach(data=>{ ps.setString(1,data) ps.setInt(2,1) ps.executeUpdate() }) ps.close() conn.close() } }
Another way to connect mysql is through a JdbcRDD object
package SparkExer import java.sql.DriverManager import org.apache import org.apache.spark import org.apache.spark.rdd.JdbcRDD import org.apache.spark.{SparkConf, SparkContext} object MysqlJDBCRdd { def main(args: Array[String]): Unit = { val conn = () => { Class.forName("com.mysql.jdbc.Driver").newInstance() DriverManager.getConnection("jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8", "root", "wjt86912572") } val conf = new SparkConf().setAppName("Tomcat Log To Mysql").setMaster("local") val sc = new SparkContext(conf) //Create a jdbcrdd object val mysqlRdd = new JdbcRDD(sc,conn,"select * from customer where id>? and id<?", 2,7,2,r=> { r.getString(2) }) } } //This object is very limited in use, can only be used for select, and must pass in two restrictions in where and specify partitions
4. shuffle problem
4.1 shuffle-induced data skew analysis
https://www.cnblogs.com/diaozhaojian/p/9635829.html
1. Data skew principle (1) When shuffle is performed, the same key on each node must be pulled to a task on a node for processing. If a key has a very large amount of data, data skewing will occur. (2) Because of the partition rules after shuffle, there is too much data in a partition and the data is skewed 2. Data skew problem discovery and positioning View the amount of data allocated by each task in the currently running stage through the Spark Web UI to further determine if the uneven data allocated by the task causes data skewing. After knowing which stage the data skew occurs in, we need to deduce which part of the corresponding code of the stage in which the skew occurs based on the principle of stage division, which part of the code must have a shuffle class operator.View the distribution of key s through countByKey. 3. Data skew solution Filter a few key s that cause tilting Increase the parallelism of shuffle operations Local and Global Aggregation
4.2 shuffle class operator
1,Duplicate removal: def distinct() def distinct(numPartitions: Int) 2,polymerization def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])] def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])] def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)] def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)] def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)] def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)] def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)] 3,sort def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)] def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] 4,Repartition def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null) 5,Collection or table operations def intersection(other: RDD[T]): RDD[T] def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] def intersection(other: RDD[T], numPartitions: Int): RDD[T] def subtract(other: RDD[T], numPartitions: Int): RDD[T] def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)] def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)] def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)] def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]