Hello, everyone. I won't introduce myself here. Let's talk about WordCount, that is, word frequency. You may learn from various channels that data processing will bear the brunt of WordCount. Why? Because WordCount is simple. But it can well describe data processing and data statistics. Today, we also follow the trend to talk about WordCount, but what? We didn't talk in general, we started with the attitude of systematic learning. Because there are many methods to implement WordCount. Each method is a different operator, which will give you different gains. That's it.
1, Data source.
[root@host juana]# touch data.txt [root@host juana]# vim data.txt liubei,sunshangxiang,zhaoyun minyue,guanyu,juyoujin,nakelulu liubei,libai libai,guanyu,bailishouyue
2, Specific implementation.
1. Method 1
this is the most primitive method.
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file sc.textFile("data/data.txt") // Flattening treatment .flatMap(line=>{line.split(",")}) // Break one by one .map(x=>(x,1)).reduceByKey(_+_) // Output one by one .foreach(println) // Turn off the environment sc.stop() } }
output
(liubei,2) (zhaoyun,1) (sunshangxiang,1) (nakelulu,1) (libai,2) (juyoujin,1) (guanyu,2) (bailishouyue,1) (minyue,1)
2. Method 2
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file sc.textFile("data/data.txt") .flatMap(line => {line.split(" ")}) .map(data => (data, 1)) .groupBy(_._1) .map(data=>(data._1, data._2.size)) .foreach(println) sc.stop()
output
(liubei,2) (zhaoyun,1) (sunshangxiang,1) (nakelulu,1) (libai,2) (juyoujin,1) (guanyu,2) (bailishouyue,1) (minyue,1)
3. Method 3
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file sc.textFile("data/data.txt") .flatMap(line => {line.split(" ")}) .map(data => (data, 1)) .groupByKey() .map(data=>(data._1, data._2.size)) .foreach(println) sc.stop()
output
(liubei,2) (zhaoyun,1) (sunshangxiang,1) (nakelulu,1) (libai,2) (juyoujin,1) (guanyu,2) (bailishouyue,1) (minyue,1)
4. Method 4
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file sc.textFile("data/data.txt") .flatMap(line => line.split(" ")) .map(data => (data, 1)) .aggregateByKey(0)(_ + _, _ + _) .foreach(println)
output
(liubei,2) (zhaoyun,1) (sunshangxiang,1) (nakelulu,1) (libai,2) (juyoujin,1) (guanyu,2) (bailishouyue,1) (minyue,1)
5. Method 5
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file sc.textFile("data/data.txt") .flatMap(line => line.split(" ")) .map(data => (data, 1)) .foldByKey(0)( _ + _) .foreach(println)
output
(liubei,2) (zhaoyun,1) (sunshangxiang,1) (nakelulu,1) (libai,2) (juyoujin,1) (guanyu,2) (bailishouyue,1) (minyue,1)
6. Method 6
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file sc.textFile("data/data.txt") .flatMap(line => line.split(" ")) .map(data => (data, 1)) .combineByKey(v=>v,(x:Int,y)=>(x+y),(x:Int,y)=>(x+y)) .foreach(println)
output
(liubei,2) (zhaoyun,1) (sunshangxiang,1) (nakelulu,1) (libai,2) (juyoujin,1) (guanyu,2) (bailishouyue,1) (minyue,1)
7. Method 7
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file val Rdd: RDD[String] = sc.textFile("data/data.txt") val rdd: RDD[String] = Rdd.flatMap(line => { val strings: Array[String] = line.split(",") strings }) val stringToLong: collection.Map[String, Long] = rdd.map(data => (data, 1)).countByKey() println(stringToLong)
output
Map( nakelulu -> 1, juyoujin -> 1, sunshangxiang -> 1, libai -> 2, minyue -> 1, zhaoyun -> 1, liubei -> 2, guanyu -> 2, bailishouyue -> 1 )
8. Method 8
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file val Rdd: RDD[String] = sc.textFile("data/data.txt") val rdd: RDD[String] = Rdd.flatMap(line => { val strings: Array[String] = line.split(",") strings }) val stringToLong: collection.Map[String, Long] = rdd.countByValue() println(stringToLong)
output
Map( nakelulu -> 1, juyoujin -> 1, sunshangxiang -> 1, libai -> 2, minyue -> 1, zhaoyun -> 1, liubei -> 2, guanyu -> 2, bailishouyue -> 1 )
9. Method 9
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file val Rdd: RDD[String] = sc.textFile("data/data.txt") val rdd: RDD[String] = Rdd.flatMap(line => { val strings: Array[String] = line.split(",") strings }) val RDD1: RDD[mutable.Map[String, Long]] = rdd.map(word => mutable.Map[String, Long]((word, 1L))) val stringToInt: mutable.Map[String, Long] = RDD1.reduce((map1, map2) => { map2.foreach { case (word, count) => val newCount: Long = map1.getOrElse(word, 0L) + count map1.update(word, newCount) } map1 } ) println(stringToInt)
output
Map( nakelulu -> 1, juyoujin -> 1, sunshangxiang -> 1, libai -> 2, minyue -> 1, zhaoyun -> 1, liubei -> 2, guanyu -> 2, bailishouyue -> 1 )
10. Method 10
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file val Rdd: RDD[String] = sc.textFile("data/data.txt") val rdd: RDD[String] = Rdd.flatMap(line => { val strings: Array[String] = line.split(",") strings }) val RDD1: RDD[mutable.Map[String, Int]] = rdd.map(word => mutable.Map[String, Int]((word, 1))) val stringToInt: mutable.Map[String, Int] = RDD1.aggregate(mutable.Map[String, Int]())((map1, map2) => { map2.foreach { case (word, count) => val newCount: Int = map1.getOrElse(word, 0) + count map1.update(word, newCount) } map1 }, (map1, map2) => { map2.foreach { case (word, count) => val newCount: Int = map1.getOrElse(word, 0) + count map1.update(word, newCount) } map1 }) println(stringToInt)
output
Map( nakelulu -> 1, juyoujin -> 1, sunshangxiang -> 1, libai -> 2, minyue -> 1, zhaoyun -> 1, liubei -> 2, guanyu -> 2, bailishouyue -> 1 )
11. Method 11
object WordCount { def main(args: Array[String]): Unit = { // Configure spark environment val conf = new SparkConf().setMaster("local[*]").setAppName("wc") // New SparkContext val sc = new SparkContext(conf) //read file val Rdd: RDD[String] = sc.textFile("data/data.txt") val rdd: RDD[String] = Rdd.flatMap(line => { val strings: Array[String] = line.split(",") strings}) val RDD1: RDD[mutable.Map[String, Int]] = rdd.map(word => mutable.Map[String, Int]((word, 1))) val stringToInt: mutable.Map[String, Int] = RDD1.fold(mutable.Map[String, Int]())((map1, map2) => { map2.foreach { case (word, count) => val newCount: Int = map1.getOrElse(word, 0) + count map1.update(word, newCount) } map1 }) println(stringToInt)
output
Map( nakelulu -> 1, juyoujin -> 1, sunshangxiang -> 1, libai -> 2, minyue -> 1, zhaoyun -> 1, liubei -> 2, guanyu -> 2, bailishouyue -> 1 )
Well, the above is our 11 implementation methods. Now let's make a summary.
You may have noticed that method 6 and the past are single outputs, but method 7 and the future j results are Map? Is it a coincidence? Or the lack of morality?
Well, I won't pull it, mainly because the main operators for implementing WordCount are different. The first is the Transformation operator, and the second is the Action operator.
Come on, have a look.
Transformation is an inert operator, which is executed when necessary. Action is an activity operator, which directly generates task execution. An action corresponds to a task.
Operator type | Implementation of main operators | Introduction to operators |
---|---|---|
Transformation | groupBy | The value data type grouping operator needs to specify the grouping value, and the return value is a binary (k, iterator) |
Transformation | groupByKey | The key value data type grouping operator does not need to specify grouping data, but directly groups according to K. The return value is a binary (k, iterator) |
Transformation | reduceByKey | The key value data type grouping aggregation operator does not need to specify the grouping data. It is directly grouped according to k, and the aggregation function needs to be specified (the inter partition function is the same as the partition class function) |
Transformation | aggregateByKey | The key value data type grouping aggregation operator does not need to specify the grouping data. It is directly grouped according to k. It is necessary to specify the aggregation function (inter partition and partition class functions are different) and the aggregation initial value |
Transformation | foldByKey | Grouping aggregation is similar to aggregateByKey, except that foldByKey can indicate that the calculation logic between partitions and partition classes is the same, coritization |
Transformation | combineByKey | The key value data type grouping aggregation operator has three parameters. The first parameter is to modify the first data, the second parameter is to aggregate functions within partitions, and the third parameter is to aggregate functions between partitions. The types of intermediate variables may not be recognized during compilation, so they need to be marked with generics |
Action | countByKey | Group aggregation is performed according to the key value, and the underlying call is reduceByKey(+) |
Action | countByValue | Direct aggregation, the bottom layer calls countByKey, map (value = > (value, null)). countByKey() |
Action | reduce | Aggregation operator, the bottom layer needs to be implemented by itself, and the value is to customize the bottom layer. See the usage above |
Action | fold | There is one more parameter than the reduce operator, which can set the initial value of intermediate temporary variables during aggregation] |
Action | aggregate | Inter partition aggregation and partition class aggregation can be performed. For example, fold has one more parameter to set the local aggregation function and global aggregation function of RDD data collection respectively |
The above is just a brief introduction to the operator. We will explain its principle and source code later. These introductions are used together. Let's look at them and understand them first. Ladies and gentlemen, I'm neglecting them.