spark - the source of all things WordCount

Keywords: Big Data GitLab Spark

N methods of implementing WordCount in Spark

   Hello, everyone. I won't introduce myself here. Let's talk about WordCount, that is, word frequency. You may learn from various channels that data processing will bear the brunt of WordCount. Why? Because WordCount is simple. But it can well describe data processing and data statistics. Today, we also follow the trend to talk about WordCount, but what? We didn't talk in general, we started with the attitude of systematic learning. Because there are many methods to implement WordCount. Each method is a different operator, which will give you different gains. That's it.

1, Data source.

[root@host juana]# touch data.txt
[root@host juana]# vim data.txt
liubei,sunshangxiang,zhaoyun
minyue,guanyu,juyoujin,nakelulu
liubei,libai
libai,guanyu,bailishouyue

2, Specific implementation.

1. Method 1

  this is the most primitive method.

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
    sc.textFile("data/data.txt")
    // Flattening treatment
      .flatMap(line=>{line.split(",")})
     // Break one by one
      .map(x=>(x,1)).reduceByKey(_+_)
     // Output one by one
      .foreach(println)
     // Turn off the environment
    sc.stop()
  }
}

output

(liubei,2)
(zhaoyun,1)
(sunshangxiang,1)
(nakelulu,1)
(libai,2)
(juyoujin,1)
(guanyu,2)
(bailishouyue,1)
(minyue,1)

2. Method 2

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
    sc.textFile("data/data.txt")
    .flatMap(line => {line.split(" ")})
    .map(data => (data, 1))
    .groupBy(_._1)
    .map(data=>(data._1, data._2.size))
    .foreach(println)
    sc.stop()

output

(liubei,2)
(zhaoyun,1)
(sunshangxiang,1)
(nakelulu,1)
(libai,2)
(juyoujin,1)
(guanyu,2)
(bailishouyue,1)
(minyue,1)

3. Method 3

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
    sc.textFile("data/data.txt")
    .flatMap(line => {line.split(" ")})
	.map(data => (data, 1))
	.groupByKey()
	.map(data=>(data._1, data._2.size))
	.foreach(println)
    sc.stop()

output

(liubei,2)
(zhaoyun,1)
(sunshangxiang,1)
(nakelulu,1)
(libai,2)
(juyoujin,1)
(guanyu,2)
(bailishouyue,1)
(minyue,1)

4. Method 4

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
    sc.textFile("data/data.txt")
    .flatMap(line => line.split(" "))
	.map(data => (data, 1))
	.aggregateByKey(0)(_ + _, _ + _)
    .foreach(println)

output

(liubei,2)
(zhaoyun,1)
(sunshangxiang,1)
(nakelulu,1)
(libai,2)
(juyoujin,1)
(guanyu,2)
(bailishouyue,1)
(minyue,1)

5. Method 5

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
    sc.textFile("data/data.txt")
    .flatMap(line => line.split(" "))
	.map(data => (data, 1))
	.foldByKey(0)( _ + _)
	.foreach(println)

output

(liubei,2)
(zhaoyun,1)
(sunshangxiang,1)
(nakelulu,1)
(libai,2)
(juyoujin,1)
(guanyu,2)
(bailishouyue,1)
(minyue,1)

6. Method 6

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
    sc.textFile("data/data.txt")
    .flatMap(line => line.split(" "))
	.map(data => (data, 1))
	.combineByKey(v=>v,(x:Int,y)=>(x+y),(x:Int,y)=>(x+y))
    .foreach(println)

output

(liubei,2)
(zhaoyun,1)
(sunshangxiang,1)
(nakelulu,1)
(libai,2)
(juyoujin,1)
(guanyu,2)
(bailishouyue,1)
(minyue,1)

7. Method 7

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
	val Rdd: RDD[String] = sc.textFile("data/data.txt")
  	val rdd: RDD[String] = Rdd.flatMap(line => {
    val strings: Array[String] = line.split(",")
    strings
  		})
   val stringToLong: collection.Map[String, Long] = rdd.map(data => (data, 1)).countByKey()
    println(stringToLong)

output

Map(
 nakelulu -> 1,
 juyoujin -> 1,
 sunshangxiang -> 1,
 libai -> 2,
 minyue -> 1,
 zhaoyun -> 1,
 liubei -> 2, 
 guanyu -> 2, 
 bailishouyue -> 1
 )

8. Method 8

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
	val Rdd: RDD[String] = sc.textFile("data/data.txt")
  	val rdd: RDD[String] = Rdd.flatMap(line => {
    val strings: Array[String] = line.split(",")
    strings
  		})
   val stringToLong: collection.Map[String, Long] = rdd.countByValue()
    println(stringToLong)

output

Map(
 nakelulu -> 1,
 juyoujin -> 1,
 sunshangxiang -> 1,
 libai -> 2,
 minyue -> 1,
 zhaoyun -> 1,
 liubei -> 2, 
 guanyu -> 2, 
 bailishouyue -> 1
 )

9. Method 9

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
	val Rdd: RDD[String] = sc.textFile("data/data.txt")
  	val rdd: RDD[String] = Rdd.flatMap(line => {
    val strings: Array[String] = line.split(",")
    strings
  		})
    val RDD1: RDD[mutable.Map[String, Long]] = rdd.map(word => mutable.Map[String, Long]((word, 1L)))
    val stringToInt: mutable.Map[String, Long] = RDD1.reduce((map1, map2) => {
      map2.foreach {
        case (word, count) =>
          val newCount: Long = map1.getOrElse(word, 0L) + count
          map1.update(word, newCount)
      }
      map1
    }
    )
    println(stringToInt)

output

Map(
 nakelulu -> 1,
 juyoujin -> 1,
 sunshangxiang -> 1,
 libai -> 2,
 minyue -> 1,
 zhaoyun -> 1,
 liubei -> 2, 
 guanyu -> 2, 
 bailishouyue -> 1
 )

10. Method 10

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
	val Rdd: RDD[String] = sc.textFile("data/data.txt")
  	val rdd: RDD[String] = Rdd.flatMap(line => {
    val strings: Array[String] = line.split(",")
    strings
  		})
    val RDD1: RDD[mutable.Map[String, Int]] = rdd.map(word => mutable.Map[String, Int]((word, 1)))

    val stringToInt: mutable.Map[String, Int] = RDD1.aggregate(mutable.Map[String, Int]())((map1, map2) => {
      map2.foreach {
        case (word, count) =>
          val newCount: Int = map1.getOrElse(word, 0) + count
          map1.update(word, newCount)
      }
      map1
    }, (map1, map2) => {
      map2.foreach {
        case (word, count) =>
          val newCount: Int = map1.getOrElse(word, 0) + count
          map1.update(word, newCount)
      }
      map1
    })
    println(stringToInt)

output

Map(
 nakelulu -> 1,
 juyoujin -> 1,
 sunshangxiang -> 1,
 libai -> 2,
 minyue -> 1,
 zhaoyun -> 1,
 liubei -> 2, 
 guanyu -> 2, 
 bailishouyue -> 1
 )

11. Method 11

object WordCount {
  def main(args: Array[String]): Unit = {
	// Configure spark environment
    val conf = new SparkConf().setMaster("local[*]").setAppName("wc")
	// New SparkContext
    val sc = new SparkContext(conf)
    //read file
	val Rdd: RDD[String] = sc.textFile("data/data.txt")
  	val rdd: RDD[String] = Rdd.flatMap(line => {
    val strings: Array[String] = line.split(",")
    strings})
    val RDD1: RDD[mutable.Map[String, Int]] = rdd.map(word => mutable.Map[String, Int]((word, 1)))
    val stringToInt: mutable.Map[String, Int] = RDD1.fold(mutable.Map[String, Int]())((map1, map2) => {
      map2.foreach {
        case (word, count) =>
          val newCount: Int = map1.getOrElse(word, 0) + count
          map1.update(word, newCount)
      }
      map1
    })
    println(stringToInt)

output

Map(
 nakelulu -> 1,
 juyoujin -> 1,
 sunshangxiang -> 1,
 libai -> 2,
 minyue -> 1,
 zhaoyun -> 1,
 liubei -> 2, 
 guanyu -> 2, 
 bailishouyue -> 1
 )

Well, the above is our 11 implementation methods. Now let's make a summary.
You may have noticed that method 6 and the past are single outputs, but method 7 and the future j results are Map? Is it a coincidence? Or the lack of morality?

Well, I won't pull it, mainly because the main operators for implementing WordCount are different. The first is the Transformation operator, and the second is the Action operator.
Come on, have a look.

Transformation is an inert operator, which is executed when necessary. Action is an activity operator, which directly generates task execution. An action corresponds to a task.

Operator typeImplementation of main operatorsIntroduction to operators
TransformationgroupByThe value data type grouping operator needs to specify the grouping value, and the return value is a binary (k, iterator)
TransformationgroupByKeyThe key value data type grouping operator does not need to specify grouping data, but directly groups according to K. The return value is a binary (k, iterator)
TransformationreduceByKeyThe key value data type grouping aggregation operator does not need to specify the grouping data. It is directly grouped according to k, and the aggregation function needs to be specified (the inter partition function is the same as the partition class function)
TransformationaggregateByKeyThe key value data type grouping aggregation operator does not need to specify the grouping data. It is directly grouped according to k. It is necessary to specify the aggregation function (inter partition and partition class functions are different) and the aggregation initial value
TransformationfoldByKeyGrouping aggregation is similar to aggregateByKey, except that foldByKey can indicate that the calculation logic between partitions and partition classes is the same, coritization
TransformationcombineByKeyThe key value data type grouping aggregation operator has three parameters. The first parameter is to modify the first data, the second parameter is to aggregate functions within partitions, and the third parameter is to aggregate functions between partitions. The types of intermediate variables may not be recognized during compilation, so they need to be marked with generics
ActioncountByKeyGroup aggregation is performed according to the key value, and the underlying call is reduceByKey(+)
ActioncountByValueDirect aggregation, the bottom layer calls countByKey, map (value = > (value, null)). countByKey()
ActionreduceAggregation operator, the bottom layer needs to be implemented by itself, and the value is to customize the bottom layer. See the usage above
ActionfoldThere is one more parameter than the reduce operator, which can set the initial value of intermediate temporary variables during aggregation]
ActionaggregateInter partition aggregation and partition class aggregation can be performed. For example, fold has one more parameter to set the local aggregation function and global aggregation function of RDD data collection respectively


The above is just a brief introduction to the operator. We will explain its principle and source code later. These introductions are used together. Let's look at them and understand them first. Ladies and gentlemen, I'm neglecting them.

Posted by mash on Mon, 22 Nov 2021 11:21:45 -0800