Spark Exercises: Seeking TopN for Teachers of All Subjects

Keywords: Big Data Scala Spark network

[Note] This article refers to learning videos from Calf School.

Spark Exercises: Seeking TopN for Teachers of All Subjects

Data format: http://bigdata.edu360.cn/laozhang

1. Data segmentation

val func=(line:String)=>{

  val index=line.lastIndexOf("/")

  val teacher=line.substring(index+1)

  val httpHost=line.substring(0,index)

  val subject=new URL(httpHost).getHost.split("[.]")(0)

  // (subject,teacher)

    //(teacher,1)

}

2. Logical Computing

2.1 Seek topN, the most popular teacher in all subjects

//Get the data source

val lines=sc.textFile(path)

val teacherAndOne=lines.map(func)

val reduced=teacherAndOne.reduceByKey(_+_)

val sorted=reduced.sortBy(_._2,false)

val result=sorted.top(topN))

2.2 Seek topN of the most popular teachers in all subjects

 

(1) Use the sortBy method in Scala (for small amounts of data)

val lines=sc.textFile(path)

val subjectAndTeacher=lines.map(func)

val maped=subjectAndTeacher.map((_,1))

val reduced=maped.reduceByKey(_+_)

//Grouped by subject, the key is subject and value is iterator of teacher data corresponding to subject.

val grouped: RDD[(String, Iterable[((String, String), Int)])]=reduced.groupBy(_._1._1)

//Take each group out for operation

//Why can we call scala's sortBy method

//Because the data of a subject is already in a set on a machine.

//If there is a large amount of data, there may be problems.

val sorted=grouped.mapValues(_.toList.sortBy(_._2).reverse.take(topN))

//The amount of data is small, so it can be collected directly or stored in a file.

val result=sorted.collect()

 

(2) Use the sortBy method in RDD (suitable for the case of large amount of data)

val lines=sc.textFile(path)

val subjectAndTeacher=lines.map(func)

val subjects=subjectAndTeacher.keys.distinct()

val maped=subjectAndTeacher.map((_,1))

val reduced=maped.reduceByKey(_+_)

for(sb <- subjects){

  val filter=reduce.filter(_._1.equals(sb))

  //Now you call the sortBy method on RDD, which is available in memory and disk.

  //The take method is to take the first few in Executor and send them to Driver through the network. It's a //Action.

  val r=filter.sortBy(_._2,false).take(topN)

  //Then collect or store r

}

(3) Custom partitioner, partitioning by discipline.

  i.Zoning device SubjectPartitioner

class SubjectPartitioner(subjects:Array[String]) extends Partitioner{

  //Equivalent to the main constructor (executed once when new)

  //A map for storing rules

  val rules=new mutable.HashMap[String,Int]()

  var i=0

  for(sb <- subjects){

    rules(sb)=i

    i=i+1

  }

  //Number of partitions returned (how many partitions are there for the next RDD)

  override def numPartitions:Int =subjects.length

    //Calculate partition labels based on the incoming key

  override def getPartition(key: Any):Int ={

    //key is a tuple (subject, teacher)

    //sb: Discipline

    val sb=key.asInstanceOf[(String,String)]._1

    //Calculate partition number according to rules

    rules(sb)

  }

}

ii. Logic

val lines=sc.textFile(path)

val subjectAndTeacher=lines.map(func)

val subjects=subjectAndTeacher.keys.distinct()

val maped=subjectAndTeacher.map((_,1))

//polymerization

//First shuffle

val reduced=maped.reduceByKey(_+_)

//Custom partitioner to partition according to the specified partitioner

//partitionBy partitions according to specified partition rules

//Second shuffle

val partitioned: RDD[((String, String), Int)] =reduce.partitionBy(new SubjectPartitioner(subjects))



//Operate one partition at a time

val sorted=partitioned.mapPartitions(it => {

//Converting an iterator to a List and sorting to an iterator return

  it.toList.sortBy(_._2).reverse.take(topN).iterator

 //Disadvantage: Loading to memory again and reordering

})

(4) Reduce the number of shuffle s

val lines=sc.textFile(path)

val subjectAndTeacher=lines.map(func)

val subjects=subjectAndTeacher.keys.distinct()

val maped=subjectAndTeacher.map((_,1))

//Zoning device

val sbPartitioner=new SubjectPartitioner(subjects)

//polymerization

//First shuffle

val reduced=maped.reduceByKey(sbPartitioner,_+_)



//Operate one partition at a time

val sorted=reduced.mapPartitions(it => {

//Converting an iterator to a List and sorting to an iterator return

  it.toList.sortBy(_._2).reverse.take(topN).iterator

//Disadvantage: Loading to memory again and reordering

//Optimize: Sort, not all loaded into memory

//Consider using a fixed-length TreeSet to load the data taken from the iterator, and then sort and leave the topN behind, then load the new data, sort and repeat until all the data in the subject have been iterated.

})

 

Posted by SureFire on Tue, 29 Jan 2019 23:33:14 -0800