[Note] This article refers to learning videos from Calf School.
Spark Exercises: Seeking TopN for Teachers of All Subjects
Data format: http://bigdata.edu360.cn/laozhang
1. Data segmentation
val func=(line:String)=>{ val index=line.lastIndexOf("/") val teacher=line.substring(index+1) val httpHost=line.substring(0,index) val subject=new URL(httpHost).getHost.split("[.]")(0) // (subject,teacher) //(teacher,1) }
2. Logical Computing
2.1 Seek topN, the most popular teacher in all subjects
//Get the data source val lines=sc.textFile(path) val teacherAndOne=lines.map(func) val reduced=teacherAndOne.reduceByKey(_+_) val sorted=reduced.sortBy(_._2,false) val result=sorted.top(topN))
2.2 Seek topN of the most popular teachers in all subjects
(1) Use the sortBy method in Scala (for small amounts of data)
val lines=sc.textFile(path) val subjectAndTeacher=lines.map(func) val maped=subjectAndTeacher.map((_,1)) val reduced=maped.reduceByKey(_+_) //Grouped by subject, the key is subject and value is iterator of teacher data corresponding to subject. val grouped: RDD[(String, Iterable[((String, String), Int)])]=reduced.groupBy(_._1._1) //Take each group out for operation //Why can we call scala's sortBy method //Because the data of a subject is already in a set on a machine. //If there is a large amount of data, there may be problems. val sorted=grouped.mapValues(_.toList.sortBy(_._2).reverse.take(topN)) //The amount of data is small, so it can be collected directly or stored in a file. val result=sorted.collect()
(2) Use the sortBy method in RDD (suitable for the case of large amount of data)
val lines=sc.textFile(path) val subjectAndTeacher=lines.map(func) val subjects=subjectAndTeacher.keys.distinct() val maped=subjectAndTeacher.map((_,1)) val reduced=maped.reduceByKey(_+_) for(sb <- subjects){ val filter=reduce.filter(_._1.equals(sb)) //Now you call the sortBy method on RDD, which is available in memory and disk. //The take method is to take the first few in Executor and send them to Driver through the network. It's a //Action. val r=filter.sortBy(_._2,false).take(topN) //Then collect or store r }
(3) Custom partitioner, partitioning by discipline.
i.Zoning device SubjectPartitioner class SubjectPartitioner(subjects:Array[String]) extends Partitioner{ //Equivalent to the main constructor (executed once when new) //A map for storing rules val rules=new mutable.HashMap[String,Int]() var i=0 for(sb <- subjects){ rules(sb)=i i=i+1 } //Number of partitions returned (how many partitions are there for the next RDD) override def numPartitions:Int =subjects.length //Calculate partition labels based on the incoming key override def getPartition(key: Any):Int ={ //key is a tuple (subject, teacher) //sb: Discipline val sb=key.asInstanceOf[(String,String)]._1 //Calculate partition number according to rules rules(sb) } }
ii. Logic
val lines=sc.textFile(path) val subjectAndTeacher=lines.map(func) val subjects=subjectAndTeacher.keys.distinct() val maped=subjectAndTeacher.map((_,1)) //polymerization //First shuffle val reduced=maped.reduceByKey(_+_) //Custom partitioner to partition according to the specified partitioner //partitionBy partitions according to specified partition rules //Second shuffle val partitioned: RDD[((String, String), Int)] =reduce.partitionBy(new SubjectPartitioner(subjects)) //Operate one partition at a time val sorted=partitioned.mapPartitions(it => { //Converting an iterator to a List and sorting to an iterator return it.toList.sortBy(_._2).reverse.take(topN).iterator //Disadvantage: Loading to memory again and reordering })
(4) Reduce the number of shuffle s
val lines=sc.textFile(path) val subjectAndTeacher=lines.map(func) val subjects=subjectAndTeacher.keys.distinct() val maped=subjectAndTeacher.map((_,1)) //Zoning device val sbPartitioner=new SubjectPartitioner(subjects) //polymerization //First shuffle val reduced=maped.reduceByKey(sbPartitioner,_+_) //Operate one partition at a time val sorted=reduced.mapPartitions(it => { //Converting an iterator to a List and sorting to an iterator return it.toList.sortBy(_._2).reverse.take(topN).iterator //Disadvantage: Loading to memory again and reordering //Optimize: Sort, not all loaded into memory //Consider using a fixed-length TreeSet to load the data taken from the iterator, and then sort and leave the topN behind, then load the new data, sort and repeat until all the data in the subject have been iterated. })