MapReduce comprehensive experiment -- ranking statistics of Chinese Universities

Keywords: Big Data Hadoop mapreduce

Ranking statistics of Chinese Universities Based on MapReduce

Overall thinking

① Fileinputformat reads data
② Mapper stage is simple for data processing
③ Serialization implements custom sorting
④ Partition partition processing
⑤ Reducer writes out data
⑥ Main class settings

The specific implementation is as follows

Driver main class, including loading jar package path, setting Mapper and Reducer classes, output type, partition partition setting, file input and output path, etc. note that the number of reductions set during partition partition should be consistent with the number of partitions. If it is more or less, an error will be reported, resulting in the stop of Map Reduce program.

public class RankDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        // Get job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // Load main class

        // Set Mapper and Reducer classes
        // Sets the data type of Mapper data

        // Set the data type of the final data

        // Set Partition and number of partitions

        // File input / output path
        FileInputFormat.setInputPaths(job, new Path("E:\\test\\data\\*"));
        FileOutputFormat.setOutputPath(job, new Path("E:\\test\\RankTopKOut"));

        // Submit job
        boolean result = job.waitForCompletion(true);
        // End of judgment
        System.exit(result ? 0 : 1);

For Bean object serialization class, pay attention to the following points

① Implement the WritableComparable interface and pass in the comparison object. Generally speaking, the comparison object is itself.
② Set null argument constructor
③ Rewrite serialization methods (write and readFields)
④ Override the compareTo method, and the method body is used to implement custom sorting
⑤ Override the toString method for the final data write out.



public class RankBean implements WritableComparable<RankBean> {

    private String module; // School type
    private double score;  // School score
    private String position;  // School location

    public RankBean() {

    public String getModule() {
        return module;

    public void setModule(String module) {
        this.module = module;

    public double getScore() {
        return score;

    public void setScore(double score) {
        this.score = score;

    public String getPosition() {
        return position;

    public void setPosition(String position) {
        this.position = position;

    public int compareTo(RankBean o) {
        if (this.score > o.score) {
            return -1;
        }else if (this.score < o.score) {
            return 1;
        }else {
            return 0;

    public void write(DataOutput out) throws IOException {

    public void readFields(DataInput in) throws IOException {
        this.module = in.readUTF();
        this.score = in.readDouble();
        this.position = in.readUTF();

    public String toString() {
        return module + "\t" + position + "\t" + score ;


Mapper class implements data reading, processing and writing operations. When writing out operations, in order to realize custom sorting, outKey means that the written key must be an object and serialized to realize custom sorting. Otherwise, the underlying logic of MapReduce will automatically sort the output keys in the way of fast scheduling, such as wordCount program.

import org.apache.hadoop.mapreduce.Mapper;


public class RankMapper extends Mapper<LongWritable, Text, RankBean, Text> {
    private RankBean outK = new RankBean();
    private Text outV = new Text();
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split("\t");
        // Get corresponding data by segmentation
        String name = split[0]; 
        String position = split[1];
        String mold = split[2];
        String score = split[3];

        // Store data

        // Write data


Partition partition class, which realizes the partition merging of different fields and finally stores the data in different files. The specific implementation steps are as follows:

① Inherit the Partitioner class, and the generic type is Mapper's data type
② Rewrite getPartition method to realize partition

import org.apache.hadoop.mapreduce.Partitioner;

public class RankPartitioner extends Partitioner<RankBean, Text> {

    public int getPartition(RankBean rankBean, Text text, int numPartitions) {
        int partition;
        if ("Beijing".equals(rankBean.getPosition())) {
            partition = 0;
        }else if ("Shanghai".equals(rankBean.getPosition())) {
            partition = 1;
        }else if ("Tianjin".equals(rankBean.getPosition())) {
            partition = 2;
        }else if ("Jiangsu".equals(rankBean.getPosition())) {
            partition = 3;
        }else if ("Henan".equals(rankBean.getPosition())) {
            partition = 4;
        }else {
            partition = 5;
        return partition;

The Reducer class implements the write out operation of data.

import org.apache.hadoop.mapreduce.Reducer;


public class RankReducer extends Reducer<RankBean, Text, Text, RankBean> {

    protected void reduce(RankBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {

So far, the ranking of Chinese universities has been written into different documents according to the zoning of key provinces. The final output is shown below.

The ranking data of Chinese universities are attached.

Data and source code download address:

Extraction code: 9q88

I hope I can help you.

Posted by ursvmg on Tue, 30 Nov 2021 09:20:18 -0800