MapReduce comprehensive experiment -- ranking statistics of Chinese Universities

Keywords: Big Data Hadoop mapreduce

Ranking statistics of Chinese Universities Based on MapReduce

Overall thinking

① Fileinputformat reads data
② Mapper stage is simple for data processing
③ Serialization implements custom sorting
④ Partition partition processing
⑤ Reducer writes out data
⑥ Main class settings

The specific implementation is as follows

Driver main class, including loading jar package path, setting Mapper and Reducer classes, output type, partition partition setting, file input and output path, etc. note that the number of reductions set during partition partition should be consistent with the number of partitions. If it is more or less, an error will be reported, resulting in the stop of Map Reduce program.

public class RankDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        // Get job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // Load main class
        job.setJarByClass(RankDriver.class);

        // Set Mapper and Reducer classes
        job.setMapperClass(RankMapper.class);
        job.setReducerClass(RankReducer.class);
        
        // Sets the data type of Mapper data
        job.setMapOutputKeyClass(RankBean.class);
        job.setMapOutputValueClass(Text.class);

        // Set the data type of the final data
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(RankBean.class);

        // Set Partition and number of partitions
        job.setPartitionerClass(RankPartitioner.class);
        job.setNumReduceTasks(6);

        // File input / output path
        FileInputFormat.setInputPaths(job, new Path("E:\\test\\data\\*"));
        FileOutputFormat.setOutputPath(job, new Path("E:\\test\\RankTopKOut"));

        // Submit job
        boolean result = job.waitForCompletion(true);
        // End of judgment
        System.exit(result ? 0 : 1);
    }
}

For Bean object serialization class, pay attention to the following points

① Implement the WritableComparable interface and pass in the comparison object. Generally speaking, the comparison object is itself.
② Set null argument constructor
③ Rewrite serialization methods (write and readFields)
④ Override the compareTo method, and the method body is used to implement custom sorting
⑤ Override the toString method for the final data write out.

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class RankBean implements WritableComparable<RankBean> {

    private String module; // School type
    private double score;  // School score
    private String position;  // School location

    public RankBean() {
    }

    public String getModule() {
        return module;
    }

    public void setModule(String module) {
        this.module = module;
    }

    public double getScore() {
        return score;
    }

    public void setScore(double score) {
        this.score = score;
    }

    public String getPosition() {
        return position;
    }

    public void setPosition(String position) {
        this.position = position;
    }


    @Override
    public int compareTo(RankBean o) {
        if (this.score > o.score) {
            return -1;
        }else if (this.score < o.score) {
            return 1;
        }else {
            return 0;
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(module);
        out.writeDouble(score);
        out.writeUTF(position);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.module = in.readUTF();
        this.score = in.readDouble();
        this.position = in.readUTF();
    }

    @Override
    public String toString() {
        return module + "\t" + position + "\t" + score ;
    }

}

Mapper class implements data reading, processing and writing operations. When writing out operations, in order to realize custom sorting, outKey means that the written key must be an object and serialized to realize custom sorting. Otherwise, the underlying logic of MapReduce will automatically sort the output keys in the way of fast scheduling, such as wordCount program.

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class RankMapper extends Mapper<LongWritable, Text, RankBean, Text> {
    private RankBean outK = new RankBean();
    private Text outV = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split("\t");
        // Get corresponding data by segmentation
        String name = split[0]; 
        String position = split[1];
        String mold = split[2];
        String score = split[3];

        // Store data
        outV.set(name);
        outK.setModule(mold);
        outK.setPosition(position);
        outK.setScore(Double.parseDouble(score));

        // Write data
        context.write(outK,outV);

    }
}

Partition partition class, which realizes the partition merging of different fields and finally stores the data in different files. The specific implementation steps are as follows:

① Inherit the Partitioner class, and the generic type is Mapper's data type
② Rewrite getPartition method to realize partition

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class RankPartitioner extends Partitioner<RankBean, Text> {

    @Override
    public int getPartition(RankBean rankBean, Text text, int numPartitions) {
        int partition;
        if ("Beijing".equals(rankBean.getPosition())) {
            partition = 0;
        }else if ("Shanghai".equals(rankBean.getPosition())) {
            partition = 1;
        }else if ("Tianjin".equals(rankBean.getPosition())) {
            partition = 2;
        }else if ("Jiangsu".equals(rankBean.getPosition())) {
            partition = 3;
        }else if ("Henan".equals(rankBean.getPosition())) {
            partition = 4;
        }else {
            partition = 5;
        }
        return partition;
    }
}

The Reducer class implements the write out operation of data.

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class RankReducer extends Reducer<RankBean, Text, Text, RankBean> {


    @Override
    protected void reduce(RankBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            context.write(value,key);
        }
    }
}

So far, the ranking of Chinese universities has been written into different documents according to the zoning of key provinces. The final output is shown below.


The ranking data of Chinese universities are attached.

Data and source code download address: https://pan.baidu.com/s/1rd5_7MwPtDptGm1u3QJnJw

Extraction code: 9q88

I hope I can help you.

Posted by ursvmg on Tue, 30 Nov 2021 09:20:18 -0800