I. MapReduce basic principle

Keywords: Big Data Programming Java Hadoop Spark

I. MapReduce overview

1, definition

Is a distributed computing programming framework. The core function is to integrate the business logic code written by the user and the default components into a complete distributed program, which runs on a hadoop cluster concurrently.

2. Advantages and disadvantages

(1) advantages
1 > easy to program: with the programming method of ordinary program and the interface provided by MapReduce, the writing of distributed program can be completed quickly.
2 > good scalability: when the computing resources are not satisfied, the computing power can be expanded by simply adding computing machines
3 > high fault tolerance: if a task's computing node is hung, the above computing task can be automatically transferred to another node for execution, that is, automatic failover. This process is internally completed without human intervention.
4 > suitable for offline processing of data above PB level

(2) disadvantages
1 > real time calculation: cannot return calculation results in milliseconds or seconds like mysql
2 > streaming Computing: the input data of streaming computing is dynamic, while MapReduce requires that the input data is static and persistent on the storage.
3 > DAG (directed acyclic graph) calculation: multiple applications have dependencies, and the input of the latter application is the output of the former. In this case, the performance of MapReduce is very low. Because the output results of each stage of MapReduce will be written to the disk first, a large number of disk IO will cause a sharp drop in performance.

3. MapReduce core idea

The core idea is divided into two stages: map and reduce.
1) firstly, the output data is sliced, and then each slice data is divided into an independent map task. Map processes data according to business logic. Each map task does not affect each other.
2) next, take the output of all map tasks as the input of reduce tasks (the number of reduce tasks is related to the partition, which will be discussed in detail later), summarize the local statistics of each map task into the global statistics, and finally complete the output of the results.
3) map reduce programming model can only have one map and reduce phase, and multiple MapReduce programs can only run serially, not in parallel.

II. MapReduce basic structure

1. MapReduce1.x architecture

Basic overview:
1) after we write MR Jobs, we need to submit a job through JobClient. The submitted information will be sent to JobTracker module, which is one of the core of the first generation MapReduce computing framework. It is responsible for maintaining heartbeat with other nodes in the cluster, allocating resources for submitted jobs, and managing the normal operation of submitted jobs (failure, restart, etc.).
2) another core function of the first generation MapReduce is TaskTracker. On each TaskTracker installation node, its main function is to monitor the resource usage of its own node.
3) TaskTracker monitors the operation of Tasks in the current node, including Map Task and Reduce Task. At last, it transfers the results to the file system of HDFS from Reduce Task to Reduce stage. The specific process is shown in steps 1-7 in the figure. During the monitoring period, TaskTracker needs to send these information to JobTracker through heartbeat mechanism. After JobTracker collects these information, it will allocate other resources to the newly submitted job to avoid duplicate resource allocation.

Disadvantages:
1) JobTracker is the entry point of the first generation MapReduce. If the JobTracker service goes down, the whole service will be paralyzed and there will be a single point of problem.
2) JobTracker is responsible for too many tasks and takes up too many resources. When there are too many jobs, it will consume a lot of memory and easily cause performance bottlenecks.
3) for TaskTracker, the role of Task is too simple to take into account the use of CPU and memory. If there are many large memory Task centralized scheduling, it is easy to have memory overflow.
4) in addition, TaskTracker forcibly divides resources into map task slot and reduce task slot. If there is only one of them (map or reduce) in MR tasks, resource waste will occur and resource utilization rate will be low. That is to say, resources are allocated statically.

2. MapReduce2.x architecture

The biggest difference between V2 and V1 is the addition of yarn.
The basic idea of architecture refactoring is to separate the two core functions of JobTracker into independent components. The separated components are application manager and Resource Scheduler. The new Resource Manager manages the resource allocation of the whole system, and the app master under each Node Manager is responsible for the corresponding scheduling and coordination work (each MapReduce task has a corresponding app Master). In practice, the app master obtains resources from the Resource Manager, and lets the Node Manager work together and monitor tasks. Control.

Compared with Task monitoring in MR V1, internal heat such as restart is handled by App Master. Resource Manager provides central service and is responsible for resource allocation and scheduling. Node Manager is responsible for maintaining the status of the Container, reporting the collected information to Resource Manager, and maintaining the heartbeat with Resource Manager.

Advantage:
1) reduce resource consumption and make monitoring of each job more distributed.
2) after adding yarn, it supports more programming models, such as spark, etc.
3) it is more reasonable to describe resources with the concept of memory amount than slot in V1, and resources are all allocated dynamically.
4) resource scheduling and allocation are more hierarchical. RM is responsible for the overall resource management and scheduling, and appMaster on each node is responsible for the resource management and scheduling of the current node.

III. MapReduce framework principle

1. Workflow

The above steps from 7 to 16 are called shuffle mechanism.
1) maptask collects the kv pairs output by our map() method and puts them into the memory buffer.
2) overflow local disk files from memory buffer, which may overflow multiple files
3) multiple overflow files will be merged into large ones
4) in the process of overflow and merging, the partitioner should be called to partition and sort the key.
5) according to the partition number of reducetask, go to each maptask machine to get the corresponding result partition data.
6) reducetask will retrieve the result files from different maptask s in the same partition, and then merge (merge and sort) these files.
7) after merging into a large file, the shuffle process is over. Later, enter the logic operation process of reducetask (take one key value pair group from the file and call the user-defined reduce() method).

2. Slicing mechanism

As can be seen from the workflow of MapReduce, the number of maptask s depends on the number of slices, so let's look at the principle of slices.

(1) slice code analysis

In the workflow of MapReduce, before map operation, the data will be sliced, and then each slice will be handed over to an independent map task for processing. So how does map task get the slice implementation class?
First, the map task starts with the run method as the entry.

/*
MapTask.java
*/
public void run(JobConf job, TaskUmbilicalProtocol umbilical) throws IOException, ClassNotFoundException, InterruptedException {
        //A lot of code is omitted here. Looking at this method directly, it is actually the compatibility of the old and new APIs.
                this.runNewMapper(job, this.splitMetaInfo, umbilical, reporter);
            } else {
                this.runOldMapper(job, this.splitMetaInfo, umbilical, reporter);
            }

            this.done(umbilical, reporter);
        }
    }

//Here is the runNewMapper method
private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runNewMapper(JobConf job, TaskSplitIndex splitIndex, TaskUmbilicalProtocol umbilical, TaskReporter reporter) throws IOException, ClassNotFoundException, InterruptedException {
    ................
        //Here we can see the implementation class to get inputFormat. The key is the taskContext object, whose class is TaskAttemptContextImpl.
        InputFormat<INKEY, INVALUE> inputFormat = (InputFormat)ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
    }

/*
TaskAttemptContextImpl.java Inherit JobContextImpl class
JobContextImpl It implements the job context interface, which defines many set and get methods to configure the
*/
public class JobContextImpl implements JobContext {
     public Class<? extends InputFormat<?, ?>> getInputFormatClass() throws ClassNotFoundException {

        //As you can see, this is to get inputformat.class from the conf object. The default value is TextInputFormat.
            return this.conf.getClass("mapreduce.job.inputformat.class", TextInputFormat.class);
        }
}

From this, we can see that the default class for processing input data is TextInputFormat, but this class does not implement slicing methods. It implements slicing methods in its parent class FileInputFormat:

/*
FileInputFormat.java
*/
    public List<InputSplit> getSplits(JobContext job) throws IOException {
        StopWatch sw = (new StopWatch()).start();
        long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job));
        long maxSize = getMaxSplitSize(job);

        //This is the array to store slice information
        List<InputSplit> splits = new ArrayList();
        //Get all files of input path
        List<FileStatus> files = this.listStatus(job);
        Iterator i$ = files.iterator();

        while(true) {
            while(true) {
                while(i$.hasNext()) {
                    FileStatus file = (FileStatus)i$.next();
                    Path path = file.getPath();
                    long length = file.getLen();
                    if (length != 0L) {
                        BlockLocation[] blkLocations;
                        if (file instanceof LocatedFileStatus) {
                            //Get file block information
                            blkLocations = ((LocatedFileStatus)file).getBlockLocations();
                        } else {
                            FileSystem fs = path.getFileSystem(job.getConfiguration());
                            blkLocations = fs.getFileBlockLocations(file, 0L, length);
                        }

                        //From here on, it's officially sliced.
                        if (this.isSplitable(job, path)) {
                            long blockSize = file.getBlockSize();
                            //Get slice size
                            long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);

                            long bytesRemaining;
                            int blkIndex;
                            //Loop to slice the file. You can see that this is to determine whether the remaining part of the file is larger than 1.1 times the slice size.
                            for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
                                blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
                                //Record the file, start and end position of slice, slice size, host of slice block to slice array as slice information.
                                splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
                            }

                            //Here is to add the last content of the file as the last slice to the slice plan.
                            if (bytesRemaining != 0L) {
                                blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
                                splits.add(this.makeSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
                            }
                        } else {
                            splits.add(this.makeSplit(path, 0L, length, blkLocations[0].getHosts(), blkLocations[0].getCachedHosts()));
                        }
                    } else {
                        splits.add(this.makeSplit(path, 0L, length, new String[0]));
                    }
                }

                job.getConfiguration().setLong("mapreduce.input.fileinputformat.numinputfiles", (long)files.size());
                sw.stop();
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Total # of splits generated by getSplits: " + splits.size() + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
                }

                return splits;
            }
        }
    }

/*
This method is to determine the slice size. In short, it mainly depends on the size of maxsize and blocksize.
maxsize > blockSize,  Then splitsize = blockSize
maxsize < blockSize,  Split size = maxsize

minSize>blockSize,Then splitsize = minSize
minSize<blockSize,Then splitsize = blockSize

Of course, it should be noted that maxsize needs to be always larger than minSize.
*/
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
        return Math.max(minSize, Math.min(maxSize, blockSize));
    }

(2) slice calculation summary -- FileInputFormat method

The above slice value is just planning. There is no real slice, but only after the job is submitted to yarn for execution, the data will be read according to the slice planning. The characteristics of the above sections are summarized as follows:
1) slice according to the length of the content of the file
2) slicing is based on each file independently, and does not slice all files as a whole, which has disadvantages (later)
3) slice size: the default is blocksize. The computer system is the same as above. It is not repeated here.

FileInputFormat.setMaxInputSplitSize();  maxsize
FileInputFormat.setMinIutputSplitSize();  minsize

You can change the slice size by setting two values

4) slicing method: according to the source code, each time you slice, you will judge whether the remaining part is greater than 1.1 times the splitSize. If not, the slicing will be terminated and the remaining part will be taken as the last slice.

(3) large number of small file slicing Optimization -- CombineTextInputFormat mode

We can know from (2) that TextInputFormat(FileInputFormat) slices are based on files, that is to say, a file is at least one slice, no matter how large the file is. If there are a large number of small files, many maptask s will be generated, and the processing efficiency is very low. In this case, the solution is:
1) solve from the data source, merge the data and upload it to HDFS without generating a large number of small files.
2) if a large number of small files have to be processed, the CombineTextInputFormat is used for slicing.

The logic of slicing is as follows (the source code is quite long, and the results after I study the source code directly are as follows):
First of all, CombineTextInputFormat does not implement the getSplit() method, but is implemented by its parent class CombineFileInputformat. It will slice multiple files in a directory as a whole data source. The slice size depends on the maximum slice size set by MaxSplitSize, and the unit is byte. The slice logic is

Totalsize < = 1.5 * maxsplitsize 1 piece, splitSize=totalSize
 1.5 * maxsplitsize < totalsize < 2 * maxsplitsize 2 pieces, splitSize=MaxSplitSize
 Totalsize > 2 * maxsplitsize n-piece, splitSize=MaxSplitSize

Note that:
If the total data size is much larger than MaxSplitSize, when cutting to the last slice, it will judge whether the remaining part is larger than twice MaxSplitSize after slicing. If not, it will be regarded as one slice. If not, it will be regarded as two slices.

Use CombineTextInputFormat as the operation class of inputformat:

//Set the class of InputFormat to CombineTextInputFormat
job.setInputFormatClass(CombineTextInputFormat.class);
//Set slice maximum and minimum values respectively
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m

3. Division working mechanism

As mentioned earlier, the number of map task s depends on the number of slices, so what does the number of reduce task s depend on? Depends on the number of partitions.

(1) basic mechanism of zoning

1) first, you need to define a partition class and inherit the partition < key, value >
2) override the public Int getPartition() method. The returned partition number
3) set the custom class as partition class in job, otherwise the default partition class is HashPartitioner.

job.setPartitionerClass(CustomPartitioner.class);

4) set the number of reduce task s, which is generally the same as the number of partitions.

job.setNumReduceTasks(N);

Note: the relationship between the number of partitions and the number of reduce task s
If the number of reduceTask > the number of getpartition results, several empty output files part-r-000xx will be generated.
If the number of 1 < reducetask is less than the result number of getpartition, some partition data will not be placed, and Exception will occur.
If the number of reducetasks = 1, no matter how many partition files are output by mapTask, the final result will be handed to this reduceTask, and only one result file part-r-00000 will be generated.

(2) zoning example

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {

    @Override
    public int getPartition(Text key, FlowBean value, int numPartitions) {

// 1 get the top three phone numbers
        String preNum = key.toString().substring(0, 3);

        //Default partition number. If none of the following conditions are met, KV will be divided into this partition.
        int partition = 4;

        // 2 determine which province
        if ("136".equals(preNum)) {
            partition = 0;
        }else if ("137".equals(preNum)) {
            partition = 1;
        }else if ("138".equals(preNum)) {
            partition = 2;
        }else if ("139".equals(preNum)) {
            partition = 3;
        }

        return partition;
    }
}

Posted by young_coder on Thu, 17 Oct 2019 18:18:22 -0700

Programmer Group