Hadoop Part 2: mapreedce

Keywords: Hadoop Spark hive HBase

Mapreedce (3)

Project address: https://github.com/KingBobTitan/hadoop.git

MR's Shuffle explanation and Join implementation

First, review

1. MapReduce's history monitoring service: JobHistoryServer

Function: used to monitor the information of all MapReduce programs running on YARN
Configure log aggregation of YARN: stored on hdfs
Launch: web:19888

2. Custom data types: encapsulating JavaBean s in Hadoop

Encapsulation requires serialization
Implementation interface
- Writable: only serialization is implemented
  - write: serialization
  - readFields: deserialization
- Writablecompatible: implements serialization and comparator
  - write: serialization
  - readFields: deserialization
  - compareTo: the method of comparison
- If the custom type will go through the shuffle process as a key, you need to implement the writablecompatible interface
Sorting: custom data types
- First check whether there is a sorter. If there is one, use the sorter
```
job.setSortComaparator(RawComparator<T>) => extends WritableComparator
```
- If not, the comparableTo of the type is called
- If neither, it's a mistake
Sorting: default type: Text, IntWritable
- First check whether there is a sorter. If there is one, use the sorter
- If there is no sorter, a comparer of the default type is called

3. Basic Shuffle process

Input
- function
  - Change all input data to KeyValue
  - Split task: calculate task: slice: 100 pieces
Map
- Function: sub
  - Start MapTask according to the partition of Input, one partition corresponds to one MapTask
  - Each MapTask will call the map method once for each keyvalue in its partition
- Implementation logic
  - map method customization
Shuffle
- function
  - Partition: if there are multiple reducers, it determines which one will process the current keyvalue
  - Sorting: sorting key s according to the sorting rules
  - Grouping: the value s of the same key are combined and put into an iterator, each with only one
- Implementation logic
  - Partitions: default hash partitions
    - Custom: inherit partitioner < key, value >
      getPartition(key,value,numReduceTask)
  - Sort: ascending by key dictionary by default
    - User 1: define a sorter [highest priority]
    - Custom 2: custom data type to implement writablecompatible
  - Group: group by key by default
Reduce
- Function: combination
  - Merge the data output by shuffle
- Merge logic
  - reduce method
Output
- Output the result of Reduce and save it to the corresponding file system
- Default TextOutputFormat: key and value are tab separated

2, Course objectives

Detailed explanation of Shuffle process [key points]
Two optimizations in Shuffle
Join scheme in MapReduce [Master]
Read write database [understand]

3, Detailed explanation of Shuffle process

1, function

partition
sort
Grouping

2, stage

Input

Input:

file1

hadoop hive hbase spark spark
hadoop hadoop hadoop

file2

hue hive hive spark spark
hadoop spark spark hbase

Functions: fragment and convert keyvalue

output

split1

key							value
0                         hadoop hive hbase spark spark
10						  hadoop hadoop hadoop

split2

key							value
0							hue hive hive spark spark
20							hadoop spark spark hbase

Map

Function: start the MapTask according to the number of slices, and then call the map method for the data in each MapTask
```
arr = value.toString.split(" ")
for(word:arr){
	this.outputKey(word)
	this.outputValue(1)
	context.write(key,value)
}
```

MapTask1

hadoop      1			
hive        1   
hbase       1     
spark       1          
spark       1         
hadoop      1         
hadoop      1      
hadoop      1

MapTask2

hue 	1
hive 	1
hive 	1
spark 	1
spark	1
hadoop 	1
spark 	1
spark 	1
hbase	1

Shuffle: partition, sort, group

Shuffle on the Map side: processing the result of Map

spill: over write [write data in memory to disk to file]

Each MapTask will put its own processing results into a circular memory buffer [100M]

When the buffer reaches 80%, overflow will be triggered, and all the data to be overflowed will be partitioned and sorted

Partition: by default, the partition is based on the hash redundancy of the key, and the essence is labeled

MapTask1: filea

hadoop      1		reduce1
hive        1   	reduce2
hbase       1     	reduce1
spark       1    	reduce2

MapTask1: fileb

spark       1       reduce2
hadoop      1       reduce1
hadoop      1		reduce1
hadoop      1    	reduce1

MapTask2: filea

hue 	1			reduce1
hive 	1			reduce2
hive 	1			reduce2
spark 	1			reduce2
spark	1			reduce2

MapTask2:fileb

hadoop 	1			reduce1
spark 	1			reduce2
spark 	1			reduce2
hbase	1			reduce1

Sorting: call the sorter or compareTo method: implementation method: fast sorting

Not the whole batch data is globally ordered, but the same partition is internally ordered

MapTask1: filea

hadoop      1		reduce1
hbase       1     	reduce1
hive        1   	reduce2
spark       1    	reduce2

MapTask1: fileb

spark       1       reduce2
hadoop      1       reduce1
hadoop      1		reduce1
hadoop      1    	reduce1

MapTask2: filea

hue 	1			reduce1
hive 	1			reduce2
hive 	1			reduce2
spark 	1			reduce2
spark	1			reduce2

MapTask2:fileb

hadoop 	1			reduce1
hbase	1			reduce1
spark 	1			reduce2
spark 	1			reduce2

Merge: merge. Each MapTask will merge all the small files generated by itself to ensure that each MapTask has only one large file

Merge: and sort in the process of merge. The way to achieve this is to merge and sort. Only index is placed in memory

The sorting logic still calls the sorter or comparaTo

MapTask1

hadoop      1		reduce1
hadoop      1       reduce1
hadoop      1		reduce1
hadoop      1    	reduce1
hbase       1     	reduce1
hive        1   	reduce2
spark       1    	reduce2
spark       1       reduce2

MapTask2

hadoop 	1			reduce1
hbase	1			reduce1
hue 	1			reduce1
hive 	1			reduce2
hive 	1			reduce2
spark 	1			reduce2
spark	1			reduce2
spark 	1			reduce2
spark 	1			reduce2

**shuffle on the Reduce side: * * give your results to Reduce

merge

==Pull: = = through Http protocol, each ReduceTask will get its own data from the results of each MapTask

reduceTask1

MapTask1

hadoop      1
hadoop      1 
hadoop      1
hadoop      1
hbase       1

MapTask2
```
hadoop 	1
hbase	1
hue 	1
```

reduceTask2

MapTask1

hive        1
spark       1
spark       1

MapTask2

hive 	1
hive 	1
spark 	1
spark	1
spark 	1
spark 	1

==Merge: = = sort also during merge: implementation method: merge sort [index only in memory]

The sorting logic still calls the sorter or comparaTo

reduceTask1

hadoop      1
hadoop      1 
hadoop      1
hadoop      1
hadoop 		1
hbase       1
hbase		1
hue 		1

reduceTask2

hive        1
hive 		1
hive 		1
spark       1
spark       1
spark 		1
spark		1
spark 		1
spark 		1

==group: = = merge the value s of the same key into the iterator

reduceTask1

hadoop      1,1,1,1,1
hbase       1,1
hue 		1

reduceTask2

hive        1,1,1
spark       1,1,1,1,1,1

Reduce

Function: aggregate the results of Shuffle, and call the reduce method for each piece of data

reduce(key,Iterator<value>:values){
	for(value:values){
		sum+=value.get()
	}
	context.write(key,sum)
}

ReduceTask1
```
hadoop	5
hbase	2
hue		1
```
ReduceTask2
```
hive	3
spark	6
```

Output
- part-r-00000
```
hadoop	5
hbase	2
hue		1
```
- part-r-00001
```
hive	3
spark	6
```

3. Flow chart

Shuffle
- Process: distributed memory = > disk [do things that cannot be realized in memory] = > memory
  - map1:1 3 => 3 1
  - map2:4 5 => 5 4
  - map3:1 7 => 7 1
  - Requirements: Global reverse order
    - Merge: disk: 3 1 5 4 7 1 = > 7 5 4 3 1 = > re read the global order into memory

4, Two optimizations in Shuffle

1. Combiner: aggregation on the Map side

Another function in the Shuffle process, which is off by default, needs to be set manually to enable
- Not all programs can use Combiner
- The procedure needs to meet certain conditions
  
  (a+b) * c = a * c+b * c
Not a combiner

Using combiner

Main purpose
- The number of concurrent maptasks is far greater than the number of Reduce
- The logic of aggregation is partially completed by each Map, and then the final aggregation is done by Reduce to Reduce the load of Reduce
- Official wordcount

wordcount developed by ourselves

You can also enable Combiner when developing by yourself
```
job.setCombinerClass(WordCountReduce.class);//Set up Combiner
```
- The Combiner class is generally the Reducer class, and the aggregation logic is consistent
- If you can't tell whether the logic can be split, you can test
  - Whether the two results are consistent
- Is the input type of Reduce consistent with the output type of Reduce
  - Map output = > reduce input
  - Combiner = Reducer
  - Combine = > reduce input

2. Compress: compress

Compression type in life: zip/rar/7z
Compression type in big data: divisible compression type
- 300m = > compression = > 200m
- block1:128M => split1 => MapTask1[node01]
- block2:72M => split2 => MapTask2[node02]
- Type: snappy, lz4, lzo
Common compression choices
- Compression and decompression are relatively slow, the algorithm is very complex, but the compression ratio is high [size after compression / original size]
- Compression and decompression are faster, algorithm is simpler, but compression ratio is low
Advantage: compression in MapReduce
- Reduce disk and network IO, improve data transmission and storage efficiency
  - Shuffle Write: 1t = > hard disk 1s / GB = > 1024s
    - Compression: 1t = > compression = > 700g = > hard disk 1s / GB = > 700s + compression time = > 750S
  - Shuffle Read: hard disk = > 1t = > 1024s
    - Compression: = > hard disk = > 700g = > decompression 1s / GB = > 700s + decompression time = > 750S
All the frameworks for storage and computing that I've learned in the future support compression

Location in MapReduce where data can be compressed

Input: MapReduce can read a compressed file as input [not used]
- The file type of data is determined by data generation
- MapReduce will read the compressed metadata when reading the compressed file

Compress in the Shuffle phase: compress the output of Map [main address]

#Enable compression
mapreduce.map.output.compress=true
#Specify the type of compression
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec

org.apache.hadoop.io.compress.Lz4Codec
org.apache.hadoop.io.compress.SnappyCodec

How to modify a configuration

mapred-site.xml: always compress all programs
In MapReduce program: conf.set(key,value)

When running a submit command, you can specify parameters: temporary

yarn jar sougou.jar cn.itcast.hadoop.mapreduce.compress.HotKeyMR -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec /app/sougou /app/output/sougou1

Output: MapReduce can output compressed file [less use]

mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

bin/hadoop checknative

hadoop:  true /export/servers/hadoop-2.6.0-cdh5.14.0/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
snappy:  true /usr/lib64/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib64/libbz2.so.1

Demonstration

Demand: count the number of times each search term appears

/**
 * @ClassName WordCount
 * @Description TODO Statistics of popular search terms based on Sogou data
 * @Date 2020/1/9 16:00
 * @Create By     Frank
 */
public class HotKeyMR extends Configured implements Tool {

    /**
     * Build a MapReduce program, configuration program and submission program
     * @param args
     * @return
     * @throws Exception
     */
    @Override
    public int run(String[] args) throws Exception {
        /**
         * First: construct a MapReduce Job
         */
        //Construct a job object
        Job job = Job.getInstance(this.getConf(),"mrword");
        //Set the class the job runs
        job.setJarByClass(HotKeyMR.class);
        /**
         * Second: configure job
         */
        //Input: set the input class and input path
//        job.setInputFormatClass(TextInputFormat.class); this is the default
        Path inputPath = new Path(args[0]);//Take the first parameter of the program as the input path
        TextInputFormat.setInputPaths(job,inputPath);
        //map
        job.setMapperClass(WordCountMapper.class);//Specify Mapper's class
        job.setMapOutputKeyClass(Text.class);//Specifies the type of key output by the map
        job.setMapOutputValueClass(IntWritable.class);//Specifies the value type of the map output
        //shuffle
        job.setCombinerClass(WordCountReduce.class);//Set up Combiner
        //reduce
        job.setReducerClass(WordCountReduce.class);//Specify the class of reduce
        job.setOutputKeyClass(Text.class);//Specify the key type of reduce output
        job.setOutputValueClass(IntWritable.class);//Specify the value type of the reduce output
//        job.setNumReduceTasks(1); / / this is the default
        //output
//        job.setOutputFormatClass(TextOutputFormat.class); / / this is the default output class
        Path outputPath = new Path(args[1]);//Use the second parameter of the program as the output path
        //Delete if output directory already exists
        FileSystem hdfs = FileSystem.get(this.getConf());
        if(hdfs.exists(outputPath)){
            hdfs.delete(outputPath,true);
        }
        //Set the output address
        TextOutputFormat.setOutputPath(job,outputPath);

        /**
         * Third: submit job
         */
        //Submit the job to run, return the boolean value, return true for success and false for failure
        return job.waitForCompletion(true) ? 0 : -1;
    }

    /**
     * The entry of the whole program, which is responsible for calling the run method of the current class
     * @param args
     */
    public static void main(String[] args) {
        //Construct a conf object to manage all the configuration of the current program
        Configuration conf = new Configuration();
        //Configuration compression
        conf.set("mapreduce.map.output.compress","true");
        conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.Lz4Codec");
        try {
            //Call the run method of the current class
            int status = ToolRunner.run(conf, new HotKeyMR(), args);
            //Exit according to the result of program operation
            System.exit(status);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


    /**
     * Mapper Four generics, inputkey,inputValue,outputKey,outputValue
     * Generics of input: determined by the input class: TextInputFormat:Longwritable Text
     * Generics of output: determined by code logic: Text, IntWritable
     * Rewrite map method
     */
    public static class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable>{
        //Construct key and value for output
        private Text outputKey = new Text();
        private IntWritable outputValue = new IntWritable(1);

        /**
         * map Method: every keyvalue passed by Input calls the map method once
         * @param key: Current key
         * @param value: Current value
         * @param context: Context, responsible for outputting the new keyvalue
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //Convert the contents of each line to String
            String line = value.toString();
            //Split the contents of each line
            String[] words = line.split("\t");
            //Take the second field, and the user's search term is the key
            this.outputKey.set(words[2]);
            //output
            context.write(this.outputKey,this.outputValue);
        }
    }

    /**
     * All Reduce needs to implement four generics
     * keyvalue of input: the keyvalue type of Map output
     * Output keyvalue: determined by code logic
     * Override reduce method
     */
    public static class WordCountReduce extends Reducer<Text, IntWritable,Text, IntWritable>{

        private IntWritable outputValue = new IntWritable();

        /**
         * reduce Method, every keyvalue will call the reduce method once
         * @param key: key passed in
         * @param values: Iterator, all value s of the current key
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            //Take the value of the iterator for accumulation
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            //value encapsulated as output
            this.outputValue.set(sum);
            //Output the result of each key
            context.write(key,outputValue);
        }
    }


}

5, Join scheme in MapReduce

1,Reduce Join

join occurs on the reduce side

join in SQL

inner join: only when there are results on both sides
 left join: on the left, there will be
 right join: if there is one on the right, the result will be
 full join: on either side, the result is


Select * from a join b = select * from a, B = > Cartesian product


join: association of columns
	a		id		name 		age		sex
	b		id		phone		addr
	Query Zhang San's mobile number
	
union: association of rows
	a id name age sex: Freshman data
    b id name age sex: sophomore data

Demand: two pieces of data
- Order data
```
1001,20150710,p0001,2
1002,20150710,p0002,3
1002,20150710,p0003,3

Order No., date, commodity id, commodity quantity
```
- Commodity data
```
p0001, helicopter, 10002000
 p0002, tank, 10003000
 p0003, rocket, 100002000

Commodity id, commodity name, price, inventory
```
- Association: get the product name of each order
```
100120150710, p0001,2 helicopter
 100220150710, p0002,3 tank
 100220150710, p0003,3 rocket
```
- Analysis: MapReduce implementation
  - Step 1: results
    - Include product name in addition to order data
  - Step 2: see if there is grouping or sorting
    - Grouping: fields of a join
    - key: Commodity id
  - Step 3: value
    - For order data, other fields besides commodity id are value
    - For commodity data, except for commodity id, only commodity name is required
  - Step 4: Verification
    - Input
      1001,20150710,p0001,2 1002,20150710,p0002,3 1002,20150710,p0003,3 p0001, helicopter, 10002000 p0002, tank, 10003000 p0003, rocket, 100002000
    - map
      - key: Commodity id
      - value:
        
        If it is an order: order number, date, product quantity
        
        In case of commodity: commodity name
    - shuffle
      - Grouping
        
        All product and order data corresponding to the same product id are in one iterator
        
        P0001 helicopter, 100120150710, p0001,2 P0002 tank, 100220150710, p0002,3 P0003 rocket, 100220150710, p0003,3

Realization

/**
 * @ClassName ReduceJoin
 * @Description TODO Reduce The process of implementing join
 * @Date 2020/1/9 17:25
 * @Create By     Frank
 */
public class ReduceJoin extends Configured implements Tool {


    /**
     * Specific definition of the whole MapReduce job: build, configure and submit
     * @param args
     * @return
     * @throws Exception
     */
    @Override
    public int run(String[] args) throws Exception {
        /**
         * Build a job
         */
        //Create an instance of a job
        Job job = Job.getInstance(this.getConf(),"mrjob");
        //Set the class the job runs
        job.setJarByClass(ReduceJoin.class);

        /**
         * Configure job
         */
        //Input: define the input mode and input path
        Path orderPath = new Path("datas/mrjoin/orders.txt");
        Path productPath = new Path("datas/mrjoin/product.txt");
        TextInputFormat.setInputPaths(job,orderPath,productPath);
        //Map: define the class and output type of map stage
        job.setMapperClass(MRJoinMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        //Shuffle: defines the class implemented in the shuffle phase
        //Reduce: define the class and output type of the reduce phase
        job.setReducerClass(MRJoinReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(1);//Set the number of Reduce, that is, the number of partitions
        //Output: defines the output class and output path
        Path outputPath = new Path("datas/output/join/reducejoin");
        //Delete if output exists
        FileSystem hdfs = FileSystem.get(this.getConf());
        if(hdfs.exists(outputPath)){
            hdfs.delete(outputPath,true);
        }
        TextOutputFormat.setOutputPath(job,outputPath);

        /**
         * Submit job: and return according to the result of job running
         */
        return job.waitForCompletion(true) ? 0:-1;
    }


    /**
     * Program entry
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        //Build a Conf object to manage all the configuration of the current program
        Configuration conf = new Configuration();
        //Call the run method of the current class
        int status = ToolRunner.run(conf, new ReduceJoin(), args);
        //Exit the whole program according to the running state of the job
        System.exit(status);
    }

    /**
     * Define Mapper implementation class and processing logic in Map process
     */
    public static class MRJoinMapper extends Mapper<LongWritable,Text,Text, Text>{

        private Text outputKey = new Text();
        private Text outputValue = new Text();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //First determine which file the current data is from
            FileSplit inputSplit = (FileSplit) context.getInputSplit();//First, obtain which file segments this data belongs to
            String fileName = inputSplit.getPath().getName();//Get file name
            //If it is order data, key is the third field and value is the other remaining fields
            if("orders.txt".equals(fileName)){
                String[] items = value.toString().split(",");
                this.outputKey.set(items[2]);//Commodity id
                this.outputValue.set(items[0]+"\t"+items[1]+"\t"+items[3]);//Other fields
                context.write(this.outputKey,this.outputValue);//Output order data
            }else{
                //If it is commodity data, key is the first field and value is the second field
                String[] split = value.toString().split(",");
                this.outputKey.set(split[0]);//Commodity id
                this.outputValue.set(split[1]);//Trade name
                context.write(this.outputKey,this.outputValue);//Export product data
            }


        }
    }

    /**
     * Define the implementation class of Reducer and the processing logic in the process of reduction
     */
    public static class MRJoinReduce extends Reducer<Text,Text,Text,Text>{

        private Text outputValue = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            StringBuilder stringBuilder = new StringBuilder();
            for (Text value : values) {
                stringBuilder.append(value.toString()+"\t");//Splice the names of all products and all corresponding orders
            }
            this.outputValue.set(stringBuilder.toString());//Take product name and order as value
            context.write(key,this.outputValue);
        }
    }

}

Application scenario
- Big data join big data
- Very inefficient comparison
- Commodity data: 60000
- Order data: 90000

2,Map Join

Commodity data: 10000
Order data: 20 million
Small data join big data
==Idea: = = put the small data into the distributed cache. When each piece of big data needs to be used, take it directly from the distributed cache, and then complete the join directly on the Map side without shuffling

Realization

public class MapJoin extends Configured implements Tool {
    /**
     * Specific definition of the whole MapReduce job: build, configure and submit
     * @param args
     * @return
     * @throws Exception
     */
    @Override
    public int run(String[] args) throws Exception {
        /**
         * Build a job
         */
        //Create an instance of a job
        Job job = Job.getInstance(this.getConf(),"mrjob");
        //Set the class the job runs
        job.setJarByClass(MapJoin.class);

        /**
         * Configure job
         */
        //Input: define the input mode and input path
        Path orderPath = new Path("datas/mrjoin/orders.txt");
        TextInputFormat.setInputPaths(job,orderPath);
        //Put product data into distributed cache
        Path productPath = new Path("datas/mrjoin/product.txt");
        job.addCacheFile(productPath.toUri());
        //Map: define the class and output type of map stage
        job.setMapperClass(MRJoinMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        //Shuffle: defines the class implemented in the shuffle phase
        //Reduce: define the class and output type of the reduce phase
//        job.setReducerClass(MRJoinReduce.class);
//        job.setOutputKeyClass(Text.class);
//        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(0);//Set the number of Reduce, that is, the number of partitions
        //Output: defines the output class and output path
        Path outputPath = new Path("datas/output/join/mapjoin");
        //Delete if output exists
        FileSystem hdfs = FileSystem.get(this.getConf());
        if(hdfs.exists(outputPath)){
            hdfs.delete(outputPath,true);
        }
        TextOutputFormat.setOutputPath(job,outputPath);

        /**
         * Submit job: and return according to the result of job running
         */
        return job.waitForCompletion(true) ? 0:-1;
    }


    /**
     * Program entry
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        //Build a Conf object to manage all the configuration of the current program
        Configuration conf = new Configuration();
        //Call the run method of the current class
        int status = ToolRunner.run(conf, new MapJoin(), args);
        //Exit the whole program according to the running state of the job
        System.exit(status);
    }

    /**
     * Define Mapper implementation class and processing logic in Map process
     */
    public static class MRJoinMapper extends Mapper<LongWritable,Text,Text, Text>{

        private Text outputKey = new Text();
        private Text outputValue = new Text();
        Map<String,String> maps = new HashMap<>();

        /**
         * Map And Reduce: three methods
         *      1-setup: It will be executed before the map or reduce method
         *      2-map/reduce: map Logic or reduce logic
         *      3-close: Final execution method
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            //Read data from distributed cache
            URI[] cacheFiles = context.getCacheFiles();//Get all cached data
            //Read file contents
            BufferedReader bufferedReader = new BufferedReader(new FileReader(cacheFiles[0].getPath()));
            String line = null;
            while(StringUtils.isNotBlank(line = bufferedReader.readLine())){
                //Read to each line
                String pid = line.split(",")[0];//Commodity id
                String productName = line.split(",")[1];//Trade name
                //Put commodity id and name into map set
                maps.put(pid,productName);
            }
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //Get order data
            String[] items = value.toString().split(",");
            String pid = items[2]; //Item id in order
            String productName = maps.get(pid);
            this.outputKey.set(productName);
            this.outputValue.set(value.toString());
            context.write(this.outputKey,this.outputValue);
        }
    }

    /**
     * Define the implementation class of Reducer and the processing logic in the process of reduction
     */
    public static class MRJoinReduce extends Reducer<Text,Text,Text,Text>{

        private Text outputValue = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            StringBuilder stringBuilder = new StringBuilder();
            for (Text value : values) {
                stringBuilder.append(value.toString()+"\t");//Splice the names of all products and all corresponding orders
            }
            this.outputValue.set(stringBuilder.toString());//Take product name and order as value
            context.write(key,this.outputValue);
        }
    }

}

Application scenario
- Suitable for small data join large data scenario

6, Read write database

1. Input and Output

Input: all input should inherit from InputFormat
- Default: textinputformat extensions fileinputformat extend inputformat
- file
- data base
Output: all Input should inherit from OutputFormat
- Default: textoutputformat extensions fileoutputformat extensions outputformat
- file
- data base
Database: JDBC

2, read MySQL

Modify the imported class

job.setInputFormatClass(DBInputFormat.class);

Customize a data type to receive data from MySQL

In addition to the implementation of Writable interface, DBWritable should also be implemented
```
public  class DBReader implements Writable,DBWritable
```

Realize the serialization and deserialization of database objects

	public void write(PreparedStatement statement) throws SQLException {
		// TODO Auto-generated method stub
		statement.setString(1, word);
		statement.setInt(2,number);

	}

	public void readFields(ResultSet resultSet) throws SQLException {
		// TODO Auto-generated method stub
		this.word = resultSet.getString(1);
		this.number = resultSet.getInt(2);

	}

After creating the conf object, you should immediately configure the jdbc connection parameters

Configuration conf = new Configuration();
DBConfiguration.configureDB(
    conf, 
    "com.mysql.jdbc.Driver", 
    "jdbc:mysql://localhost:3306/test",
    "root", 
    "123456"
);

Configure read data: tables, SQL statements, fields

DBInputFormat.setInput(
    job, 
    DBReader.class, //Object to store read results
    "wcresult", //Indicate
    null, //Filter condition
    "number",//Sorted fields
    fields  //Which fields to read
);

public static void setInput(
	  Job job,
      Class<? extends DBWritable> inputClass,  //Object to store read results
      String inputQuery,  //SQL statement, the fields returned in the SQL statement must be consistent with the inputClass property
      String inputCountQuery
      ) {
  }

2, write MySQL

Configure the output class

job.setOutputFormatClass(DBOutputFormat.class);

Configure output parameters: table, field

DBOutputFormat.setOutput(job, "wcresult", fields);

The object that the program will output the key to MySQL must implement the DBWritable interface [serialization method]

public static class WriteMap extends Mapper<LongWritable, Text, DBReader, NullWritable>{
	
		private DBReader outputKey = new DBReader();
		private NullWritable outputValue = NullWritable.get();
		
		@Override
		protected void map(LongWritable key, Text value,
				Context context)
						throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			String line = value.toString();
			this.outputKey.setWord(line.split("\t")[0]);
			this.outputKey.setNumber(Integer.valueOf(line.split("\t")[1]));
			context.write(outputKey, outputValue);
		}
	}

Go straight ahead.

Published 6 original articles, won praise 1, visited 1181

Private letter follow

Posted by nick1 on Tue, 14 Jan 2020 02:21:13 -0800

Programmer Group

Hadoop Part 2: mapreedce

Mapreedce (3)

Project address: https://github.com/KingBobTitan/hadoop.git

MR's Shuffle explanation and Join implementation

First, review

2, Course objectives

3, Detailed explanation of Shuffle process

1, function

2, stage

3. Flow chart

4, Two optimizations in Shuffle

1. Combiner: aggregation on the Map side

2. Compress: compress

5, Join scheme in MapReduce

1,Reduce Join

2,Map Join

6, Read write database

1. Input and Output

2, read MySQL

2, write MySQL

Hot Keywords