Hadoop Part 2: mapreedce

Keywords: Hadoop Spark hive HBase

Mapreedce (3)

Project address: https://github.com/KingBobTitan/hadoop.git

MR's Shuffle explanation and Join implementation

First, review

1. MapReduce's history monitoring service: JobHistoryServer

  • Function: used to monitor the information of all MapReduce programs running on YARN
  • Configure log aggregation of YARN: stored on hdfs
  • Launch: web:19888

2. Custom data types: encapsulating JavaBean s in Hadoop

  • Encapsulation requires serialization

  • Implementation interface

    • Writable: only serialization is implemented
      • write: serialization
      • readFields: deserialization
    • Writablecompatible: implements serialization and comparator
      • write: serialization
      • readFields: deserialization
      • compareTo: the method of comparison
    • If the custom type will go through the shuffle process as a key, you need to implement the writablecompatible interface
  • Sorting: custom data types

    • First check whether there is a sorter. If there is one, use the sorter

      job.setSortComaparator(RawComparator<T>) => extends WritableComparator
      
    • If not, the comparableTo of the type is called

    • If neither, it's a mistake

  • Sorting: default type: Text, IntWritable

    • First check whether there is a sorter. If there is one, use the sorter
    • If there is no sorter, a comparer of the default type is called

3. Basic Shuffle process

  • Input

    • function
      • Change all input data to KeyValue
      • Split task: calculate task: slice: 100 pieces
  • Map

    • Function: sub
      • Start MapTask according to the partition of Input, one partition corresponds to one MapTask
      • Each MapTask will call the map method once for each keyvalue in its partition
    • Implementation logic
      • map method customization
  • Shuffle

    • function

      • Partition: if there are multiple reducers, it determines which one will process the current keyvalue
      • Sorting: sorting key s according to the sorting rules
      • Grouping: the value s of the same key are combined and put into an iterator, each with only one
    • Implementation logic

      • Partitions: default hash partitions

        • Custom: inherit partitioner < key, value >

          getPartition(key,value,numReduceTask)
          
      • Sort: ascending by key dictionary by default

        • User 1: define a sorter [highest priority]
        • Custom 2: custom data type to implement writablecompatible
      • Group: group by key by default

  • Reduce

    • Function: combination
      • Merge the data output by shuffle
    • Merge logic
      • reduce method
  • Output

    • Output the result of Reduce and save it to the corresponding file system
    • Default TextOutputFormat: key and value are tab separated

2, Course objectives

  1. Detailed explanation of Shuffle process [key points]

  2. Two optimizations in Shuffle

  3. Join scheme in MapReduce [Master]

  4. Read write database [understand]

3, Detailed explanation of Shuffle process

1, function

  • partition
  • sort
  • Grouping

2, stage

  • Input

    • Input:

      • file1

        hadoop hive hbase spark spark
        hadoop hadoop hadoop
        
      • file2

        hue hive hive spark spark
        hadoop spark spark hbase
        
    • Functions: fragment and convert keyvalue

    • output

      • split1

        key							value
        0                         hadoop hive hbase spark spark
        10						  hadoop hadoop hadoop
        
      • split2

        key							value
        0							hue hive hive spark spark
        20							hadoop spark spark hbase
        
  • Map

    • Function: start the MapTask according to the number of slices, and then call the map method for the data in each MapTask

      arr = value.toString.split(" ")
      for(word:arr){
      	this.outputKey(word)
      	this.outputValue(1)
      	context.write(key,value)
      }
      
    • MapTask1

      hadoop      1			
      hive        1   
      hbase       1     
      spark       1          
      spark       1         
      hadoop      1         
      hadoop      1      
      hadoop      1    
      
    • MapTask2

      hue 	1
      hive 	1
      hive 	1
      spark 	1
      spark	1
      hadoop 	1
      spark 	1
      spark 	1
      hbase	1
      
  • Shuffle: partition, sort, group

    • Shuffle on the Map side: processing the result of Map

      • spill: over write [write data in memory to disk to file]

        • Each MapTask will put its own processing results into a circular memory buffer [100M]

        • When the buffer reaches 80%, overflow will be triggered, and all the data to be overflowed will be partitioned and sorted

          • Partition: by default, the partition is based on the hash redundancy of the key, and the essence is labeled

            • MapTask1: filea

              hadoop      1		reduce1
              hive        1   	reduce2
              hbase       1     	reduce1
              spark       1    	reduce2
              
            • MapTask1: fileb

              spark       1       reduce2
              hadoop      1       reduce1
              hadoop      1		reduce1
              hadoop      1    	reduce1
              
            • MapTask2: filea

              hue 	1			reduce1
              hive 	1			reduce2
              hive 	1			reduce2
              spark 	1			reduce2
              spark	1			reduce2
              
            • MapTask2:fileb

              hadoop 	1			reduce1
              spark 	1			reduce2
              spark 	1			reduce2
              hbase	1			reduce1
              
          • Sorting: call the sorter or compareTo method: implementation method: fast sorting

            • Not the whole batch data is globally ordered, but the same partition is internally ordered

            • MapTask1: filea

              hadoop      1		reduce1
              hbase       1     	reduce1
              hive        1   	reduce2
              spark       1    	reduce2
              
            • MapTask1: fileb

              spark       1       reduce2
              hadoop      1       reduce1
              hadoop      1		reduce1
              hadoop      1    	reduce1
              
            • MapTask2: filea

              hue 	1			reduce1
              hive 	1			reduce2
              hive 	1			reduce2
              spark 	1			reduce2
              spark	1			reduce2
              
            • MapTask2:fileb

              hadoop 	1			reduce1
              hbase	1			reduce1
              spark 	1			reduce2
              spark 	1			reduce2
              

      • Merge: merge. Each MapTask will merge all the small files generated by itself to ensure that each MapTask has only one large file

        • Merge: and sort in the process of merge. The way to achieve this is to merge and sort. Only index is placed in memory

          • The sorting logic still calls the sorter or comparaTo

          • MapTask1

            hadoop      1		reduce1
            hadoop      1       reduce1
            hadoop      1		reduce1
            hadoop      1    	reduce1
            hbase       1     	reduce1
            hive        1   	reduce2
            spark       1    	reduce2
            spark       1       reduce2
            
          • MapTask2

            hadoop 	1			reduce1
            hbase	1			reduce1
            hue 	1			reduce1
            hive 	1			reduce2
            hive 	1			reduce2
            spark 	1			reduce2
            spark	1			reduce2
            spark 	1			reduce2
            spark 	1			reduce2
            
    • **shuffle on the Reduce side: * * give your results to Reduce

      • merge

        • ==Pull: = = through Http protocol, each ReduceTask will get its own data from the results of each MapTask

          • reduceTask1

            • MapTask1

              hadoop      1
              hadoop      1 
              hadoop      1
              hadoop      1
              hbase       1
              
            • MapTask2

              hadoop 	1
              hbase	1
              hue 	1
              
          • reduceTask2

            • MapTask1

              hive        1
              spark       1
              spark       1
              
            • MapTask2

              hive 	1
              hive 	1
              spark 	1
              spark	1
              spark 	1
              spark 	1
              
        • ==Merge: = = sort also during merge: implementation method: merge sort [index only in memory]

          • The sorting logic still calls the sorter or comparaTo

          • reduceTask1

            hadoop      1
            hadoop      1 
            hadoop      1
            hadoop      1
            hadoop 		1
            hbase       1
            hbase		1
            hue 		1
            
          • reduceTask2

            hive        1
            hive 		1
            hive 		1
            spark       1
            spark       1
            spark 		1
            spark		1
            spark 		1
            spark 		1
            
      • ==group: = = merge the value s of the same key into the iterator

        • reduceTask1

          hadoop      1,1,1,1,1
          hbase       1,1
          hue 		1
          
        • reduceTask2

          hive        1,1,1
          spark       1,1,1,1,1,1
          
  • Reduce

    • Function: aggregate the results of Shuffle, and call the reduce method for each piece of data

      reduce(key,Iterator<value>:values){
      	for(value:values){
      		sum+=value.get()
      	}
      	context.write(key,sum)
      }
      
    • ReduceTask1

      hadoop	5
      hbase	2
      hue		1
      
    • ReduceTask2

      hive	3
      spark	6
      
  • Output

    • part-r-00000

      hadoop	5
      hbase	2
      hue		1
      
    • part-r-00001

      hive	3
      spark	6
      

3. Flow chart

  • Shuffle

    • Process: distributed memory = > disk [do things that cannot be realized in memory] = > memory

      • map1:1 3 => 3 1

      • map2:4 5 => 5 4

      • map3:1 7 => 7 1

      • Requirements: Global reverse order

        • Merge: disk: 3 1 5 4 7 1 = > 7 5 4 3 1 = > re read the global order into memory

4, Two optimizations in Shuffle

1. Combiner: aggregation on the Map side

  • Another function in the Shuffle process, which is off by default, needs to be set manually to enable

    • Not all programs can use Combiner

    • The procedure needs to meet certain conditions

      (a+b) * c = a * c+b * c

  • Not a combiner

  • Using combiner

  • Main purpose

    • The number of concurrent maptasks is far greater than the number of Reduce

    • The logic of aggregation is partially completed by each Map, and then the final aggregation is done by Reduce to Reduce the load of Reduce

    • Official wordcount

  • wordcount developed by ourselves

  • You can also enable Combiner when developing by yourself

    job.setCombinerClass(WordCountReduce.class);//Set up Combiner
    
    • The Combiner class is generally the Reducer class, and the aggregation logic is consistent
    • If you can't tell whether the logic can be split, you can test
      • Whether the two results are consistent
    • Is the input type of Reduce consistent with the output type of Reduce
      • Map output = > reduce input
      • Combiner = Reducer
      • Combine = > reduce input

2. Compress: compress

  • Compression type in life: zip/rar/7z

  • Compression type in big data: divisible compression type

    • 300m = > compression = > 200m
    • block1:128M => split1 => MapTask1[node01]
    • block2:72M => split2 => MapTask2[node02]
    • Type: snappy, lz4, lzo
  • Common compression choices

    • Compression and decompression are relatively slow, the algorithm is very complex, but the compression ratio is high [size after compression / original size]
    • Compression and decompression are faster, algorithm is simpler, but compression ratio is low
  • Advantage: compression in MapReduce

    • Reduce disk and network IO, improve data transmission and storage efficiency
      • Shuffle Write: 1t = > hard disk 1s / GB = > 1024s
        • Compression: 1t = > compression = > 700g = > hard disk 1s / GB = > 700s + compression time = > 750S
      • Shuffle Read: hard disk = > 1t = > 1024s
        • Compression: = > hard disk = > 700g = > decompression 1s / GB = > 700s + decompression time = > 750S
  • All the frameworks for storage and computing that I've learned in the future support compression

  • Location in MapReduce where data can be compressed

    • Input: MapReduce can read a compressed file as input [not used]

      • The file type of data is determined by data generation
      • MapReduce will read the compressed metadata when reading the compressed file
    • Compress in the Shuffle phase: compress the output of Map [main address]

      #Enable compression
      mapreduce.map.output.compress=true
      #Specify the type of compression
      mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec
      
      org.apache.hadoop.io.compress.Lz4Codec
      org.apache.hadoop.io.compress.SnappyCodec
      
      • How to modify a configuration

        • mapred-site.xml: always compress all programs

        • In MapReduce program: conf.set(key,value)

        • When running a submit command, you can specify parameters: temporary

          yarn jar sougou.jar cn.itcast.hadoop.mapreduce.compress.HotKeyMR -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec /app/sougou /app/output/sougou1
          
    • Output: MapReduce can output compressed file [less use]

      mapreduce.output.fileoutputformat.compress=true
      mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
      
  • bin/hadoop checknative

    hadoop:  true /export/servers/hadoop-2.6.0-cdh5.14.0/lib/native/libhadoop.so.1.0.0
    zlib:    true /lib64/libz.so.1
    snappy:  true /usr/lib64/libsnappy.so.1
    lz4:     true revision:10301
    bzip2:   true /lib64/libbz2.so.1
    
  • Demonstration

    • Demand: count the number of times each search term appears

      /**
       * @ClassName WordCount
       * @Description TODO Statistics of popular search terms based on Sogou data
       * @Date 2020/1/9 16:00
       * @Create By     Frank
       */
      public class HotKeyMR extends Configured implements Tool {
      
          /**
           * Build a MapReduce program, configuration program and submission program
           * @param args
           * @return
           * @throws Exception
           */
          @Override
          public int run(String[] args) throws Exception {
              /**
               * First: construct a MapReduce Job
               */
              //Construct a job object
              Job job = Job.getInstance(this.getConf(),"mrword");
              //Set the class the job runs
              job.setJarByClass(HotKeyMR.class);
              /**
               * Second: configure job
               */
              //Input: set the input class and input path
      //        job.setInputFormatClass(TextInputFormat.class); this is the default
              Path inputPath = new Path(args[0]);//Take the first parameter of the program as the input path
              TextInputFormat.setInputPaths(job,inputPath);
              //map
              job.setMapperClass(WordCountMapper.class);//Specify Mapper's class
              job.setMapOutputKeyClass(Text.class);//Specifies the type of key output by the map
              job.setMapOutputValueClass(IntWritable.class);//Specifies the value type of the map output
              //shuffle
              job.setCombinerClass(WordCountReduce.class);//Set up Combiner
              //reduce
              job.setReducerClass(WordCountReduce.class);//Specify the class of reduce
              job.setOutputKeyClass(Text.class);//Specify the key type of reduce output
              job.setOutputValueClass(IntWritable.class);//Specify the value type of the reduce output
      //        job.setNumReduceTasks(1); / / this is the default
              //output
      //        job.setOutputFormatClass(TextOutputFormat.class); / / this is the default output class
              Path outputPath = new Path(args[1]);//Use the second parameter of the program as the output path
              //Delete if output directory already exists
              FileSystem hdfs = FileSystem.get(this.getConf());
              if(hdfs.exists(outputPath)){
                  hdfs.delete(outputPath,true);
              }
              //Set the output address
              TextOutputFormat.setOutputPath(job,outputPath);
      
              /**
               * Third: submit job
               */
              //Submit the job to run, return the boolean value, return true for success and false for failure
              return job.waitForCompletion(true) ? 0 : -1;
          }
      
          /**
           * The entry of the whole program, which is responsible for calling the run method of the current class
           * @param args
           */
          public static void main(String[] args) {
              //Construct a conf object to manage all the configuration of the current program
              Configuration conf = new Configuration();
              //Configuration compression
              conf.set("mapreduce.map.output.compress","true");
              conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.Lz4Codec");
              try {
                  //Call the run method of the current class
                  int status = ToolRunner.run(conf, new HotKeyMR(), args);
                  //Exit according to the result of program operation
                  System.exit(status);
              } catch (Exception e) {
                  e.printStackTrace();
              }
          }
      
      
          /**
           * Mapper Four generics, inputkey,inputValue,outputKey,outputValue
           * Generics of input: determined by the input class: TextInputFormat:Longwritable Text
           * Generics of output: determined by code logic: Text, IntWritable
           * Rewrite map method
           */
          public static class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable>{
              //Construct key and value for output
              private Text outputKey = new Text();
              private IntWritable outputValue = new IntWritable(1);
      
              /**
               * map Method: every keyvalue passed by Input calls the map method once
               * @param key: Current key
               * @param value: Current value
               * @param context: Context, responsible for outputting the new keyvalue
               * @throws IOException
               * @throws InterruptedException
               */
              @Override
              protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                  //Convert the contents of each line to String
                  String line = value.toString();
                  //Split the contents of each line
                  String[] words = line.split("\t");
                  //Take the second field, and the user's search term is the key
                  this.outputKey.set(words[2]);
                  //output
                  context.write(this.outputKey,this.outputValue);
              }
          }
      
          /**
           * All Reduce needs to implement four generics
           * keyvalue of input: the keyvalue type of Map output
           * Output keyvalue: determined by code logic
           * Override reduce method
           */
          public static class WordCountReduce extends Reducer<Text, IntWritable,Text, IntWritable>{
      
              private IntWritable outputValue = new IntWritable();
      
              /**
               * reduce Method, every keyvalue will call the reduce method once
               * @param key: key passed in
               * @param values: Iterator, all value s of the current key
               * @param context
               * @throws IOException
               * @throws InterruptedException
               */
              @Override
              protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
                  //Take the value of the iterator for accumulation
                  int sum = 0;
                  for (IntWritable value : values) {
                      sum += value.get();
                  }
                  //value encapsulated as output
                  this.outputValue.set(sum);
                  //Output the result of each key
                  context.write(key,outputValue);
              }
          }
      
      
      }
      
      

5, Join scheme in MapReduce

1,Reduce Join

  • join occurs on the reduce side

  • join in SQL

    inner join: only when there are results on both sides
     left join: on the left, there will be
     right join: if there is one on the right, the result will be
     full join: on either side, the result is
    
    
    Select * from a join b = select * from a, B = > Cartesian product
    
    
    join: association of columns
    	a		id		name 		age		sex
    	b		id		phone		addr
    	Query Zhang San's mobile number
    	
    union: association of rows
    	a id name age sex: Freshman data
        b id name age sex: sophomore data
       
    
  • Demand: two pieces of data

    • Order data

      1001,20150710,p0001,2
      1002,20150710,p0002,3
      1002,20150710,p0003,3
      
      Order No., date, commodity id, commodity quantity
      
    • Commodity data

      p0001, helicopter, 10002000
       p0002, tank, 10003000
       p0003, rocket, 100002000
      
      Commodity id, commodity name, price, inventory
      
    • Association: get the product name of each order

      100120150710, p0001,2 helicopter
       100220150710, p0002,3 tank
       100220150710, p0003,3 rocket
      
    • Analysis: MapReduce implementation

      • Step 1: results

        • Include product name in addition to order data
      • Step 2: see if there is grouping or sorting

        • Grouping: fields of a join
        • key: Commodity id
      • Step 3: value

        • For order data, other fields besides commodity id are value
        • For commodity data, except for commodity id, only commodity name is required
      • Step 4: Verification

        • Input

          1001,20150710,p0001,2
          1002,20150710,p0002,3
          1002,20150710,p0003,3
           p0001, helicopter, 10002000
           p0002, tank, 10003000
           p0003, rocket, 100002000
          
        • map

          • key: Commodity id
          • value:
            • If it is an order: order number, date, product quantity
            • In case of commodity: commodity name
        • shuffle

          • Grouping

            • All product and order data corresponding to the same product id are in one iterator
            P0001 helicopter, 100120150710, p0001,2
             P0002 tank, 100220150710, p0002,3
             P0003 rocket, 100220150710, p0003,3
            
  • Realization

    /**
     * @ClassName ReduceJoin
     * @Description TODO Reduce The process of implementing join
     * @Date 2020/1/9 17:25
     * @Create By     Frank
     */
    public class ReduceJoin extends Configured implements Tool {
    
    
        /**
         * Specific definition of the whole MapReduce job: build, configure and submit
         * @param args
         * @return
         * @throws Exception
         */
        @Override
        public int run(String[] args) throws Exception {
            /**
             * Build a job
             */
            //Create an instance of a job
            Job job = Job.getInstance(this.getConf(),"mrjob");
            //Set the class the job runs
            job.setJarByClass(ReduceJoin.class);
    
            /**
             * Configure job
             */
            //Input: define the input mode and input path
            Path orderPath = new Path("datas/mrjoin/orders.txt");
            Path productPath = new Path("datas/mrjoin/product.txt");
            TextInputFormat.setInputPaths(job,orderPath,productPath);
            //Map: define the class and output type of map stage
            job.setMapperClass(MRJoinMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);
            //Shuffle: defines the class implemented in the shuffle phase
            //Reduce: define the class and output type of the reduce phase
            job.setReducerClass(MRJoinReduce.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
            job.setNumReduceTasks(1);//Set the number of Reduce, that is, the number of partitions
            //Output: defines the output class and output path
            Path outputPath = new Path("datas/output/join/reducejoin");
            //Delete if output exists
            FileSystem hdfs = FileSystem.get(this.getConf());
            if(hdfs.exists(outputPath)){
                hdfs.delete(outputPath,true);
            }
            TextOutputFormat.setOutputPath(job,outputPath);
    
            /**
             * Submit job: and return according to the result of job running
             */
            return job.waitForCompletion(true) ? 0:-1;
        }
    
    
        /**
         * Program entry
         * @param args
         * @throws Exception
         */
        public static void main(String[] args) throws Exception {
            //Build a Conf object to manage all the configuration of the current program
            Configuration conf = new Configuration();
            //Call the run method of the current class
            int status = ToolRunner.run(conf, new ReduceJoin(), args);
            //Exit the whole program according to the running state of the job
            System.exit(status);
        }
    
        /**
         * Define Mapper implementation class and processing logic in Map process
         */
        public static class MRJoinMapper extends Mapper<LongWritable,Text,Text, Text>{
    
            private Text outputKey = new Text();
            private Text outputValue = new Text();
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                //First determine which file the current data is from
                FileSplit inputSplit = (FileSplit) context.getInputSplit();//First, obtain which file segments this data belongs to
                String fileName = inputSplit.getPath().getName();//Get file name
                //If it is order data, key is the third field and value is the other remaining fields
                if("orders.txt".equals(fileName)){
                    String[] items = value.toString().split(",");
                    this.outputKey.set(items[2]);//Commodity id
                    this.outputValue.set(items[0]+"\t"+items[1]+"\t"+items[3]);//Other fields
                    context.write(this.outputKey,this.outputValue);//Output order data
                }else{
                    //If it is commodity data, key is the first field and value is the second field
                    String[] split = value.toString().split(",");
                    this.outputKey.set(split[0]);//Commodity id
                    this.outputValue.set(split[1]);//Trade name
                    context.write(this.outputKey,this.outputValue);//Export product data
                }
    
    
            }
        }
    
        /**
         * Define the implementation class of Reducer and the processing logic in the process of reduction
         */
        public static class MRJoinReduce extends Reducer<Text,Text,Text,Text>{
    
            private Text outputValue = new Text();
    
            @Override
            protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                StringBuilder stringBuilder = new StringBuilder();
                for (Text value : values) {
                    stringBuilder.append(value.toString()+"\t");//Splice the names of all products and all corresponding orders
                }
                this.outputValue.set(stringBuilder.toString());//Take product name and order as value
                context.write(key,this.outputValue);
            }
        }
    
    }
    
    
  • Application scenario

    • Big data join big data

    • Very inefficient comparison

    • Commodity data: 60000

    • Order data: 90000

2,Map Join

  • Commodity data: 10000

  • Order data: 20 million

  • Small data join big data

  • ==Idea: = = put the small data into the distributed cache. When each piece of big data needs to be used, take it directly from the distributed cache, and then complete the join directly on the Map side without shuffling

  • Realization

    public class MapJoin extends Configured implements Tool {
        /**
         * Specific definition of the whole MapReduce job: build, configure and submit
         * @param args
         * @return
         * @throws Exception
         */
        @Override
        public int run(String[] args) throws Exception {
            /**
             * Build a job
             */
            //Create an instance of a job
            Job job = Job.getInstance(this.getConf(),"mrjob");
            //Set the class the job runs
            job.setJarByClass(MapJoin.class);
    
            /**
             * Configure job
             */
            //Input: define the input mode and input path
            Path orderPath = new Path("datas/mrjoin/orders.txt");
            TextInputFormat.setInputPaths(job,orderPath);
            //Put product data into distributed cache
            Path productPath = new Path("datas/mrjoin/product.txt");
            job.addCacheFile(productPath.toUri());
            //Map: define the class and output type of map stage
            job.setMapperClass(MRJoinMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);
            //Shuffle: defines the class implemented in the shuffle phase
            //Reduce: define the class and output type of the reduce phase
    //        job.setReducerClass(MRJoinReduce.class);
    //        job.setOutputKeyClass(Text.class);
    //        job.setOutputValueClass(Text.class);
            job.setNumReduceTasks(0);//Set the number of Reduce, that is, the number of partitions
            //Output: defines the output class and output path
            Path outputPath = new Path("datas/output/join/mapjoin");
            //Delete if output exists
            FileSystem hdfs = FileSystem.get(this.getConf());
            if(hdfs.exists(outputPath)){
                hdfs.delete(outputPath,true);
            }
            TextOutputFormat.setOutputPath(job,outputPath);
    
            /**
             * Submit job: and return according to the result of job running
             */
            return job.waitForCompletion(true) ? 0:-1;
        }
    
    
        /**
         * Program entry
         * @param args
         * @throws Exception
         */
        public static void main(String[] args) throws Exception {
            //Build a Conf object to manage all the configuration of the current program
            Configuration conf = new Configuration();
            //Call the run method of the current class
            int status = ToolRunner.run(conf, new MapJoin(), args);
            //Exit the whole program according to the running state of the job
            System.exit(status);
        }
    
        /**
         * Define Mapper implementation class and processing logic in Map process
         */
        public static class MRJoinMapper extends Mapper<LongWritable,Text,Text, Text>{
    
            private Text outputKey = new Text();
            private Text outputValue = new Text();
            Map<String,String> maps = new HashMap<>();
    
            /**
             * Map And Reduce: three methods
             *      1-setup: It will be executed before the map or reduce method
             *      2-map/reduce: map Logic or reduce logic
             *      3-close: Final execution method
             * @param context
             * @throws IOException
             * @throws InterruptedException
             */
    
            @Override
            protected void setup(Context context) throws IOException, InterruptedException {
                //Read data from distributed cache
                URI[] cacheFiles = context.getCacheFiles();//Get all cached data
                //Read file contents
                BufferedReader bufferedReader = new BufferedReader(new FileReader(cacheFiles[0].getPath()));
                String line = null;
                while(StringUtils.isNotBlank(line = bufferedReader.readLine())){
                    //Read to each line
                    String pid = line.split(",")[0];//Commodity id
                    String productName = line.split(",")[1];//Trade name
                    //Put commodity id and name into map set
                    maps.put(pid,productName);
                }
            }
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                //Get order data
                String[] items = value.toString().split(",");
                String pid = items[2]; //Item id in order
                String productName = maps.get(pid);
                this.outputKey.set(productName);
                this.outputValue.set(value.toString());
                context.write(this.outputKey,this.outputValue);
            }
        }
    
        /**
         * Define the implementation class of Reducer and the processing logic in the process of reduction
         */
        public static class MRJoinReduce extends Reducer<Text,Text,Text,Text>{
    
            private Text outputValue = new Text();
    
            @Override
            protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                StringBuilder stringBuilder = new StringBuilder();
                for (Text value : values) {
                    stringBuilder.append(value.toString()+"\t");//Splice the names of all products and all corresponding orders
                }
                this.outputValue.set(stringBuilder.toString());//Take product name and order as value
                context.write(key,this.outputValue);
            }
        }
    
    }
    
    
  • Application scenario

    • Suitable for small data join large data scenario

6, Read write database

1. Input and Output

  • Input: all input should inherit from InputFormat
    • Default: textinputformat extensions fileinputformat extend inputformat
    • file
    • data base
  • Output: all Input should inherit from OutputFormat
    • Default: textoutputformat extensions fileoutputformat extensions outputformat
    • file
    • data base
  • Database: JDBC

2, read MySQL

  • Modify the imported class

    job.setInputFormatClass(DBInputFormat.class);
    
  • Customize a data type to receive data from MySQL

    • In addition to the implementation of Writable interface, DBWritable should also be implemented

      public  class DBReader implements Writable,DBWritable
      
    • Realize the serialization and deserialization of database objects

      	public void write(PreparedStatement statement) throws SQLException {
      		// TODO Auto-generated method stub
      		statement.setString(1, word);
      		statement.setInt(2,number);
      
      	}
      
      	public void readFields(ResultSet resultSet) throws SQLException {
      		// TODO Auto-generated method stub
      		this.word = resultSet.getString(1);
      		this.number = resultSet.getInt(2);
      
      	}
      
  • After creating the conf object, you should immediately configure the jdbc connection parameters

    Configuration conf = new Configuration();
    DBConfiguration.configureDB(
        conf, 
        "com.mysql.jdbc.Driver", 
        "jdbc:mysql://localhost:3306/test",
        "root", 
        "123456"
    );
    
  • Configure read data: tables, SQL statements, fields

    DBInputFormat.setInput(
        job, 
        DBReader.class, //Object to store read results
        "wcresult", //Indicate
        null, //Filter condition
        "number",//Sorted fields
        fields  //Which fields to read
    );
    
    public static void setInput(
    	  Job job,
          Class<? extends DBWritable> inputClass,  //Object to store read results
          String inputQuery,  //SQL statement, the fields returned in the SQL statement must be consistent with the inputClass property
          String inputCountQuery
          ) {
      }
    

2, write MySQL

  • Configure the output class

    job.setOutputFormatClass(DBOutputFormat.class);
    
  • Configure output parameters: table, field

    DBOutputFormat.setOutput(job, "wcresult", fields);
    
  • The object that the program will output the key to MySQL must implement the DBWritable interface [serialization method]

    public static class WriteMap extends Mapper<LongWritable, Text, DBReader, NullWritable>{
    	
    		private DBReader outputKey = new DBReader();
    		private NullWritable outputValue = NullWritable.get();
    		
    		@Override
    		protected void map(LongWritable key, Text value,
    				Context context)
    						throws IOException, InterruptedException {
    			// TODO Auto-generated method stub
    			String line = value.toString();
    			this.outputKey.setWord(line.split("\t")[0]);
    			this.outputKey.setNumber(Integer.valueOf(line.split("\t")[1]));
    			context.write(outputKey, outputValue);
    		}
    	}
    
Published 6 original articles, won praise 1, visited 1181
Private letter follow

Posted by nick1 on Tue, 14 Jan 2020 02:21:13 -0800