Hadoop entry note 23: MapReduce performance optimization - data compression optimization

Keywords: Hadoop

1, Compression optimization design

When running MapReduce program, disk I/O operation, network data transmission, shuffle and merge take a lot of time, especially in the case of large data scale and intensive workload. Since disk I/O and network bandwidth are valuable resources of Hadoop, data compression is very helpful to save resources and minimize disk I/O and network transmission. If disk I/O and network bandwidth affect MapReduce job performance, enabling compression at any MapReduce stage can improve end-to-end processing time and reduce I/O and network traffic.

Compression is an optimization strategy of mapreduce: compress the output of mapper or reducer through compression coding,
To reduce disk IO and improve the running speed of MR program, its advantages and disadvantages are as follows:
Advantages of compression:

  • Reduce file storage space
  • Speed up the file transfer efficiency, so as to improve the processing speed of the system
  • Reduce the number of IO reads and writes

Disadvantages of compression

  • When using data, you need to decompress the file first to increase the CPU load. The more complex the compression algorithm is, the longer the decompression time is

2, Compression support

1. Check the compression algorithm supported by Hadoop: hadoop checknative

2. Compression algorithm supported by Hadoop

3. Comparison of compression performance of each compression algorithm

compression algorithmadvantageshortcoming
GzipThe compression ratio is higher in the four compression modes; hadoop itself supports. Processing gzip files in an application is the same as directly processing text; hadoop native library; Most linux systems come with gzip command, which is easy to usesplit is not supported
LzoThe compression / decompression speed is also relatively fast and the compression rate is reasonable; Support split, which is the most popular compression format in hadoop; Support hadoop native library; The lzop command needs to be installed under linux system, which is convenient to useThe compression ratio is lower than gzip; hadoop itself is not supported and needs to be installed; lzo supports split, but the lzo file needs to be indexed. Otherwise, hadoop will treat the lzo file as an ordinary file (in order to support split, the inputformat needs to be specified as lzo format)
Bzip2Support split; It has a high compression ratio, which is higher than gzip compression ratio; hadoop itself supports, but does not support native; bzip2 command comes with linux system, which is easy to useSlow compression / decompression speed; native is not supported
SnappyFast compression speed; Support hadoop native librarysplit is not supported; Low compression ratio; hadoop itself is not supported and needs to be installed; There is no corresponding command under linux system

4. Compression ratio for data of the same size

5. Compression time and decompression time


From the above comparison, it can be seen that the higher the compression ratio, the longer the compression time. The compression algorithm with medium compression ratio and compression time should be selected

3, Gzip compression

1. Generate Gzip compressed file

1. Requirements: read the ordinary text file and compress the ordinary text file into Gzip format

2. Ideas

  1. Input reads a normal text file
  2. Direct output of Map and Reduce
  3. Configure Output output
  4. Output compressed to Gzip format

3. Code implementation

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRWriteGzip
 * @Description TODO Read ordinary file data and compress the data in Gzip format
 */
public class MRWriteGzip extends Configured implements Tool {

    //Build, configure, and submit a MapReduce Job
    public int run(String[] args) throws Exception {

        //Build Job
        Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
        job.setJarByClass(MRWriteGzip.class);

        //Input: configuration input
        Path inputPath = new Path(args[0]);
        TextInputFormat.setInputPaths(job,inputPath);

        //Map: configure map
        job.setMapperClass(MrMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        //Reduce: configure reduce
        job.setReducerClass(MrReduce.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
   //Output: configure output
        Path outputPath = new Path(args[1]);
        TextOutputFormat.setOutputPath(job,outputPath);

        return job.waitForCompletion(true) ? 0 : -1;
    }

    //Program entry, call run
    public static void main(String[] args) throws Exception {
        //Used to manage all configurations of the current program
        Configuration conf = new Configuration();
        //The configuration output results are compressed into Gzip format
        conf.set("mapreduce.output.fileoutputformat.compress","true");
        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
        //Call the run method to submit and run the Job
        int status = ToolRunner.run(conf, new MRWriteGzip(), args);
        System.exit(status);
    }


    /**
     * Define Mapper class
     */
    public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

        private NullWritable outputKey = NullWritable.get();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            context.write(this.outputKey,value);
        }
    }

    /**
     * Define Reduce class
     */
    public static class MrReduce extends Reducer<NullWritable,
Text,NullWritable, Text> {

        @Override
        protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            for (Text value : values) {
                context.write(key, value);
            }
        }
    }

}

2. Read Gzip compressed file

1. Requirements: read Gzip compressed file and restore it to normal text file

2. Ideas

  1. Input directly reads the compression result file of the previous step
  2. Direct output of Map and Reduce
  3. Output saves the results as a normal text file

    3. Code development

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    import java.io.IOException;
    
    /**
     * @ClassName MRReadGzip
     * @Description TODO Read the data in Gzip format and restore it to a normal text file
     */
    public class MRReadGzip extends Configured implements Tool {
    
     //Build, configure, and submit a MapReduce Job
     public int run(String[] args) throws Exception {
    
         //Build Job
         Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
         job.setJarByClass(MRReadGzip.class);
    
         //Input: configuration input
         Path inputPath = new Path(args[0]);
         TextInputFormat.setInputPaths(job,inputPath);
    
         //Map: configure map
         job.setMapperClass(MrMapper.class);
         job.setMapOutputKeyClass(NullWritable.class);
         job.setMapOutputValueClass(Text.class);
    
         //Reduce: configure reduce
         job.setReducerClass(MrReduce.class);
         job.setOutputKeyClass(NullWritable.class);
         job.setOutputValueClass(Text.class);
    
    
         //Output: configure output
         Path outputPath = new Path(args[1]);
         TextOutputFormat.setOutputPath(job,outputPath);
    
         return job.waitForCompletion(true) ? 0 : -1;
     }
    
     //Program entry, call run
     public static void main(String[] args) throws Exception {
         //Used to manage all configurations of the current program
         Configuration conf = new Configuration();
         //The configuration output results are compressed into Gzip format
    //        conf.set("mapreduce.output.fileoutputformat.compress","true");
    //        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
     //Call the run method to submit and run the Job
         int status = ToolRunner.run(conf, new MRReadGzip(), args);
         System.exit(status);
     }
    
    
     /**
      * Define Mapper class
      */
     public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{
    
         private NullWritable outputKey = NullWritable.get();
    
         @Override
         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             context.write(this.outputKey,value);
         }
     }
    
     /**
      * Define Reduce class
      */
     public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {
    
         @Override
         protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             for (Text value : values) {
                 context.write(key, value);
             }
         }
     }
    
    }
    

    3, Snappy compression

    1. Configure Hadoop to support Snappy

    Hadoop supports Snappy compression algorithm and is also the most commonly used compression algorithm. However, the official compiled installation package of Hadoop does not provide Snappy support. Therefore, if you want to use Snappy compression, you must download the Hadoop source code, compile it yourself, and add Snappy support during compilation. For the specific compilation process, please refer to the Hadoop 3 compilation and installation manual.

2. Generate Snappy compressed file: Map output is not compressed

1. Requirements: read ordinary text files and convert them into Snappy compressed files

2. Ideas

  1. Input reads a normal text file
  2. Direct output of Map and Reduce
  3. Output configures that the output is compressed to Snappy type

    3. Code development

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    import java.io.IOException;
    
    /**
     * @ClassName MRWriteSnappy
     * @Description TODO Read ordinary file data and compress the data in Snappy format
     */
    public class MRWriteSnappy extends Configured implements Tool {
    
     //Build, configure, and submit a MapReduce Job
     public int run(String[] args) throws Exception {
    
         //Build Job
         Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
         job.setJarByClass(MRWriteSnappy.class);
    
         //Input: configuration input
         Path inputPath = new Path(args[0]);
         TextInputFormat.setInputPaths(job,inputPath);
       //Map: configure map
         job.setMapperClass(MrMapper.class);
         job.setMapOutputKeyClass(NullWritable.class);
         job.setMapOutputValueClass(Text.class);
    
         //Reduce: configure reduce
         job.setReducerClass(MrReduce.class);
         job.setOutputKeyClass(NullWritable.class);
         job.setOutputValueClass(Text.class);
    
    
         //Output: configure output
         Path outputPath = new Path(args[1]);
         TextOutputFormat.setOutputPath(job,outputPath);
    
         return job.waitForCompletion(true) ? 0 : -1;
     }
    
     //Program entry, call run
     public static void main(String[] args) throws Exception {
         //Used to manage all configurations of the current program
         Configuration conf = new Configuration();
         //Configure the output results to be compressed into Snappy format
         conf.set("mapreduce.output.fileoutputformat.compress","true");
         conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");
         //Call the run method to submit and run the Job
         int status = ToolRunner.run(conf, new MRWriteSnappy(), args);
         System.exit(status);
     }
    
    
     /**
      * Define Mapper class
      */
     public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{
    
         private NullWritable outputKey = NullWritable.get();
     @Override
         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             context.write(this.outputKey,value);
         }
     }
    
     /**
      * Define Reduce class
      */
     public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {
    
         @Override
         protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             for (Text value : values) {
                 context.write(key, value);
             }
         }
     }
    }

2. Generate Snappy compressed file: Map output compression

1. Requirements: read ordinary text files, convert them into Snappy compressed files, and use Snappy compression for the results output from Map

2. Idea: add the configuration of Map output compression to the code in the previous step

3. Code development

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRMapOutputSnappy
 * @Description TODO Read ordinary file data and compress the data output from Map in Snappy format
 */
public class MRMapOutputSnappy extends Configured implements Tool {

    //Build, configure, and submit a MapReduce Job
    public int run(String[] args) throws Exception {

        //Build Job
        Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
        job.setJarByClass(MRMapOutputSnappy.class);

        //Input: configuration input
        Path inputPath = new Path(args[0]);
        TextInputFormat.setInputPaths(job,inputPath);

        //Map: configure map
        job.setMapperClass(MrMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        //Reduce: configure reduce
        job.setReducerClass(MrReduce.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);


        //Output: configure output
        Path outputPath = new Path(args[1]);
        TextOutputFormat.setOutputPath(job,outputPath);
  return job.waitForCompletion(true) ? 0 : -1;
    }

    //Program entry, call run
    public static void main(String[] args) throws Exception {
        //Used to manage all configurations of the current program
        Configuration conf = new Configuration();
        //Configure Map output results to be compressed into Snappy format
        conf.set("mapreduce.map.output.compress","true");
        conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");
        //Configure the compression of Reduce output results into Snappy format
        conf.set("mapreduce.output.fileoutputformat.compress","true");
        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");
        //Call the run method to submit and run the Job
        int status = ToolRunner.run(conf, new MRMapOutputSnappy(), args);
        System.exit(status);
    }


    /**
     * Define Mapper class
     */
    public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

        private NullWritable outputKey = NullWritable.get();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            context.write(this.outputKey,value);
        }
    }

    /**
     * Define Reduce class
     */
   public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {

        @Override
        protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            for (Text value : values) {
                context.write(key, value);
            }
        }
    }
}

4. Read Snappy compressed file

1. Requirements: read the Snappy file generated in the previous step and restore it to a normal text file

2. Ideas:

  1. Input read Snappy file
  2. Direct output of Map and Reduce
  3. Output outputs directly to the normal text type

    3. Code:

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    import java.io.IOException;
    /**
     * @ClassName MRReadSnappy
     * @Description TODO Read the data in Snappy format and restore it to a normal text file
     */
    public class MRReadSnappy extends Configured implements Tool {
    
     //Build, configure, and submit a MapReduce Job
     public int run(String[] args) throws Exception {
    
         //Build Job
         Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
         job.setJarByClass(MRReadSnappy.class);
    
         //Input: configuration input
         Path inputPath = new Path(args[0]);
         TextInputFormat.setInputPaths(job,inputPath);
    
         //Map: configure map
         job.setMapperClass(MrMapper.class);
         job.setMapOutputKeyClass(NullWritable.class);
         job.setMapOutputValueClass(Text.class);
    
         //Reduce: configure reduce
         job.setReducerClass(MrReduce.class);
         job.setOutputKeyClass(NullWritable.class);
         job.setOutputValueClass(Text.class);
    
    
         //Output: configure output
         Path outputPath = new Path(args[1]);
         TextOutputFormat.setOutputPath(job,outputPath);
    
         return job.waitForCompletion(true) ? 0 : -1;
     }
    
     //Program entry, call run
     public static void main(String[] args) throws Exception {
         //Used to manage all configurations of the current program
         Configuration conf = new Configuration();
         //Call the run method to submit and run the Job
         int status = ToolRunner.run(conf, new MRReadSnappy(), args);
     System.exit(status);
     }
    
    
     /**
      * Define Mapper class
      */
     public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{
    
         private NullWritable outputKey = NullWritable.get();
    
         @Override
         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             context.write(this.outputKey,value);
         }
     }
    
     /**
      * Define Reduce class
      */
     public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {
    
         @Override
         protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             for (Text value : values) {
                 context.write(key, value);
             }
         }
     }
    }

    5, Lzo compression

    1. Configure Hadoop to support Lzo

    Hadoop itself does not support LZO compression. It needs to be installed separately, and LZO compression algorithm support is added during compilation. For the compilation process, please refer to the compilation manual Apache Hadoop 3-1-3 compilation, installation and deployment LZO compression guide.
    After compilation, please implement the following configuration to make the current Hadoop support Lzo compression

    • Add lzo support jar package
    cp hadoop-lzo-0.4.21-SNAPSHOT.jar /export/server/hadoop-3.1.4/share/hadoop/common/

    • Synchronize to all nodes
    cd  /export/server/hadoop-3.1.4/share/hadoop/common/
    scp hadoop-lzo-0.4.21-SNAPSHOT.jar node2:$PWD
    scp hadoop-lzo-0.4.21-SNAPSHOT.jar node3:$PWD
    
    • Modify core-site.xml
    <property>
     <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
    </property>
    <property>
     <name>io.compression.codec.lzo.class</name>
     <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
    
    • Synchronize core-site.xml to all other nodes
    cd  /export/server/hadoop-3.1.4/etc/hadoop
    scp  core-site.xml node2:$PWD
    scp  core-site.xml node3:$PWD
    • Restart Hadoop cluster

    2. Generate Lzo compressed file

    1. Requirements: read ordinary text files and generate Lzo compression result files

    2. Ideas

  4. Read normal text file
  5. Direct output of Map and Reduce
  6. Configure Output compression to Lzo type

    3. Code development

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    import java.io.IOException;
    
    /**
     * @ClassName MRWriteLzo
     * @Description TODO Read ordinary file data and compress the data in Lzo format
     */
    public class MRWriteLzo extends Configured implements Tool {
    
     //Build, configure, and submit a MapReduce Job
     public int run(String[] args) throws Exception {
    
         //Build Job
         Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
         job.setJarByClass(MRWriteLzo.class);
    
         //Input: configuration input
         Path inputPath = new Path(args[0]);
         TextInputFormat.setInputPaths(job,inputPath);
      //Map: configure map
         job.setMapperClass(MrMapper.class);
         job.setMapOutputKeyClass(NullWritable.class);
         job.setMapOutputValueClass(Text.class);
    
         //Reduce: configure reduce
         job.setReducerClass(MrReduce.class);
         job.setOutputKeyClass(NullWritable.class);
         job.setOutputValueClass(Text.class);
    
    
         //Output: configure output
         Path outputPath = new Path(args[1]);
         TextOutputFormat.setOutputPath(job,outputPath);
    
         return job.waitForCompletion(true) ? 0 : -1;
     }
    
     //Program entry, call run
     public static void main(String[] args) throws Exception {
         //Used to manage all configurations of the current program
         Configuration conf = new Configuration();
         //Configure the output results to be compressed into Lzo format
         conf.set("mapreduce.output.fileoutputformat.compress","true");
         conf.set("mapreduce.output.fileoutputformat.compress.codec","com.hadoop.compression.lzo.LzopCodec");
         //Call the run method to submit and run the Job
         int status = ToolRunner.run(conf, new MRWriteLzo(), args);
         System.exit(status);
     }
    
    
     /**
      * Define Mapper class
      */
     public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{
    
         private NullWritable outputKey = NullWritable.get();
    
         @Override
         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             context.write(this.outputKey,value);
         }
     }
    
     /**
      * Define Reduce class
      */
     public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {
    
         @Override
         protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             for (Text value : values) {
                 context.write(key, value);
             }
         }
     }
    }
    

    3. Read Lzo compressed file

    1. Requirements: read Lzo compressed file and restore it to normal text file

    2. Code development

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    import java.io.IOException;
    
    /**
     * @ClassName MRReadLzo
     * @Description TODO Read the data in Lzo format and restore it to a normal text file
     */
    public class MRReadLzo extends Configured implements Tool {
    
     //Build, configure, and submit a MapReduce Job
     public int run(String[] args) throws Exception {
    
         //Build Job
         Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
         job.setJarByClass(MRReadLzo.class);
    
         //Input: configuration input
         Path inputPath = new Path(args[0]);
         TextInputFormat.setInputPaths(job,inputPath);
    
         //Map: configure map
         job.setMapperClass(MrMapper.class);
         job.setMapOutputKeyClass(NullWritable.class);
         job.setMapOutputValueClass(Text.class);
    
         //Reduce: configure reduce
         job.setReducerClass(MrReduce.class);
         job.setOutputKeyClass(NullWritable.class);
         job.setOutputValueClass(Text.class);
    
    
         //Output: configure output
         Path outputPath = new Path(args[1]);
         TextOutputFormat.setOutputPath(job,outputPath);
    
         return job.waitForCompletion(true) ? 0 : -1;
     }
      //Program entry, call run
     public static void main(String[] args) throws Exception {
         //Used to manage all configurations of the current program
         Configuration conf = new Configuration();
         //The configuration output results are compressed into Gzip format
    //        conf.set("mapreduce.output.fileoutputformat.compress","true");
    //        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
         //Call the run method to submit and run the Job
         int status = ToolRunner.run(conf, new MRReadLzo(), args);
         System.exit(status);
     }
    
    
     /**
      * Define Mapper class
      */
     public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{
    
         private NullWritable outputKey = NullWritable.get();
    
         @Override
         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
             //Direct output of each data
             context.write(this.outputKey,value);
         }
     }
    
     /**
      * Define Reduce class
      */
     public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {
    
         @Override
         protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
             //Direct output of each data
      for (Text value : values) {
                 context.write(key, value);
             }
         }
     }
    }
    

Posted by Pandolfo on Sun, 07 Nov 2021 22:03:53 -0800