Hadoop entry note 23: MapReduce performance optimization - data compression optimization

Keywords: Hadoop

1, Compression optimization design

When running MapReduce program, disk I/O operation, network data transmission, shuffle and merge take a lot of time, especially in the case of large data scale and intensive workload. Since disk I/O and network bandwidth are valuable resources of Hadoop, data compression is very helpful to save resources and minimize disk I/O and network transmission. If disk I/O and network bandwidth affect MapReduce job performance, enabling compression at any MapReduce stage can improve end-to-end processing time and reduce I/O and network traffic.

Compression is an optimization strategy of mapreduce: compress the output of mapper or reducer through compression coding,
To reduce disk IO and improve the running speed of MR program, its advantages and disadvantages are as follows:
Advantages of compression:

Reduce file storage space
Speed up the file transfer efficiency, so as to improve the processing speed of the system
Reduce the number of IO reads and writes

Disadvantages of compression

When using data, you need to decompress the file first to increase the CPU load. The more complex the compression algorithm is, the longer the decompression time is

2, Compression support

1. Check the compression algorithm supported by Hadoop: hadoop checknative

2. Compression algorithm supported by Hadoop

3. Comparison of compression performance of each compression algorithm

compression algorithm	advantage	shortcoming
Gzip	The compression ratio is higher in the four compression modes; hadoop itself supports. Processing gzip files in an application is the same as directly processing text; hadoop native library; Most linux systems come with gzip command, which is easy to use	split is not supported
Lzo	The compression / decompression speed is also relatively fast and the compression rate is reasonable; Support split, which is the most popular compression format in hadoop; Support hadoop native library; The lzop command needs to be installed under linux system, which is convenient to use	The compression ratio is lower than gzip; hadoop itself is not supported and needs to be installed; lzo supports split, but the lzo file needs to be indexed. Otherwise, hadoop will treat the lzo file as an ordinary file (in order to support split, the inputformat needs to be specified as lzo format)
Bzip2	Support split; It has a high compression ratio, which is higher than gzip compression ratio; hadoop itself supports, but does not support native; bzip2 command comes with linux system, which is easy to use	Slow compression / decompression speed; native is not supported
Snappy	Fast compression speed; Support hadoop native library	split is not supported; Low compression ratio; hadoop itself is not supported and needs to be installed; There is no corresponding command under linux system

4. Compression ratio for data of the same size

5. Compression time and decompression time

From the above comparison, it can be seen that the higher the compression ratio, the longer the compression time. The compression algorithm with medium compression ratio and compression time should be selected

3, Gzip compression

1. Generate Gzip compressed file

1. Requirements: read the ordinary text file and compress the ordinary text file into Gzip format

2. Ideas

Input reads a normal text file
Direct output of Map and Reduce
Configure Output output
Output compressed to Gzip format

3. Code implementation

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRWriteGzip
 * @Description TODO Read ordinary file data and compress the data in Gzip format
 */
public class MRWriteGzip extends Configured implements Tool {

    //Build, configure, and submit a MapReduce Job
    public int run(String[] args) throws Exception {

        //Build Job
        Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
        job.setJarByClass(MRWriteGzip.class);

        //Input: configuration input
        Path inputPath = new Path(args[0]);
        TextInputFormat.setInputPaths(job,inputPath);

        //Map: configure map
        job.setMapperClass(MrMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        //Reduce: configure reduce
        job.setReducerClass(MrReduce.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
   //Output: configure output
        Path outputPath = new Path(args[1]);
        TextOutputFormat.setOutputPath(job,outputPath);

        return job.waitForCompletion(true) ? 0 : -1;
    }

    //Program entry, call run
    public static void main(String[] args) throws Exception {
        //Used to manage all configurations of the current program
        Configuration conf = new Configuration();
        //The configuration output results are compressed into Gzip format
        conf.set("mapreduce.output.fileoutputformat.compress","true");
        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
        //Call the run method to submit and run the Job
        int status = ToolRunner.run(conf, new MRWriteGzip(), args);
        System.exit(status);
    }


    /**
     * Define Mapper class
     */
    public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

        private NullWritable outputKey = NullWritable.get();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            context.write(this.outputKey,value);
        }
    }

    /**
     * Define Reduce class
     */
    public static class MrReduce extends Reducer<NullWritable,
Text,NullWritable, Text> {

        @Override
        protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            for (Text value : values) {
                context.write(key, value);
            }
        }
    }

}

2. Read Gzip compressed file

1. Requirements: read Gzip compressed file and restore it to normal text file

2. Ideas

Input directly reads the compression result file of the previous step
Direct output of Map and Reduce

Output saves the results as a normal text file

3. Code development

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRReadGzip
 * @Description TODO Read the data in Gzip format and restore it to a normal text file
 */
public class MRReadGzip extends Configured implements Tool {

 //Build, configure, and submit a MapReduce Job
 public int run(String[] args) throws Exception {

     //Build Job
     Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
     job.setJarByClass(MRReadGzip.class);

     //Input: configuration input
     Path inputPath = new Path(args[0]);
     TextInputFormat.setInputPaths(job,inputPath);

     //Map: configure map
     job.setMapperClass(MrMapper.class);
     job.setMapOutputKeyClass(NullWritable.class);
     job.setMapOutputValueClass(Text.class);

     //Reduce: configure reduce
     job.setReducerClass(MrReduce.class);
     job.setOutputKeyClass(NullWritable.class);
     job.setOutputValueClass(Text.class);


     //Output: configure output
     Path outputPath = new Path(args[1]);
     TextOutputFormat.setOutputPath(job,outputPath);

     return job.waitForCompletion(true) ? 0 : -1;
 }

 //Program entry, call run
 public static void main(String[] args) throws Exception {
     //Used to manage all configurations of the current program
     Configuration conf = new Configuration();
     //The configuration output results are compressed into Gzip format
//        conf.set("mapreduce.output.fileoutputformat.compress","true");
//        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
 //Call the run method to submit and run the Job
     int status = ToolRunner.run(conf, new MRReadGzip(), args);
     System.exit(status);
 }


 /**
  * Define Mapper class
  */
 public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

     private NullWritable outputKey = NullWritable.get();

     @Override
     protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         context.write(this.outputKey,value);
     }
 }

 /**
  * Define Reduce class
  */
 public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {

     @Override
     protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         for (Text value : values) {
             context.write(key, value);
         }
     }
 }

}

3, Snappy compression

1. Configure Hadoop to support Snappy

Hadoop supports Snappy compression algorithm and is also the most commonly used compression algorithm. However, the official compiled installation package of Hadoop does not provide Snappy support. Therefore, if you want to use Snappy compression, you must download the Hadoop source code, compile it yourself, and add Snappy support during compilation. For the specific compilation process, please refer to the Hadoop 3 compilation and installation manual.

2. Generate Snappy compressed file: Map output is not compressed

1. Requirements: read ordinary text files and convert them into Snappy compressed files

2. Ideas

Input reads a normal text file
Direct output of Map and Reduce

Output configures that the output is compressed to Snappy type

3. Code development

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRWriteSnappy
 * @Description TODO Read ordinary file data and compress the data in Snappy format
 */
public class MRWriteSnappy extends Configured implements Tool {

 //Build, configure, and submit a MapReduce Job
 public int run(String[] args) throws Exception {

     //Build Job
     Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
     job.setJarByClass(MRWriteSnappy.class);

     //Input: configuration input
     Path inputPath = new Path(args[0]);
     TextInputFormat.setInputPaths(job,inputPath);
   //Map: configure map
     job.setMapperClass(MrMapper.class);
     job.setMapOutputKeyClass(NullWritable.class);
     job.setMapOutputValueClass(Text.class);

     //Reduce: configure reduce
     job.setReducerClass(MrReduce.class);
     job.setOutputKeyClass(NullWritable.class);
     job.setOutputValueClass(Text.class);


     //Output: configure output
     Path outputPath = new Path(args[1]);
     TextOutputFormat.setOutputPath(job,outputPath);

     return job.waitForCompletion(true) ? 0 : -1;
 }

 //Program entry, call run
 public static void main(String[] args) throws Exception {
     //Used to manage all configurations of the current program
     Configuration conf = new Configuration();
     //Configure the output results to be compressed into Snappy format
     conf.set("mapreduce.output.fileoutputformat.compress","true");
     conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");
     //Call the run method to submit and run the Job
     int status = ToolRunner.run(conf, new MRWriteSnappy(), args);
     System.exit(status);
 }


 /**
  * Define Mapper class
  */
 public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

     private NullWritable outputKey = NullWritable.get();
 @Override
     protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         context.write(this.outputKey,value);
     }
 }

 /**
  * Define Reduce class
  */
 public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {

     @Override
     protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         for (Text value : values) {
             context.write(key, value);
         }
     }
 }
}

2. Generate Snappy compressed file: Map output compression

1. Requirements: read ordinary text files, convert them into Snappy compressed files, and use Snappy compression for the results output from Map

2. Idea: add the configuration of Map output compression to the code in the previous step

3. Code development

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRMapOutputSnappy
 * @Description TODO Read ordinary file data and compress the data output from Map in Snappy format
 */
public class MRMapOutputSnappy extends Configured implements Tool {

    //Build, configure, and submit a MapReduce Job
    public int run(String[] args) throws Exception {

        //Build Job
        Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
        job.setJarByClass(MRMapOutputSnappy.class);

        //Input: configuration input
        Path inputPath = new Path(args[0]);
        TextInputFormat.setInputPaths(job,inputPath);

        //Map: configure map
        job.setMapperClass(MrMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        //Reduce: configure reduce
        job.setReducerClass(MrReduce.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);


        //Output: configure output
        Path outputPath = new Path(args[1]);
        TextOutputFormat.setOutputPath(job,outputPath);
  return job.waitForCompletion(true) ? 0 : -1;
    }

    //Program entry, call run
    public static void main(String[] args) throws Exception {
        //Used to manage all configurations of the current program
        Configuration conf = new Configuration();
        //Configure Map output results to be compressed into Snappy format
        conf.set("mapreduce.map.output.compress","true");
        conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");
        //Configure the compression of Reduce output results into Snappy format
        conf.set("mapreduce.output.fileoutputformat.compress","true");
        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");
        //Call the run method to submit and run the Job
        int status = ToolRunner.run(conf, new MRMapOutputSnappy(), args);
        System.exit(status);
    }


    /**
     * Define Mapper class
     */
    public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

        private NullWritable outputKey = NullWritable.get();

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            context.write(this.outputKey,value);
        }
    }

    /**
     * Define Reduce class
     */
   public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {

        @Override
        protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //Direct output of each data
            for (Text value : values) {
                context.write(key, value);
            }
        }
    }
}

4. Read Snappy compressed file

1. Requirements: read the Snappy file generated in the previous step and restore it to a normal text file

2. Ideas:

Input read Snappy file
Direct output of Map and Reduce

Output outputs directly to the normal text type

3. Code:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;
/**
 * @ClassName MRReadSnappy
 * @Description TODO Read the data in Snappy format and restore it to a normal text file
 */
public class MRReadSnappy extends Configured implements Tool {

 //Build, configure, and submit a MapReduce Job
 public int run(String[] args) throws Exception {

     //Build Job
     Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
     job.setJarByClass(MRReadSnappy.class);

     //Input: configuration input
     Path inputPath = new Path(args[0]);
     TextInputFormat.setInputPaths(job,inputPath);

     //Map: configure map
     job.setMapperClass(MrMapper.class);
     job.setMapOutputKeyClass(NullWritable.class);
     job.setMapOutputValueClass(Text.class);

     //Reduce: configure reduce
     job.setReducerClass(MrReduce.class);
     job.setOutputKeyClass(NullWritable.class);
     job.setOutputValueClass(Text.class);


     //Output: configure output
     Path outputPath = new Path(args[1]);
     TextOutputFormat.setOutputPath(job,outputPath);

     return job.waitForCompletion(true) ? 0 : -1;
 }

 //Program entry, call run
 public static void main(String[] args) throws Exception {
     //Used to manage all configurations of the current program
     Configuration conf = new Configuration();
     //Call the run method to submit and run the Job
     int status = ToolRunner.run(conf, new MRReadSnappy(), args);
 System.exit(status);
 }


 /**
  * Define Mapper class
  */
 public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

     private NullWritable outputKey = NullWritable.get();

     @Override
     protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         context.write(this.outputKey,value);
     }
 }

 /**
  * Define Reduce class
  */
 public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {

     @Override
     protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         for (Text value : values) {
             context.write(key, value);
         }
     }
 }
}

5, Lzo compression

1. Configure Hadoop to support Lzo

Hadoop itself does not support LZO compression. It needs to be installed separately, and LZO compression algorithm support is added during compilation. For the compilation process, please refer to the compilation manual Apache Hadoop 3-1-3 compilation, installation and deployment LZO compression guide.
After compilation, please implement the following configuration to make the current Hadoop support Lzo compression

Add lzo support jar package

cp hadoop-lzo-0.4.21-SNAPSHOT.jar /export/server/hadoop-3.1.4/share/hadoop/common/

Synchronize to all nodes

cd  /export/server/hadoop-3.1.4/share/hadoop/common/
scp hadoop-lzo-0.4.21-SNAPSHOT.jar node2:$PWD
scp hadoop-lzo-0.4.21-SNAPSHOT.jar node3:$PWD

Modify core-site.xml

<property>
 <name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
 <name>io.compression.codec.lzo.class</name>
 <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

Synchronize core-site.xml to all other nodes

cd  /export/server/hadoop-3.1.4/etc/hadoop
scp  core-site.xml node2:$PWD
scp  core-site.xml node3:$PWD

Restart Hadoop cluster

2. Generate Lzo compressed file

1. Requirements: read ordinary text files and generate Lzo compression result files

2. Ideas

Read normal text file
Direct output of Map and Reduce

Configure Output compression to Lzo type

3. Code development

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRWriteLzo
 * @Description TODO Read ordinary file data and compress the data in Lzo format
 */
public class MRWriteLzo extends Configured implements Tool {

 //Build, configure, and submit a MapReduce Job
 public int run(String[] args) throws Exception {

     //Build Job
     Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
     job.setJarByClass(MRWriteLzo.class);

     //Input: configuration input
     Path inputPath = new Path(args[0]);
     TextInputFormat.setInputPaths(job,inputPath);
  //Map: configure map
     job.setMapperClass(MrMapper.class);
     job.setMapOutputKeyClass(NullWritable.class);
     job.setMapOutputValueClass(Text.class);

     //Reduce: configure reduce
     job.setReducerClass(MrReduce.class);
     job.setOutputKeyClass(NullWritable.class);
     job.setOutputValueClass(Text.class);


     //Output: configure output
     Path outputPath = new Path(args[1]);
     TextOutputFormat.setOutputPath(job,outputPath);

     return job.waitForCompletion(true) ? 0 : -1;
 }

 //Program entry, call run
 public static void main(String[] args) throws Exception {
     //Used to manage all configurations of the current program
     Configuration conf = new Configuration();
     //Configure the output results to be compressed into Lzo format
     conf.set("mapreduce.output.fileoutputformat.compress","true");
     conf.set("mapreduce.output.fileoutputformat.compress.codec","com.hadoop.compression.lzo.LzopCodec");
     //Call the run method to submit and run the Job
     int status = ToolRunner.run(conf, new MRWriteLzo(), args);
     System.exit(status);
 }


 /**
  * Define Mapper class
  */
 public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

     private NullWritable outputKey = NullWritable.get();

     @Override
     protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         context.write(this.outputKey,value);
     }
 }

 /**
  * Define Reduce class
  */
 public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {

     @Override
     protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         for (Text value : values) {
             context.write(key, value);
         }
     }
 }
}

3. Read Lzo compressed file

1. Requirements: read Lzo compressed file and restore it to normal text file

2. Code development

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

/**
 * @ClassName MRReadLzo
 * @Description TODO Read the data in Lzo format and restore it to a normal text file
 */
public class MRReadLzo extends Configured implements Tool {

 //Build, configure, and submit a MapReduce Job
 public int run(String[] args) throws Exception {

     //Build Job
     Job job = Job.getInstance(this.getConf(),this.getClass().getSimpleName());
     job.setJarByClass(MRReadLzo.class);

     //Input: configuration input
     Path inputPath = new Path(args[0]);
     TextInputFormat.setInputPaths(job,inputPath);

     //Map: configure map
     job.setMapperClass(MrMapper.class);
     job.setMapOutputKeyClass(NullWritable.class);
     job.setMapOutputValueClass(Text.class);

     //Reduce: configure reduce
     job.setReducerClass(MrReduce.class);
     job.setOutputKeyClass(NullWritable.class);
     job.setOutputValueClass(Text.class);


     //Output: configure output
     Path outputPath = new Path(args[1]);
     TextOutputFormat.setOutputPath(job,outputPath);

     return job.waitForCompletion(true) ? 0 : -1;
 }
  //Program entry, call run
 public static void main(String[] args) throws Exception {
     //Used to manage all configurations of the current program
     Configuration conf = new Configuration();
     //The configuration output results are compressed into Gzip format
//        conf.set("mapreduce.output.fileoutputformat.compress","true");
//        conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.GzipCodec");
     //Call the run method to submit and run the Job
     int status = ToolRunner.run(conf, new MRReadLzo(), args);
     System.exit(status);
 }


 /**
  * Define Mapper class
  */
 public static class MrMapper extends Mapper<LongWritable, Text, NullWritable, Text>{

     private NullWritable outputKey = NullWritable.get();

     @Override
     protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
         //Direct output of each data
         context.write(this.outputKey,value);
     }
 }

 /**
  * Define Reduce class
  */
 public static class MrReduce extends Reducer<NullWritable, Text,NullWritable, Text> {

     @Override
     protected void reduce(NullWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
         //Direct output of each data
  for (Text value : values) {
             context.write(key, value);
         }
     }
 }
}

Posted by Pandolfo on Sun, 07 Nov 2021 22:03:53 -0800

Programmer Group

Hadoop entry note 23: MapReduce performance optimization - data compression optimization

1, Compression optimization design

2, Compression support

1. Check the compression algorithm supported by Hadoop: hadoop checknative

2. Compression algorithm supported by Hadoop

3. Comparison of compression performance of each compression algorithm

4. Compression ratio for data of the same size

5. Compression time and decompression time

3, Gzip compression

1. Generate Gzip compressed file

1. Requirements: read the ordinary text file and compress the ordinary text file into Gzip format

2. Ideas

3. Code implementation

2. Read Gzip compressed file

1. Requirements: read Gzip compressed file and restore it to normal text file

2. Ideas

3. Code development

3, Snappy compression

1. Configure Hadoop to support Snappy

2. Generate Snappy compressed file: Map output is not compressed

1. Requirements: read ordinary text files and convert them into Snappy compressed files

2. Ideas

3. Code development

2. Generate Snappy compressed file: Map output compression

1. Requirements: read ordinary text files, convert them into Snappy compressed files, and use Snappy compression for the results output from Map

2. Idea: add the configuration of Map output compression to the code in the previous step

3. Code development

4. Read Snappy compressed file

1. Requirements: read the Snappy file generated in the previous step and restore it to a normal text file

2. Ideas:

3. Code:

5, Lzo compression

1. Configure Hadoop to support Lzo

2. Generate Lzo compressed file

1. Requirements: read ordinary text files and generate Lzo compression result files

2. Ideas

3. Code development

3. Read Lzo compressed file

1. Requirements: read Lzo compressed file and restore it to normal text file

2. Code development

Hot Keywords