Hive quick start series (12) | introduction and use of hive data compression

Keywords: Hadoop Apache hive codec

                   .

Catalog

1, Hadoop source compilation supports Snappy compression

1.1 resource preparation

  • 1. CentOS networking

Configure CentOS to connect to the Internet. Linux virtual machine ping www.baidu.com is unblocked
Note: compile with root role to reduce the problem of folder permission

  1. 2. jar package preparation (hadoop source code, JDK8, maven, protobuf)

(1)hadoop-2.7.2-src.tar.gz
(2)jdk-8u144-linux-x64.tar.gz
(3)snappy-1.1.3.tar.gz
(4)apache-maven-3.0.5-bin.tar.gz
(5)protobuf-2.5.0.tar.gz

If you need these files, you can download them through the link shared by the blogger:
Link: https://pan.baidu.com/s/19lM5UgctzCgEkF5S7ZKBtA
Extraction code: drql

1.2 jar package installation

Note: all operations must be completed under root

  • 1. Decompress the JDK, configure the environment variables JAVA_HOME and PATH, and verify the Java version (you need to verify whether the configuration is successful as follows)
[root@hadoop001 software] # tar -zxf jdk-8u144-linux-x64.tar.gz -C /opt/module/
[root@hadoop001 software]# vi /etc/profile
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin
[root@hadoop001 software]#source /etc/profile

Validation command: Java version

  • 2. Extract and configure MAVEN_HOME and PATH
[root@hadoop001 software]# tar -zxvf apache-maven-3.0.5-bin.tar.gz -C /opt/module/
[root@hadoop001 apache-maven-3.0.5]# vi /etc/profile
#MAVEN_HOME
export MAVEN_HOME=/opt/module/apache-maven-3.0.5
export PATH=$PATH:$MAVEN_HOME/bin
[root@hadoop001 software]#source /etc/profile

Validation command: mvn -version

1.3 compiling source code

  • 1. Prepare the compilation environment
[root@hadoop001 software]# yum install svn
[root@hadoop001 software]# yum install autoconf automake libtool cmake
[root@hadoop001 software]# yum install ncurses-devel
[root@hadoop001 software]# yum install openssl-devel
[root@hadoop001 software]# yum install gcc*
  • 2. Compile and install snappy
[root@hadoop001 software]# tar -zxvf snappy-1.1.3.tar.gz -C /opt/module/
[root@hadoop001 module]# cd snappy-1.1.3/
[root@hadoop001 snappy-1.1.3]# ./configure
[root@hadoop001 snappy-1.1.3]# make
[root@hadoop001 snappy-1.1.3]# make install
# View snappy library files
[root@hadoop001 snappy-1.1.3]# ls -lh /usr/local/lib |grep snappy
  • 3. Compile and install protobuf
[root@hadoop001 software]# tar -zxvf protobuf-2.5.0.tar.gz -C /opt/module/
[root@hadoop001 module]# cd protobuf-2.5.0/
[root@hadoop001 protobuf-2.5.0]# ./configure 
[root@hadoop001 protobuf-2.5.0]#  make 
[root@hadoop001 protobuf-2.5.0]#  make install
# Check the protobuf version to see if the installation is successful
[root@hadoop001 protobuf-2.5.0]# protoc --version
  • 4. Compile hadoop native
[root@hadoop001 software]# tar -zxvf hadoop-2.7.2-src.tar.gz
[root@hadoop001 software]# cd hadoop-2.7.2-src/
[root@hadoop001 software]# mvn clean package -DskipTests -Pdist,native -Dtar -Dsnappy.lib=/usr/local/lib -Dbundle.snappy

After successful execution, / opt/software/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz is the newly generated binary installation package supporting snappy compression.

2, Hadoop compression configuration

2.1 compression coding supported by Mr

Compressed format tool algorithm File extension Whether it can be divided
DEFLATE nothing DEFLATE .deflate no
Gzip gzip DEFLATE .gz no
bzip2 bzip2 bzip2 .bz2 yes
LZO lzop LZO .lzo yes
Snappy nothing Snappy .snappy no
  • In order to support a variety of compression / decompression algorithms, Hadoop introduces a codec / decoder:
Compressed format Corresponding encoder / decoder
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec
  • Comparison of compression performance
compression algorithm Original file size Compressed file size Compression speed Decompression speed
gzip 8.3GB 1.8GB 17.5MB/s 58MB/s
bzip2 8.3GB 1.1GB 2.4MB/s 9.5MB/s
LZO 8.3GB 2.9GB 49.3MB/s 74.6MB/s

Here, I didn't write snappy. Let's see snappy's open source website first.
http://google.github.io/snappy/

On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

                     we can see that snappy compression has reached 250MB/s and decompression has reached 500MB/s, which directly!

2.2 compression parameter configuration

   to enable compression in Hadoop, you can configure the following parameters (in the mapred-site.xml file):

parameter Default stage proposal
io.compression.codecs (configured in core-site.xml) org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.Lz4Codec Input compression Hadoop uses the file extension to determine whether it supports a codec or not
mapreduce.map.output.compress false mapper output Set this parameter to true to enable compression
mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.DefaultCodec mapper output Use LZO, LZ4, or snappy codecs to compress data at this stage
mapreduce.output.fileoutputformat.compress false reducer output Set this parameter to true to enable compression
mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress. DefaultCodec reducer output Use standard tools or codecs such as gzip and bzip2
mapreduce.output.fileoutputformat.compress.type RECORD reducer output Compression types used for SequenceFile output: NONE and BLOCK

3, Turn on Map output phase compression

Enable map output phase compression to reduce the data transmission between map and Reduce task in job. The specific configuration is as follows:

Case practice:

  • 1. Enable the data compression function of hive intermediate transmission
hive (default)>set hive.exec.compress.intermediate=true;
  • 2. Enable the map output compression function in mapreduce
hive (default)>set mapreduce.map.output.compress=true;
  • 3. Set the compression method of map output data in mapreduce
hive (default)>set mapreduce.map.output.compress.codec=
 org.apache.hadoop.io.compress.SnappyCodec;

  • 4. Execute query statement
hive (default)> select count(ename) name from emp;

4, Enable Reduce output stage compression

   when Hive writes the output to the table, the output can also be compressed. The property hive.exec.compress.output controls this function. Users may need to keep the default value of false in the default settings file, so that the default output is an uncompressed plain text file. The user can turn on the output compression function by setting this value to true in the query statement or execution script.

Case practice:

  • 1. Enable the compression function of hive final output data
hive (default)>set hive.exec.compress.output=true;
  • 2. Enable mapreduce final output data compression
hive (default)>set mapreduce.output.fileoutputformat.compress=true;
  • 3. Set the final data output compression mode of mapreduce
hive (default)> set mapreduce.output.fileoutputformat.compress.codec =
 org.apache.hadoop.io.compress.SnappyCodec;

  • 4. Set the mapreduce final data output compression to block compression
hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
  • 5. Test whether the output result is a compressed file
hive (default)> insert overwrite local directory
 '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

This is the end of this sharing,

I like it when I finish reading it, and form a habit!!! \Like it after reading it, form a habit!!! }Like after reading, form a habit!!! ^ ^ Mei A kind of Mei A kind of Mei A kind of
It's not easy to code. Your support is the driving force for me to persist. Don't forget to pay attention to me!

Original Article 109 praised 201 visits 130000+
follow Private letter

Posted by pavanpuligandla on Tue, 05 May 2020 23:29:55 -0700