Hive quick start series (12) | introduction and use of hive data compression

Keywords: Hadoop Apache hive codec

Catalog

1, Hadoop source compilation supports Snappy compression

1.1 resource preparation
1.2 jar package installation
1.3 compiling source code

2, Hadoop compression configuration

2.1 compression coding supported by Mr
2.2 compression parameter configuration

3, Turn on Map output phase compression
4, Enable Reduce output stage compression

1, Hadoop source compilation supports Snappy compression

1.1 resource preparation

1. CentOS networking

Configure CentOS to connect to the Internet. Linux virtual machine ping www.baidu.com is unblocked
Note: compile with root role to reduce the problem of folder permission

2. jar package preparation (hadoop source code, JDK8, maven, protobuf)

(1)hadoop-2.7.2-src.tar.gz
(2)jdk-8u144-linux-x64.tar.gz
(3)snappy-1.1.3.tar.gz
(4)apache-maven-3.0.5-bin.tar.gz
(5)protobuf-2.5.0.tar.gz

If you need these files, you can download them through the link shared by the blogger:
Link: https://pan.baidu.com/s/19lM5UgctzCgEkF5S7ZKBtA
Extraction code: drql

1.2 jar package installation

Note: all operations must be completed under root

1. Decompress the JDK, configure the environment variables JAVA_HOME and PATH, and verify the Java version (you need to verify whether the configuration is successful as follows)

[root@hadoop001 software] # tar -zxf jdk-8u144-linux-x64.tar.gz -C /opt/module/
[root@hadoop001 software]# vi /etc/profile
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin
[root@hadoop001 software]#source /etc/profile

Validation command: Java version

2. Extract and configure MAVEN_HOME and PATH

[root@hadoop001 software]# tar -zxvf apache-maven-3.0.5-bin.tar.gz -C /opt/module/
[root@hadoop001 apache-maven-3.0.5]# vi /etc/profile
#MAVEN_HOME
export MAVEN_HOME=/opt/module/apache-maven-3.0.5
export PATH=$PATH:$MAVEN_HOME/bin
[root@hadoop001 software]#source /etc/profile

Validation command: mvn -version

1.3 compiling source code

1. Prepare the compilation environment

[root@hadoop001 software]# yum install svn
[root@hadoop001 software]# yum install autoconf automake libtool cmake
[root@hadoop001 software]# yum install ncurses-devel
[root@hadoop001 software]# yum install openssl-devel
[root@hadoop001 software]# yum install gcc*

2. Compile and install snappy

[root@hadoop001 software]# tar -zxvf snappy-1.1.3.tar.gz -C /opt/module/
[root@hadoop001 module]# cd snappy-1.1.3/
[root@hadoop001 snappy-1.1.3]# ./configure
[root@hadoop001 snappy-1.1.3]# make
[root@hadoop001 snappy-1.1.3]# make install
# View snappy library files
[root@hadoop001 snappy-1.1.3]# ls -lh /usr/local/lib |grep snappy

3. Compile and install protobuf

[root@hadoop001 software]# tar -zxvf protobuf-2.5.0.tar.gz -C /opt/module/
[root@hadoop001 module]# cd protobuf-2.5.0/
[root@hadoop001 protobuf-2.5.0]# ./configure 
[root@hadoop001 protobuf-2.5.0]#  make 
[root@hadoop001 protobuf-2.5.0]#  make install
# Check the protobuf version to see if the installation is successful
[root@hadoop001 protobuf-2.5.0]# protoc --version

4. Compile hadoop native

[root@hadoop001 software]# tar -zxvf hadoop-2.7.2-src.tar.gz
[root@hadoop001 software]# cd hadoop-2.7.2-src/
[root@hadoop001 software]# mvn clean package -DskipTests -Pdist,native -Dtar -Dsnappy.lib=/usr/local/lib -Dbundle.snappy

After successful execution, / opt/software/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz is the newly generated binary installation package supporting snappy compression.

2, Hadoop compression configuration

2.1 compression coding supported by Mr

Compressed format	tool	algorithm	File extension	Whether it can be divided
DEFLATE	nothing	DEFLATE	.deflate	no
Gzip	gzip	DEFLATE	.gz	no
bzip2	bzip2	bzip2	.bz2	yes
LZO	lzop	LZO	.lzo	yes
Snappy	nothing	Snappy	.snappy	no

In order to support a variety of compression / decompression algorithms, Hadoop introduces a codec / decoder:

Compressed format	Corresponding encoder / decoder
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

Comparison of compression performance

compression algorithm	Original file size	Compressed file size	Compression speed	Decompression speed
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB/s
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s

Here, I didn't write snappy. Let's see snappy's open source website first.
http://google.github.io/snappy/

On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

we can see that snappy compression has reached 250MB/s and decompression has reached 500MB/s, which directly!

2.2 compression parameter configuration

to enable compression in Hadoop, you can configure the following parameters (in the mapred-site.xml file):

parameter	Default	stage	proposal
io.compression.codecs (configured in core-site.xml)	org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.Lz4Codec	Input compression	Hadoop uses the file extension to determine whether it supports a codec or not
mapreduce.map.output.compress	false	mapper output	Set this parameter to true to enable compression
mapreduce.map.output.compress.codec	org.apache.hadoop.io.compress.DefaultCodec	mapper output	Use LZO, LZ4, or snappy codecs to compress data at this stage
mapreduce.output.fileoutputformat.compress	false	reducer output	Set this parameter to true to enable compression
mapreduce.output.fileoutputformat.compress.codec	org.apache.hadoop.io.compress. DefaultCodec	reducer output	Use standard tools or codecs such as gzip and bzip2
mapreduce.output.fileoutputformat.compress.type	RECORD	reducer output	Compression types used for SequenceFile output: NONE and BLOCK

3, Turn on Map output phase compression

Enable map output phase compression to reduce the data transmission between map and Reduce task in job. The specific configuration is as follows:

Case practice:

1. Enable the data compression function of hive intermediate transmission

hive (default)>set hive.exec.compress.intermediate=true;

2. Enable the map output compression function in mapreduce

hive (default)>set mapreduce.map.output.compress=true;

3. Set the compression method of map output data in mapreduce

hive (default)>set mapreduce.map.output.compress.codec=
 org.apache.hadoop.io.compress.SnappyCodec;

4. Execute query statement

hive (default)> select count(ename) name from emp;

4, Enable Reduce output stage compression

when Hive writes the output to the table, the output can also be compressed. The property hive.exec.compress.output controls this function. Users may need to keep the default value of false in the default settings file, so that the default output is an uncompressed plain text file. The user can turn on the output compression function by setting this value to true in the query statement or execution script.

Case practice:

1. Enable the compression function of hive final output data

hive (default)>set hive.exec.compress.output=true;

2. Enable mapreduce final output data compression

hive (default)>set mapreduce.output.fileoutputformat.compress=true;

3. Set the final data output compression mode of mapreduce

hive (default)> set mapreduce.output.fileoutputformat.compress.codec =
 org.apache.hadoop.io.compress.SnappyCodec;

4. Set the mapreduce final data output compression to block compression

hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;

5. Test whether the output result is a compressed file

hive (default)> insert overwrite local directory
 '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

This is the end of this sharing,

I like it when I finish reading it, and form a habit!!! \Like it after reading it, form a habit!!! }Like after reading, form a habit!!! ^ ^ Mei A kind of Mei A kind of Mei A kind of
It's not easy to code. Your support is the driving force for me to persist. Don't forget to pay attention to me!

Not warm Bu Huo

Original Article 109 praised 201 visits 130000+

follow Private letter

Posted by pavanpuligandla on Tue, 05 May 2020 23:29:55 -0700

Programmer Group