.
Catalog
1, Hadoop source compilation supports Snappy compression
1.1 resource preparation
- 1. CentOS networking
Configure CentOS to connect to the Internet. Linux virtual machine ping www.baidu.com is unblocked
Note: compile with root role to reduce the problem of folder permission
- 2. jar package preparation (hadoop source code, JDK8, maven, protobuf)
(1)hadoop-2.7.2-src.tar.gz
(2)jdk-8u144-linux-x64.tar.gz
(3)snappy-1.1.3.tar.gz
(4)apache-maven-3.0.5-bin.tar.gz
(5)protobuf-2.5.0.tar.gz
If you need these files, you can download them through the link shared by the blogger:
Link: https://pan.baidu.com/s/19lM5UgctzCgEkF5S7ZKBtA
Extraction code: drql
1.2 jar package installation
Note: all operations must be completed under root
- 1. Decompress the JDK, configure the environment variables JAVA_HOME and PATH, and verify the Java version (you need to verify whether the configuration is successful as follows)
[root@hadoop001 software] # tar -zxf jdk-8u144-linux-x64.tar.gz -C /opt/module/ [root@hadoop001 software]# vi /etc/profile #JAVA_HOME export JAVA_HOME=/opt/module/jdk1.8.0_144 export PATH=$PATH:$JAVA_HOME/bin [root@hadoop001 software]#source /etc/profile
Validation command: Java version
- 2. Extract and configure MAVEN_HOME and PATH
[root@hadoop001 software]# tar -zxvf apache-maven-3.0.5-bin.tar.gz -C /opt/module/ [root@hadoop001 apache-maven-3.0.5]# vi /etc/profile #MAVEN_HOME export MAVEN_HOME=/opt/module/apache-maven-3.0.5 export PATH=$PATH:$MAVEN_HOME/bin [root@hadoop001 software]#source /etc/profile
Validation command: mvn -version
1.3 compiling source code
- 1. Prepare the compilation environment
[root@hadoop001 software]# yum install svn [root@hadoop001 software]# yum install autoconf automake libtool cmake [root@hadoop001 software]# yum install ncurses-devel [root@hadoop001 software]# yum install openssl-devel [root@hadoop001 software]# yum install gcc*
- 2. Compile and install snappy
[root@hadoop001 software]# tar -zxvf snappy-1.1.3.tar.gz -C /opt/module/ [root@hadoop001 module]# cd snappy-1.1.3/ [root@hadoop001 snappy-1.1.3]# ./configure [root@hadoop001 snappy-1.1.3]# make [root@hadoop001 snappy-1.1.3]# make install # View snappy library files [root@hadoop001 snappy-1.1.3]# ls -lh /usr/local/lib |grep snappy
- 3. Compile and install protobuf
[root@hadoop001 software]# tar -zxvf protobuf-2.5.0.tar.gz -C /opt/module/ [root@hadoop001 module]# cd protobuf-2.5.0/ [root@hadoop001 protobuf-2.5.0]# ./configure [root@hadoop001 protobuf-2.5.0]# make [root@hadoop001 protobuf-2.5.0]# make install # Check the protobuf version to see if the installation is successful [root@hadoop001 protobuf-2.5.0]# protoc --version
- 4. Compile hadoop native
[root@hadoop001 software]# tar -zxvf hadoop-2.7.2-src.tar.gz [root@hadoop001 software]# cd hadoop-2.7.2-src/ [root@hadoop001 software]# mvn clean package -DskipTests -Pdist,native -Dtar -Dsnappy.lib=/usr/local/lib -Dbundle.snappy
After successful execution, / opt/software/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz is the newly generated binary installation package supporting snappy compression.
2, Hadoop compression configuration
2.1 compression coding supported by Mr
Compressed format | tool | algorithm | File extension | Whether it can be divided |
---|---|---|---|---|
DEFLATE | nothing | DEFLATE | .deflate | no |
Gzip | gzip | DEFLATE | .gz | no |
bzip2 | bzip2 | bzip2 | .bz2 | yes |
LZO | lzop | LZO | .lzo | yes |
Snappy | nothing | Snappy | .snappy | no |
- In order to support a variety of compression / decompression algorithms, Hadoop introduces a codec / decoder:
Compressed format | Corresponding encoder / decoder |
---|---|
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | com.hadoop.compression.lzo.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
- Comparison of compression performance
compression algorithm | Original file size | Compressed file size | Compression speed | Decompression speed |
---|---|---|---|---|
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s |
LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s |
Here, I didn't write snappy. Let's see snappy's open source website first.
http://google.github.io/snappy/
On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.
we can see that snappy compression has reached 250MB/s and decompression has reached 500MB/s, which directly!
2.2 compression parameter configuration
to enable compression in Hadoop, you can configure the following parameters (in the mapred-site.xml file):
parameter | Default | stage | proposal |
---|---|---|---|
io.compression.codecs (configured in core-site.xml) | org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.Lz4Codec | Input compression | Hadoop uses the file extension to determine whether it supports a codec or not |
mapreduce.map.output.compress | false | mapper output | Set this parameter to true to enable compression |
mapreduce.map.output.compress.codec | org.apache.hadoop.io.compress.DefaultCodec | mapper output | Use LZO, LZ4, or snappy codecs to compress data at this stage |
mapreduce.output.fileoutputformat.compress | false | reducer output | Set this parameter to true to enable compression |
mapreduce.output.fileoutputformat.compress.codec | org.apache.hadoop.io.compress. DefaultCodec | reducer output | Use standard tools or codecs such as gzip and bzip2 |
mapreduce.output.fileoutputformat.compress.type | RECORD | reducer output | Compression types used for SequenceFile output: NONE and BLOCK |
3, Turn on Map output phase compression
Enable map output phase compression to reduce the data transmission between map and Reduce task in job. The specific configuration is as follows:
Case practice:
- 1. Enable the data compression function of hive intermediate transmission
hive (default)>set hive.exec.compress.intermediate=true;
- 2. Enable the map output compression function in mapreduce
hive (default)>set mapreduce.map.output.compress=true;
- 3. Set the compression method of map output data in mapreduce
hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;
- 4. Execute query statement
hive (default)> select count(ename) name from emp;
4, Enable Reduce output stage compression
when Hive writes the output to the table, the output can also be compressed. The property hive.exec.compress.output controls this function. Users may need to keep the default value of false in the default settings file, so that the default output is an uncompressed plain text file. The user can turn on the output compression function by setting this value to true in the query statement or execution script.
Case practice:
- 1. Enable the compression function of hive final output data
hive (default)>set hive.exec.compress.output=true;
- 2. Enable mapreduce final output data compression
hive (default)>set mapreduce.output.fileoutputformat.compress=true;
- 3. Set the final data output compression mode of mapreduce
hive (default)> set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;
- 4. Set the mapreduce final data output compression to block compression
hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
- 5. Test whether the output result is a compressed file
hive (default)> insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;
This is the end of this sharing,
I like it when I finish reading it, and form a habit!!! \Like it after reading it, form a habit!!! }Like after reading, form a habit!!! ^ ^ Mei A kind of Mei A kind of Mei A kind of
It's not easy to code. Your support is the driving force for me to persist. Don't forget to pay attention to me!