Using LZO compression and supporting slicing for hadoop

Keywords: Hadoop hive Apache xml

1. introduction:

Install LZO:

lzo is not native to Linux systems, so you need to download and install packages. There are at least three packages to install here.

lzo, lzop, hadoop-gpl-packaging.

Add index:
The main function of gpl-packing is to create index for compressed lzo files. Otherwise, whether the compressed file is larger than the block size on hdfs, it will only be a fragmentation process.

2. Install lzo and generate data:
2.1 Generating Uncompressed Test Data
Mr. Wang has developed a test data larger than 128M, which ensures that the compressed file size is larger than the data block size after using lzo compression, and facilitates the subsequent test of fragmentation effect. I use cat a > b, cat b > A to generate data quickly.
2.2 Installation of lzo-related tools

[root@hadoop-01 ~]# yum install -y svn ncurses-devel
[root@hadoop-01 ~]# yum install -y gcc gcc-c++ make cmake
[root@hadoop-01 ~]# yum install -y openssl openssl-devel svn ncurses-devel zlib-devel libtool 
[root@hadoop-01 ~]# yum install -y lzo lzo-devel lzop autoconf automake cmake

2.3 Compression of test data using lzop tools
lzo compression: lzop-v file
lzo decompression: lzop-dv file

[hadoop@hadoop-01 date]$lzop -v page_views.dat
compressing page_views.dat into page_views.dat.lzo

[hadoop@hadoop-01 data]$ du -sh lzodate.txt.lzo 
276M    page_views.dat.lzo

#Compress the lzodate.txt file into lzo format

3. Compiling hadoop-lzo

3.1 Download and configure hadoop-lzo

[hadoop@hadoop-01 software]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip

#Decompression:
[hadoop@hadoop-01 software]$ unzip master.zip

#Enter the unzipped directory
[hadoop@hadoop-01 app]$ cd hadoop-lzo-master/
[hadoop@hadoop-01 hadoop-lzo-master]$ 

#Because hadoop uses 2.6.0, the version is modified to 2.6.0:
<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>2.6.0</hadoop.current.version>
    <hadoop.old.version>1.0.4</hadoop.old.version>
</properties>

[hadoop@hadoop-01 hadoop-lzo-master]$ export CFLAGS=-m64
[hadoop@hadoop-01 hadoop-lzo-master]$ export CXXFLAGS=-m64

#Modify the actual path for your hadoop
[hadoop@hadoop-01 hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lzo/include
[hadoop@hadoop-01 hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lzo/lib

3.2 MVN source code

[hadoop@hadoop-01 hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true

[INFO] Building jar: /home/hadoop/software/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8:43.929s
[INFO] Finished at: Sun Apr 14 16:42:02 CST 2019
[INFO] Final Memory: 25M/61M
[INFO] ---------------------------------------------------------------------

[hadoop@hadoop-01 hadoop-lzo-master]$ 
#Enter the target folder
[hadoop@hadoop-01 hadoop-lzo-master]$ cd target/native/Linux-amd64-64/

[hadoop@hadoop-01 Linux-amd64-64]$ mkdir ~/app/hadoop-lzo-files
[hadoop@hadoop-01 Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~/app/hadoop-lzo-files

[hadoop@hadoop-01 hadoop-lzo-files]$ cp ~/app/hadoop-lzo-files/libgplcompression* $HADOOP_HOME/lib/native/

#Copy all files to app/hadoop-lzo-files

It's important to upload hadoop-lzo.jar

#Copy hadoop-lzo-0.4.21-SNAPSHOT.jar to the common directory of each Hadoop 

[hadoop@hadoop-01 hadoop-lzo-master]$ cp hadoop-lzo-0.4.21-SNAPSHOT.jar ~/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/ 

[hadoop@hadoop-01 hadoop-lzo-master]$ ll ~/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo* -rw-rw-r--. 1 hadoop hadoop 180667 Apr 14 08:52 /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar

3.3 Configure core.site.xml

# Stop hadoop
[hadoop@hadoop-01 hadoop-lzo-master]$ stop-all.sh 

#Edit core-site.xml to add or modify the following [hadoop@hadoop-01 ~]$VIM ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/core-site.xml] 

-----------------------start-------------------------------
<property>
	<name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,
    	org.apache.hadoop.io.compress.DefaultCodec,
		org.apache.hadoop.io.compress.BZip2Codec,
		org.apache.hadoop.io.compress.SnappyCodec,
		com.hadoop.compression.lzo.LzoCodec,
		com.hadoop.compression.lzo.LzopCodec
    </value>
    </property>
	<property>
		<name>io.compression.codec.lzo.class</name>
		<value>com.hadoop.compression.lzo.LzoCodec</value>
	</property>
	-----------------------End------------------------------- 
	
	
#Parsing: Configure com.hadoop.compression.lzo.LzoCodec
			 com.hadoopcompression.lzo.LzopCodec Compression class

io.compression.codec.lzo.class Must be specified as LzoCodec wrong LzopCodec,Otherwise, compressed files will not support fragmentation

3.4 Configure mapred-site.xml

#Edit mapred-site.xml to add or modify the following 

[hadoop@hadoop-01 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml 

-----------------------start-------------------------------
Compression in the Intermediate Stage
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Compression in the final stage
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
-----------------------End-------------------------------


#Start hadoop
[hadoop@hadoop-01 ~]$ start-all.sh

4.LZO File Testing
4.1LZO file does not support fragmentation

#Create a LZO compressed file test table. If the Hadoop common directory does not have hadoop-lzo jar, the class DeprecatedLzoTextInputFormat will not find an exception.

create table g6_access_copy_lzo (
cdn string,
region string,
level string,
time string,
ip string,
domain string,
url string,
traffic bigint
)row format delimited fields terminated by '\t'
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

#Load test data in lzo format. Note:
LOAD DATA LOCAL INPATH '/home/hadoop/data/page_views.dat.lzo' OVERWRITE INTO TABLE g6_access_copy_lzo;

#View data:
[hadoop@hadoop-01 hadoop-2.6.0-cdh5.7.0]$ hadoop fs -du -s -h /user/hive/warehouse/myhive.db/g6_access_copy_lzo
275.9 M  275.9 M  /user/hive/warehouse/myhive.db/g6_access_copy_lzo

#Query testing
select count(1) from g6_access_copy_lzo;

//Console Log Interception:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.37 sec   HDFS Read: 289332141 HDFS Write: 8 SUCCESS

From the log we can see that there is only one map task, and our data file is much larger than 128M, indicating that the current lzo file does not support data slicing by default.

4.2LZO File Support Fragmentation

Note that unless the lzo file is directly loaded, compression needs to be turned on, and the compression format is LzopCodec. load data can not change the file format and compression format.

#When compression is turned on, the compressed file format generated must be set to LzopCodec, and the compressed file format of lzoCode with the suffix of. lzo_deflate cannot be indexed.

SET hive.exec.compress.output=true;

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;

#Create LZO Compressed File Test Table
create table g6_access_copy_lzo_split
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
as select *  from g6_access_copy_lzo;

#Build the LZO file index using the tool classes in the jar package we typed earlier
hadoop jar ~/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \ com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/myhive.db/g6_access_copy_lzo_split

#By querying the hdfs data directory, we can see that there is a. index index file in the same level directory of lzo file.
[hadoop@hadoop-01 hadoop-2.6.0-cdh5.7.0]$ hadoop fs -ls /user/hive/warehouse/myhive.db/g6_access_copy_lzo_split
Found 2 items

-rwxr-xr-x   1 hadoop supergroup  190593490 2019-04-23 17:32 /user/hive/warehouse/myhive.db/g6_access_copy_lzo_split/000000_0.lzo

-rw-r--r--   1 hadoop supergroup      22256 2019-04-23 17:32/user/hive/warehouse/myhive.db/g6_access_copy_lzo_split/000000_0.lzo.index


#Implementing statistical analysis
select count(1) from g6_access_copy_lzo_split;

//Console logs:
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 9.37 sec   HDFS Read: 190593490 HDFS Write: 8 SUCCESS

The log shows that there are two map tasks at this time. That is to say, data fragmentation is supported after index is constructed.

Conclusion:
The common compression format in large data is only bzip2, which supports data fragmentation. lzo only supports data fragmentation after the file is indexed.

Posted by oaf357 on Tue, 23 Apr 2019 17:00:34 -0700