3. File storage format
1. Column Storage and Row Storage
(1) Features of column storage:
Because the data aggregated storage for each field can greatly reduce the amount of data read when only a few fields are needed for query.
Each field must have the same data type, so better compression algorithms can be designed for it.
(2) Features of row storage:
When querying an entire row of data that meets the criteria, the column storage needs to go to each clustered field to find the value of each column.
Row storage only needs to find one value, and the rest are adjacent, so it can store queries faster.
(3) Base:
The storage formats of ORC and PAQUET are column based.
TEXTFILE and SEQUENCEFILE are based on row storage;
2.TextFile format
Default format, no data compression, high disk overhead, high data parsing overhead.
Can be used in conjunction with gzip, bzip2.With gzip, however, hive does not slice the data, making it impossible to operate in parallel.
3.Orc Format
Each orc file consists of one or more stripe s, each of which has three parts
Is Index Data, Row Data, Stripe Footer.
(1)Index Data:
A lightweight index, with one index every 1W rows by default.
The index here simply records the offset of the fields of a row in row data.
(2) Row Data:
Store specific data, fetch some rows, and then store them in columns.
Each column is encoded and divided into Stream s to store.
(3) Stripe Footer:
There is information about each Stream's type, length, and so on.
Each file has a File Footer with the number of rows per Stripe and the number of rows per Column
Data type information, etc.
At the end of each file is a PostScript, which records the compression type of the entire file to
And FileFooter's length information;
When reading a file, it reads PostScript from the seek to the end of the file and parses it to the File Footer length.
Read the FileFooter again, parse each Stripe into its information, and read each Stripe again, back to front.
4.Parquet Format
Parquet files are stored binary, so they cannot be read directly, including
The data and metadata of the file, so the Parquet format file is self-parsing.
When storing Parquet data, the size of the row groups is set to the Block size, for each mapper task
The smallest unit of processing data is a Block.
This allows each row group to be processed by a Mapper task, increasing the parallelism of task execution.
4. Comparing Experiments on Mainstream File Storage Formats
1.textfile
(1)Create storage data in the format TEXTFILE Table
create table log_text(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile;
(2)Loading data into a table
load data local inpath '/root/hivedata/log.data' into table log_text;
(3)View data size in tables
dfs -du -h /user/hive/warehouse/log_text;
2.orc
(1)Create storage data in the format orc Table
create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc;
(2)Loading data into a table
insert into table log_orc select * from log_text;
(3)View data size in tables
dfs -du -h /user/hive/warehouse/log_orc/;
3.parquet
(1)Create storage data in the format parquet Table
create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;
(2)Loading data into a table
insert into table log_parquet select * from log_text;
(3)View data size
dfs -du -h /user/hive/warehouse/log_parquet/;
4. Summary of compression ratio of stored files:
ORC > Parquet > textFile
5. Query speed test for stored files:
select count(*) from log_text;
select count(*) from log_orc;
select count(*) from log_parquet;
//Summary: Query speed is similar.