Hive storage format
-
textfile
hive's default storage format
Storage: row storage
Disk overhead, data parsing overhead
Compressed text file hive cannot be merged and split -
SequenceFile
Binary files are serialized into files in the form of key and value
Storage: row storage
Separable compression
Generally selected block compression
The advantage is that files and mapfile s in Hadoop APIs are compatible -
rcfile
Storage: Data is divided into rows, blocks and columns.
Compressed Fast Column Access
Reading records involves as little block s as possible
Reading the required columns requires only reading the header definition of each row group.
The operational performance of reading full data may not have obvious advantages over sequencefile -
orc
Storage: Data is divided into rows, blocks and columns.
Compressed Fast Column Access
Efficiency is higher than rcfile, which is an improved version of rcfile -
Custom Format
Users can customize input and output formats by implementing input format and output format
Several Ways for Hive to Import Data
- Import from the local file system
load data local inpath 'customer .data' into table customer;
- Import from HDFS
load data inpath '/hive/customer .data' into table customer;
- Query out the corresponding data import from other tables
insert into table customer select * from customer_tmp;
Convert storage format to orc
Because data of textfile type can not be saved directly in orc type tables, we can do some transformation according to the above several ways of importing data. First, we can import it into a textfile type table and then import it into another table by query.
Example:
use tpcds_orc;
drop table if exists customer_tmp;
create table customer_tmp
(
c_customer_sk int ,
c_customer_id char(16) ,
c_current_cdemo_sk int ,
c_current_hdemo_sk int ,
c_current_addr_sk int ,
c_first_shipto_date_sk int ,
c_first_sales_date_sk int ,
c_salutation char(10) ,
c_first_name char(20) ,
c_last_name char(30) ,
c_preferred_cust_flag char(1) ,
c_birth_day int ,
c_birth_month int ,
c_birth_year int ,
c_birth_country varchar(20) ,
c_login char(13) ,
c_email_address char(50) ,
c_last_review_date char(10)
)
row format delimited fields terminated by '|';
load data local inpath '/data1/tpcds/data100/customer.dat' into table customer_tmp;
drop table if exists customer;
create table customer
(
c_customer_sk int ,
c_customer_id char(16) ,
c_current_cdemo_sk int ,
c_current_hdemo_sk int ,
c_current_addr_sk int ,
c_first_shipto_date_sk int ,
c_first_sales_date_sk int ,
c_salutation char(10) ,
c_first_name char(20) ,
c_last_name char(30) ,
c_preferred_cust_flag char(1) ,
c_birth_day int ,
c_birth_month int ,
c_birth_year int ,
c_birth_country varchar(20) ,
c_login char(13) ,
c_email_address char(50) ,
c_last_review_date char(10)
)
row format delimited fields terminated by '|'
stored as orc ;
insert into table customer select * from customer_tmp;
drop table customer_tmp;