Converting Hive database storage format to orc

Keywords: hive Hadoop

Hive storage format

  • textfile

    hive's default storage format
    Storage: row storage
    Disk overhead, data parsing overhead
    Compressed text file hive cannot be merged and split

  • SequenceFile

    Binary files are serialized into files in the form of key and value
    Storage: row storage
    Separable compression
    Generally selected block compression
    The advantage is that files and mapfile s in Hadoop APIs are compatible

  • rcfile

    Storage: Data is divided into rows, blocks and columns.
    Compressed Fast Column Access
    Reading records involves as little block s as possible
    Reading the required columns requires only reading the header definition of each row group.
    The operational performance of reading full data may not have obvious advantages over sequencefile

  • orc

    Storage: Data is divided into rows, blocks and columns.
    Compressed Fast Column Access
    Efficiency is higher than rcfile, which is an improved version of rcfile

  • Custom Format

    Users can customize input and output formats by implementing input format and output format

Several Ways for Hive to Import Data

  • Import from the local file system
load data local inpath 'customer .data' into table customer;
  • Import from HDFS
load data inpath '/hive/customer .data' into table customer;
  • Query out the corresponding data import from other tables
insert into table customer select * from customer_tmp;

Convert storage format to orc

Because data of textfile type can not be saved directly in orc type tables, we can do some transformation according to the above several ways of importing data. First, we can import it into a textfile type table and then import it into another table by query.

Example:

use tpcds_orc;
drop table if exists customer_tmp;
create table customer_tmp
(
    c_customer_sk             int                           ,
    c_customer_id             char(16)                      ,
    c_current_cdemo_sk        int                           ,
    c_current_hdemo_sk        int                           ,
    c_current_addr_sk         int                           ,
    c_first_shipto_date_sk    int                           ,
    c_first_sales_date_sk     int                           ,
    c_salutation              char(10)                      ,
    c_first_name              char(20)                      ,
    c_last_name               char(30)                      ,
    c_preferred_cust_flag     char(1)                       ,
    c_birth_day               int                           ,
    c_birth_month             int                           ,
    c_birth_year              int                           ,
    c_birth_country           varchar(20)                   ,
    c_login                   char(13)                      ,
    c_email_address           char(50)                      ,
    c_last_review_date        char(10)
)
row format delimited fields terminated by '|';
load data local inpath '/data1/tpcds/data100/customer.dat' into table customer_tmp;
drop table if exists customer;
create table customer
(
    c_customer_sk             int                           ,
    c_customer_id             char(16)                      ,
    c_current_cdemo_sk        int                           ,
    c_current_hdemo_sk        int                           ,
    c_current_addr_sk         int                           ,
    c_first_shipto_date_sk    int                           ,
    c_first_sales_date_sk     int                           ,
    c_salutation              char(10)                      ,
    c_first_name              char(20)                      ,
    c_last_name               char(30)                      ,
    c_preferred_cust_flag     char(1)                       ,
    c_birth_day               int                           ,
    c_birth_month             int                           ,
    c_birth_year              int                           ,
    c_birth_country           varchar(20)                   ,
    c_login                   char(13)                      ,
    c_email_address           char(50)                      ,
    c_last_review_date        char(10)
)
row format delimited fields terminated by '|'
stored as orc ;
insert into table customer select * from customer_tmp;
drop table customer_tmp;

Posted by WendyLady on Sun, 21 Apr 2019 18:30:35 -0700