Hive Integrated HBase Detailed

Keywords: Big Data HBase hive Apache Hadoop

Reproduced from: https://www.cnblogs.com/MOBIN/p/5704001.html

1. Create HBase tables from Hive

Create a Live table pointing to HBase using the HQL statement

CREATE TABLE hbase_table_1(key int, value string) //Table name hbase_table_1 in Hive
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  //Designated Storage Processor
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") //Declare column family, column name
TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");  
//hbase.table.name declares the HBase table name, which defaults to the same table name as Hive for optional attributes.
//Hbase. mapred. output. output table specifies the table to be written when inserting data, and if you need to insert data into the table later, you need to specify the value.

The properties of the newly created Base table can be viewed through the HBase shell

$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Version: 0.20.3, r902334, Mon Jan 25 13:13:08 PST 2010
hbase(main):001:0> list
xyz                                                                                                           
1 row(s) in 0.0530 seconds
hbase(main):002:0> describe "xyz"
DESCRIPTION                                                           ENABLED                               
  {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE', VE true                                  
  RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>                                       
  'false', BLOCKCACHE => 'true'}]}                                                                            
1 row(s) in 0.0220 seconds

hbase(main):003:0> scan "xyz"
ROW                          COLUMN+CELL                                                                      
0 row(s) in 0.0060 seconds

Insert data into HBase tables using HQL

INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;

View the inserted data on the HBase side

hbase(main):009:0> scan "xyz"
ROW                          COLUMN+CELL                                                                      
 98                          column=cf1:val, timestamp=1267737987733, value=val_98                            
1 row(s) in 0.0110 seconds

2. Mapping HBase from Hive

Create a Hive table that points to an existing HBase table

CREATE EXTERNAL TABLE hbase_table_2(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
TBLPROPERTIES("hbase.table.name" = "some_existing_table", "hbase.mapred.output.outputtable" = "some_existing_table");

The Hive table has an external table, so deleting the table does not delete the data in the HBase table

Be careful:

If no key is specified when creating a table or mapping table, the first column defaults to row keys.
There is no concept of timestamp in the Hive table corresponding to HBase. The default return is the latest version of the value.
Since there is no data type information in HBase, it is converted to String type when storing data.

3. Mapping of Multi-column and Multi-column Families

The following table: value1 and value2 are from the b c column corresponding to column family a, and value3 is from the column corresponding to column family d

CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,a:b,a:c,d:e"
);
INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2 
FROM pokes WHERE foo=98 OR foo=100;

4. Mapping rules of Hive Map type in HBase

The following table: Map the HBase table through Hive's Map data type so that each row can have different column combinations, column names correspond to key s in maps, and column values correspond to values in maps

CREATE TABLE hbase_table_1(value map<string,int>, row_key int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "cf:,:key"
);
INSERT OVERWRITE TABLE hbase_table_1 SELECT map(bar, foo), foo FROM pokes 
WHERE foo=98 OR foo=100;

cf is a column family whose column name corresponds to bar in map and column value corresponds to foo in map.

Viewing data under HBase

hbase(main):012:0> scan "hbase_table_1"
ROW                          COLUMN+CELL                                                                      
 100                         column=cf:val_100, timestamp=1267739509194, value=100                            
 98                          column=cf:val_98, timestamp=1267739509194, value=98                              
2 row(s) in 0.0080 seconds

View data under Hive

hive> select * from hbase_table_1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
{"val_100":100}    100
{"val_98":98}    98
Time taken: 3.808 seconds

Note: Because the key in the map is used as the column name of HBase, the key type in the map must be String type.

The following mapping statements all report errors

CREATE TABLE hbase_table_1(key int, value map<int,int>) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,cf:"
);

Reason: The key in the map must be String

CREATE TABLE hbase_table_1(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,cf:"
);

Cause: When the column family in hbase.columns.mapping is empty (like cf:), the corresponding data type in Hive is map, and the corresponding String in this statement causes an error.

5.Hive also supports simple composite line keys

Here's how to create a Live table pointing to HBase with two fields for row keys and ~separating between fields

CREATE EXTERNAL TABLE delimited_example(key struct<f1:string, f2:string>, value string) 
ROW FORMAT DELIMITED 
COLLECTION ITEMS TERMINATED BY '~' 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (
  'hbase.columns.mapping'=':key,f:c1');

6. Attention should be paid to integrating HBase tables with Hive

Pre-partitioning the HBase table, increasing the parallelism of its MapReduce job and designing rowkey reasonably to distribute it as far as possible on the pre-partitioned Region by setting hbase. client. scanner. caching a reasonable scan cache

Reference material:

Hive HBase Integration

Posted by maxpagels on Sat, 02 Feb 2019 02:45:15 -0800

Programmer Group