Hive HQL Data Operation and Data Query

Keywords: hive Hadoop

HQL Data Operation

The source of our content is Hadoop Mass Data Processing Technology Details and Project Practice, People's Posts and Telecommunications Publishing House

1. Loading data

LOAD DATA INPATH '/user/hadoop/o' INTO TABLE test;

If the test table is a partition table, specify a distinction in HQL:

LOAD DATA INPATH '/USER/HADOOP/O' OVERWRITE INTO TABLE test3 PARTITION(part="a");

Loading table tables directly from the local

LOAD DATA LOCAL INPATH '/HOME/HADOOP/O' INTO TABLE test;

Hive does not validate the data format when loading data. Users need to ensure that the data format is consistent with the format defined by the table.

2. Insert data into tables through query statements

INSERT OVERWRITE TABLE test SELECT * FROM source;

Similarly, when test is a partition table, you must specify partitions

INSERT OVERWRITE TABLE test PARTITION (part='a') SELECT id,name FROM source;

Hive started supporting INSERT INTO after 0.7.

Multiple disjoint outputs are generated by one query

FROM source
INSERT OVERWRITE TABLE test PARTITION (part='a')
SELECT id,name WHERE id>=0 AND id<100
INSERT OVERWRITE TABLE test PARTITION (part='b')
SELECT id,name WHERE id>=100 AND id<200
...

This inserts eligible data into each partition of the test table by querying the source table only once. To use this feature, the FROM clause should be written in front.

3. Insert data into tables using dynamic partitioning

Support automatic inference of partitions to be created based on query parameters

INSERT OVERWRITE TABLE test PARTITION(time) SELECT id,modify_time FROM source;

Hive uses the last query field in the SELECT statement as the basis for dynamic partitioning, rather than the field name. If n dynamic partitioning fields are specified, Hive will use the last n fields in the select statement as the basis for dynamic partitioning.
In the above statement, Hive creates partitions based on different values of modify_time.
Hive does not open dynamic partition by default. Open statement:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict; Allow all partitions to be dynamic
  • hive.exec.max.dynamic.partitions.pernode The maximum number of partitions that each Mapper or Reducer can create
  • Maximum number of partitions that a dynamic partition creation statement can create
  • hive.exec.max.created.files The maximum number of files that a MapReduce job can create

4. Loading data through CTAS

CREATE TABLE test AS SELECT id,name FROM source;

5. Export data

INSERT OVERWRITE DIRECTORY '/user/hadoop/r' SELECT * FROM test;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/r' SELECT * FROM test;

If the data in the Hive table is exactly the data format that the user needs, it can be copied directly.

hadoop dfs -cp /user/hive/warehouse/source_table /user/hadoop/

II. Data Query

1.SELECT

SELECT col1,col2 FROM table;
SELECT t.col1 c1,t.col2 c2 FROM table t;

2. Nested Query

SELECT l.name,r.course FROM (SELECT id,name FROM left) l JOIN (select id ,course FROM right) r ON l.id=r.id;

3. Regular expression selection column

SELECT'user. *'FROM test; for example, column: user.name,user.age will be returned

4. Limit quantity

SELECT * FROM test LIMIT 100;

5. Need to process a column value

SELECT id,name,sex,
CASE
 WHEN sex='M' THEN 'male'
 WHEN sex='F' THEN 'female'
 ELSE 'Invalid data'
END
FROM student;

6. Conditional constraints

Support common =,<>,!=, A IS NULL, A [NOT] LIKE B, A RLIKE B, A REGEXP B, etc.

SELECT * FROM student WHERE age=18;

7. Grouping and aggregation

SELECT COUNT(*) FROM student GROUP BY age;
SELECT AVG(age) FROM student GROUP BY classId;
SELECT name,AVG(age) FROM student where sex='F' GROUP BY classId HAVING AVG(age)>18;

8.JOIN

INNER JOIN Intranet Connection
 LEFt/RIGHT OUTER JOIN Left/Right External Connection
 FULL OUTER JOIN External Connection
 LEFT-SEMI JOIN Left Half Connection
 In the map-side JOIN map stage, small tables are read into memory and joined directly on the map side. Users can turn on automatic optimization by setting hive.auto.convert.join=true, and define the size of small tables by hive.mapjoin.smalltable.filesize, which defaults to 250000 bytes.
Multi-table JOIN

9.ORDER BY and SORTBY

When the number of reducer s is 1, they are the same.
When the number of reducers is more than one, the output will overlap. SORT BY controls the sort within each reducer.

10.DISTRIBUTE BY and SORTBY

DISTRIBUTE BYSome data can be controlled to enter the same Reducer,This goes by SORT BYLater, the result of global ordering can be obtained.
SELECT col1,col2 FROM ss DISTRIBUTE BY col1 SORT BY col1,col2;

11.CLUSTER BY

If DISTRIBUTE BY and SORTBY statements are used, the columns involved in DISTRIBUTE BY and SORTBY are identical, and in ascending order, CLSTER BY can be used instead of DISTRIBUTE BY and SORTBY.

12. Bucketing and sampling

Sometimes data needs to be sampled, and Hive provides bucket sampling of tables.

SELECT * FROM  test TABLESAMPLE(BUCKET 3 OUT OF 10 ON id);

For BUCKET x OUT OF Y ON z, y denotes y barrels, x denotes x barrels, and Z denotes barrels on the basis of hashing the hash values of Z columns divided by the remainder of Y. If Z is not specified, random column sampling can be used.

SELECT * FROM test TABLESAMPLE(BUCKET 3 OUT OF 10 ON RAND());

When tabulating, you can specify a bucket table, which is more efficient in sampling.

set hive.enforce.bucketing=true;
CREATE TABLE buckettable (id INT) CLUSTERED BY (id) INTO 4 BUCKETS;

The table will be divided into four buckets and the INSERT statement will be executed

INSET OVERWRITE TABLE buckettable SELECT * FROM source;

The data will be divided into four files stored under the table path, each representing a bucket.

13.UNION ALL

SELECT r.id,r.price
FROM(
 SELECT m.id,m.price FROM monday m
UNION ALL
 SELECT t.id,t.price FROM tuesday t)r

Hive does not support direct UNION ALL and must perform nested queries.

Posted by altis88 on Tue, 02 Jul 2019 15:05:17 -0700