HQL Data Operation
The source of our content is Hadoop Mass Data Processing Technology Details and Project Practice, People's Posts and Telecommunications Publishing House
1. Loading data
LOAD DATA INPATH '/user/hadoop/o' INTO TABLE test;
If the test table is a partition table, specify a distinction in HQL:
LOAD DATA INPATH '/USER/HADOOP/O' OVERWRITE INTO TABLE test3 PARTITION(part="a");
Loading table tables directly from the local
LOAD DATA LOCAL INPATH '/HOME/HADOOP/O' INTO TABLE test;
Hive does not validate the data format when loading data. Users need to ensure that the data format is consistent with the format defined by the table.
2. Insert data into tables through query statements
INSERT OVERWRITE TABLE test SELECT * FROM source;
Similarly, when test is a partition table, you must specify partitions
INSERT OVERWRITE TABLE test PARTITION (part='a') SELECT id,name FROM source;
Hive started supporting INSERT INTO after 0.7.
Multiple disjoint outputs are generated by one query
FROM source
INSERT OVERWRITE TABLE test PARTITION (part='a')
SELECT id,name WHERE id>=0 AND id<100
INSERT OVERWRITE TABLE test PARTITION (part='b')
SELECT id,name WHERE id>=100 AND id<200
...
This inserts eligible data into each partition of the test table by querying the source table only once. To use this feature, the FROM clause should be written in front.
3. Insert data into tables using dynamic partitioning
Support automatic inference of partitions to be created based on query parameters
INSERT OVERWRITE TABLE test PARTITION(time) SELECT id,modify_time FROM source;
Hive uses the last query field in the SELECT statement as the basis for dynamic partitioning, rather than the field name. If n dynamic partitioning fields are specified, Hive will use the last n fields in the select statement as the basis for dynamic partitioning.
In the above statement, Hive creates partitions based on different values of modify_time.
Hive does not open dynamic partition by default. Open statement:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict; Allow all partitions to be dynamic
- hive.exec.max.dynamic.partitions.pernode The maximum number of partitions that each Mapper or Reducer can create
- Maximum number of partitions that a dynamic partition creation statement can create
- hive.exec.max.created.files The maximum number of files that a MapReduce job can create
4. Loading data through CTAS
CREATE TABLE test AS SELECT id,name FROM source;
5. Export data
INSERT OVERWRITE DIRECTORY '/user/hadoop/r' SELECT * FROM test;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/r' SELECT * FROM test;
If the data in the Hive table is exactly the data format that the user needs, it can be copied directly.
hadoop dfs -cp /user/hive/warehouse/source_table /user/hadoop/
II. Data Query
1.SELECT
SELECT col1,col2 FROM table;
SELECT t.col1 c1,t.col2 c2 FROM table t;
2. Nested Query
SELECT l.name,r.course FROM (SELECT id,name FROM left) l JOIN (select id ,course FROM right) r ON l.id=r.id;
3. Regular expression selection column
SELECT'user. *'FROM test; for example, column: user.name,user.age will be returned
4. Limit quantity
SELECT * FROM test LIMIT 100;
5. Need to process a column value
SELECT id,name,sex,
CASE
WHEN sex='M' THEN 'male'
WHEN sex='F' THEN 'female'
ELSE 'Invalid data'
END
FROM student;
6. Conditional constraints
Support common =,<>,!=, A IS NULL, A [NOT] LIKE B, A RLIKE B, A REGEXP B, etc.
SELECT * FROM student WHERE age=18;
7. Grouping and aggregation
SELECT COUNT(*) FROM student GROUP BY age;
SELECT AVG(age) FROM student GROUP BY classId;
SELECT name,AVG(age) FROM student where sex='F' GROUP BY classId HAVING AVG(age)>18;
8.JOIN
INNER JOIN Intranet Connection LEFt/RIGHT OUTER JOIN Left/Right External Connection FULL OUTER JOIN External Connection LEFT-SEMI JOIN Left Half Connection In the map-side JOIN map stage, small tables are read into memory and joined directly on the map side. Users can turn on automatic optimization by setting hive.auto.convert.join=true, and define the size of small tables by hive.mapjoin.smalltable.filesize, which defaults to 250000 bytes. Multi-table JOIN
9.ORDER BY and SORTBY
When the number of reducer s is 1, they are the same. When the number of reducers is more than one, the output will overlap. SORT BY controls the sort within each reducer.
10.DISTRIBUTE BY and SORTBY
DISTRIBUTE BYSome data can be controlled to enter the same Reducer,This goes by SORT BYLater, the result of global ordering can be obtained.
SELECT col1,col2 FROM ss DISTRIBUTE BY col1 SORT BY col1,col2;
11.CLUSTER BY
If DISTRIBUTE BY and SORTBY statements are used, the columns involved in DISTRIBUTE BY and SORTBY are identical, and in ascending order, CLSTER BY can be used instead of DISTRIBUTE BY and SORTBY.
12. Bucketing and sampling
Sometimes data needs to be sampled, and Hive provides bucket sampling of tables.
SELECT * FROM test TABLESAMPLE(BUCKET 3 OUT OF 10 ON id);
For BUCKET x OUT OF Y ON z, y denotes y barrels, x denotes x barrels, and Z denotes barrels on the basis of hashing the hash values of Z columns divided by the remainder of Y. If Z is not specified, random column sampling can be used.
SELECT * FROM test TABLESAMPLE(BUCKET 3 OUT OF 10 ON RAND());
When tabulating, you can specify a bucket table, which is more efficient in sampling.
set hive.enforce.bucketing=true;
CREATE TABLE buckettable (id INT) CLUSTERED BY (id) INTO 4 BUCKETS;
The table will be divided into four buckets and the INSERT statement will be executed
INSET OVERWRITE TABLE buckettable SELECT * FROM source;
The data will be divided into four files stored under the table path, each representing a bucket.
13.UNION ALL
SELECT r.id,r.price
FROM(
SELECT m.id,m.price FROM monday m
UNION ALL
SELECT t.id,t.price FROM tuesday t)r
Hive does not support direct UNION ALL and must perform nested queries.