Hive query operation

query

1. Data preparation

1.1 create table

// Create department table
create table if not exists dept (
    deptno int,
    dname string,
    loc int
)
row format delimited fields terminated by '\t';

// Create employee table
create table if not exists emp (
    empno int,
    ename string,
    job string,
    mgr int,
    hiredate string,
    sal double,
    comm double,
    deptno int
)
row format delimited fields terminated by '\t';

1.2 loading data

// Import data
load data local inpath '/opt/module/hive-3.1.2/datas/dept.txt' into table dept;
load data local inpath '/opt/module/hive-3.1.2/datas/emp.txt' into table emp;

1.3 data display

2. Basic query

2.1 basic query

Full table query
```
select * from emp;
```
Select a specific column query
```
select empno, ename from emp;
```

2.2 column alias

Rename a column
Easy to calculate
Immediately following the column name, you can also add the keyword AS between the column name and alias

select ename AS name, deptno dn from emp;

2.3 arithmetic operators

operator	describe
A+B	Add A and B
A-B	A minus B
A*B	Multiply A and B
A/B	A divided by B
A%B	A to B remainder
A&B	A and B are bitwise AND
A\|B	A and B take or by bit
A^B	A and B bitwise XOR
~A	A reverse by bit

// + 1 after querying the salary of all employees
select sal +1 AS sal from emp;

2.4 common functions

count the total number of rows
```
select count(*) cnt from emp;
```
Find the maximum value of salary max
```
select max(sal) max_sal from emp;
```
Find the minimum value of salary min
```
select min(sal) min_sal from emp;
```
sum of wages
```
select sum(sal) sum_sal from emp;
```
Average wages avg
```
select avg(sal) avg_sal from emp;
```

2.5 Limit statement

The LIMIT clause is used to LIMIT the number of rows returned

select * from emp limit 5;

2.6 Where statement

Use the WHERE clause to filter out rows that do not meet the conditions
The WHERE clause follows the FROM clause

// Query all employees with salary greater than 1000
select * from emp where sal > 1000;

2.7 comparison operators

Operator	Supported data types	describe
A=B	Basic data type	If A is equal to B, it returns TRUE, otherwise it returns FALSE
A<=>B	Basic data type	Returns TRUE if both A and B are NULL, and False if one side is NULL
A<>B, A!=B	Basic data type	If A or B is NULL, NULL is returned; If A is not equal to B, it returns TRUE, otherwise it returns FALSE
A<B	Basic data type	If A or B is NULL, NULL is returned; If A is less than B, it returns TRUE, otherwise it returns FALSE
A<=B	Basic data type	If A or B is NULL, NULL is returned; If A is less than or equal to B, it returns TRUE, otherwise it returns FALSE
A>B	Basic data type	If A or B is NULL, NULL is returned; If A is greater than B, it returns TRUE, otherwise it returns FALSE
A>=B	Basic data type	If A or B is NULL, NULL is returned; If A is greater than or equal to B, it returns TRUE, otherwise it returns FALSE
A [NOT] BETWEEN B AND C	Basic data type	If any of A, B or C is NULL, the result is NULL. If the value of A is greater than or equal to B and less than or equal to C, the result is TRUE, otherwise it is FALSE. If you use the NOT keyword, the opposite effect can be achieved.
A IS NULL	All data types	If A is equal to NULL, it returns TRUE, otherwise it returns FALSE
A IS NOT NULL	All data types	If A is not equal to NULL, it returns TRUE, otherwise it returns FALSE
In (value 1, value 2)	All data types	Use the IN operation to display the values IN the list
A [NOT] LIKE B	STRING type	B is A simple regular expression under SQL, also known as wildcard pattern. If A matches it, it returns TRUE; Otherwise, FALSE is returned.
A RLIKE B, A REGEXP B	STRING type	B is A regular expression based on java. If A matches it, it returns TRUE; Otherwise, FALSE is returned.

// Query employees with salary equal to 5000
select * from emp where sal = 5000;

// Query employee information with salary between 500 and 1000
select * from emp where sal between 500 and 1000;

// Query employee information with comm blank
select * from emp where comm is null;

// Query employee information with salary of 1500 or 5000
select * from emp where sal in (1500, 5000);

2.8 Like and RLike

Use the LIKE operator to select a similar value
- %Represents zero or more characters
- _ Represents a character
RLIKE clause is an extension of this function in Hive, which can specify matching conditions through Java regular expressions

// Query employee information whose name starts with A
select * from emp where ename like 'A%';

// Query the employee information with the second letter A in the name
select * from emp where ename like '_A%';

// Query employee information with A in name
select * from emp where ename rlike '[A]';

2.9 logical operators

Operator	meaning
AND	Logical Union
OR	Logical or
NOT	Logical no

// The inquiry salary is greater than 1000 and the Department is 30
select * from emp where sal > 1000 and deptno = 30;

// Query salary is greater than 1000 or department is 30
select * from emp where sal > 1000 or deptno = 30;

// Query employee information except 20 departments and 30 departments
select * from emp where deptno not in (30, 20);

3. Grouping

3.1 Group By statement

The GROUP BY statement is usually used with an aggregate function to group one or more queued results, and then aggregate each group.

// Calculate the average salary of each department
select deptno, avg(sal) avg_sal from emp group by deptno;

// Calculate the maximum salary for each position in each department
select t.deptno, t.job, max(t.sal) max_sal
from emp t
group by t.deptno, t.job;

3.2 Having statement

Differences between having and where:\

having is to filter the data after grouping, and where is to filter the data before grouping
Aggregate functions can be used after having, but not after where
The execution order during query: from > where > group (including aggregation) > having > order > select

// Find the average salary of each department
select deptno, avg(sal) from emp group by deptno;

// The average salary of each department is greater than 2000
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

4. Join statement

Benefits of table aliases:

Using aliases can simplify queries
Using table name prefix can improve execution efficiency

4.1 internal connection

Only if there are data matching the connection conditions in the two tables to be connected, the data will be retained

select e.empno, e.ename, d.dname
from emp e
join dept d on e.deptno = d.deptno;

4.2 left outer connection

The data in the left table that matches the connection conditions will be retained

select e.empno, e.ename, d.dname
from emp e
left join dept d on e.deptno = d.deptno;

4.3 right outer connection

The data in the right table that matches the connection conditions will be retained

select e.empno, e.ename, d.dname
from emp e
right join dept d on e.deptno = d.deptno;

4.4 full external connection

The data matching the connection conditions in the two tables to be connected will be retained

select e.empno, e.ename, d.dname
from emp e
full join dept d on e.deptno = d.deptno;

4.5 Cartesian product

Cartesian assemblies are generated under the following conditions:

Omit connection conditions
Invalid connection condition
All rows in all tables are connected to each other

select empno, dname from emp, dept;

5. Sorting

5.1 global sort Order By

Global sorting, only one Reducer

// Query employee information in ascending order of salary
select * from emp order by sal;

// Query employee information in descending order of salary
select * from emp order by sal desc;

5.2 sorting by alias

// Sort by twice the employee's salary
select ename, sal*2 twosal from emp order by twosal;

5.3 sorting multiple columns

// Sort by multiple columns: sort by department and salary in ascending order
select ename, deptno, sal from emp order by deptno, sal;

5.4 Sort By within each Reduce

Sort by: for large-scale data sets, order by is very inefficient. In many cases, global sorting is not required. In this case, Sort by can be used. Sort by generates a sort file for each Reducer. Each Reducer is sorted internally, not for the global result set

// Set the number of reduce
set mapreduce.job.reduces=3;

// Import query results into a file
insert overwrite local directory
'/opt/module/hive-3.1.2/datas/export/sortby-result'
row format delimited fields terminated by '\t'
select * from emp sort by deptno desc ;

5.5 partition distribution by

Distribution by: in some cases, we need to control which reducer a particular row should go to, usually for subsequent aggregation operations. The distribute by clause can do this. distribute by is similar to partition (custom partition) in MR. it is used in combination with sort by.

// Set the number of reduce
set mapreduce.job.reduces=3;

// First, divide by department number, and then sort by employee number in descending order
insert overwrite local directory
'/opt/module/hive-3.1.2/datas/export/distribute-result_4'
row format delimited fields terminated by '\t'
select * from emp
distribute by deptno
sort by empno desc ;

5.6 Cluster By

When the distribution by and sort by fields are the same, the cluster by mode can be used

// The following two expressions are equivalent
select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;

Posted by sean.mccoy on Wed, 06 Oct 2021 06:41:53 -0700

Programmer Group