Hive analysis function and window function
Supported after Hive 0.11, scan multiple input rows to calculate the results of each row. Usually used with OVER,PARTITION BY, ORDER BY, WINDOWING. Unlike traditional grouping results, traditional results have only one result in each group. The results of the analysis function will appear many times, and each record will be connected to the output.
The grammatical forms are as follows:
Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])
Window function
Function name | Explain |
---|---|
FIRST_VALUE | After taking out the sorting within the group, the first value is as close as the current row. |
LAST_VALUE | After taking out the sorting within the group, the current row is closed, and the last value is obtained. |
LEAD(col, n, DEFAULT) | Used to count the value of line n down the window. The first parameter is the column name, the second parameter is the next nth row (optional, default 1), and the third parameter is the default value (when NULL is the next nth action, default value) |
LAG(col,n,DEFAULT) | Contrary to lead, it is used to count the nth value down the window. The first parameter is column name, and the second parameter is line n up (optional, default 1) |
OVER clause
- Use standard aggregation functions COUNT,SUM,MIN,MAX,AVG
- Use the PARTITION BY statement to use columns of one or more original data types
- Using PARTITION BY and ORDER BY statements, using partitions or beat sequences of one or more data types
- Using the window specification, the window specification supports a format:
(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING
When the window clause condition is missing after ORDER BY, the window specification defaults to
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
When ORDER BY and window clauses are missing, the window specification defaults to:
ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Analysis function
function | Explain |
---|---|
ROW_NUMBER() | From the beginning, the sequence of records in a group is generated in order. For example, the sequence of records in a group is arranged in descending order according to pv, and the PV ranking in a group is generated every day. There are many application scenarios of ROW_NUMBER(). For example, the first record in a group is obtained, and the first refer in a session is obtained. |
RANK() | Generate the ranking of data items in the group. Equal ranking will leave a vacancy in the ranking. |
DENSE_RANK() | Generate the ranking of data items in the group. Equal ranking will not leave a vacancy in the ranking. |
CUME_DIST() | The number of rows less than or equal to the current value divided by the total number of rows in the group. For example, the proportion of people whose salaries are less than or equal to their current salaries to the total number of people. |
PERCENT_RANK() | RANK value of the current row in the group - 1 / total row number in the group - 1 |
NTILE(n) | Used to divide the grouped data into n slices in order and return the current slice value. If the slice is not uniform, the distribution of the first slice is increased by default. NTILE does not support ROWS BETWEEN, such as NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PERCEDING AND CURRENT ROW) |
Hive 2.1.0 and later supports distinct
distinct is supported in aggregate functions (sum, count, avg), but not in order by or window restrictions.
conut(distinct a) over(partition by c)
Support aggregation function in OVER clause after Hive 2.1.0
select rank() over(order by sum(b))
Hive 2.2.0 supports distinct when using ORDER BY and window restrictions
count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)
Deep Understanding of Window Function and Analysis Function by Examples
Case Analysis of COUNT, SUM, MIN, MAX and AVG
## Create data tables
create table orders(
user_id string,
device_id string,
user_type string,
price float,
sales int);
## Add data orders.txt
zhangsa test1 new 67.1 2
lisi test2 old 43.32 1
wanger test3 new 88.88 3
liliu test4 new 66.0 1
tom test5 new 54.32 1
tomas test6 old 77.77 2
tomson test7 old 88.44 3
tom1 test8 new 56.55 6
tom2 test9 new 88.88 5
tom3 test10 new 66.66 5
## Case study of window opening function
select
user_id,
user_type,
sales,
-- Default all duplicate rows from start to current
sum(sales) over(partition by user_type order by sales asc) as sales_1,
-- All duplicate rows and sales_1 Same result
sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2,
-- From the starting point to the current line, results and results sale_1 Different results
sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3,
-- Current line plus forward3That's ok
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4,
-- Increase the current scope3That's ok
sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5,
-- Current row+Forward3That's ok+In the future1That's ok
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6,
--
sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7,
-- Current row+All subsequent rows
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8,
--
sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9,
-- All rows in a group
sum(sales) over(partition by user_type) as sales_10
from
orders
order by
user_type,
sales,
user_id;
##The results of the above query are as follows:
| user_id | user_type | sales | sales_1 | sales_2 | sales_3 | sales_4 | sales_5 | sales_6 | sales_7 | sales_8 | sales_9 | sales_10 |
|----------|------------|--------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----------|
| liliu | new | 1 | 2 | 2 | 2 | 2 | 2 | 4 | 4 | 22 | 23 | 23 |
| tom | new | 1 | 2 | 2 | 1 | 1 | 2 | 2 | 4 | 23 | 23 | 23 |
| zhangsa | new | 2 | 4 | 4 | 4 | 4 | 4 | 7 | 7 | 21 | 21 | 23 |
| wanger | new | 3 | 7 | 7 | 7 | 7 | 7 | 12 | 7 | 19 | 19 | 23 |
| tom2 | new | 5 | 17 | 17 | 17 | 15 | 15 | 21 | 21 | 11 | 16 | 23 |
| tom3 | new | 5 | 17 | 17 | 12 | 11 | 15 | 16 | 21 | 16 | 16 | 23 |
| tom1 | new | 6 | 23 | 23 | 23 | 19 | 19 | 19 | 19 | 6 | 6 | 23 |
| lisi | old | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 3 | 6 | 6 | 6 |
| tomas | old | 2 | 3 | 3 | 3 | 3 | 3 | 6 | 6 | 5 | 5 | 6 |
| tomson | old | 3 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 3 | 3 | 6 |
Be careful
1. Result is related to ORDER BY, default is ascending order.
2. If ROWS BETWEEN is not specified, the default is from the starting point to the current line.
3. If ORDER BY is not specified, all values in the grouping are accumulated.
4. PRECEDING: Forward
5. FOLLOWING: Later
6. CURRENT ROW: Current row
7. UNBOUNDED: Unbounded (starting or ending)
8. UNBOUNDED PRECEDING: Represents the starting point from the front
9. UNBOUNDED FOLLOWING: Represents the end point later
10. Other COUNT, AVG, MIN, MAX are used in the same way as SUM.
Case Analysis of FIRST_VALUE and LAST_VALUE
select
user_id,
user_type,
sales,
ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
orders
order by
user_type,
sales;
##The results of the above query are as follows:
| user_id | user_type | sales | row_num | max_sales_user | min_sales_user | curr_last_min_user | curr_last_max_user |
| ------- | --------- | ----- | ------- | -------------- | -------------- | ------------------ | ------------------ |
| tom | new | 1 | 1 | tom1 | tom | tom | liliu |
| liliu | new | 1 | 2 | tom1 | tom | tom | liliu |
| zhangsa | new | 2 | 3 | tom1 | tom | zhangsa | zhangsa |
| wanger | new | 3 | 4 | tom1 | tom | wanger | wanger |
| tom3 | new | 5 | 5 | tom1 | tom | tom3 | tom2 |
| tom2 | new | 5 | 6 | tom1 | tom | tom3 | tom2 |
| tom1 | new | 6 | 7 | tom1 | tom | tom1 | tom1 |
| lisi | old | 1 | 1 | tomson | lisi | lisi | lisi |
| tomas | old | 2 | 2 | tomson | lisi | tomas | tomas |
| tomson | old | 3 | 3 | tomson | lisi | tomson | tomson |
LEAD and LAG
select
user_id,
device_id,
sales,
ROW_NUMBER() OVER(ORDER BY sales) AS row_num,
lead(device_id) over (order by sales) as default_after_one_line,
lag(device_id) over (order by sales) as default_before_one_line,
lead(device_id,2) over (order by sales) as after_two_line,
lag(device_id,2,'abc') over (order by sales) as before_two_line
from
orders
order by
sales;
//The results of the above query are as follows
| user_id | device_id | sales | row_num | default_after_one_line | default_before_one_line | after_two_line | before_two_line |
|----------|------------|--------|----------|-------------------------|--------------------------|-----------------|------------------|
| lisi | test2 | 1 | 3 | test6 | test4 | test1 | test5 |
| liliu | test4 | 1 | 2 | test2 | test5 | test6 | abc |
| tom | test5 | 1 | 1 | test4 | NULL | test2 | abc |
| zhangsa | test1 | 2 | 5 | test7 | test6 | test3 | test2 |
| tomas | test6 | 2 | 4 | test1 | test2 | test7 | test4 |
| wanger | test3 | 3 | 7 | test10 | test7 | test9 | test1 |
| tomson | test7 | 3 | 6 | test3 | test1 | test10 | test6 |
| tom2 | test9 | 5 | 9 | test8 | test10 | NULL | test3 |
| tom3 | test10 | 5 | 8 | test9 | test3 | test8 | test7 |
| tom1 | test8 | 6 | 10 | NULL | test9 | NULL | test10 |
RANK,ROW_NUMBER,DENSE_RANK
select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
orders;
##The results of the above query are as follows
| user_id | user_type | sales | r | rn | dr |
| ------- | --------- | ----- | --- | --- | --- |
| tom1 | new | 6 | 1 | 1 | 1 |
| tom3 | new | 5 | 2 | 2 | 2 |
| tom2 | new | 5 | 2 | 3 | 2 |
| wanger | new | 3 | 4 | 4 | 3 |
| zhangsa | new | 2 | 5 | 5 | 4 |
| tom | new | 1 | 6 | 6 | 5 |
| liliu | new | 1 | 6 | 7 | 5 |
| tomson | old | 3 | 1 | 1 | 1 |
| tomas | old | 2 | 2 | 2 | 2 |
| lisi | old | 1 | 3 | 3 | 3 |
NTILE
select
user_type,sales,
--Divide data into groups2slice
NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
--Divide data into groups3slice
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
--Divide data into groups4slice
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
--Divide all data into4slice
NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
orders
order by
user_type,
sales;
##The results of the above query are as follows
| user_type | sales | nt2 | nt3 | nt4 | all_nt4 |
| --------- | ----- | --- | --- | --- | ------- |
| new | 1 | 1 | 1 | 1 | 1 |
| new | 1 | 1 | 1 | 1 | 1 |
| new | 2 | 1 | 1 | 2 | 2 |
| new | 3 | 1 | 2 | 2 | 3 |
| new | 5 | 2 | 2 | 3 | 4 |
| new | 5 | 2 | 3 | 3 | 3 |
| new | 6 | 2 | 3 | 4 | 4 |
| old | 1 | 1 | 1 | 1 | 1 |
| old | 2 | 1 | 2 | 2 | 2 |
| old | 3 | 2 | 3 | 3 | 2 |
Get the top 20% user ID of sale
select
user_id
from
(
select
user_id,
NTILE(5) OVER(ORDER BY sales desc) AS nt
from
orders
)A
where nt=1;
## give the result as follows
+----------+
| user_id |
+----------+
| tom1 |
| tom3 |
+----------+
CUME_DIST,PERCENT_RANK
select
user_id,user_type,sales,
--No, partition,All data are1group
CUME_DIST() OVER(ORDER BY sales) AS cd1,
--according to user_type Grouping
CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
orders;
## The results are as follows.
+----------+------------+--------+------+----------------------+--+
| user_id | user_type | sales | cd1 | cd2 |
+----------+------------+--------+------+----------------------+--+
| liliu | new | 1 | 0.3 | 0.2857142857142857 |
| tom | new | 1 | 0.3 | 0.2857142857142857 |
| zhangsa | new | 2 | 0.5 | 0.42857142857142855 |
| wanger | new | 3 | 0.7 | 0.5714285714285714 |
| tom2 | new | 5 | 0.9 | 0.8571428571428571 |
| tom3 | new | 5 | 0.9 | 0.8571428571428571 |
| tom1 | new | 6 | 1.0 | 1.0 |
| lisi | old | 1 | 0.3 | 0.3333333333333333 |
| tomas | old | 2 | 0.5 | 0.6666666666666666 |
| tomson | old | 3 | 0.7 | 1.0 |
+----------+------------+--------+------+----------------------+--+
select
user_type,sales,
--Number of total rows in a group
SUM(1) OVER(PARTITION BY user_type) AS s,
--RANKvalue
RANK() OVER(ORDER BY sales) AS r,
PERCENT_RANK() OVER(ORDER BY sales) AS pr,
--In group
PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
orders;
## The results are as follows.
+------------+--------+----+-----+---------------------+---------------------+--+
| user_type | sales | s | r | pr | prg |
+------------+--------+----+-----+---------------------+---------------------+--+
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 2 | 7 | 4 | 0.3333333333333333 | 0.3333333333333333 |
| new | 3 | 7 | 6 | 0.5555555555555556 | 0.5 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 6 | 7 | 10 | 1.0 | 1.0 |
| old | 1 | 3 | 1 | 0.0 | 0.0 |
| old | 2 | 3 | 4 | 0.3333333333333333 | 0.5 |
| old | 3 | 3 | 6 | 0.5555555555555556 | 1.0 |
+------------+--------+----+-----+---------------------+---------------------+--+
Reference blog
http://blog.csdn.net/scgaliguodong123_/article/details/60135385