Hive analysis function and window function

Keywords: hive less Session

Hive analysis function and window function

Supported after Hive 0.11, scan multiple input rows to calculate the results of each row. Usually used with OVER,PARTITION BY, ORDER BY, WINDOWING. Unlike traditional grouping results, traditional results have only one result in each group. The results of the analysis function will appear many times, and each record will be connected to the output.

The grammatical forms are as follows:

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])

Window function

Function name Explain
FIRST_VALUE After taking out the sorting within the group, the first value is as close as the current row.
LAST_VALUE After taking out the sorting within the group, the current row is closed, and the last value is obtained.
LEAD(col, n, DEFAULT) Used to count the value of line n down the window. The first parameter is the column name, the second parameter is the next nth row (optional, default 1), and the third parameter is the default value (when NULL is the next nth action, default value)
LAG(col,n,DEFAULT) Contrary to lead, it is used to count the nth value down the window. The first parameter is column name, and the second parameter is line n up (optional, default 1)

OVER clause

  1. Use standard aggregation functions COUNT,SUM,MIN,MAX,AVG
  2. Use the PARTITION BY statement to use columns of one or more original data types
  3. Using PARTITION BY and ORDER BY statements, using partitions or beat sequences of one or more data types
  4. Using the window specification, the window specification supports a format:
(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING

When the window clause condition is missing after ORDER BY, the window specification defaults to

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

When ORDER BY and window clauses are missing, the window specification defaults to:

ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

Analysis function

function Explain
ROW_NUMBER() From the beginning, the sequence of records in a group is generated in order. For example, the sequence of records in a group is arranged in descending order according to pv, and the PV ranking in a group is generated every day. There are many application scenarios of ROW_NUMBER(). For example, the first record in a group is obtained, and the first refer in a session is obtained.
RANK() Generate the ranking of data items in the group. Equal ranking will leave a vacancy in the ranking.
DENSE_RANK() Generate the ranking of data items in the group. Equal ranking will not leave a vacancy in the ranking.
CUME_DIST() The number of rows less than or equal to the current value divided by the total number of rows in the group. For example, the proportion of people whose salaries are less than or equal to their current salaries to the total number of people.
PERCENT_RANK() RANK value of the current row in the group - 1 / total row number in the group - 1
NTILE(n) Used to divide the grouped data into n slices in order and return the current slice value. If the slice is not uniform, the distribution of the first slice is increased by default. NTILE does not support ROWS BETWEEN, such as NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PERCEDING AND CURRENT ROW)

Hive 2.1.0 and later supports distinct

distinct is supported in aggregate functions (sum, count, avg), but not in order by or window restrictions.
conut(distinct a) over(partition by c)

Support aggregation function in OVER clause after Hive 2.1.0

select rank() over(order by sum(b))

Hive 2.2.0 supports distinct when using ORDER BY and window restrictions

count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)

Deep Understanding of Window Function and Analysis Function by Examples

Case Analysis of COUNT, SUM, MIN, MAX and AVG

## Create data tables
create table orders(
    user_id string,
    device_id string,
    user_type string,
    price float,
    sales int);

## Add data orders.txt
zhangsa test1   new     67.1    2
lisi    test2   old     43.32   1
wanger  test3   new     88.88   3
liliu   test4   new     66.0    1
tom     test5   new     54.32   1
tomas   test6   old     77.77   2
tomson  test7   old     88.44   3
tom1    test8   new     56.55   6
tom2    test9   new     88.88   5
tom3    test10  new     66.66   5

## Case study of window opening function
select
    user_id,
    user_type,
    sales,
    -- Default all duplicate rows from start to current
    sum(sales) over(partition by user_type order by sales asc) as sales_1,
    -- All duplicate rows and sales_1 Same result
    sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2,
    -- From the starting point to the current line, results and results sale_1 Different results
    sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3,
    -- Current line plus forward3That's ok
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4,
    -- Increase the current scope3That's ok
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5,
    -- Current row+Forward3That's ok+In the future1That's ok
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6,
    --
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7,
    -- Current row+All subsequent rows
    sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8,
    --
    sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9,
    -- All rows in a group
    sum(sales) over(partition by user_type) as sales_10
from
    orders
order by
    user_type,
    sales,
    user_id;

##The results of the above query are as follows:

| user_id  | user_type  | sales  | sales_1  | sales_2  | sales_3  | sales_4  | sales_5  | sales_6  | sales_7  | sales_8  | sales_9  | sales_10  |
|----------|------------|--------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----------|
| liliu    | new        | 1      | 2        | 2        | 2        | 2        | 2        | 4        | 4        | 22       | 23       | 23        |
| tom      | new        | 1      | 2        | 2        | 1        | 1        | 2        | 2        | 4        | 23       | 23       | 23        |
| zhangsa  | new        | 2      | 4        | 4        | 4        | 4        | 4        | 7        | 7        | 21       | 21       | 23        |
| wanger   | new        | 3      | 7        | 7        | 7        | 7        | 7        | 12       | 7        | 19       | 19       | 23        |
| tom2     | new        | 5      | 17       | 17       | 17       | 15       | 15       | 21       | 21       | 11       | 16       | 23        |
| tom3     | new        | 5      | 17       | 17       | 12       | 11       | 15       | 16       | 21       | 16       | 16       | 23        |
| tom1     | new        | 6      | 23       | 23       | 23       | 19       | 19       | 19       | 19       | 6        | 6        | 23        |
| lisi     | old        | 1      | 1        | 1        | 1        | 1        | 1        | 3        | 3        | 6        | 6        | 6         |
| tomas    | old        | 2      | 3        | 3        | 3        | 3        | 3        | 6        | 6        | 5        | 5        | 6         |
| tomson   | old        | 3      | 6        | 6        | 6        | 6        | 6        | 6        | 6        | 3        | 3        | 6         |

Be careful
1. Result is related to ORDER BY, default is ascending order.
2. If ROWS BETWEEN is not specified, the default is from the starting point to the current line.
3. If ORDER BY is not specified, all values in the grouping are accumulated.
4. PRECEDING: Forward
5. FOLLOWING: Later
6. CURRENT ROW: Current row
7. UNBOUNDED: Unbounded (starting or ending)
8. UNBOUNDED PRECEDING: Represents the starting point from the front
9. UNBOUNDED FOLLOWING: Represents the end point later
10. Other COUNT, AVG, MIN, MAX are used in the same way as SUM.

Case Analysis of FIRST_VALUE and LAST_VALUE

select
    user_id,
    user_type,
    sales,
    ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
    first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
    first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
    last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
    last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
    orders
order by
    user_type,
    sales;

##The results of the above query are as follows:

| user_id | user_type | sales | row_num | max_sales_user | min_sales_user | curr_last_min_user | curr_last_max_user |
| ------- | --------- | ----- | ------- | -------------- | -------------- | ------------------ | ------------------ |
| tom     | new       | 1     | 1       | tom1           | tom            | tom                | liliu              |
| liliu   | new       | 1     | 2       | tom1           | tom            | tom                | liliu              |
| zhangsa | new       | 2     | 3       | tom1           | tom            | zhangsa            | zhangsa            |
| wanger  | new       | 3     | 4       | tom1           | tom            | wanger             | wanger             |
| tom3    | new       | 5     | 5       | tom1           | tom            | tom3               | tom2               |
| tom2    | new       | 5     | 6       | tom1           | tom            | tom3               | tom2               |
| tom1    | new       | 6     | 7       | tom1           | tom            | tom1               | tom1               |
| lisi    | old       | 1     | 1       | tomson         | lisi           | lisi               | lisi               |
| tomas   | old       | 2     | 2       | tomson         | lisi           | tomas              | tomas              |
| tomson  | old       | 3     | 3       | tomson         | lisi           | tomson             | tomson             |

LEAD and LAG

select
    user_id,
    device_id,
    sales,
    ROW_NUMBER() OVER(ORDER BY sales) AS row_num,
    lead(device_id) over (order by sales) as default_after_one_line,
    lag(device_id) over (order by sales) as default_before_one_line,
    lead(device_id,2) over (order by sales) as after_two_line,
    lag(device_id,2,'abc') over (order by sales) as before_two_line
from
    orders
order by
    sales;

//The results of the above query are as follows

| user_id  | device_id  | sales  | row_num  | default_after_one_line  | default_before_one_line  | after_two_line  | before_two_line  |
|----------|------------|--------|----------|-------------------------|--------------------------|-----------------|------------------|
| lisi     | test2      | 1      | 3        | test6                   | test4                    | test1           | test5            |
| liliu    | test4      | 1      | 2        | test2                   | test5                    | test6           | abc              |
| tom      | test5      | 1      | 1        | test4                   | NULL                     | test2           | abc              |
| zhangsa  | test1      | 2      | 5        | test7                   | test6                    | test3           | test2            |
| tomas    | test6      | 2      | 4        | test1                   | test2                    | test7           | test4            |
| wanger   | test3      | 3      | 7        | test10                  | test7                    | test9           | test1            |
| tomson   | test7      | 3      | 6        | test3                   | test1                    | test10          | test6            |
| tom2     | test9      | 5      | 9        | test8                   | test10                   | NULL            | test3            |
| tom3     | test10     | 5      | 8        | test9                   | test3                    | test8           | test7            |
| tom1     | test8      | 6      | 10       | NULL                    | test9                    | NULL            | test10           |

RANK,ROW_NUMBER,DENSE_RANK

select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
orders;

##The results of the above query are as follows

| user_id | user_type | sales |  r  | rn  | dr  |
| ------- | --------- | ----- | --- | --- | --- |
| tom1    | new       | 6     | 1   | 1   | 1   |
| tom3    | new       | 5     | 2   | 2   | 2   |
| tom2    | new       | 5     | 2   | 3   | 2   |
| wanger  | new       | 3     | 4   | 4   | 3   |
| zhangsa | new       | 2     | 5   | 5   | 4   |
| tom     | new       | 1     | 6   | 6   | 5   |
| liliu   | new       | 1     | 6   | 7   | 5   |
| tomson  | old       | 3     | 1   | 1   | 1   |
| tomas   | old       | 2     | 2   | 2   | 2   |
| lisi    | old       | 1     | 3   | 3   | 3   |

NTILE

select
    user_type,sales,
    --Divide data into groups2slice
    NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
    --Divide data into groups3slice
    NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
    --Divide data into groups4slice
    NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
    --Divide all data into4slice
    NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
    orders
order by
    user_type,
    sales;

##The results of the above query are as follows
| user_type | sales | nt2 | nt3 | nt4 | all_nt4 |
| --------- | ----- | --- | --- | --- | ------- |
| new       | 1     | 1   | 1   | 1   | 1       |
| new       | 1     | 1   | 1   | 1   | 1       |
| new       | 2     | 1   | 1   | 2   | 2       |
| new       | 3     | 1   | 2   | 2   | 3       |
| new       | 5     | 2   | 2   | 3   | 4       |
| new       | 5     | 2   | 3   | 3   | 3       |
| new       | 6     | 2   | 3   | 4   | 4       |
| old       | 1     | 1   | 1   | 1   | 1       |
| old       | 2     | 1   | 2   | 2   | 2       |
| old       | 3     | 2   | 3   | 3   | 2       |

Get the top 20% user ID of sale

select
    user_id
from
(
    select
        user_id,
        NTILE(5) OVER(ORDER BY sales desc) AS nt
    from
        orders
)A
where nt=1;
## give the result as follows
+----------+
| user_id  |
+----------+
| tom1     |
| tom3     |
+----------+

CUME_DIST,PERCENT_RANK

select
    user_id,user_type,sales,
    --No, partition,All data are1group
    CUME_DIST() OVER(ORDER BY sales) AS cd1,
    --according to user_type Grouping
    CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
    orders;

## The results are as follows.
+----------+------------+--------+------+----------------------+--+
| user_id  | user_type  | sales  | cd1  |         cd2          |
+----------+------------+--------+------+----------------------+--+
| liliu    | new        | 1      | 0.3  | 0.2857142857142857   |
| tom      | new        | 1      | 0.3  | 0.2857142857142857   |
| zhangsa  | new        | 2      | 0.5  | 0.42857142857142855  |
| wanger   | new        | 3      | 0.7  | 0.5714285714285714   |
| tom2     | new        | 5      | 0.9  | 0.8571428571428571   |
| tom3     | new        | 5      | 0.9  | 0.8571428571428571   |
| tom1     | new        | 6      | 1.0  | 1.0                  |
| lisi     | old        | 1      | 0.3  | 0.3333333333333333   |
| tomas    | old        | 2      | 0.5  | 0.6666666666666666   |
| tomson   | old        | 3      | 0.7  | 1.0                  |
+----------+------------+--------+------+----------------------+--+

select
    user_type,sales,
    --Number of total rows in a group
    SUM(1) OVER(PARTITION BY user_type) AS s,
    --RANKvalue
    RANK() OVER(ORDER BY sales) AS r,
    PERCENT_RANK() OVER(ORDER BY sales) AS pr,
    --In group
    PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
    orders;

## The results are as follows.
+------------+--------+----+-----+---------------------+---------------------+--+
| user_type  | sales  | s  |  r  |         pr          |         prg         |
+------------+--------+----+-----+---------------------+---------------------+--+
| new        | 1      | 7  | 1   | 0.0                 | 0.0                 |
| new        | 1      | 7  | 1   | 0.0                 | 0.0                 |
| new        | 2      | 7  | 4   | 0.3333333333333333  | 0.3333333333333333  |
| new        | 3      | 7  | 6   | 0.5555555555555556  | 0.5                 |
| new        | 5      | 7  | 8   | 0.7777777777777778  | 0.6666666666666666  |
| new        | 5      | 7  | 8   | 0.7777777777777778  | 0.6666666666666666  |
| new        | 6      | 7  | 10  | 1.0                 | 1.0                 |
| old        | 1      | 3  | 1   | 0.0                 | 0.0                 |
| old        | 2      | 3  | 4   | 0.3333333333333333  | 0.5                 |
| old        | 3      | 3  | 6   | 0.5555555555555556  | 1.0                 |
+------------+--------+----+-----+---------------------+---------------------+--+

Reference blog

http://blog.csdn.net/scgaliguodong123_/article/details/60135385

Posted by Vincent III on Fri, 31 May 2019 12:56:36 -0700