Hive QL: Window Opening Function (Cumulative Statistics)

Keywords: hive Hadoop vim

Catalog

 

Preface

1. What is a windowing function

2. Window-opening function syntax

3. Classification of window-opening functions

4. Cumulative Statistics Window Opening Function

4.1 Cumulative sum(xx) over

4.2 Cumulative Average avg(xx) over

4.3 Cumulative max(xx) over

4.4 Cumulative Minimum min(xx) over

 

Preface

Starting with this article, the series of window-opening functions in Hive is introduced.These functions are called many ways, including analysis function, window function, open window function, analysis window function. In fact, they all refer to a kind of function. This series of blogs follow the naming of open window function.

 

1. What is a windowing function

Here is the definition of Baidu Encyclopedia:

The windowing function defines a window for a row (in this case, the window is the set of rows that the operation will operate on), it operates on a set of values, does not need to group data using a GROUP BY clause, and it can return both columns of the underlying row and aggregated columns in the same row.

 

2. Window-opening function syntax

The syntax of the windowing function is:

over(partition by column name 1 order by column name 2), the two keywords partition by and order by in parentheses can only appear one.over() is preceded by a function.

 

3. Classification of window-opening functions

Window-opening functions in Hive are divided into three main categories by function:

  1. Cumulative Statistics
  2. Sort within group
  3. Hierarchical Query

 

4. Cumulative Statistics Window Opening Function

This paper mainly introduces the first kind of window-opening functions, namely, sum, avg, min, max, which are used for cumulative statistics.

Data preparation:

Create a new test.txt file and enter the following three columns of data separated by spaces.The first column is the month, the second column represents the name of the store, and the third column represents the turnover of the store in that month (RMB 10,000).

[root@hadoop ~]# vim test.txt 

2019-01 a 10
2019-02 a 20
2019-03 a 30
2019-01 b 10
2019-02 b 20
2019-03 b 30

Create a new table temp_test9 in hive, insert data from the test file, and view the data.

CREATE TABLE temp_test9 (
      month STRING comment 'Month'
      ,shop STRING comment 'Shop Name'
      ,money STRING comment 'Turnover (RMB 10,000)'
      ) row format delimited fields terminated BY ' ';
 
 
load data local inpath '/root/test.txt' into table temp_test9;
 
select * from temp_test9;
 
temp_test9.month	temp_test9.shop	temp_test9.money
2019-01	a	10
2019-02	a	20
2019-03	a	30
2019-01	b	10
2019-02	b	20
2019-03	b	30

4.1 Cumulative sum(xx) over

Ask store a to accumulate monthly turnover from January to that month, that is:

  • January data is January turnover
  • February is January+February turnover
  • March is January+February+March turnover
SELECT month
      ,SUM(MONEY) OVER (
            ORDER BY month -- Sorts by month, then defaults to cumulative summation from the starting line to the current line
            ) AS money_leiji
FROM temp_test9
 WHERE shop ='a'; --windowing functions do not need to write group by

Result:
month	money_leiji
2019-01	10.0
2019-02	30.0
2019-03	60.0

Also calculate the total turnover of stores a and b from January to that month

SELECT shop
      ,month
      ,SUM(MONEY) OVER (
            PARTITION BY shop ORDER BY month --Follow first shop Grouped, then sorted by month within each group, and finally accumulated from the starting line to the current line by default
            ) AS money_leiji
FROM temp_test9;


//Result:
shop	month	money_leiji
a	2019-01	10.0
a	2019-02	30.0
a	2019-03	60.0
b	2019-01	10.0
b	2019-02	30.0
b	2019-03	60.0

Group summation alone will not cause errors, but that makes no sense. Instead, use group by directly.

SELECT shop
      ,month
      ,SUM(MONEY) OVER (PARTITION BY shop) AS money_leiji
FROM temp_test9;

shop	month	money_leiji
a	2019-03	60.0
a	2019-02	60.0
a	2019-01	60.0
b	2019-03	60.0
b	2019-02	60.0
b	2019-01	60.0


//If you don't write month, you will also find the sum in three groups
SELECT shop
      ,SUM(money) OVER (
            PARTITION BY shop
            ) AS money_leiji
FROM temp_test9;

shop	money_leiji
a	60.0
a	60.0
a	60.0
b	60.0
b	60.0
b	60.0

In addition to the cumulative sum methods commonly used above, Hive also allows you to customize the cumulative method.The default cumulative method is to accumulate from the starting line to the current line. There are four main ways to customize the cumulative method:

  1. Current Line + Forward Lines
  2. Current Line + Backward Lines
  3. Current Line + Forward Line + Backward Line
  4. Current Line + All Backward Lines

There are few scenarios where these customized cumulative methods can be applied and you can't master them. Here's just the syntax:

  • Current Line + Forward Lines
1 PRECEDING is one line ahead, where you can specify any number you want

OVER (PARTITION BY column name 1 ORDER BY column name 2 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
  • Current Line + Backward Lines
1 FOLLOWING refers to the next line, which specifies any number you want

OVER (PARTITION BY column name 1 ORDER BY column name 2 ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)
  • Current Line + Forward Line + Backward Line
1 PRECEDING is one line forward and 1 FOLLOWING is one line backward. Both values can specify any number you want.

OVER (PARTITION BY column name 1 ORDER BY column name 2 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
  • Current Line + All Backward Lines
OVER (PARTITION BY Column Name 1 ORDER BY Column Name 2 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

 

4.2 Cumulative Average avg(xx) over

The other open window functions and the cumulative logic of summation are the same.

Give an example:

At the same time, the average turnover of stores a and b from January to that month is calculated.

SELECT shop
      ,month
      ,AVG(MONEY) OVER (
            PARTITION BY shop ORDER BY month --Follow first shop Grouping, then sorting by month within each group, and finally cumulative mean from start line to current line by default
            ) AS money_leiji
FROM temp_test9;


//Result:
shop	month	money_leiji
a	2019-01	10.0
a	2019-02	15.0
a	2019-03	20.0
b	2019-01	10.0
b	2019-02	15.0
b	2019-03	20.0

 

4.3 Cumulative max(xx) over

Give an example:

At the same time, the maximum turnover of stores a and b is calculated from January to that month.

SELECT shop
      ,month
      ,MAX(MONEY) OVER (
            PARTITION BY shop ORDER BY month --Follow first shop Grouped, then sorted by month within each group, and finally accumulated maximum from start line to current line by default
            ) AS money_leiji_max
FROM temp_test9;


//Result:
shop	month	money_leiji_max
a	2019-01	10
a	2019-02	20
a	2019-03	30
b	2019-01	10
b	2019-02	20
b	2019-03	30

 

4.4 Cumulative Minimum min(xx) over

Give an example:

At the same time, the minimum turnover of stores a and b is calculated from January to that month.

SELECT shop
      ,month
      ,MIN(MONEY) OVER (
            PARTITION BY shop ORDER BY month --Follow first shop Grouped, then sorted by month within each group, and finally accumulated minimum from start line to current line by default
            ) AS money_leiji_min
FROM temp_test9;


//Result:
shop	month	money_leiji_min
a	2019-01	10
a	2019-02	10
a	2019-03	10
b	2019-01	10
b	2019-02	10
b	2019-03	10

 

If you can see the students here, please give them a compliment in the upper right corner, 3Q~

Posted by AdamSnow on Mon, 09 Sep 2019 18:12:47 -0700