Summary of common HIVE functions

Keywords: hive SQL

Summary of common HIVE functions

There are many hive functions. In the past, it was always used and checked. Today, I will summarize the commonly used functions for easy reference in the future.

This paper mainly involves the following aspects:

1. Hive function introduction and built-in function view

See Hive official documents for more information

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

1) View the functions provided by the system

hive> show functions;

2) Displays the usage of the built-in function

hive> desc function upper;

3) Displays the usage of the built-in function in detail

hive> desc function extended upper;

2. Introduction to common functions

Relational operation

1. Equivalence comparison:=

Syntax: A=B
Operation type: all basic types
Description: TRUE if expression A is equal to expression B; Otherwise, FALSE

hive> select 1 from tableName where 1=1;

2. Unequal comparison: < >

Syntax: a < > b
Operation type: all basic types
Description: if expression A is NULL or expression B is NULL, NULL is returned; TRUE if expression A is not equal to expression B; Otherwise, FALSE

hive> select 1 from tableName where 1 <> 2;

3. Less than comparison:<

Syntax: a < B
Operation type: all basic types
Description: if expression A is NULL or expression B is NULL, NULL is returned; TRUE if expression A is less than expression B; Otherwise, FALSE

hive> select 1 from tableName where 1 < 2;

4. Less than or equal to comparison:<=

Syntax: a < = b
Operation type: all basic types
Description: if expression A is NULL or expression B is NULL, NULL is returned; TRUE if expression A is less than or equal to expression B; Otherwise, FALSE

hive> select 1 from tableName where 1 < = 1;

5. Greater than comparison: >

Syntax: a > b
Operation type: all basic types
Description: if expression A is NULL or expression B is NULL, NULL is returned; TRUE if expression A is greater than expression B; Otherwise, FALSE

hive> select 1 from tableName where 2 > 1;

6. Greater than or equal to comparison: >=

Syntax: a > = b
Operation type: all basic types
Description: if expression A is NULL or expression B is NULL, NULL is returned; TRUE if expression A is greater than or equal to expression B; Otherwise, FALSE

hive> select 1 from tableName where 1 >= 1;
1

Note: pay attention to the comparison of strings (common time comparisons can be made after to_date)

hive> select * from tableName;
O
2011111209 00:00:00 2011111209

hive> select a, b, a<b, a>b, a=b from tableName;
2011111209 00:00:00 2011111209 false true false

7. Null value judgment: IS NULL

Syntax: A IS NULL
Operation type: all types
Description: TRUE if the value of expression A is NULL; Otherwise, FALSE

hive> select 1 from tableName where null is null;

8. Non null judgment: IS NOT NULL

Syntax: A IS NOT NULL
Operation type: all types
Description: FALSE if the value of expression A is NULL; Otherwise, TRUE

hive> select 1 from tableName where 1 is not null;

9. LIKE comparison: LIKE

Syntax: A LIKE B
Operation type: strings
Description: if string A or string B is NULL, NULL is returned; TRUE if string A conforms to the regular syntax of expression B; Otherwise, it is FALSE. Character "" in B Represents any single character, while the character '%' represents any number of characters.

hive> select 1 from tableName where 'football' like 'foot____';

<strong>Note: negative comparison is used NOT A LIKE B</strong>

hive> select 1 from tableName where NOT 'football' like 'fff%'; 

10. LIKE operation in JAVA: RLIKE

Syntax: A RLIKE B
Operation type: strings
Description: if string A or string B is NULL, NULL is returned; TRUE if string A conforms to the regular syntax of JAVA regular expression B; Otherwise, it is FALSE.

hive> select 1 from tableName where 'footbar' rlike '^f.*r$';

1

Note: judge whether a string is all numbers:

hive>select 1 from tableName where '123456' rlike '^\\d+$';
1

hive> select 1 from tableName where '123456aa' rlike '^\\d+$';

11. REGEXP operation: REGEXP

Syntax: A REGEXP B
Operation type: strings
Description: the function is the same as RLIKE

hive> select 1 from tableName where 'footbar' REGEXP '^f.*r$';

1

Mathematical operation:

1. Addition operation:+

Syntax: A + B
Operation type: all numeric types
Description: returns the result of adding a and B. The value type of the result is equal to the minimum parent type of type A and type B (see the inheritance relationship of data types for details). For example, the general result of int + int is of type int, while the general result of int + double is of type double

hive> select 1 + 9 from tableName;
10

hive> create table tableName as select 1 + 1.2 from tableName;
hive> describe tableName;
_c0 double

2. Subtraction operation:-

Syntax: A – B
Operation type: all numeric types
Description: returns the result of subtracting a from B. The value type of the result is equal to the minimum parent type of type A and type B (see the inheritance relationship of data types for details). For example, the general result of int – int is of type int, while the general result of int – double is of type double

hive> select 10 – 5 from tableName;
5
hive> create table tableName as select 5.6 – 4 from tableName;
hive> describe tableName;
_c0 double

3. Multiplication operation:*

Syntax: A * B
Operation type: all numeric types
Description: returns the result of multiplying a and B. The value type of the result is equal to the minimum parent type of type A and type B (see the inheritance relationship of data types for details). Note that if the result of a multiplied by B exceeds the value range of the default result type, you need to convert the result to a larger value type through cast

hive> select 40 * 5 from tableName;

200

4. Division operation:/

Syntax: A / B
Operation type: all numeric types
Description: returns the result of A divided by B. The numeric type of the result is double

hive> select 40 / 5 from tableName;

8.0

Note: the highest precision data type in hive is double, which is only accurate to 16 digits after the decimal point. Pay special attention to division

hive>select ceil(28.0/6.999999999999999999999) from tableName limit 1;
4

hive>select ceil(28.0/6.99999999999999) from tableName limit 1;
5

5. Remainder fetching operation:%

Syntax: a% B
Operation type: all numeric types
Description: returns the remainder of a divided by B. The value type of the result is equal to the minimum parent type of type A and type B (see the inheritance relationship of data types for details).

hive> select 41 % 5 from tableName;
1

hive> select 8.4 % 4 from tableName;
0.40000000000000036

Note: precision is a big problem in hive. For operations like this, it is best to specify the precision through round

hive> select round(8.4 % 4 , 2) from tableName;
0.4

6. Bit and operation:&

Syntax: A & B
Operation type: all numeric types
Description: returns the result of bitwise and operation of a and B. The value type of the result is equal to the minimum parent type of type A and type B (see the inheritance relationship of data types for details).

hive> select 4 & 8 from tableName;
0

hive> select 6 & 4 from tableName;
4

7. Bit or operation:|

Grammar: A | B
Operation type: all numeric types
Description: returns the result of a and B bitwise or operations. The value type of the result is equal to the minimum parent type of type A and type B (see the inheritance relationship of data types for details).

hive> select 4 | 8 from tableName;
12

hive> select 6 | 8 from tableName;
14

8. Bitwise exclusive or operation:^

Syntax: A ^ B
Operation type: all numeric types
Description: returns the result of bit-by-bit XOR operation of a and B. The value type of the result is equal to the minimum parent type of type A and type B (see the inheritance relationship of data types for details).

hive> select 4 ^ 8 from tableName;
12

hive> select 6 ^ 4 from tableName;
2

9. Bit reversal operation:~

Syntax: ~ A
Operation type: all numeric types
Description: returns the result of a bitwise inversion operation. The numeric type of the result is equal to the type of A.

hive> select ~6 from tableName;
-7

hive> select ~4 from tableName;
-5

Logical operation:

1. Logic AND operation: AND

Syntax: A AND B
Operation type: boolean
Note: TRUE if both A and B are TRUE; Otherwise, it is FALSE. NULL if A is NULL or B is NULL

hive> select 1 from tableName where 1=1 and 2=2;
1

2. Logical OR operation: OR

Syntax: A OR B
Operation type: boSsolean
Note: TRUE if A is TRUE, B is TRUE, or both A and B are TRUE; Otherwise, FALSE

hive> select 1 from tableName where 1=2 or 2=2;

1

3. Logical non operation: NOT

Syntax: NOT A
Operation type: boolean
Note: TRUE if A is FALSE or A is NULL; Otherwise, FALSE

hive> select 1 from tableName where not 1=2;

1

numerical calculation

1. Rounding function: round***

Syntax: round(double a)
Return value: BIGINT
Description: returns the integer value part of double type (follow rounding)

hive> select round(3.1415926) from tableName;
3

hive> select round(3.5) from tableName;
4

hive> create table tableName as select round(9542.158) from tableName;
hive> describe tableName;
_c0 bigint

2. Specify precision rounding function: round***

Syntax: round(double a, int d)
Return value: DOUBLE
Description: returns the double type of the specified precision d

hive> select round(3.1415926,4) from tableName;
3.1416

3. Rounding down function: floor***

Syntax: floor(double a)
Return value: BIGINT
Description: returns the maximum integer equal to or less than the double variable

hive> select floor(3.1415926) from tableName;
3

hive> select floor(25) from tableName;
25

4. Rounding up function: ceil***

Syntax: ceil(double a)
Return value: BIGINT
Description: returns the smallest integer equal to or greater than the double variable

hive> select ceil(3.1415926) from tableName;
4

hive> select ceil(46) from tableName;
46

5. Rounding up function: ceiling***

Syntax: ceiling(double a)
Return value: BIGINT
Note: it has the same function as ceil

hive> select ceiling(3.1415926) from tableName;
4

hive> select ceiling(46) from tableName;
46

6. Random number function: rand***

Syntax: rand(),rand(int seed)
Return value: double
Description: returns a random number in the range of 0 to 1. If seed is specified, it will wait for a stable sequence of random numbers

hive> select rand() from tableName;
0.5577432776034763

hive> select rand() from tableName;
0.6638336467363424

hive> select rand(100) from tableName;
0.7220096548596434

hive> select rand(100) from tableName;
0.7220096548596434

7. Natural exponential function: exp

Syntax: exp(double a)
Return value: double
Description: returns the natural logarithm e to the power of a

hive> select exp(2) from tableName;
7.38905609893065

Natural logarithm function: ln

Syntax: ln(double a)

Return value: double

Description: returns the natural logarithm of a

1

hive> select ln(7.38905609893065) from tableName;

2.0

8. Base 10 logarithm function: log10

Syntax: log10(double a)
Return value: double
Description: returns the logarithm of a based on 10

hive> select log10(100) from tableName;

2.0

9. Base 2 logarithmic function: log2

Syntax: log2(double a)
Return value: double
Description: returns the logarithm of a based on 2

hive> select log2(8) from tableName;

3.0

10. Log function: log

Syntax: log(double base, double a)
Return value: double
Description: returns the logarithm of a based on base

hive> select log(4,256) from tableName;

4.0

11. Power operation function: pow

Syntax: pow(double a, double p)
Return value: double
Description: returns the p-power of a

hive> select pow(2,4) from tableName;

16.0

12. power operation function: power

Syntax: power(double a, double p)
Return value: double
Description: returns the p-power of a, the same function as pow

hive> select power(2,4) from tableName;

16.0

13. Square function: sqrt

Syntax: sqrt(double a)
Return value: double
Description: returns the square root of a

hive> select sqrt(16) from tableName;

4.0

14. Binary function: bin

Syntax: bin(BIGINT a)
Return value: string
Description: returns the binary code representation of a

hive> select bin(7) from tableName;

111

15. Hex function: hex

Syntax: hex(BIGINT a)
Return value: string
Note: if the variable is of type int, the hexadecimal representation of a is returned; If the variable is of type string, the hexadecimal representation of the string is returned

hive> select hex(17) from tableName;
11

hive> select hex('abc') from tableName;
616263

16. Reverse hex function: unhex

Syntax: unhex(string a)
Return value: string
Description: returns the string of the hexadecimal string code

hive> select unhex('616263') from tableName;
abc

hive> select unhex('11') from tableName;
\-

hive> select unhex(616263) from tableName;
abc

17. Binary conversion function: conv

Syntax: conv(BIGINT num, int from_base, int to_base)
Return value: string
Note: change the value num from from_base to_base system

hive> select conv(17,10,16) from tableName;
11

hive> select conv(17,10,2) from tableName;
10001

18. Absolute value function: abs

Syntax: abs(double a) abs(int a)
Return value: double int
Description: returns the absolute value of the value a

hive> select abs(-3.9) from tableName;
3.9

hive> select abs(10.9) from tableName;
10.9

19. Positive remainder function: pmod

Syntax: pmod(int a, int b),pmod(double a, double b)
Return value: int double
Description: returns the remainder of positive a divided by b

hive> select pmod(9,4) from tableName;
1

hive> select pmod(-9,4) from tableName;
3

20. Sine function: sin

Syntax: sin(double a)
Return value: double
Description: returns the sine of a

hive> select sin(0.8) from tableName;

0.7173560908995228

21. Inverse sine function: asin

Syntax: asin(double a)
Return value: double
Description: returns the arcsine value of a

hive> select asin(0.7173560908995228) from tableName;
0.8

22. Cosine function: cos

Syntax: cos(double a)
Return value: double
Description: returns the cosine value of a

hive> select cos(0.9) from tableName;
0.6216099682706644

23. Inverse cosine function: acos

Syntax: acos(double a)
Return value: double
Description: returns the inverse cosine of a

hive> select acos(0.6216099682706644) from tableName;

0.9

24. Positive function: positive

Syntax: positive(int a), positive(double a)
Return value: int double
Description: returns a

hive> select positive(-10) from tableName;
-10

hive> select positive(12) from tableName;
12

25. Negative function: negative

Syntax: negative(int a), negative(double a)
Return value: int double
Description: Return - a

hive> select negative(-5) from tableName;
5

hive> select negative(8) from tableName;
-8

Date function

1. UNIX timestamp to date function: from_unixtime ***

Syntax: from_unixtime(bigint unixtime[, string format])
Return value: string
Note: convert the UNIX timestamp (from 1970-01-01 00:00:00 UTC to the number of seconds of the specified time) to the time format of the current time zone

hive> select from_unixtime(1323308943,'yyyyMMdd') from tableName;

20111208

2. Get current UNIX timestamp function: unix_timestamp ***

Syntax: unix_timestamp()
Return value: bigint
Description: get the UNIX timestamp of the current time zone

hive> select unix_timestamp() from tableName;

1323309615

3. Date to UNIX timestamp function: unix_timestamp ***

Syntax: unix_timestamp(string date)
Return value: bigint
Description: convert the date in the format "yyyy MM DD HH: mm: SS" to UNIX timestamp. If the conversion fails, 0 is returned.

hive> select unix_timestamp('2011-12-07 13:01:03') from tableName;

1323234063

4. Specified format date to UNIX timestamp function: unix_timestamp ***

Syntax: unix_timestamp(string date, string pattern)
Return value: bigint
Description: convert the date in pattern format to UNIX timestamp. If the conversion fails, 0 is returned.

hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName;

1323234063

5. Date time to date function: to_date ***

Syntax: to_date(string timestamp)
Return value: string
Description: returns the date part of the date time field.

hive> select to_date('2011-12-08 10:03:01') from tableName;

2011-12-08

6. Date to year function: year***

Syntax: year(string date)
Return value: int
Description: returns the year in the date.

hive> select year('2011-12-08 10:03:01') from tableName;
2011

hive> select year('2012-12-08') from tableName;
2012

7. Date to month function: month***

Syntax: month (string date)
Return value: int
Description: returns the month in the date.

hive> select month('2011-12-08 10:03:01') from tableName;

12

hive> select month('2011-08-08') from tableName;

8

8. Date conversion function: day****

Syntax: day (string date)
Return value: int
Description: returns the days in the date.

hive> select day('2011-12-08 10:03:01') from tableName;
8

hive> select day('2011-12-24') from tableName;
24

9. Date to hour function: hour***

Syntax: hour (string date)
Return value: int
Description: returns the hour in the date.

hive> select hour('2011-12-08 10:03:01') from tableName;

10

10. Date to minute function: minute

Syntax: minute (string date)
Return value: int
Description: returns the minute in the date.

hive> select minute('2011-12-08 10:03:01') from tableName;

3

11. Date to second function: Second

Syntax: second (string date)
Return value: int
Description: returns the second in the date.

hive> select second('2011-12-08 10:03:01') from tableName;

1

12. Date to week function: weekofyear

Syntax: weekofyear (string date)
Return value: int
Description: returns the number of weeks the date is in the current.

hive> select weekofyear('2011-12-08 10:03:01') from tableName;

49

13. Date comparison function: datediff***

Syntax: datediff(string enddate, string startdate)
Return value: int
Description: returns the number of days from the end date minus the start date.

hive> select datediff('2012-12-08','2012-05-09') from tableName;

213

14. Date addition function: date_add ***

Syntax: date_add(string startdate, int days)
Return value: string
Description: returns the start date. startdate is the date after days is added.

hive> select date_add('2012-12-08',10) from tableName;

2012-12-18

15. Date reduction function: date_sub ***

Syntax: date_sub (string startdate, int days)
Return value: string
Description: returns the date after the start date is reduced by days.

hive> select date_sub('2012-12-08',10) from tableName;

2012-11-28

Conditional function

1. If function: if***

Syntax: if (Boolean TestCondition, t ValueTrue, t valuefalse or null)
Return value: T
Description: when the condition testCondition is TRUE, valueTrue is returned; Otherwise, valuefalse ornull is returned

hive> select if(1=2,100,200) from tableName;
200

hive> select if(1=1,100,200) from tableName;
100

2. Non empty lookup function: COALESCE

Syntax: COALESCE(T v1, T v2,...)
Return value: T
Description: returns the first non null value in the parameter; Returns NULL if all values are null

hive> select COALESCE(null,'100','50') from tableName;

100

3. Conditional judgment function: CASE***

Syntax: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
Return value: T
Note: if a equals b, c is returned; If a equals d, then e is returned; Otherwise, return f

hive> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
mary

hive> Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
tim

4. Conditional judgment function: CASE****

Syntax: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
Return value: T
Note: if a is TRUE, b is returned; If c is TRUE, d is returned; Otherwise, return e

hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
mary

hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
tom

String function

1. String length function: length

Syntax: length(string A)
Return value: int
Description: returns the length of string A

hive> select length('abcedfg') from tableName;
7

2. String inversion function: reverse

Syntax: reverse(string A)
Return value: string
Description: returns the inversion result of string A

hive> select reverse('abcedfg') from tableName;
gfdecba

3. String concatenation function: concat***

Syntax: concat(string A, string B...)
Return value: string
Description: returns the result of input string connection. Any input string is supported

hive> select concat('abc','def','gh') from tableName;
abcdefgh

4. Delimited string concatenation function: concat_ws ***

Syntax: concat_ws(string SEP, string A, string B…)
Return value: string
Description: returns the result of the input string connection. SEP represents the separator between each string

hive> select concat_ws(',','abc','def','gh')from tableName;
abc,def,gh

5. String interception function: substr,substring****

Syntax: substr(string A, int start),substring(string A, int start)
Return value: string
Description: returns the string from the start position to the end of string A

hive> select substr('abcde',3) from tableName;
cde

hive> select substring('abcde',3) from tableName;
cde

hive> select substr('abcde',-1) from tableName; (and ORACLE Same)
e

6. String interception function: substr,substring****

Syntax: substr(string A, int start, int len),substring(string A, int start, int len)
Return value: string
Description: returns the string A with the length of len starting from the start position

hive> select substr('abcde',3,2) from tableName;
cd

hive> select substring('abcde',3,2) from tableName;
cd

hive>select substring('abcde',-2,2) from tableName;
de

7. String to uppercase function: upper,ucase****

Syntax: upper(string A) ucase(string A)
Return value: string
Description: returns the uppercase format of string A

hive> select upper('abSEd') from tableName;
ABSED

hive> select ucase('abSEd') from tableName;
ABSED

8. String to lowercase function: lower,lcase***

Syntax: lower(string A) lcase(string A)
Return value: string
Description: returns the lowercase format of string A

hive> select lower('abSEd') from tableName;
absed

hive> select lcase('abSEd') from tableName;
absed

9. De whitespace function: trim***

Syntax: trim(string A)
Return value: string
Description: remove the spaces on both sides of the string

hive> select trim(' abc ') from tableName;
abc

10. Left space function: ltrim

Syntax: ltrim(string A)
Return value: string
Description: remove the space to the left of the string

hive> select ltrim(' abc ') from tableName;
abc

11. Function to remove space on the right: rtrim

Syntax: rtrim(string A)
Return value: string
Description: remove the space on the right of the string

hive> select rtrim(' abc ') from tableName;
abc

12. Regular expression replacement function: regexp_replace

Syntax: regexp_replace(string A, string B, string C)
Return value: string
Description: replace the part of string A that conforms to java regular expression B with C. Note that in some cases, escape characters are used, similar to regexp in oracle_ Replace function.

hive> select regexp_replace('foobar', 'oo|ar', '') from tableName;
fb

13. Regular expression parsing function: regexp_extract

Syntax: regexp_extract(string subject, string pattern, int index)
Return value: string
Description: split the string subject according to the rules of pattern regular expression and return the characters specified by index.

hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 1) from tableName;
the

hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 2) from tableName;
bar

hive> select regexp_extract('foothebar', 'foo(.*?)(bar)', 0) from tableName;
foothebar

strong>Note that in some cases, the escape character should be used, and the following equal sign should be escaped with a double vertical line, which is java Rules for regular expressions.
select data_field,
regexp_extract(data_field,'.*?bgStart\\=([^&]+)',1) as aaa,
regexp_extract(data_field,'.*?contentLoaded_headStart\\=([^&]+)',1) as bbb,
regexp_extract(data_field,'.*?AppLoad2Req\\=([^&]+)',1) as ccc
from pt_nginx_loginlog_st
where pt = '2012-03-26' limit 2;

14. URL parsing function: parse_url ****

Syntax: parse_url(string urlString, string partToExtract [, string keyToExtract])
Return value: string
Description: returns the part specified in the URL. Valid values of partToExtract are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO

hive> select parse_url
('[https://www.tableName.com/path1/p.php?k1=v1&k2=v2](https://link.zhihu.com/?target=https%3A//www.iteblog.com/path1/p.php%3Fk1%3Dv1%26k2%3Dv2)#Ref1', 'HOST')
from tableName;
[http://www.tableName.com](https://link.zhihu.com/?target=http%3A//www.tableName.com)

hive> select parse_url
('[https://www.tableName.com/path1/p.php?k1=v1&k2=v2](https://link.zhihu.com/?target=https%3A//www.iteblog.com/path1/p.php%3Fk1%3Dv1%26k2%3Dv2)#Ref1', 'QUERY', 'k1')
from tableName;
v1

15. JSON parsing function: get_json_object ****

Syntax: get_json_object(string json_string, string path)
Return value: string
Description: parse json string_ String to return the content specified by path. NULL is returned if the json string entered is invalid.

hive> select get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;

16. space string function: space

Syntax: space(int n)
Return value: string
Description: returns a string of length n

hive> select space(10) from tableName;
hive> select length(space(10)) from tableName;
10

17. Repeat string function: repeat***

Syntax: repeat(string str, int n)
Return value: string
Description: returns the str string after n repetitions

hive> select repeat('abc',5) from tableName;
abcabcabcabcabc

18. First character function: ascii

Syntax: ascii(string str)
Return value: int
Description: returns the ascii code of the first character of string str

hive> select ascii('abcde') from tableName;
97

19. Left complement function: lpad

Syntax: lpad(string str, int len, string pad)
Return value: string
Note: the str is left complemented to the len bit with pad

hive> select lpad('abc',10,'td') from tableName;
tdtdtdtabc
 Note: and GP,ORACLE Different, pad Cannot default

20. Right complement function: rpad

Syntax: rpad(string str, int len, string pad)
Return value: string
Note: right complement str to len bit with pad

hive> select rpad('abc',10,'td') from tableName;
abctdtdtdt

21. Split string function: split****

Syntax: split(string str, string pat)
Return value: array
Note: if str is split according to pat string, the split string array will be returned

hive> select split('abtcdtef','t') from tableName;
["ab","cd","ef"]

22. Set lookup function: find_in_set

Syntax: find_in_set(string str, string strList)
Return value: int
Description: returns the first occurrence of STR in strlist. Strlist is a comma separated string. If the str character is not found, 0 is returned

hive> select find_in_set('ab','ef,ab,de') from tableName;
2

hive> select find_in_set('at','ef,ab,de') from tableName;
0

Set statistical function

1. Number statistics function: count***

Syntax: count(), count(expr), count(DISTINCT expr[, expr_.])
Return value: int
Note: count() counts the number of retrieved rows, including NULL rows; count(expr) returns the number of non NULL values of the specified field; count(DISTINCT expr[, expr_.]) returns the number of different non NULL values of the specified field

hive> select count(*) from tableName;
20

hive> select count(distinct t) from tableName;
10

2. sum statistics function: sum***

Syntax: sum(col), sum(DISTINCT col)
Return value: double
Note: sum(col) is the result of adding col in the statistical result set; sum(DISTINCT col) is the result of the addition of different col values in the statistical results

hive> select sum(t) from tableName;
100

hive> select sum(distinct t) from tableName;
70

3. Average statistical function: avg***

Syntax: avg(col), avg(DISTINCT col)
Return value: double
Note: avg(col) is the average value of col in the statistical result set; avg(DISTINCT col) is the average value of the addition of different col values in the statistical results

hive> select avg(t) from tableName;
50

hive> select avg (distinct t) from tableName;
30

4. Minimum statistical function: min***

Syntax: min(col)
Return value: double
Description: the minimum value of col field in the statistics result set

hive> select min(t) from tableName;

20

5. Maximum statistical function: max***

Syntax: maxcol)
Return value: double
Description: the maximum value of col field in the statistics result set

hive> select max(t) from tableName;

120

6. Non empty set population variable function: var_pop

Syntax: var_pop(col)
Return value: double
Description: the total variable of col non empty collection in the statistics result set (null is ignored)

7. Non empty set sample variable function: var_samp

Syntax: var_samp (col)
Return value: double
Note: the sample variable of col non empty set in the statistical result set (null is ignored)

8. Overall standard deviation function: stddev_pop

Syntax: stddev_pop(col)
Return value: double
Note: this function calculates the population standard deviation and returns the square root of the population variable, and its return value is the same as var_ The square root of pop function is the same

9. Sample standard deviation function: stddev_samp

Syntax: stddev_samp (col)
Return value: double
Note: this function calculates the sample standard deviation

10. Median function: percentile

Syntax: percentile(BIGINT col, p)
Return value: double
Note: for the accurate pth percentile, p must be between 0 and 1, but the col field currently only supports integers and does not support floating-point numbers

11. Median function: percentile

Syntax: percentile(BIGINT col, array(p1 [, p2]...))
Return value: array
Note: the function is similar to the above. After that, you can enter multiple percentiles. The return type is also array, where is the corresponding percentile.

select percentile(score,&lt;0.2,0.4>) from tableName; Take 0.2,0.4 Location data

12. Approximate median function: percentile_approx

Syntax: percentile_approx(DOUBLE col, p [, B])
Return value: double
Note: for the pth percentile of approximation, p must be between 0 and 1. The return type is double, but the col field supports floating-point type. Parameter B controls the approximate accuracy of memory consumption. The larger B, the higher the accuracy of the result. The default is 10000. When the number of distinct values in col field is less than B, the result is the exact percentile

13. Approximate median function: percentile_approx

Syntax: percentile_approx(DOUBLE col, array(p1 [, p2]…) [, B])
Return value: array
Note: the function is similar to the above. After that, you can enter multiple percentiles. The return type is also array, where is the corresponding percentile.

14. Histogram: histogram_numeric

Syntax: histogram_numeric(col, b)
Return value: array < struct {'x', 'y'} >
Note: calculate the histogram information of col based on b.

hive> select histogram_numeric(100,5) from tableName;

[{"x":100.0,"y":1.0}

Composite type build operation

1. map type build: map****

Syntax: map (key1, value1, key2, value2,...)
Description: build the map type according to the entered key and value pairs

hive> Create table mapTable as select map('100','tom','200','mary') as t from tableName;

hive> describe mapTable;
t map<string ,string>

hive> select t from tableName;
{"100":"tom","200":"mary"}

2. Struct type construction: struct

Syntax: struct(val1, val2, val3,...)
Description: construct the struct type of the structure according to the input parameters

hive> create table struct_table as select struct('tom','mary','tim') as t from tableName;

hive> describe struct_table;
t struct<col1:string ,col2:string,col3:string>

hive> select t from tableName;
{"col1":"tom","col2":"mary","col3":"tim"}

3. Array type build: array

Syntax: array(val1, val2,...)
Description: build the array type according to the input parameters

hive> create table arr_table as select array("tom","mary","tim") as t from tableName;

hive> describe tableName;
t array<string>

hive> select t from tableName;
["tom","mary","tim"]

Complex type access operation****

1. array type access: A[n]

Syntax: A[n]
Operation type: A is array type and n is int type
Description: returns the nth variable value in array A. The starting index of the array is 0. For example, if A is an array type with values of ['foo', 'bar], then A[0] will return' foo 'and A[1] will return' bar '

hive> create table arr_table2 as select array("tom","mary","tim") as t

from tableName;

hive> select t[0],t[1] from arr_table2;

tom mary tim

2. map type access: M[key]

Syntax: M[key]
Operation type: M is the map type, and key is the key value in the map
Description: returns the value of the specified value in the map type M. For example, if M is a map type with values of {'f' - > 'foo', 'b' - > 'bar', 'all' - > 'foobar'}, then m ['all'] will return 'foobar'

hive> Create table map_table2 as select map('100','tom','200','mary') as t from tableName;

hive> select t['200'],t['100'] from map_table2;

mary tom

3. struct type access: S.x

Syntax: S.x
Operation type: S is struct type
Description: returns the x field in structure S. For example, for struct foobar {int foo, int bar}, foobar.foo returns the foo field in the struct

hive> create table str_table2 as select struct('tom','mary','tim') as t from tableName;

hive> describe tableName;
t struct<col1:string ,col2:string,col3:string>
hive> select t.col1,t.col3 from str_table2;

tom tim

Complex type length statistical function****

1.Map type length function: size (map < K. V >)

Syntax: size (map < K. V >)
Return value: int
Description: returns the length of the map type

hive> select size(t) from map_table2;
2

2.array type length function: size(Array)

Syntax: size(Array)
Return value: int
Description: returns the length of array type

hive> select size(t) from arr_table2;
4

3. Type conversion function***

Type conversion function: cast
Syntax: cast(expr as)
Return value: Expected "=" to follow "type"
Description: returns the converted data type

hive> select cast('1' as bigint) from tableName;
1

3. lateral view, expand, reflect and window functions in hive

1. Use the expand function to split the Map and Array field data in the hive table

lateral view is used with split, expand and other udtfs. It can split a row of data into multiple rows of data. On this basis, it can aggregate the split data. lateral view first calls UDTF for each row of the original table. UDTF will split a row into one or more rows. lateral view combines the results to generate a virtual table supporting alias tables.

Expand can also be used to split the complex array or map structure in the hive column into multiple rows

Requirements: the data format is as follows

zhangsan child1,child2,child3,child4 k1:v1,k2:v2

lisi child5,child6,child7,child8 k3:v3,k4:v4

Fields are separated by \ t, and all child ren need to be split into one column

+----------+--+
| mychild |
+----------+--+
| child1 |
| child2 |
| child3 |
| child4 |
| child5 |
| child6 |
| child7 |
| child8 |
+----------+--+

The key and value of the map are also disassembled to obtain the following results

+-----------+-------------+--+
| mymapkey | mymapvalue |
+-----------+-------------+--+
| k1 | v1 |
| k2 | v2 |
| k3 | v3 |
| k4 | v4 |
+-----------+-------------+--+

Step 1: create hive database

Create hive database

hive (default)> create database hive_explode;

hive (default)> use hive_explode;

Step 2: create the hive table, and then use expand to split the map and array

hive (hive_explode)> create table t3(name string,children array<string>,address Map<string,string>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':' stored as textFile;

Step 3: load data

node03 Execute the following command to create a table data file

mkdir -p /export/servers/hivedatas/

cd /export/servers/hivedatas/

vim maparray

zhangsan child1,child2,child3,child4 k1:v1,k2:v2

lisi child5,child6,child7,child8 k3:v3,k4:v4

Load data in hive table

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/maparray' into table t3;

Step 4: use explode to split the data in hive

Separate the data in the array

hive (hive_explode)> SELECT explode(children) AS myChild FROM t3;

Separate the data in the map

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;

2. Split json strings using explode

Requirements: some data formats are as follows:

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

Where the separator between fields is|

We need to parse all the monthSales corresponding to the following column (row to column)

4900

2090

6987

Step 1: create hive table

hive (hive_explode)> create table explode_lateral_view
(`area` string,
`goods_id` string,
`sale_info` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS textfile;

Step 2: prepare data and load data

The prepared data are as follows

cd /export/servers/hivedatas

vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

Load the data into the hive table

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/explode_json' overwrite into table explode_lateral_view;

Step 3: split the Array with explode

hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from explode_lateral_view;

Step 4: use explode to disassemble the Map

hive (hive_explode)> select explode(split(area,',')) as area from explode_lateral_view;

Step 5: disassemble the json field

hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as sale_info from explode_lateral_view;

Then we want to use get_json_object to obtain the data whose key is monthSales:

hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') as sale_info from explode_lateral_view;

Then the exception failed: semanticeexception [error 10081]: udtf's are not supported outside the select clause, nor needed in expressions

Udtf expand cannot be written in other functions

If you write this, you want to check two fields

 select explode(split(area,',')) as area,good_id from explode_lateral_view;

Failed: semanticeexception 1:40 only a single expression in the select claim is supported with udtf's. error encoded near token 'good_id’

When using UDTF, only one field is supported. At this time, the final view is required

3. Use with final view

Query multiple fields with lateral view

hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2;

The final view expand (split (goods_id, ',', '')) goods is equivalent to a virtual table, which is the same as the original table expand_ lateral_ View Cartesian product association.

It can also be used multiple times

hive (hive_explode)> select goods_id2,sale_info,area2
from explode_lateral_view
LATERAL VIEW explode(split(goods_id,','))goods as goods_id2
LATERAL VIEW explode(split(area,','))area as area2;It is also the result of the Cartesian product of three tables

Finally, we can completely convert a row of data in json format into a two-dimensional table through the following sentences

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source,
get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales,
get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,
get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales from explode_lateral_view
LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;

Summary:

Later view usually appears together with UDTF to solve the problem that UDTF does not allow in the select field.
Multiple Cartesian view can be implemented similar to Cartesian product.
Outer keyword can output the empty result of UDTF that is not output as NULL to prevent data loss.

4. Column to row

1. Description of related functions

CONCAT(string A/col, string B/col...): returns the result of connecting input strings. Any input string is supported;

CONCAT_WS(separator, str1, str2,...): it is a special form of CONCAT(). The separator between the first parameter and the remaining parameters. The delimiter can be the same string as the remaining parameters. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter parameter. The separator will be added between the connected strings;

COLLECT_SET(col): the function only accepts basic data types. Its main function is to de summarize the values of a field and generate an array type field.

2. Data preparation

Table 6-6 data preparation

name	constellation	blood_type
 Sun WuKong	Aries	A
 Lao Wang	sagittarius	A
 Song song	Aries	B
 Zhu Bajie	Aries	A
 Miss Luo Yu feng	sagittarius	A

3. Demand

Classify people with the same constellation and blood type. The results are as follows:

Sagittarius, A old Wang Fengjie

Aries, A monkey king, pig Bajie

Aries, B song

4. Create local constellation.txt and import data

The node03 server executes the following command to create a file. Note that the data is divided using \ t

cd /export/servers/hivedatas

vim constellation.txt
Sun WuKong	Aries	A
 Lao Wang	sagittarius	A
 Song song	Aries	B       
Zhu Bajie	Aries	A
 Miss Luo Yu feng	sagittarius	A

5. Create hive table and import data

Create hive table and load data

hive (hive_explode)> create table person_info(

name string,

constellation string,

blood_type string)

row format delimited fields terminated by "\t";

Load data

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/constellation.txt' into table person_info;

6. Query data as required

hive (hive_explode)> select
t1.base,
concat_ws('|', collect_set(t1.name)) name
from
(select
name,
concat(constellation, "," , blood_type) base
from
person_info) t1
group by
t1.base;

5. Row train

1. Function description

Expand (Col): split the complex array or map structure in the hive column into multiple rows.

LATERAL VIEW

Usage: final view udtf (expression) tablealias as columnalias

Explanation: used with split, expand and other udtfs. It can split a column of data into multiple rows of data. On this basis, it can aggregate the split data.

2. Data preparation

cd /export/servers/hivedatas
vim movie.txt

Use \ t to split data fields

<Suspect tracking suspense,action,science fiction,plot
<Lie to me> Suspense,gangster ,action,psychology,plot
<War wolf 2 War,action,disaster

3. Demand

Expand the array data in the movie category. The results are as follows:

<Suspect tracking suspense
<Suspect tracking action
<Suspect tracking science fiction
<Plot of suspect tracking
<Lie to me> Suspense
<Lie to me> gangster 
<Lie to me> action
<Lie to me> psychology
<Lie to me> plot
<War wolf 2 War
<Action of war wolf 2
<Wolf 2 disaster

4. Create hive table and import data

Create hive table

create table movie_info(
movie string,
category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";

Load data

load data local inpath "/export/servers/hivedatas/movie.txt" into table movie_info;

5. Query data as required

select
movie,
category_name
from
movie_info lateral view explode(category) table_tmp as category_name;

6. reflect function

The reflect function can support the calling function in java, and seckill all udf functions in sql.

Use Max in java.lang.Math to find the maximum value in the two columns

Create hive table

create table test_udf(col1 int,col2 int) row format delimited fields terminated by ',';

Prepare data and load data

cd /export/servers/hivedatas

vim test_udf

1,2

4,3

6,4

7,5

5,6

Load data

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf' overwrite into table test_udf;

Use Max in java.lang.Math to find the maximum value in the two columns

hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf;

Different records execute different java built-in functions

Create hive table

hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

Prepare data

cd /export/servers/hivedatas

vim test_udf2

java.lang.Math,min,1,2

java.lang.Math,max,2,3

Load data

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf2' overwrite into table test_udf2;

Execute query

hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

Judge whether it is a number

Using the functions in apache commons, the jar s under commons have been included in the classpath of hadoop, so they can be used directly.

The usage is as follows:

select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123");

7. Window function and analysis function

hive also has many window functions and analysis functions, which are mainly used in the following scenarios

(1) Used for partition sorting
(2) Dynamic Group By
(3)Top N
(4) Cumulative calculation
(5) Hierarchical query

1. Create hive table and load data

Create table

hive (hive_explode)> create table order_detail(

user_id string,device_id string,user_type string,price double,sales int

)row format delimited fields terminated by ',';

Load data

cd /export/servers/hivedatas

vim order_detail

zhangsan,1,new,67.1,2

lisi,2,old,43.32,1

wagner,3,new,88.88,3

liliu,4,new,66.0,1

qiuba,5,new,54.32,1

wangshi,6,old,77.77,2

liwei,7,old,88.44,3

wutong,8,new,56.55,6

lilisi,9,new,88.88,5

qishili,10,new,66.66,5

Load data

hive (hive_explode)> load data local inpath '/export/servers/hivedatas/order_detail' into table order_detail;

2. Window function

FIRST_VALUE: the first value after sorting in the group up to the current line

LAST_VALUE: the last value until the current row after sorting in the group

LEAD(col,n,DEFAULT): used to count the value of the nth row down in the window. The first parameter is the column name, the second parameter is the next nth row (optional, the default is 1), and the third parameter is the default value (when the next nth row is NULL, the default value is taken, or NULL if not specified)

LAG(col,n,DEFAULT): opposite to lead, it is used to count the value of the nth row up in the window. The first parameter is the column name, the second parameter is the nth row up (optional, the default is 1), and the third parameter is the default value (the default value is taken when the nth row up is NULL, or NULL if it is not specified)

3. OVER clause

1. Use the standard aggregate functions COUNT, SUM, MIN, MAX, AVG

2. Use the PARTITION BY statement to use one or more columns of the original data type

3. Use the PARTITION BY and ORDER BY statements to use one or more data types for partitioning or sorting

4. Use the window specification, which supports the following formats:

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

1
2
3

When the window clause condition is missing after ORDER BY, the window specification defaults to range between unbounded prediction and current row

When both ORDER BY and window clauses are missing, the window specification defaults to row between unbounded forecasting and unbounded following

The OVER clause supports the following functions, but they are not supported with windows.

Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank

Lead and Lag functions

Use the window function to calculate the sales volume

Use the window function sum over to count sales

hive (hive_explode)> select
user_id,
user_type,
sales,
--All rows in the group
sum(sales) over(partition by user_type) AS sales_1 ,
sum(sales) over(order by user_type) AS sales_2 ,
--The default is from the starting point to the current row, if sales Same, the accumulation result is the same
sum(sales) over(partition by user_type order by sales asc) AS sales_3,
--From the starting point to the current line, the result is the same as sales_3 Different. According to the sorting order, the results may accumulate differently
sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) AS sales_4,
--Current row+3 lines ahead
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) AS sales_5,
--Current row+3 lines ahead+1 line back
sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) AS sales_6,
--Current row+All lines back
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) AS sales_7
from
order_detail
order by
user_type,
sales,
user_id;

After statistics, the results are as follows:

+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
| user_id | user_type | sales | sales_1 | sales_2 | sales_3 | sales_4 | sales_5 | sales_6 | sales_7 |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+
| liliu | new | 1 | 23 | 23 | 2 | 2 | 2 | 4 | 22 |
| qiuba | new | 1 | 23 | 23 | 2 | 1 | 1 | 2 | 23 |
| zhangsan | new | 2 | 23 | 23 | 4 | 4 | 4 | 7 | 21 |
| wagner | new | 3 | 23 | 23 | 7 | 7 | 7 | 12 | 19 |
| lilisi | new | 5 | 23 | 23 | 17 | 17 | 15 | 21 | 11 |
| qishili | new | 5 | 23 | 23 | 17 | 12 | 11 | 16 | 16 |
| wutong | new | 6 | 23 | 23 | 23 | 23 | 19 | 19 | 6 |
| lisi | old | 1 | 6 | 29 | 1 | 1 | 1 | 3 | 6 |
| wangshi | old | 2 | 6 | 29 | 3 | 3 | 3 | 6 | 5 |
| liwei | old | 3 | 6 | 29 | 6 | 6 | 6 | 6 | 3 |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+

be careful:

The result is related to ORDER BY, which is ascending by default

If ROWS BETWEEN is not specified, the default is from the starting point to the current row;

If ORDER BY is not specified, all values in the group will be accumulated;

The key is to understand the meaning of ROWS BETWEEN, also known as the WINDOW clause:

Predicting: forward

FOLLOWING: back

CURRENT ROW: CURRENT ROW

UNBOUNDED: UNBOUNDED (start or end)

Unbounded forecasting: indicates the starting point from the front

UNBOUNDED FOLLOWING: indicates to the following end point

The usage of other COUNT, AVG, MIN, MAX and SUM is the same.

Find the first and last values after grouping first_value and last_value

Use first_value and last_value find the first and last values after grouping

select
user_id,
user_type,
ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
order_detail;
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
| user_id | user_type | row_num | max_sales_user | min_sales_user | curr_last_min_user | curr_last_max_user |
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+
| wutong | new | 7 | wutong | qiuba | wutong | wutong |
| lilisi | new | 6 | wutong | qiuba | qishili | lilisi |
| qishili | new | 5 | wutong | qiuba | qishili | lilisi |
| wagner | new | 4 | wutong | qiuba | wagner | wagner |
| zhangsan | new | 3 | wutong | qiuba | zhangsan | zhangsan |
| liliu | new | 2 | wutong | qiuba | qiuba | liliu |
| qiuba | new | 1 | wutong | qiuba | qiuba | liliu |
| liwei | old | 3 | liwei | lisi | liwei | liwei |
| wangshi | old | 2 | liwei | lisi | wangshi | wangshi |
| lisi | old | 1 | liwei | lisi | lisi | lisi |
+-----------+------------+----------+-----------------+-----------------+---------------------+---------------------+--+

4. Analysis function

1.ROW_NUMBER():

Starting from 1, the sequence of records in the group is generated in order. For example, the pv rank and row of each day in the group are generated in descending order of pv_ There are many application scenarios for number (). For example, get the records that rank first in the group; Get the first refer ence in a session.

2.RANK() :

Generate the ranking of data items in the group. Equal ranking will leave a vacancy in the ranking

3.DENSE_RANK() :

Generate the ranking of data items in the group. If the ranking is equal, there will be no vacancy in the ranking

4.CUME_DIST :

Rows less than or equal to the current value / total rows in the group. For example, count the proportion of the number of people whose salary is less than or equal to the current salary in the total number of people

5.PERCENT_RANK :

RANK value of current row in the group - 1 / total number of rows in the group - 1

6.NTILE(n) :

It is used to divide the grouped data into n slices in order and return the current slice value. If the slices are uneven, the distribution of the first slice is increased by default. NTILE does not support ROWS BETWEEN, such as NTILE (2) over (partition by cookie order by createtime ROWS BETWEEN 3 predicting and current row).

RANK,ROW_NUMBER,DENSE_ Use of rank over

Using these functions, you can find topN by grouping

Demand: sort by user type to get the top N data with the largest sales volume

select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
order_detail;


+-----------+------------+--------+----+-----+-----+--+
| user_id | user_type | sales | r | rn | dr |
+-----------+------------+--------+----+-----+-----+--+
| wutong | new | 6 | 1 | 1 | 1 |
| qishili | new | 5 | 2 | 2 | 2 |
| lilisi | new | 5 | 2 | 3 | 2 |
| wagner | new | 3 | 4 | 4 | 3 |
| zhangsan | new | 2 | 5 | 5 | 4 |
| qiuba | new | 1 | 6 | 6 | 5 |
| liliu | new | 1 | 6 | 7 | 5 |
| liwei | old | 3 | 1 | 1 | 1 |
| wangshi | old | 2 | 2 | 2 | 2 |
| lisi | old | 1 | 3 | 3 | 3 |
+-----------+------------+--------+----+-----+-----+--+

Use NTILE to calculate the percentage

We can use NTILE to divide our data into several parts, and then calculate the percentage

Slice data using NTILE

select
user_type,sales,
--Divide the data into 2 pieces in the group
NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
--Divide the data into 3 pieces in the group
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
--Divide the data into 4 pieces within the group
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
--Divide all data into 4 pieces
NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
order_detail
order by
user_type,
sales;

The results are as follows:

+------------+--------+------+------+------+----------+--+
| user_type | sales | nt2 | nt3 | nt4 | all_nt4 |
+------------+--------+------+------+------+----------+--+
| new | 1 | 1 | 1 | 1 | 1 |
| new | 1 | 1 | 1 | 1 | 1 |
| new | 2 | 1 | 1 | 2 | 2 |
| new | 3 | 1 | 2 | 2 | 3 |
| new | 5 | 2 | 2 | 3 | 4 |
| new | 5 | 2 | 3 | 3 | 3 |
| new | 6 | 2 | 3 | 4 | 4 |
| old | 1 | 1 | 1 | 1 | 1 |
| old | 2 | 1 | 2 | 2 | 2 |
| old | 3 | 2 | 3 | 3 | 2 |
+------------+--------+------+------+------+----------+--+

Use NTILE to get the user id of the top 20% of sales

select
user_id
from
(select user_id, NTILE(5) OVER(ORDER BY sales desc) AS nt
from order_detail
)A
where nt=1;

+----------+--+
| user_id |
+----------+--+
| wutong |
| qishili |

5. Enhanced aggregation Cube and Grouping and Rollup

These analysis functions are usually used in OLAP and cannot be accumulated. They need to be counted according to the indicators of drilling up and drilling down in different dimensions, such as the number of UV s per hour, day and month.

GROUPING SETS

In a GROUP BY query, aggregation based on different dimension combinations is equivalent to UNION ALL for GROUP BY result sets of different dimensions,

The group__ ID, indicating which grouping set the result belongs to.

Requirements: according to user_type and sales are grouped to obtain data respectively

0: jdbc:hive2://node03:10000>select
user_type,
sales,
count(user_id) as pv,
GROUPING__ID
from
order_detail
group by
user_type,sales
GROUPING SETS(user_type,sales)
ORDER BY
GROUPING__ID;

The results are as follows:

+------------+--------+-----+---------------+--+
| user_type | sales | pv | grouping__id |
+------------+--------+-----+---------------+--+
| old | NULL | 3 | 1 |
| new | NULL | 7 | 1 |
| NULL | 6 | 1 | 2 |
| NULL | 5 | 2 | 2 |
| NULL | 3 | 2 | 2 |
| NULL | 2 | 2 | 2 |
| NULL | 1 | 3 | 2 |
+------------+--------+-----+---------------+--+

Requirements: according to user_type, sales, and user_type + salt are grouped separately to obtain statistical data

0: jdbc:hive2://node03:10000>select
user_type,
sales,
count(user_id) as pv,
GROUPING__ID
from
order_detail
group by
user_type,sales
GROUPING SETS(user_type,sales,(user_type,sales))
ORDER BY
GROUPING__ID;

The results are as follows:

+------------+--------+-----+---------------+--+
| user_type | sales | pv | grouping__id |
+------------+--------+-----+---------------+--+
| old | NULL | 3 | 1 |
| new | NULL | 7 | 1 |
| NULL | 1 | 3 | 2 |
| NULL | 6 | 1 | 2 |
| NULL | 5 | 2 | 2 |
| NULL | 3 | 2 | 2 |
| NULL | 2 | 2 | 2 |
| old | 3 | 1 | 3 |
| old | 2 | 1 | 3 |
| old | 1 | 1 | 3 |
| new | 6 | 1 | 3 |
| new | 5 | 2 | 3 |
| new | 3 | 1 | 3 |
| new | 1 | 2 | 3 |
| new | 2 | 1 | 3 |
+------------+--------+-----+---------------+--+

6. Use cube and ROLLUP to aggregate according to all combinations of dimensions of GROUP BY.

Aggregate cube s

Requirement: no grouping, by user_type, sales and user_type+sales to get statistics by grouping

0: jdbc:hive2://node03:10000>select
user_type,
sales,
count(user_id) as pv,
GROUPING__ID
from
order_detail
group by
user_type,sales
WITH CUBE
ORDER BY
GROUPING__ID;



+------------+--------+-----+---------------+--+
| user_type | sales | pv | grouping__id |
+------------+--------+-----+---------------+--+
| NULL | NULL | 10 | 0 |
| new | NULL | 7 | 1 |
| old | NULL | 3 | 1 |
| NULL | 6 | 1 | 2 |
| NULL | 5 | 2 | 2 |
| NULL | 3 | 2 | 2 |
| NULL | 2 | 2 | 2 |
| NULL | 1 | 3 | 2 |
| old | 3 | 1 | 3 |
| old | 2 | 1 | 3 |
| old | 1 | 1 | 3 |
| new | 6 | 1 | 3 |
| new | 5 | 2 | 3 |
| new | 3 | 1 | 3 |
| new | 2 | 1 | 3 |
| new | 1 | 2 | 3 |
+------------+--------+-----+---------------+--+

ROLLUP for aggregation

rollup is a subset of CUBE, mainly the leftmost dimension, from which hierarchical aggregation is performed.

select
user_type,
sales,
count(user_id) as pv,
GROUPING__ID
from
order_detail
group by
user_type,sales
WITH ROLLUP
ORDER BY
GROUPING__ID;


+------------+--------+-----+---------------+--+
| user_type | sales | pv | grouping__id |
+------------+--------+-----+---------------+--+
| NULL | NULL | 10 | 0 |
| old | NULL | 3 | 1 |
| new | NULL | 7 | 1 |
| old | 3 | 1 | 3 |
| old | 2 | 1 | 3 |
| old | 1 | 1 | 3 |
| new | 6 | 1 | 3 |
| new | 5 | 2 | 3 |
| new | 3 | 1 | 3 |
| new | 2 | 1 | 3 |
| new | 1 | 2 | 3 |
+------------+--------+-----+---------------+--+

4. Custom functions for hive

1. Hive custom function

1) Hive comes with some functions, such as max/min, but the number is limited. You can easily expand it by customizing UDF.

2) When the built-in function provided by Hive cannot meet your business processing needs, you can consider using user-defined function (UDF).

3) There are three types of user-defined functions:

(1)UDF(User-Defined-Function)

One in and one out

(2)UDAF(User-Defined Aggregation Function)

Aggregate function, one more in and one out

Similar to: count/max/min

(3)UDTF(User-Defined Table-Generating Functions)

One in and many out

Such as lateral view explore()

4) Official document address

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5) Programming steps:

(1) Inherit org.apache.hadoop.hive.ql.UDF

(2) You need to implement the evaluate function; The evaluate function supports overloading;

6) Precautions

(1) UDF must have a return type, which can return null, but the return type cannot be void;

(2) Text/LongWritable and other types are commonly used in UDF, and java type is not recommended;

Posted by jamesm87 on Thu, 07 Oct 2021 00:35:22 -0700