Analysis of 5 typical Hive topics

Keywords: Big Data hive

1, Continuous problem

The following data are the low carbon emissions received by users in ant forest

id dt lowcarbon
1001  2021-12-12 123
1002  2021-12-12 45
1001  2021-12-13 43
1001  2021-12-13 45
1001  2021-12-13 23
1002  2021-12-14 45
1001  2021-12-14 230
1002  2021-12-15 45
1001  2021-12-15 23
... ...

Find out the users who have received low carbon emission of more than 100 for 3 consecutive days or more

Resolution:
Suppose you create a table named test1:
(1) Calculate the low carbon emission received by each user in a single day by grouping according to the user ID and time fields → t1 table

select
    id,
    dt,
    sum(lowcarbon) lowcarbon
from test1
group by id,dt
having lowcarbon>100;
id	dt	lowcarbon
1001	2021-12-12	123
1001	2021-12-13	111
1001	2021-12-14	230

Arithmetic sequence method: if two arithmetic sequences have the same arithmetic difference, the result of subtracting the data at the same position will be the same

(2) Group according to users and sort according to time, and calculate the Rank value of each data → t2 table

select
    id,
    dt,
    lowcarbon,
    rank() over(partition by id order by dt) rk
from t1;
id	dt	lowcarbon	rk
1001	2021-12-12	123	1
1001	2021-12-13	111	2
1001	2021-12-14	230	3

(3) Subtract the Rank value from the date in each row → t3 table

select
    id,
    dt,
    lowcarbon,
    date_sub(dt,rk) flag
from t2;
id	dt	lowcarbon	flag
1001	2021-12-12	123	2021-12-11
1001	2021-12-13	111	2021-12-11
1001	2021-12-14	230	2021-12-11

(4) Group according to users and flags, find out how many pieces of data there are in each group, and find data greater than or equal to 3 pieces

select
    id,
    flag,
    count(*) ct
from t3
group by id,flag
having ct>=3;

The final result is:

id	flag	ct
1001	2021-12-11	3

Overall execution statement:

select
    id,
    flag,
    count(*) ct
from 
    (select
        id,
        dt,
        lowcarbon,
        date_sub(dt,rk) flag
    from 
        (select
            id,
            dt,
            lowcarbon,
            rank() over(partition by id order by dt) rk
        from 
            (select
                id,
                dt,
                sum(lowcarbon) lowcarbon
            from test1
            group by id,dt
            having lowcarbon>100)t1)t2)t3
group by id,flag
having ct>=3;

2, Grouping problem

Group the following data according to the results shown in

id		ts
1001	17523641234
1001	17523641256
1002	17523641278
1001	17523641334
1002	17523641434
1001	17523641534
1001	17523641544
1002	17523641634
1001	17523641638
1001	17523641654

For the same id, if the time interval is less than 60 seconds, they are divided into the same group

1001	17523641234		1
1001	17523641256		1
1001	17523641334		2
1001	17523641534		3
1001	17523641544		3
1001	17523641638		4
1001	17523641654		4
1002	17523641278		1
1002	17523641434		2
1002	17523641634		3

Resolution:
Suppose you create a table named test2:
(1) Move the time data of the previous row down → t1 table
lead: Leader
lag: delay

select
    id,
    ts,
    lag(ts,1,0) over(partition by id order by ts) lagts
from
    test2;
id	ts	lagts
1001	17523641234	0
1001	17523641256	17523641234
1001	17523641334	17523641256
1001	17523641534	17523641334
1001	17523641544	17523641534
1001	17523641638	17523641544
1001	17523641654	17523641638
1002	17523641278	0
1002	17523641434	17523641278
1002	17523641634	17523641434

(2) Subtract the time data of the previous row from the time data of the current row → t2 table

select
    id,
    ts,
    ts-lagts tsdiff
from
    t1;
id	ts	tsdiff
1001	17523641234	17523641234
1001	17523641256	22
1001	17523641334	78
1001	17523641534	200
1001	17523641544	10
1001	17523641638	94
1001	17523641654	16
1002	17523641278	17523641278
1002	17523641434	156
1002	17523641634	200

(3) Calculate the total number (group number) of tsdiff greater than or equal to 60 from the first row to the current row within each user range

select
    id,
    ts,
    sum(if(tsdiff>=60,1,0)) over(partition by id order by ts) groupid
from
    t2;

The final result is:

id	ts	groupid
1001	17523641234	1
1001	17523641256	1
1001	17523641334	2
1001	17523641534	3
1001	17523641544	3
1001	17523641638	4
1001	17523641654	4
1002	17523641278	1
1002	17523641434	2
1002	17523641634	3

Overall execution statement:

select
    id,
    ts,
    sum(if(tsdiff>=60,1,0)) over(partition by id order by ts) groupid
from
    (select
        id,
        ts,
        ts-lagts tsdiff
    from
        (select
            id,
            ts,
            lag(ts,1,0) over(partition by id order by ts) lagts
        from
            test2)t1)t2;

3, Interval continuity problem

Daily login data of users recorded by a game company

id dt
1001  2021-12-12
1002  2021-12-12
1001  2021-12-13
1001  2021-12-14
1001  2021-12-16
1002  2021-12-16
1001  2021-12-19
1002  2021-12-17
1001  2021-12-20

Calculate the maximum number of consecutive login days for each user, with an interval of one day. Explanation: if a user logs in to the game on 1, 3, 5 and 6, it is regarded as logging in for 6 consecutive days

Resolution:
Suppose you create a table named test3:
(1) Move the time data of the previous row down → t1 table

select
    id,
    dt,
    lag(dt,1,'1970-01-01') over(partition by id order by dt) lagdt
from
    test3;
id	dt	lagdt
1001	2021-12-12	1970-01-01
1001	2021-12-13	2021-12-12
1001	2021-12-14	2021-12-13
1001	2021-12-16	2021-12-14
1001	2021-12-19	2021-12-16
1001	2021-12-20	2021-12-19
1002	2021-12-12	1970-01-01
1002	2021-12-16	2021-12-12
1002	2021-12-17	2021-12-16

(2) Subtract the current row time from the previous row time data (datediff(dt1,dt2)) → t2 table

select
    id,
    dt,
    datediff(dt,lagdt) flag
from
    t1;
id	dt	flag
1001	2021-12-12	18973
1001	2021-12-13	1
1001	2021-12-14	1
1001	2021-12-16	2
1001	2021-12-19	3
1001	2021-12-20	1
1002	2021-12-12	18973
1002	2021-12-16	4
1002	2021-12-17	1

(3) Group according to users and sort according to time, and calculate the total number of data (sum (if (flag > 2,1,0)) greater than 2 from the first row to the current row. If it is less than 2, it means that it meets the "continuous" → t3 table required in the title

select
    id,
    dt,
    sum(if(flag>2,1,0)) over(partition by id order by dt) flag
from
    t2
id	dt	flag
1001	2021-12-12	1
1001	2021-12-13	1
1001	2021-12-14	1
1001	2021-12-16	1
1001	2021-12-19	2
1001	2021-12-20	2
1002	2021-12-12	1
1002	2021-12-16	2
1002	2021-12-17	2

(4) Group according to the user and flag, calculate the maximum time minus the minimum time → t4 table

select
    id,
    flag,
    datediff(max(dt),min(dt)) days
from
    t3
group by id,flag;

(5) Take the maximum number of consecutive login days and + 1

select
    id,
    max(days)+1
from
    t4
group by id;

Final result:

id	_c1
1001	5
1002	2

Overall execution statement:

select
    id,
    max(days)+1
from
    (select
        id,
        flag,
        datediff(max(dt),min(dt)) days
    from
        (select
            id,
            dt,
            sum(if(flag>2,1,0)) over(partition by id order by dt) flag
        from
            (select
                id,
                dt,
                datediff(dt,lagdt) flag
            from
                (select
                    id,
                    dt,
                    lag(dt,1,'1970-01-01') over(partition by id order by dt) lagdt
                from
                    test3)t1)t2)t3
    group by id,flag)t4
group by id;

4, Discount date cross problem

The following is the platform commodity promotion data: the fields are brand, discount start date and discount end date

brand  sdt edt
oppo  2021-06-05 2021-06-09
oppo  2021-06-11 2021-06-21
vivo  2021-06-05 2021-06-15
vivo  2021-06-09 2021-06-21
redmi  2021-06-05 2021-06-21
redmi  2021-06-09 2021-06-15
redmi  2021-06-17 2021-06-26
huawei 2021-06-05 2021-06-26
huawei 2021-06-09 2021-06-15
huawei 2021-06-17 2021-06-21

Calculate the total discount days of each brand and pay attention to the cross dates. For example, for vivo brand, the first activity time is from June 5, 2021 to June 15, 2021, and the second activity time is from June 09, 2021 to June 21, 2021. The 9th to 15th are the repetition days. Only once is counted, that is, the total discount days of vivo are 17 days from June 05, 2021 to June 21, 2021

Resolution:
Suppose you create a table named test4:
(1) Place the largest edt in the previous data of the current row into the current row → t1 table

select
    brand,
    sdt,
    edt,
    max(edt) over(partition by brand order by sdt rows between UNBOUNDED PRECEDING and 1 PRECEDING) maxEdt
from test4;

(2) Compare the start time with the moved data. If the start time is large, there is no need to operate. On the contrary, the moved data needs to be added with one to replace the start time of the current row. If it is the first row of data and maxEDT is null, there is no need to operate → t2 table

select
    brand,
    if(maxEdt is null,sdt,if(sdt>maxEdt,sdt,date_add(maxEdt,1))) sdt,
    edt
from t1;
brand	sdt	edt
huawei	2021-06-05	2021-06-26
huawei	2021-06-27	2021-06-15
huawei	2021-06-27	2021-06-21
oppo	2021-06-05	2021-06-09
oppo	2021-06-11	2021-06-21
redmi	2021-06-05	2021-06-21
redmi	2021-06-22	2021-06-15
redmi	2021-06-22	2021-06-26
vivo	2021-06-05	2021-06-15
vivo	2021-06-16	2021-06-21

(3) Subtract the start date from the end date in each row of data → t3 table

select
    brand,
    datediff(edt,sdt) days
from
    t2
brand	days
huawei	21
huawei	-12
huawei	-6
oppo	4
oppo	10
redmi	16
redmi	-7
redmi	4
vivo	10
vivo	5

(4) Group by brand and calculate the sum of each data plus one

select
    brand,
    sum(if(days>=0,days+1,0)) days
from
    t3
group by id;

Final execution result:

brand	days
huawei	22
oppo	16
redmi	22
vivo	17

Overall execution statement:

select
    brand,
    sum(if(days>=0,days+1,0)) days
from
    (select
        brand,
        datediff(edt,sdt) days
    from
        (select
            brand,
            if(maxEdt is null,sdt,if(sdt>maxEdt,sdt,date_add(maxEdt,1))) sdt,
            edt
        from 
            (select
                brand,
                sdt,
                edt,
                max(edt) over(partition by brand order by sdt rows between UNBOUNDED PRECEDING and 1 PRECEDING) maxEdt
            from test4)t1)t2)t3
group by brand;

5, Simultaneous online problems

The following is the opening and closing time of the anchor of a live broadcast platform. According to the data, the number of anchors online at the peak of the platform is calculated.

id stt edt
1001  2021-06-14 12:12:12 2021-06-14 18:12:12
1003  2021-06-14 13:12:12 2021-06-14 16:12:12
1004  2021-06-14 13:15:12 2021-06-14 20:12:12
1002  2021-06-14 15:12:12 2021-06-14 16:12:12
1005  2021-06-14 15:18:12 2021-06-14 20:12:12
1001  2021-06-14 20:12:12 2021-06-14 23:12:12
1006  2021-06-14 21:12:12 2021-06-14 23:15:12
1007  2021-06-14 22:12:12 2021-06-14 23:10:12

Resolution:
Suppose you create a table named test5:
(1) For data classification, add a positive 1 after the start data to indicate that the anchor is online. At the same time, add - 1 after the close data to indicate that the anchor is offline → t1 table

select id,sdt dt,1 p from lives
union
select id,edt dt,-1 p from lives;

(2) Calculate the cumulative number of people in chronological order → t2 table

select
    id,
    dt,
    sum(p) over(order by dt) sum_p
from
    t1;
id	dt	sum_p
1001	2021-06-14 12:12:12	1
1003	2021-06-14 13:12:12	2
1004	2021-06-14 13:15:12	3
1002	2021-06-14 15:12:12	4
1005	2021-06-14 15:18:12	5
1002	2021-06-14 16:12:12	3
1003	2021-06-14 16:12:12	3
1001	2021-06-14 18:12:12	2
1001	2021-06-14 20:12:12	1
1004	2021-06-14 20:12:12	1
1005	2021-06-14 20:12:12	1
1006	2021-06-14 21:12:12	2
1007	2021-06-14 22:12:12	3
1007	2021-06-14 23:10:12	2
1001	2021-06-14 23:12:12	1
1006	2021-06-14 23:15:12	0

(3) Find out the maximum number of people online at the same time

select
    max(sum_p)
from
    t2;

Final result:

5

Overall execution statement:

select
    max(sum_p)
from
    (select
        id,
        dt,
        sum(p) over(order by dt) sum_p
    from
        (select id,sdt dt,1 p from test5
        union
        select id,edt dt,-1 p from test5)t1)t2;

Reference video: [Shangsi Valley] 2021 latest Hive advanced course - 33 ~ 42

Posted by dixondwayne on Wed, 08 Sep 2021 12:09:56 -0700