Hive Homework - Return the name of the student with the highest score for each course

Keywords: SQL Python hive Apache

Assignment: Return the name of the student who scored the highest mark for each course.
t_score_data.txt

zhangsan        math:90,english:60
lisi    chinese:80,math:66,english:77
wangwu  chinese:66,math:55,english:80

Return the names of the students for each course and for the highest score

chinese lisi
math zhangsan
english wangwu 

 

Solution 1: Store in map type
1. create table

create TABLE t_score(
    name STRING,
    scores map<STRING,INT>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':';

Separation of each field by [\t] - ------- FIELDS TERMINATED BY't'
The second field is in Array form, where elements are separated from each other by [,] - ------- COLLECTION ITEMS TERMINATED BY', '
The data is in the form of K-V. Each group of K-V pairs is internally divided by [:] and each group of K-V pairs is separated by [:] - ------- MAP KEYS TERMINATED BY': '

After importing the data into the table, the table data is as follows:

zhangsan        {"math":90,"english":60}
lisi    {"chinese":80,"math":66,"english":77}
wangwu  {"chinese":66,"math":55,"english":80}

2.sql

select name,subject,score from(
    select name,subject,score,
        rank() over(partition by subject order by score desc) as rank
    from t_score lateral view explode(scores) subject_score as subject,score
)T1 where T1.rank = 1;

Ordered by subject partition and numbered - ------- row_number() over(partition by subject order by score desc) as rank
Split the scores field and generate a result set of data after splitting a single row into multiple rows - t_score lateral view explode(scores) my_score as subject,score. https://blog.csdn.net/guodong2k/article/details/79459282
 

sql results:

lisi    chinese 80
wangwu  english 80
zhangsan        math    90

Scheme 2: Non-map type
1. build tables

create TABLE t_score2(
    name STRING,
    scores STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';

Import the data into the table. The table data are as follows:

zhangsan        math:90,english:60
lisi    chinese:80,math:66,english:77
wangwu  chinese:66,math:55,english:80

2. sql

select T2.name,T2.subject,T2.score from
(
    select T1.name,T1.subject,T1.score,rank() over(partition by T1.subject order by T1.score desc) as rank from
    (
        select name,split(subject_score,':')[0] as subject,split(subject_score,':')[1] as score from t_score2 lateral view explode(split(scores,',')) t as subject_score
    ) T1
) T2
where T2.rank = 1;

sql results:

lisi    chinese 80
wangwu  english 80
zhangsan        math    90

Solution 3: Using nested python scripts to process
1. build tables

create TABLE t_score2(
    name STRING,
    scores STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';

2. sql
mapper.py

# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
    name_sub_sc = line.strip().split('\t')
    name = name_sub_sc[0]
    sub_scs = name_sub_sc[1].strip().split(',')
    for sub_sc in sub_scs:
        subject,score = sub_sc.split(':')
        print '\t'.join([name,subject,score])

Load python scripts into distributed caches before executing sql

hive> add file /usr/local/src/apache-hive-1.2.2-bin/warehouse/mapper.py;
select T2.subject, T2.name, T2.score from
(
    select name, subject, score, rank() over (partition by T1.subject order by T1.score desc) as rank from 
    (
        select transform(line) using 'python mapper.py' as (name STRING, subject STRING, score INT) from t_score3
    ) T1
) T2
where T2.rank = 1;

sql results:

lisi    chinese 80
wangwu  english 80
zhangsan        math    90

 

Posted by nephish on Sun, 13 Oct 2019 10:23:26 -0700