Assignment: Return the name of the student who scored the highest mark for each course.
t_score_data.txt
zhangsan math:90,english:60 lisi chinese:80,math:66,english:77 wangwu chinese:66,math:55,english:80
Return the names of the students for each course and for the highest score
chinese lisi math zhangsan english wangwu
Solution 1: Store in map type
1. create table
create TABLE t_score( name STRING, scores map<STRING,INT> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';
Separation of each field by [\t] - ------- FIELDS TERMINATED BY't'
The second field is in Array form, where elements are separated from each other by [,] - ------- COLLECTION ITEMS TERMINATED BY', '
The data is in the form of K-V. Each group of K-V pairs is internally divided by [:] and each group of K-V pairs is separated by [:] - ------- MAP KEYS TERMINATED BY': '
After importing the data into the table, the table data is as follows:
zhangsan {"math":90,"english":60} lisi {"chinese":80,"math":66,"english":77} wangwu {"chinese":66,"math":55,"english":80}
2.sql
select name,subject,score from( select name,subject,score, rank() over(partition by subject order by score desc) as rank from t_score lateral view explode(scores) subject_score as subject,score )T1 where T1.rank = 1;
Ordered by subject partition and numbered - ------- row_number() over(partition by subject order by score desc) as rank
Split the scores field and generate a result set of data after splitting a single row into multiple rows - t_score lateral view explode(scores) my_score as subject,score. https://blog.csdn.net/guodong2k/article/details/79459282
sql results:
lisi chinese 80 wangwu english 80 zhangsan math 90
Scheme 2: Non-map type
1. build tables
create TABLE t_score2( name STRING, scores STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
Import the data into the table. The table data are as follows:
zhangsan math:90,english:60 lisi chinese:80,math:66,english:77 wangwu chinese:66,math:55,english:80
2. sql
select T2.name,T2.subject,T2.score from ( select T1.name,T1.subject,T1.score,rank() over(partition by T1.subject order by T1.score desc) as rank from ( select name,split(subject_score,':')[0] as subject,split(subject_score,':')[1] as score from t_score2 lateral view explode(split(scores,',')) t as subject_score ) T1 ) T2 where T2.rank = 1;
sql results:
lisi chinese 80 wangwu english 80 zhangsan math 90
Solution 3: Using nested python scripts to process
1. build tables
create TABLE t_score2( name STRING, scores STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
2. sql
mapper.py
# -*- coding: utf-8 -*- import sys for line in sys.stdin: name_sub_sc = line.strip().split('\t') name = name_sub_sc[0] sub_scs = name_sub_sc[1].strip().split(',') for sub_sc in sub_scs: subject,score = sub_sc.split(':') print '\t'.join([name,subject,score])
Load python scripts into distributed caches before executing sql
hive> add file /usr/local/src/apache-hive-1.2.2-bin/warehouse/mapper.py;
select T2.subject, T2.name, T2.score from ( select name, subject, score, rank() over (partition by T1.subject order by T1.score desc) as rank from ( select transform(line) using 'python mapper.py' as (name STRING, subject STRING, score INT) from t_score3 ) T1 ) T2 where T2.rank = 1;
sql results:
lisi chinese 80 wangwu english 80 zhangsan math 90