SQL one trick a day: how to parse JSON through HQL

Keywords: hive SQL

Many data in Hive will be stored in JSON format. For example, when developers embed the page on the APP, they will store multiple fields in a JSON array. Therefore, when the data platform calls the data, it is necessary to analyze the embedded data. Next, let's talk about how JSON data is parsed in Hive.

Hive's own json parsing function

1. get_json_object

  • Syntax: get_json_object(json_string, '$.key')

  • Description: parse json string_ String to return the content specified by path. NULL is returned if the json string entered is invalid. This function can only return one data item at a time.

  • Example:

select 
get_json_object('{"name":"zhangsan","age":18}','$.name'); 
  • result:

name
zhangsan

If you want to resolve both the name field and the age field, you can write this:

select 
get_json_object('{"name":"zhangsan","age":18}','$.name'),
get_json_object('{"name":"zhangsan","age":18}','$.age');

However, if there are many fields to be parsed, it will be too troublesome to write in this way, so there are   json_tuple   This function.

2. json_tuple

  • Syntax: json_tuple(json_string, k1, k2 ...)

  • Description: parse json string_ String, which can specify the key in multiple json data and return the corresponding value. NULL is returned if the json string entered is invalid.

  • Example:

select 
b.name
,b.age
from tableName a lateral view
json_tuple('{"name":"zhangsan","age":18}','name','age') b as name,age;
  • result:

nameage
zhangsan18

Note: JSON above_ There is no $. In the tuple function

If you are using JSON_ If $. Is added to the tuple function, the parsing fails:

select 
b.name
,b.age
from tableName a lateral view
json_tuple('{"name":"zhangsan","age":18}','$.name','$.age') b as name,age;

result:

nameage
NULLNULL

The fields are all NULL, so JSON_ The tuple function does not need to add $. Otherwise it will not be parsed.

Summary: json_tuple is equivalent to get_ json_ The advantage of object is that it can parse multiple JSON fields at a time. But if we have a JSON array, neither of these functions can handle it.

Hive parse json array

1, Nested subquery parsing json array

If there is a hive table, the JSON in the table_ The contents of STR field are as follows:

json_str
[{"website":"baidu.com","name": "Baidu"}, {"website":"google.com","name": "Google"}]

We want to parse this field to form the following structure:

websitename
baidu.comBaidu
google.comGoogle

To parse this json array, you can't parse it with only the two functions described above. You also need the following functions:

Expand function

  • Syntax: explode(Array OR Map)

  • Description: the expand () function receives data of an array or map type as input, and then outputs the elements in the array or map in the form of each row, that is, the complex array or map structure in the hive column is split into multiple rows for display, which is also called column to row function.

  • Example:

-- analysis array
hive> select explode(array('A','B','C'));
OK
A
B
C
-- analysis map
hive> select explode(map('A',10,'B',20,'C',30));
OK
A       10
B       20
C       30

regexp_replace function

  • Syntax: regexp_replace(string A, string B, string C)

  • Description: replace the part of string A that conforms to java regular expression B with C. Note that in some cases, escape characters are used, similar to regexp in oracle_ Replace function.

  • Example:

hive> select regexp_replace('foobar', 'oo|ar', ''); 
OK
fb

The above example replaces oo or ar in the string with ''.

With the above functions, let's parse JSON_ Contents of STR field:

    1. First, parse out the elements in the json array and convert them into each line of display:

hive> SELECT explode(split(regexp_replace(regexp_replace('[{"website":"baidu.com","name":"Baidu"},{"website":"google.com","name":"Google"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;'));
OK
{"website":"baidu.com","name":"Baidu"}
{"website":"google.com","name":"Google"}

Briefly describe the above sql:

SELECT explode(split(
    regexp_replace(
        regexp_replace(
            '[
                {"website":"baidu.com","name":"Baidu"},
                {"website":"google.com","name":"Google"}
            ]', 
            '\\[|\\]' , ''), take json Remove the brackets around the array
            
              '\\}\\,\\{' , '\\}\\;\\{'), take json Replace commas between array elements with semicolons
                
                 '\\;') Semicolon as separator(split Functions are separated by semicolons)
          );  

Why replace commas between json array elements with semicolons?
Because the separation within the elements is also a comma, if the commas between the elements are not replaced, the data in the elements will also be separated when separated by the split function, which is not the result we want.

    1. In the previous step, a json array has been converted into multiple json strings. Next, combine the json array with the json string_ Tuple function to parse the fields in json:

select 
json_tuple(explode(split(
regexp_replace(regexp_replace('[{"website":"baidu.com","name":"Baidu"},{"website":"google.com","name":"Google"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) 
, 'website', 'name') ;

When the above statement is executed, an error is reported:
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

This means that UDTF functions cannot be written in other functions, that is, the expand function here cannot be written in JSON_ Inside the tuple.

Since the expand function cannot be written in another JSON_ In tuple, we can use sub query, as shown below:

select json_tuple(json, 'website', 'name') 
from (
select explode(split(regexp_replace(regexp_replace('[{"website":"baidu.com","name":"Baidu"},{"website":"google.com","name":"Google"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) 
as json) t;

Execute the above statement without error, and the execution result is as follows:

www.baidu.com   Baidu
google.com      Google

Second, use lateral view to parse json arrays

Goods in hive table_ ID and JSON_ The contents of STR field are as follows:

goods_idjson_str
1,2,3[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

Objective: to bring goods_id field and JSON_ The monthSales in the str field is resolved.

Let's start parsing:

Split goods_id field and convert json array into multiple json strings:

select 
explode(split(goods_id,',')) as good_id,
explode(split(regexp_replace(regexp_replace(json_str , '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) 
as sale_info 
from tableName;

When the above statement is executed, an error is reported:
FAILED: SemanticException 3:0 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sale_info'

This means that when using UDTF, select only supports one field. There are two fields in the above statement select, so an error is reported.

To solve this problem, we have to introduce another hive syntax:

lateral view

lateral view is used with split, expand and other udtfs. It can split a row of data into multiple rows of data. On this basis, it can aggregate the split data. lateral view first calls UDTF for each row of the original table. UDTF will split a row into one or more rows. lateral view combines the results to generate a virtual table supporting alias tables.

  • Example:

Suppose we have a list of users' interests and hobbies_table, which has two columns of data. The first column is the name and the second column is the id of the user's interest_ List is an array that stores the id value of interest:

nameid_list
zhangsan[1,2,3]
lisi[3,4,5]

We want to count the number of occurrences of all interest IDS in all users:

  1. Resolve the interest id:

SELECT name, hobby_id 
FROM hobbies_table 
LATERAL VIEW explode(id_list) tmp_table AS hobby_id;

The above sql execution results:

namehobby_id
zhangsan1
zhangsan2
zhangsan3
lisi3
lisi4
lisi5

2. According to hobby_id can be grouped and aggregated:

SELECT hobby_id ,count(name) client_num
FROM hobbies_table 
LATERAL VIEW explode(id_list) tmp_table AS hobby_id
GROUP BY hobby_id;

result:

hobby_idclient_num
11
21
32
41
51

End of introduction   lateral view   Later, we will solve the problem that when using UDTF, SELECT only supports one field:

select good_id,get_json_object(sale_json,'$.monthSales') as monthSales
from tableName 
LATERAL VIEW explode(split(goods_id,','))goods as good_id 
LATERAL VIEW explode(split(regexp_replace(regexp_replace(json_str , '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) sales as sale_json;

Note: the above statement is the result of the Cartesian product of three tables, so this method is applicable to the case where the amount of data is not large.

The execution results of the above statements are as follows:

goods_idmonthSales
14900
12090
16987
24900
22090
26987
34900
32090
36987

If there are other fields in the table, we can filter the data matching the results according to other fields.

Summary: lateral view usually appears together with UDTF. In order to solve the problem that UDTF does not allow multiple fields in select.

Posted by cypher235 on Thu, 02 Sep 2021 17:13:25 -0700