Many data in Hive will be stored in JSON format. For example, when developers embed the page on the APP, they will store multiple fields in a JSON array. Therefore, when the data platform calls the data, it is necessary to analyze the embedded data. Next, let's talk about how JSON data is parsed in Hive.
Hive's own json parsing function
1. get_json_object
-
Syntax: get_json_object(json_string, '$.key')
-
Description: parse json string_ String to return the content specified by path. NULL is returned if the json string entered is invalid. This function can only return one data item at a time.
-
Example:
select get_json_object('{"name":"zhangsan","age":18}','$.name');
-
result:
name |
---|
zhangsan |
If you want to resolve both the name field and the age field, you can write this:
select get_json_object('{"name":"zhangsan","age":18}','$.name'), get_json_object('{"name":"zhangsan","age":18}','$.age');
However, if there are many fields to be parsed, it will be too troublesome to write in this way, so there are json_tuple This function.
2. json_tuple
-
Syntax: json_tuple(json_string, k1, k2 ...)
-
Description: parse json string_ String, which can specify the key in multiple json data and return the corresponding value. NULL is returned if the json string entered is invalid.
-
Example:
select b.name ,b.age from tableName a lateral view json_tuple('{"name":"zhangsan","age":18}','name','age') b as name,age;
-
result:
name | age |
---|---|
zhangsan | 18 |
Note: JSON above_ There is no $. In the tuple function
If you are using JSON_ If $. Is added to the tuple function, the parsing fails:
select b.name ,b.age from tableName a lateral view json_tuple('{"name":"zhangsan","age":18}','$.name','$.age') b as name,age;
result:
name | age |
---|---|
NULL | NULL |
The fields are all NULL, so JSON_ The tuple function does not need to add $. Otherwise it will not be parsed.
Summary: json_tuple is equivalent to get_ json_ The advantage of object is that it can parse multiple JSON fields at a time. But if we have a JSON array, neither of these functions can handle it.
Hive parse json array
1, Nested subquery parsing json array
If there is a hive table, the JSON in the table_ The contents of STR field are as follows:
json_str |
---|
[{"website":"baidu.com","name": "Baidu"}, {"website":"google.com","name": "Google"}] |
We want to parse this field to form the following structure:
website | name |
---|---|
baidu.com | Baidu |
google.com |
To parse this json array, you can't parse it with only the two functions described above. You also need the following functions:
Expand function
-
Syntax: explode(Array OR Map)
-
Description: the expand () function receives data of an array or map type as input, and then outputs the elements in the array or map in the form of each row, that is, the complex array or map structure in the hive column is split into multiple rows for display, which is also called column to row function.
-
Example:
-- analysis array hive> select explode(array('A','B','C')); OK A B C -- analysis map hive> select explode(map('A',10,'B',20,'C',30)); OK A 10 B 20 C 30
regexp_replace function
-
Syntax: regexp_replace(string A, string B, string C)
-
Description: replace the part of string A that conforms to java regular expression B with C. Note that in some cases, escape characters are used, similar to regexp in oracle_ Replace function.
-
Example:
hive> select regexp_replace('foobar', 'oo|ar', ''); OK fb
The above example replaces oo or ar in the string with ''.
With the above functions, let's parse JSON_ Contents of STR field:
-
-
First, parse out the elements in the json array and convert them into each line of display:
-
hive> SELECT explode(split(regexp_replace(regexp_replace('[{"website":"baidu.com","name":"Baidu"},{"website":"google.com","name":"Google"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')); OK {"website":"baidu.com","name":"Baidu"} {"website":"google.com","name":"Google"}
Briefly describe the above sql:
SELECT explode(split( regexp_replace( regexp_replace( '[ {"website":"baidu.com","name":"Baidu"}, {"website":"google.com","name":"Google"} ]', '\\[|\\]' , ''), take json Remove the brackets around the array '\\}\\,\\{' , '\\}\\;\\{'), take json Replace commas between array elements with semicolons '\\;') Semicolon as separator(split Functions are separated by semicolons) );
Why replace commas between json array elements with semicolons?
Because the separation within the elements is also a comma, if the commas between the elements are not replaced, the data in the elements will also be separated when separated by the split function, which is not the result we want.
-
-
In the previous step, a json array has been converted into multiple json strings. Next, combine the json array with the json string_ Tuple function to parse the fields in json:
-
select json_tuple(explode(split( regexp_replace(regexp_replace('[{"website":"baidu.com","name":"Baidu"},{"website":"google.com","name":"Google"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) , 'website', 'name') ;
When the above statement is executed, an error is reported:
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
This means that UDTF functions cannot be written in other functions, that is, the expand function here cannot be written in JSON_ Inside the tuple.
Since the expand function cannot be written in another JSON_ In tuple, we can use sub query, as shown below:
select json_tuple(json, 'website', 'name') from ( select explode(split(regexp_replace(regexp_replace('[{"website":"baidu.com","name":"Baidu"},{"website":"google.com","name":"Google"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) as json) t;
Execute the above statement without error, and the execution result is as follows:
www.baidu.com Baidu google.com Google
Second, use lateral view to parse json arrays
Goods in hive table_ ID and JSON_ The contents of STR field are as follows:
goods_id | json_str |
---|---|
1,2,3 | [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] |
Objective: to bring goods_id field and JSON_ The monthSales in the str field is resolved.
Let's start parsing:
Split goods_id field and convert json array into multiple json strings:
select explode(split(goods_id,',')) as good_id, explode(split(regexp_replace(regexp_replace(json_str , '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) as sale_info from tableName;
When the above statement is executed, an error is reported:
FAILED: SemanticException 3:0 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sale_info'
This means that when using UDTF, select only supports one field. There are two fields in the above statement select, so an error is reported.
To solve this problem, we have to introduce another hive syntax:
lateral view
lateral view is used with split, expand and other udtfs. It can split a row of data into multiple rows of data. On this basis, it can aggregate the split data. lateral view first calls UDTF for each row of the original table. UDTF will split a row into one or more rows. lateral view combines the results to generate a virtual table supporting alias tables.
-
Example:
Suppose we have a list of users' interests and hobbies_table, which has two columns of data. The first column is the name and the second column is the id of the user's interest_ List is an array that stores the id value of interest:
name | id_list |
---|---|
zhangsan | [1,2,3] |
lisi | [3,4,5] |
We want to count the number of occurrences of all interest IDS in all users:
-
Resolve the interest id:
SELECT name, hobby_id FROM hobbies_table LATERAL VIEW explode(id_list) tmp_table AS hobby_id;
The above sql execution results:
name | hobby_id |
---|---|
zhangsan | 1 |
zhangsan | 2 |
zhangsan | 3 |
lisi | 3 |
lisi | 4 |
lisi | 5 |
2. According to hobby_id can be grouped and aggregated:
SELECT hobby_id ,count(name) client_num FROM hobbies_table LATERAL VIEW explode(id_list) tmp_table AS hobby_id GROUP BY hobby_id;
result:
hobby_id | client_num |
---|---|
1 | 1 |
2 | 1 |
3 | 2 |
4 | 1 |
5 | 1 |
End of introduction lateral view Later, we will solve the problem that when using UDTF, SELECT only supports one field:
select good_id,get_json_object(sale_json,'$.monthSales') as monthSales from tableName LATERAL VIEW explode(split(goods_id,','))goods as good_id LATERAL VIEW explode(split(regexp_replace(regexp_replace(json_str , '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) sales as sale_json;
Note: the above statement is the result of the Cartesian product of three tables, so this method is applicable to the case where the amount of data is not large.
The execution results of the above statements are as follows:
goods_id | monthSales |
---|---|
1 | 4900 |
1 | 2090 |
1 | 6987 |
2 | 4900 |
2 | 2090 |
2 | 6987 |
3 | 4900 |
3 | 2090 |
3 | 6987 |
If there are other fields in the table, we can filter the data matching the results according to other fields.
Summary: lateral view usually appears together with UDTF. In order to solve the problem that UDTF does not allow multiple fields in select.