[hard hive] hive advanced (15): optimize (15) Explain view execution plan

Keywords: Big Data hive

Welcome to the blog home page: Wechat search: Import_ Bigdata, hard core original author in big data field_ import_bigdata_ CSDN blog
Welcome to like, collect, leave messages, and exchange messages!
This article was originally written by [Wang Zhiwu] and started on CSDN blog!
This article is the first CSDN forum. It is strictly prohibited to reprint without the permission of the official and myself!

This article is right [hard big data learning route] learning guide for experts from zero to big data (fully upgraded version) The Hive section of.

0 Introduction

Hive SQL's execution plan describes the overall outline of the actual execution of SQL. Through the execution plan, you can understand the execution logic of the SQL program when it is converted into the corresponding computing engine. If you master the execution logic, you can better grasp the bottleneck of the program, so as to achieve more targeted optimization. In addition, it can help developers identify that seemingly equivalent SQL is actually unequal, and seemingly unequal SQL is actually equivalent SQL. It can be said that the execution plan is a key to open the door of SQL optimization.

To learn the SQL execution plan, you need to learn the command to view the execution plan: explain. Adding the keyword explain in front of the SQL of the query statement is the basic method to view the execution plan.

Learning to explain can bring great convenience to the use of hive in our work!

View SQL execution plan

The execution plan provided by Hive can view the following information:

  • explain: view the basic information of the execution plan;

  • explain dependency: the use of dependency in an explain statement produces additional information about the inputs in the plan. It shows the various properties of the input;

  • explain authorization: view information about permissions related to SQL operations;

  • explain vectorization: view the vectorization description information of SQL to show why Map and Reduce are not vectorized. Support from Hive 2.3.0;

  • explain analyze: annotate the plan with the actual number of lines. Support from Hive 2.2.0;

  • explain cbo: output the plan generated by the compute optimizer. CBO is supported from hive version 4.0.0;

  • explain locks: This is useful for understanding which LOCKS the system will acquire to run the specified query. LOCKS is supported from Hive 3.2.0;

  • explain ast: output the abstract syntax tree of the query. AST was deleted in hive version 2.1.0. There is a bug. Dumping AST may lead to OOM errors, which will be repaired in version 4.0.0;

  • explain extended: add extended to output additional information about the plan. This is usually physical information, such as file names, which is of little use to us;

1.  explain 

one point one   explain usage

Hive provides the explain command to show the execution plan of a query. This execution plan is very helpful for us to understand the underlying principle, hive tuning, troubleshooting data skew, etc.

The syntax used is as follows:

explain query;

In hive cli, enter the following command (hive 2.3.7):

explain select sum(id) from test1;

Results obtained:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test1
            Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int)
              outputColumnNames: id
              Statistics: Num rows: 6 Data size: 75 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(id)
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col0 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

How do you feel after reading the above contents? Don't worry. Each parameter will be explained in detail below. I believe you will be able to read the explain query results after learning the following contents.

A HIVE query is transformed into a sequence of one or more stages (directed acyclic graph DAG). These stages can be MapReduce stage s, stages responsible for metadata storage, or stages responsible for file system operations (such as moving and renaming).

Let's split the above results into two parts, starting from the outermost layer:

  1. stage dependencies: dependencies between stages
  2. stage plan: the execution plan of each stage

Let's first look at the stage dependencies in the first part, which contains two stages. Stage-1 is the root stage, indicating that this is the starting stage. Stage-0 depends on stage-1. After stage-1 is executed, stage-0 is executed.

Let's look at the second part of the stage plan. There is a Map Reduce, and an MR implementation plan is divided into two parts:

  1. Map Operator Tree: the execution plan tree on the MAP side
  2. Reduce Operator Tree: the execution plan tree on the reduce side

The two execution plan trees contain the operator s of this sql statement:

  1. TableScan: table scanning operation. The first operation on the map side must be to load a table, so it is a table scanning operation. Common attributes:
    • alias: table name
    • Statistics: table statistics, including the number and size of data in the table
  2. Select Operator: select operation. Common properties:
    • expressions: required field name and field type
    • outputColumnNames: the name of the output column
    • Statistics: table statistics, including the number and size of data in the table
  3. Group By Operator: group aggregation operation. Common attributes:
    • aggregations: displays aggregate function information
    • Mode: aggregation mode. The values are hash: random aggregation, which is hash partition; partial: local aggregation; Final: final aggregation
    • keys: grouped field. If there is no grouping, there is no such field
    • outputColumnNames: output column names after aggregation
    • Statistics: table statistics, including the number and size of data after grouping and aggregation
  4. Reduce Output Operator: output to reduce operation. Common attributes:
    • sort order: do not sort if the value is empty; The value is + positive sort, and the value is - reverse sort; Value is+-   The sorted columns are two columns. The first column is in positive order and the second column is in reverse order
  5. Filter Operator: filter operation, common attributes:
    • Predict: filter condition. If where id > = 1 in the sql statement, it will be displayed here (ID > = 1)
  6. Map Join Operator: join operation, common attributes:
    • condition map: join mode, such as inner join 0 to 1 left outer join 0 to 2
    • Keys: condition field of join
    • outputColumnNames: the fields output after the join is completed
    • Statistics: the number and size of data pieces generated after the join is completed
  7. File Output Operator: file output operation, common attributes
    • compressed: whether to compress
    • Table: table information, including input / output file format, serialization, etc
  8. Fetch Operator client obtains data. Common attributes:
    • Limit. A value of - 1 means no limit on the number of entries. Other values are the number of entries

1.2 use scenario of explain

This section introduces the convenience and confusion that explain can bring to our production practice

https://www.cnblogs.com/qiu-hua/p/14472190.html

2. explain dependency usage

explain dependency is used to describe the data source required for a piece of SQL. The output is a json format data, which contains the following two parts:

  • input_partitions: describes the partition of the data source table on which a segment of SQL depends. The partition name list is stored in it. If all tables contained in the whole segment of SQL are non partitioned tables, it will be empty.

  • input_tables: describes a SQL dependent data source table, which stores a list of Hive table names.

Use explain dependency to view the non partitioned ordinary table of SQL query, and enter the following command in hive cli:

explain dependency select s_age,count(1) num from student_orc;

Results obtained:

{"input_partitions":[],"input_tables":[{"tablename":"default@student_tb _orc","tabletype":"MANAGED_TABLE"}]}

Use explain dependency to view the SQL query partition table, and enter the following command in hive cli:

explain dependency select s_age,count(1) num from student_orc_partition;

Results obtained:

{"input_partitions":[{"partitionName":"default@student_orc_partition@ part=0"}, 
{"partitionName":"default@student_orc_partition@part=1"}, 
{"partitionName":"default@student_orc_partition@part=2"}, 
{"partitionName":"default@student_orc_partition@part=3"},
{"partitionName":"default@student_orc_partition@part=4"}, 
{"partitionName":"default@student_orc_partition@part=5"},
{"partitionName":"default@student_orc_partition@part=6"},
{"partitionName":"default@student_orc_partition@part=7"},
{"partitionName":"default@student_orc_partition@part=8"},
{"partitionName":"default@student_orc_partition@part=9"}], 
"input_tables":[{"tablename":"default@student_orc_partition", "tabletype":"MANAGED_TABLE"}]

There are two usage scenarios for explain dependency:

  • Scenario 1: quick exclusion. Quickly eliminate the exception of task data output caused by failure to read the data of the corresponding partition. For example, in a task divided by days, the upstream task is abnormal or runs away due to uncontrollable factors in the production process, resulting in exceptions caused by the downstream task. In this way, you can quickly check whether there are exceptions in the partition read by SQL.

  • Scenario 2: sort out the table input to help understand the operation of the program, especially the dependent input with multiple subqueries and multi table connections.

Here are two cases to see the practical application of explain dependency:

Case 1: identify seemingly equivalent codes

For programmers who have just come into contact with SQL, it is easy to

select * from a inner join b on a.no=b.no and a.f>1 and a.f<3;

Equivalent to

select * from a inner join b on a.no=b.no where a.f>1 and a.f<3;

We can see their differences through cases:

Code 1:

select 
a.s_no 
from student_orc_partition a 
inner join 
student_orc_partition_only b 
on a.s_no=b.s_no and a.part=b.part and a.part>=1 and a.part<=2;

Code 2:

select 
a.s_no 
from student_orc_partition a 
inner join 
student_orc_partition_only b 
on a.s_no=b.s_no and a.part=b.part 
where a.part>=1 and a.part<=2;

Let's take a look at the output of the above two code explain dependency:

explain dependency result for code 1

{"input_partitions": 
[{"partitionName":"default@student_orc_partition@part=0"}, 
{"partitionName":"default@student_orc_partition@part=1"}, 
{"partitionName":"default@student_orc_partition@part=2"},
{"partitionName":"default@student_orc_partition_only@part=1"}, 
{"partitionName":"default@student_orc_partition_only@part=2"}], 
"input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

explain dependency result of code 2:

{"input_partitions": 
[{"partitionName":"default@student_orc_partition@part=1"}, 
{"partitionName" : "default@student_orc_partition@part=2"},
{"partitionName" :"default@student_orc_partition_only@part=1"},
{"partitionName":"default@student_orc_partition_only@part=2"}], 
"input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

It can be seen from the above output results that the above two SQL statements are not equivalent. After adding non equivalent filter conditions to the connection condition (on) in the inner join, code 1 does not filter the left and right tables of the inner join according to the filter conditions, and the inner join will read more partition data with part=0 during execution. In code 2, unqualified partitions are filtered out.

Case 2: identify the difference in the data range read by SQL

Code 1:

explain dependency
select
a.s_no 
from student_orc_partition a 
left join 
student_orc_partition_only b 
on a.s_no=b.s_no and a.part=b.part and b.part>=1 and b.part<=2;

Code 2:

explain dependency 
select 
a.s_no 
from student_orc_partition a 
left join 
student_orc_partition_only b 
on a.s_no=b.s_no and a.part=b.part and a.part>=1 and a.part<=2;

Is the data reading range of the above two codes the same? The answer is different. Let's look at it through explain dependency:

explain dependency result of code 1:

{"input_partitions": 
[{"partitionName": "default@student_orc_partition@part=0"}, 
{"partitionName":"default@student_orc_partition@part=1"}, ...Omit 7 partitions in the middle
{"partitionName":"default@student_orc_partition@part=9"}, 
{"partitionName":"default@student_orc_partition_only@part=1"}, 
{"partitionName":"default@student_orc_partition_only@part=2"}], 
"input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

explain dependency result of code 2:

{"input_partitions": 
[{"partitionName":"default@student_orc_partition@part=0"}, 
{"partitionName":"default@student_orc_partition@part=1"}, ...Omit 7 partitions in the middle 
{"partitionName":"default@student_orc_partition@part=9"}, 
{"partitionName":"default@student_orc_partition_only@part=0"}, 
{"partitionName":"default@student_orc_partition_only@part=1"}, ...Omit 7 partitions in the middle 
{"partitionName":"default@student_orc_partition_only@part=9"}],
"input_tables": [{"tablename":"default@student_orc_partition","tabletype":"MANAGED_TABLE"}, {"tablename":"default@student_orc_partition_only","tabletype":"MANAGED_TABLE"}]}

It can be seen that non equivalent filtering conditions are added to the connection conditions for the left external connection. If the filtering conditions are applied to the right table (table b), the right table can only scan two partitions, but the left table (table a) will scan the whole table. If the filtering condition is for the left table, it does not play the role of filtering at all, and the two tables will be scanned for the whole table. In this case, as with all external connections, it is necessary to scan the two data tables.

In the process of use, it is easy to think that code fragment 2 can filter data like code fragment 1. By viewing the output result of explain dependency, you can know that this is not the case.

3. explain authorization

Through explain authorization, you can know the data sources (INPUTS) and data OUTPUTS of the current SQL access, as well as the CURRENT_USER and OPERATION of the current Hive.

In hive cli, enter the following command:

explain authorization 
select variance(s_score) from student_tb_orc;

The results are as follows:

INPUTS: 
  default@student_tb_orc 
OUTPUTS: 
  hdfs://node01:8020/tmp/hive/hdfs/cbf182a5-8258-4157-9194- 90f1475a3ed5/-mr-10000 
CURRENT_USER: 
  hdfs 
OPERATION: 
  QUERY 
AUTHORIZATION_FAILURES: 
  No privilege 'Select' found for inputs { database:default, table:student_ tb_orc, columnName:s_score}

From the above information:

The data source of the above case is the student in the defalut database_ tb_ Orc table;

The output path of the data is hdfs://node01:8020/tmp/hive/hdfs/cbf182a5-8258-4157-9194-90f1475a3ed5/-mr-10000;

The current operation user is hdfs, and the operation is query;

Looking at the above information, we will also see authorization_ The failures message indicates that you do not have query permission for the current input, but it can run normally if you run the above SQL. Why does this happen? Hive does not perform permission verification when permission management is not configured by default. All users in hive are super administrators. Even if specific users are not empowered, they can query normally.

last

Through the above introduction to explain, we can find that there are many contents worthy of our study in explain. Reading the explain execution plan will help us optimize Hive SQL and improve our control over SQL.

 

Posted by Tobeon on Fri, 10 Sep 2021 17:10:49 -0700