Two Temporary Tables in MySQL

Keywords: JSON MySQL SQL Session

Change from: http://mysql.taobao.org/monthly/2016/06/07/

External Temporary Table

Through CREATE TEMPORARY TABLE
Create temporary tables, which are called external temporary tables. This temporary table is only visible to the current user and closes automatically at the end of the current session. This temporary table can be named with the same name as the non-temporary table (after the same name, the non-temporary table will not be visible to the current session until the temporary table is deleted).

Internal Temporary Table

Internal temporary table is a special lightweight temporary table for performance optimization. This temporary table is automatically created by MySQL and used to store intermediate results of certain operations. These operations may be included in the optimization phase or in the execution phase. This internal table is invisible to users, but through EXPLAIN or SHOW STATUS, you can see whether MYSQL uses internal temporary tables to help complete an operation. Internal temporary tables play a very important role in the optimization of SQL statements. Many operations in MySQL depend on internal temporary tables for optimization. However, the use of internal temporary tables requires the creation of tables and the access cost of intermediate data, so users should try to avoid using temporary tables when writing SQL statements.

There are two types of internal temporary tables: one is HEAP temporary tables. All data of such temporary tables will be stored in memory, and no IO operation is required for the operation of such tables. Another is the OnDisk temporary table, which, as its name implies, stores data on disk. OnDisk temporary tables are used to handle operations with larger intermediate results. If HEAP temporary tables store more data than MAX_HEAP_TABLE_SIZE (see the System Variables section of MySQL Manual for details), HEAP temporary tables will be automatically converted to OnDisk temporary tables. OnDisk Temporary Table in 5.7 can choose to use MyISAM engine or InnoDB engine through INTERNAL_TMP_DISK_STORAGE_ENGINE system variable.

This article focuses on what operations might be used for internal temporary tables. If the user can use interior temporary table as little as possible to optimize the query when writing the SQL statement, it will effectively improve the efficiency of query execution.

First, we define a table t1.

CREATE TABLE t1( a int, b int); INSERT INTO t1 VALUES(1,2),(3,4);

All of the following operations are exemplified based on table t1.

Use SQL_BUFFER_RESULT hint in SQL statements
SQL_BUFFER_RESULT is mainly used to make MySQL release the locks on tables as soon as possible. Because if the amount of data is very large, it will take a long time to send the data to the client. By buffering the data into the temporary table, the occupied time of reading lock on the table can be effectively reduced.
For example:

    mysql> explain format=json select SQL_BUFFER_RESULT * from t1;
    EXPLAIN
    {
      "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "2.00"
        },
        "buffer_result": {
          "using_temporary_table": true,
          "table": {
            "table_name": "t1",
            "access_type": "ALL",
        ...

If DERIVED_TABLE is included in the SQL statement.
In 5.7, because of the new optimization approach, we need to use set optimizer_switch='derived_merge=off'to prohibit derived table s from being merged into the outer Query.
For example:

    mysql> explain format=json select * from (select * from t1) as tt;
    EXPLAIN
    {
      "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "2.40"
        },
        "table": {
          "table_name": "tt",
          "access_type": "ALL",
          ...
          "materialized_from_subquery": {
            "using_temporary_table": true,
        ...

If we query the system table, the data of the system table will be stored in the interior temporary table.
We can't use EXPLAIN to see if we need to use internal temporary tables to read system table data, but we can use SHOW STATUS to see if we have used internal temporary tables.
For example:

    mysql> select * from information_schema.character_sets;
    mysql> show status like 'CREATE%';

If the DISTINCT statement is not optimized, that is, the DISTINCT statement is optimized to be converted to GROUP BY operation or UNIQUE INDEX is used to eliminate DISTINCT, the internal temporary table will be used.

    mysql> explain format=json select distinct a from t1;
    EXPLAIN
    {
    {
      "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "1.60"
        },
        "duplicates_removal": {
          "using_temporary_table": true,
        ...

If the query has an ORDER BY statement, it cannot be optimized. The following scenarios use internal temporary tables to cache intermediate data, and then sort the intermediate data.
1) If the join table uses BNL (Batched Nestloop)/BKA(Batched Key Access)
For example:

1) BNL is open by default

mysql> explain format=json select * from t1, t1 as t2 order by t1.a;
EXPLAIN
{
  "query_block": {
  "select_id": 1,
  "cost_info": {
    "query_cost": "22.00"
  },
  "ordering_operation": {
    "using_temporary_table": true,
  ...

2) When BNL is turned off, ORDER BY will use filesort directly.

mysql> set optimizer_switch='block_nested_loop=off';
Query OK, 0 rows affected (0.00 sec)
mysql> explain format=json select * from t1, t1 as t2 order by t1.a;
EXPLAIN
{
   "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "25.00"
    },
    "ordering_operation": {
      "using_filesort": true,
    ...

2) The column of ORDER BY does not belong to the column of the first join table in the execution plan.
For example:

mysql> explain format=json select * from t as t1, t as t2 order by t2.a;
EXPLAIN
{
   "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "25.00"
    },
    "ordering_operation": {
      "using_temporary_table": true,
    ...

3) If the ORDER BY expression is a complex expression.

So what kind of ORDER BY expression does MySQL think is a complex expression?

1) If the sorting expression is SP or UDF.
For example:

drop function if exists func1;
delimiter |
create function func1(x int)
returns int deterministic
begin
declare z1, z2 int;
set z1 = x;
set z2 = z1+2;
return z2;
end|
delimiter ;
explain format=json select * from t1 order by func1(a);
{
    "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "2.20"
    },
    "ordering_operation": {
      "using_temporary_table": true,
    ..

2) The column of ORDER BY contains aggregation function

To simplify the execution plan, we use INDEX to optimize GROUP BY statements.
For example:

  create index idx1 on t1(a);
  explain format=json SELECt a FROM t1 group by a order by sum(a);
  | {
       "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "1.20"
        },
        "ordering_operation": {
          "using_temporary_table": true,
          "using_filesort": true,
          "grouping_operation": {
            "using_filesort": false,
        ...
  drop index idx1 on t1;

3) The column of ORDER BY contains SCALAR SUBQUERY, which of course has not been optimized.
For example:

explain format=json select (select rand() from t1 limit 1) as a from t1 order by a;     
| {
      "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "1.20"
        },
        "ordering_operation": {
          "using_temporary_table": true,
          "using_filesort": true,
            ...

4) If the query contains both ORDER BY and GROUP BY statements, but the columns used by the two statements are different.

Note: If it is 5.7, we need to set sql_mode to non-only_full_group_by mode, otherwise we will report an error.

Similarly, in order to simplify the execution plan, we use INDEX to optimize GROUP BY statements.
For example:

set sql_mode='';
create index idx1 on t1(b);
explain format=json select t1.a from t1 group by t1.b order by 1;
| {
     "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "1.40"
        },
    "ordering_operation": {
          "using_temporary_table": true,
          "using_filesort": true,
          "grouping_operation": {
            "using_filesort": false,
    ...
drop index idx1 on t1;

If the query has a GROUP BY statement, it cannot be optimized. The following scenarios use internal temporary tables to cache intermediate data, and then GROUP BY the intermediate data.
1) If the join table uses BNL (Batched Nestloop)/BKA(Batched Key Access).
For example:

    explain format=json select t2.a from t1, t1 as t2 group by t1.a;
    | {
        "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "8.20"
        },
        "grouping_operation": {
          "using_temporary_table": true,
          "using_filesort": true,
          "cost_info": {
            "sort_cost": "4.00"
        ...

2) If the column of GROUP BY does not belong to the first join table in the execution plan.
For example:

    explain format=json select t2.a from t1, t1 as t2 group by t2.a;
    | {
        "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "8.20"
        },
        "grouping_operation": {
          "using_temporary_table": true,
          "using_filesort": true,
          "nested_loop": [
        ...

3) If the columns used in GROUP BY statement are different from those used in ORDER BY statement.
For example:

    set sql_mode='';
    explain format=json select t1.a from t1 group by t1.b order by t1.a;
    | {
       "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "1.40"
        },
        "ordering_operation": {
          "using_filesort": true,
          "grouping_operation": {
            "using_temporary_table": true,
            "using_filesort": false,
        ...

4) If GROUP BY has ROLLUP and is based on multi-off-table connections.
For example:

    explain format=json select sum(t1.a) from t1 left join t1 as t2 on true group by t1.a with rollup;
    | {
        "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "7.20"
        },
        "grouping_operation": {
          "using_temporary_table": true,
          "using_filesort": true,
          "cost_info": {
            "sort_cost": "4.00"
          },
        ...

5) If the column used in GROUP BY statement comes from SCALAR SUBQUERY and is not optimized.
For example:

    explain format=json select (select avg(a) from t1) as a from t1 group by a;
    | {
        "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "3.40"
        },
        "grouping_operation": {
          "using_temporary_table": true,
          "using_filesort": true,
          "cost_info": {
            "sort_cost": "2.00"
          },
        ...

Conversion of IN expression to semi-join for optimization
1) If the semi-join execution mode is Materialization
For example:

set optimizer_switch='firstmatch=off,duplicateweedout=off';
explain format=json select * from t1 where a in (select b from t1);
| {
    "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "5.60"
    },
    "nested_loop": [
      {
         "rows_examined_per_scan": 1,
              "materialized_from_subquery": {
                "using_temporary_table": true,
                "query_block": {
                  "table": {
                    "table_name": "t1",
                    "access_type": "ALL",

    ... 

2) If the semi-join execution mode is Duplicate Weedout
For example:

    set optimizer_switch='firstmatch=off';
    explain format=json select * from t1 where a in (select b from t1);
     | {
        "query_block": {
        "select_id": 1,
        "cost_info": {
          "query_cost": "4.80"
        },
        "duplicates_removal": {
          "using_temporary_table": true,
          "nested_loop": [
                {
        ...

If the query statement has UNION, MySQL will use internal temporary tables to help UNION operations eliminate duplication.
For example:

    explain format=json select * from t1 union select * from t1;
    | {
        "query_block": {
        "union_result": {
          "using_temporary_table": true,
          "table_name": "<union1,2>",
        ...

If the query statement is updated using multiple tables.
Explain can't see the internal temporary table being used here, so you need to check status.
For example:

update t1, t1 as t2 set t1.a=3;
show status like 'CREATE%';

If the aggregation function contains the following functions, the internal temporary table will also be used.
1) count(distinct *)
For example:

    explain format=json select count(distinct a) from t1;

2) group_concat
For example:

    explain format=json select group_concat(b) from t1;

In summary, MySQL will use internal temporary tables to cache intermediate results. If the amount of data is large, internal temporary tables will store data on disk, which obviously has an impact on performance. In order to minimize performance losses, we need to avoid these situations as much as possible.

Posted by lucidpc on Thu, 23 May 2019 12:37:24 -0700