Notes on MySQL multi table connection

Keywords: Database MySQL SQL

Introduction: in the project, I met the need to merge MySQL database tables. In the process of merging, I should skip the existing records, because the primary key was not set in the initial stage of table design. Finally, I realized the requirements through MySQL's not exists statement. By the way, I explored the multi table connection in MySQL. The records are as follows.

Note: only for personal notes, please be careful

1. Create test cases

Create a player table and a guild table and insert data into them. Because we don't investigate the problems related to indexes, we don't set primary keys and indexes.

use test;
drop table if exists guild;
create table guild (id int, guild_name varchar(20) );
insert into guild (id, guild_name) values (1001,"g1"),(1002,"g2"),(1004,"g4");
drop table if exists player;
create table player (id int, player_name varchar(20), guild_id int);
insert into player (id, player_name, guild_id) values (1000001,"p1",1001),(1000002,"p2",1001),(1000003,"p3",1002),(1000004,"p4",1003);

The data in the table are as follows:

mysql> select * from player;
+---------+-------------+----------+
| id      | player_name | guild_id |
+---------+-------------+----------+
| 1000001 | p1          |     1001 |
| 1000002 | p2          |     1001 |
| 1000003 | p3          |     1002 |
| 1000004 | p4          |     1003 |
+---------+-------------+----------+
4 rows in set (0.00 sec)

mysql> select * from guild;
+------+------------+
| id   | guild_name |
+------+------------+
| 1001 | g1         |
| 1002 | g2         |
| 1004 | g4         |
+------+------------+
3 rows in set (0.00 sec)

We deliberately staggered the trade union table by 1003 for the following inquiry.

2. Multiple forms of MySQL multi table query:

Table join, using join Keywords: inner join, left join, right join.

Subquery, where followed by query condition: where clause with in, any, exists and not exists.

3. Investigation join (Block Nest Join and hash join)

If we need to query the union name of the player, we will use the table connection to connect the player table and the guild table:

select player.id, player.player_name, guild.id, guild.guild_name 
from player 
left join guild on guild.id=player.guild_id;

The current MySQL version is 8.0.23 MySQL. Let's use explain to view the execution plan:

mysql> explain select player.id, player.player_name, guild.id, guild.guild_name from player left join guild on guild.id=player.guild_id;
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table  | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                      |
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
|  1 | SIMPLE      | player | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |   100.00 | NULL                                       |
|  1 | SIMPLE      | guild  | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using where; Using join buffer (hash join) |
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

Because the index is not set, the fields related to the relational index are not used. We are concerned about extra: using where; Using the join buffer (hash join), you can see that the join buffer (hash join) is used for the guild table in the execution plan.

If MySQL version 5.7.34 is used for query, it is found that Block Nested Loop is used.

mysql> explain select player.id, player.player_name, guild.id, guild.guild_name from player left join guild on guild.id=player.guild_id;
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------+
| id | select_type | table  | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                              |
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------+
|  1 | SIMPLE      | player | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |   100.00 | NULL                                               |
|  1 | SIMPLE      | guild  | NULL       | ALL  | PRIMARY       | NULL | NULL    | NULL |    3 |   100.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

MySQL's hash join is a feature added in version 8.0.18. For the differences between hash join and Block Nested Loop, please refer to MySQL 8.0 release, are you familiar with and unfamiliar with Hash Join- Nuggets (juejin.cn)

In MySQL version 8.0.18, the format format=tree is added to explain. You can view the specific process of executing the plan:

mysql> explain format=tree select player.id, player.player_name, guild.id, guild.guild_name from player left join guild on guild.id=player.guild_id\G;
*************************** 1. row ***************************
EXPLAIN: -> Left hash join (guild.id = player.guild_id)  (cost=1.59 rows=12)
    -> Table scan on player  (cost=0.65 rows=4)
    -> Hash
        -> Table scan on guild  (cost=0.14 rows=3)

1 row in set (0.00 sec)

The general implementation process is as follows:

Select guild as the hash candidate table, scan and create the hash table and put it into memory.
Scan the player table, compare it with the hash table in memory, and execute the left hash join.

We can also view the optimization process of MySQL optimizer on statements through Optimizer Trace:

-- Remember to set as after execution off
SET optimizer_trace="enabled=on"; 
-- To execute select sentence...
SELECT * FROM information_schema.optimizer_trace;

The result has four fields. We only care about the TRACE field, which is a large json. The general structure is as follows:

{
	trace:{
		steps:[
			{join_preparation:{}}, 					/*SQL Preparation stage*/
			{join_optimization:{}},					/*SQL Optimization stage*/
			{join_execution:{}}						/*SQL Execution phase*/
		]
	}
}

Skip here first (because I can't see anything...), and then use it to see the optimization of in and exists statements.

4. semijoin

Now, we need to find the players whose union id is in the table guild:

select * from player where player.guild_id in (select id from guild);

View explain:

mysql> explain select * from player where player.guild_id in (select id from guild);
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
| id | select_type | table  | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                                     |
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | guild  | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Start temporary                                           |
|  1 | SIMPLE      | player | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |    25.00 | Using where; End temporary; Using join buffer (hash join) |
+----+-------------+--------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

Not surprisingly, hash join is also used.

View explain format=tree:

mysql> explain format=tree select * from player where player.guild_id in (select id from guild)\G;
*************************** 1. row ***************************
EXPLAIN: -> Remove duplicate player rows using temporary table (weedout)  (cost=2.00 rows=3)
    -> Inner hash join (player.guild_id = guild.id)  (cost=2.00 rows=3)
        -> Table scan on player  (cost=0.35 rows=4)
        -> Hash
            -> Table scan on guild  (cost=0.55 rows=3)

1 row in set (0.00 sec)

You can see that compared with the previous join example, there is an additional process of de duplication through the temporary table.

Check Optimizer Trace. Because the subquery is nested, MySQL will optimize each statement, so the result is larger.

In the join preparation phase, we can see:

{
  "transformation": {
    "select#": 2,
    "from": "IN (SELECT)",
    "to": "semijoin",
    "chosen": true,
    "transformation_to_semi_join": {
      "subquery_predicate": "`player`.`guild_id` in (/* select#2 */ select `guild`.`id` from `guild`)",
      "embedded in": "WHERE",
      "semi-join condition": "(`player`.`guild_id` = `guild`.`id`)",
      "decorrelated_predicates": [
        {
          "outer": "`player`.`guild_id`",
          "inner": "`guild`.`id`"
        }
      ]
    }
  }
}

"select#": 2 refers to the second select statement, that is, our sub query statement. As you can see, the optimizer converts IN to semijoin (semi join query).

For details of semi join query, please refer to the article of Tencent cloud database team: MySQL optimizer and SemiJoin optimization - cloud + community - Tencent cloud (tencent.com)

In terms of strategy selection, we can also see the process of selecting semijoin strategy by comparing the number of scanning lines and consumption:

{
  "semijoin_strategy_choice": [
    {
      "strategy": "MaterializeScan",
      "recalculate_access_paths_and_cost": {
        "tables": [
          {
            "table": "`player`",
            "best_access_path": {
              "considered_access_paths": [
                {
                  "rows_to_scan": 4,
                  "filtering_effect": [
                  ],
                  "final_filtering_effect": 1,
                  "access_type": "scan",
                  "using_join_cache": true,
                  "buffers_needed": 1,
                  "resulting_rows": 4,
                  "cost": 1.45,
                  "chosen": true
                }
              ]
            }
          }
        ]
      },
      "cost": 3.6,
      "rows": 1,
      "duplicate_tables_left": true,
      "chosen": true
    },
    {
      "strategy": "DuplicatesWeedout",
      "cost": 3.4,
      "rows": 1,
      "duplicate_tables_left": false,
      "chosen": true
    }
  ]
}

Also, the semijoin strategy selected last is duplicate weekly

{
  "final_semijoin_strategy": "DuplicateWeedout"
}

5. not exists (anti join)

Let's execute to find players whose union is not in the table ` guild:

select * from player where not exists (select * from guild where guild.id=player.guild_id);

View explain format=tree,

mysql> explain format=tree select * from player where not exists (select * from guild where guild.id=player.guild_id)\G;
*************************** 1. row ***************************
EXPLAIN: -> Hash antijoin (guild.id = player.guild_id)  (cost=1.86 rows=12)
    -> Table scan on player  (cost=0.65 rows=4)
    -> Hash
        -> Table scan on guild  (cost=0.41 rows=3)

1 row in set, 1 warning (0.00 sec)

You can see that anti join is used

Viewing Optimizer Trace, you can also see the process of converting to antijoin:

{
  "transformation": {
    "select#": 2,
    "from": "IN (SELECT)",
    "to": "antijoin",
    "chosen": true,
    "transformation_to_semi_join": {
      "subquery_predicate": "exists(/* select#2 */ select 1 from `guild` where (`guild`.`id` = `player`.`guild_id`)) is false",
      "embedded in": "WHERE",
      "semi-join condition": "(`player`.`guild_id` = `guild`.`id`)",
      "decorrelated_predicates": [
        {
          "outer": "`player`.`guild_id`",
          "inner": "`guild`.`id`"
        }
      ]
    }
  }
}

According to the official MySQL statement, anti join can remove some sub queries to improve efficiency:

"The optimizer now transforms a WHERE condition having NOT IN (subquery), NOT EXISTS (subquery), IN (subquery) IS NOT TRUE, or EXISTS (subquery) IS NOT TRUE internally into an antijoin, thus removing the subquery."

For details, please refer to Antijoin in MySQL 8 | MySQL Server Blog (mysqlserverteam.com)

6. Summary

In mysql, whether you display subqueries using the join keyword or where in/exists different tables, the table join operation will be triggered. Before MySQL 8.0.18, Block Nest Join will be triggered, and then it will be optimized as hash join. Specific situations need specific analysis. When table connection is required, master several principles:

Use small tables to drive large tables.
If it is before 8.0.18, select the index correctly to convert BNL(Block Nest Join) to NLJ (index nested loop join).
Make good use of analysis tools to analyze bottlenecks.

The energy level is limited, and many details are not explored in depth. The article only makes personal notes. Please be careful for reference

Posted by pipe_girl on Fri, 17 Sep 2021 06:32:06 -0700

Programmer Group