Technology sharing | MySQL's join | buffer | size application on internal connection

Keywords: Database MySQL Session SQL JSON

Author: Yang Taotao

This article introduces the use of MySQL parameter join ﹣ BUFFER ﹣ size in the INNER JOIN scenario, which is not included in OUTER JOIN. Before discussing the BUFFER, let's first understand the INNER JOIN classification of MySQL.

If it is subdivided according to the retrieval performance, whether it is two or more inner joins, it can be roughly divided into the following categories:

1.JOIN KEY has index and primary key
2.JOIN KEY has index, secondary index
3.JOIN KEY has no index

Today, we mainly analyze the third scenario, which is the scenario of full table scanning.

Looking back, what is join buffer size?

JOIN BUFFER is a BUFFER memory block used by MySQL to cache the above two types of JOIN retrieval. It is generally recommended to set a small GLOBAL value, and make a proper adjustment based on SESSION or QUERY. For example, the default value is 512K. If you want to temporarily adjust it to 1G, then

mysql>set session join_buffer_size = 1024 * 1024 * 1024; 
mysql>select * from ...;    
mysql>set session join_buffer_size=default;    
//perhaps
mysql>select /*+  set_var(join_buffer_size=1G) */ * from ...;

Let's look at the usage of JOIN BUFFER in detail. There are several algorithms for INNER JOIN in MySQL,

1. Nested loop join is translated as nested loop join, or NLJ for short.

This is the simplest and easiest to understand table Association Algorithm in MySQL.

For example, take the statement select * from p1 join p2 using (r1),

First, take out a record ROW1 from table p1, and then traverse each record in table p2 with ROW1, and check whether the field r1 is the same for output; repeat the previous process until the comparison of the record numbers of the two tables is completed.

Look at the actual SQL execution plan

mysql> explain format=json select * from p1 inner join p2 as b using(r1)\G
*************************** 1. row ***************************
EXPLAIN: {
 "query_block": {
   "select_id": 1,
   "cost_info": {
     "query_cost": "1003179606.87"
   },
   "nested_loop": [
     {
       "table": {
         "table_name": "b",
         "access_type": "ALL",
         "rows_examined_per_scan": 1000,
         "rows_produced_per_join": 1000,
         "filtered": "100.00",
         "cost_info": {
           "read_cost": "1.00",
           "eval_cost": "100.00",
           "prefix_cost": "101.00",
           "data_read_per_join": "15K"
         },
         "used_columns": [
           "id",
           "r1",
           "r2"
         ]
       }
     },
     {
       "table": {
         "table_name": "p1",
         "access_type": "ALL",
         "rows_examined_per_scan": 9979810,
         "rows_produced_per_join": 997981014,
         "filtered": "10.00",
         "cost_info": {
           "read_cost": "5198505.87",
           "eval_cost": "99798101.49",
           "prefix_cost": "1003179606.87",
           "data_read_per_join": "14G"
         },
         "used_columns": [
           "id",
           "r1",
           "r2"
         ],
         "attached_condition": "(`ytt_new`.`p1`.`r1` = `ytt_new`.`b`.`r1`)"
       }
     }
   ]
 }
}
1 row in set, 1 warning (0.00 sec)        

From the above execution plan, table p2 is the first table (drive table or appearance), the second table is p1, so the number of records p2 needs to traverse is 1000, and the number of records p1 needs to traverse is about 1000W. To complete this SQL execution, you need to match table p1 (inner table) 1000 times, and the corresponding read cost is 5198505.87. How can we reduce the matching times of table p1? Then JOIN BUFFER will be useful

2. Block nested loop join (BNLJ for short)

Compared with NLJ, BNLJ has an extra BUFFER in the middle to cache the corresponding records of the surface, thus reducing the number of surface loops and the number of internal table matches. For the above example, if the join BUFFER size just can accommodate the corresponding JOIN KEY record of the appearance, the number of matches to table p2 will be reduced from 1000 to 1, and the performance will be directly improved by 1000 times.

Let's take a look at the implementation plan using BNLJ

mysql> explain format=json select * from p1 inner join p2 as b using(r1)\G
*************************** 1. row ***************************
EXPLAIN: {
 "query_block": {
   "select_id": 1,
   "cost_info": {
     "query_cost": "997986300.01"
   },
   "nested_loop": [
     {
       "table": {
         "table_name": "b",
         "access_type": "ALL",
         "rows_examined_per_scan": 1000,
         "rows_produced_per_join": 1000,
         "filtered": "100.00",
         "cost_info": {
           "read_cost": "1.00",
           "eval_cost": "100.00",
           "prefix_cost": "101.00",
           "data_read_per_join": "15K"
         },
         "used_columns": [
           "id",
           "r1",
           "r2"
         ]
       }
     },
     {
       "table": {
         "table_name": "p1",
         "access_type": "ALL",
         "rows_examined_per_scan": 9979810,
         "rows_produced_per_join": 997981014,
         "filtered": "10.00",
         "using_join_buffer": "Block Nested Loop",
         "cost_info": {
           "read_cost": "5199.01",
           "eval_cost": "99798101.49",
           "prefix_cost": "997986300.01",
           "data_read_per_join": "14G"
         },
         "used_columns": [
           "id",
           "r1",
           "r2"
         ],
         "attached_condition": "(`ytt_new`.`p1`.`r1` = `ytt_new`.`b`.`r1`)"
       }
     }
   ]
 }
}
1 row in set, 1 warning (0.00 sec)

The above implementation plan has two information,

First, there is an additional "using" join "buffer:" block nested loop "

Second, read_cost decreased from 5198505.87 to 5199.01

3. Recently, MySQL 8.0.18 released a new JOIN algorithm - HASH JOIN.

MySQL's hash join also uses the JOIN BUFFER to cache, but unlike BNLJ, it builds a hash table based on the appearance in the JOIN BUFFER, and the inner table matches the hash table through the hash algorithm, which further reduces the number of times the inner table matches. Of course, the official didn't give a detailed description of the algorithm. The above only represents personal imagination. That's still for the above SQL. Let's take a look at the execution plan.

mysql> explain format=tree select * from p1 inner join p2 as b using(r1)\G
*************************** 1. row ***************************
EXPLAIN: -> Inner hash join (p1.r1 = b.r1)  (cost=997986300.01 rows=997981015)
   -> Table scan on p1  (cost=105.00 rows=9979810)
   -> Hash
       -> Table scan on b  (cost=101.00 rows=1000)

1 row in set (0.00 sec)

From the above execution plan, we can see that a hash table is established for table p2, and then a hash match is made for table p1.

At present, it only supports to simply check whether HASH JOIN is used, and there is no other more information display.

In summary, this paper mainly discusses the inner table Association of MySQL in the inefficient scenario without any index. Other scenes will be opened separately.

Posted by deckrdx on Sat, 02 Nov 2019 08:03:18 -0700