preface
-
Interviewer: let's say, how do you query 10 million data?
-
Brother B: direct paging query, using limit paging.
-
Interviewer: have you practiced it?
-
Brother B: there must be
Here's a song "cool"
Maybe some people have never met a table with tens of millions of data, and they don't know what will happen when querying tens of millions of data.
Today, let's take you to practice. This time, we do the test based on MySQL 5.7.26
Prepare data
What if you don't have 10 million data?
Create it
Code creation 10 million? That's impossible. It's too slow. It may really take a day. You can use database scripts to execute much faster.
Create table
CREATE TABLE `user_operation_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, `attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL, PRIMARY KEY (`id`) USING BTREE ) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
Create data script
Using batch insertion, the efficiency will be much faster, and commit every 1000 pieces. Too much data will also lead to slow batch insertion efficiency
DELIMITER ;; CREATE PROCEDURE batch_insert_log() BEGIN DECLARE i INT DEFAULT 1; DECLARE userId INT DEFAULT 10000000; set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES'; set @execData = ''; WHILE i<=10000000 DO set @attr = "'Test long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long properties'"; set @execData = concat(@execData, "(", userId + i, ", '10.0.69.175', 'User login operation'", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")"); if i % 1000 = 0 then set @stmtSql = concat(@execSql, @execData,";"); prepare stmt from @stmtSql; execute stmt; DEALLOCATE prepare stmt; commit; set @execData = ""; else set @execData = concat(@execData, ","); end if; SET i=i+1; END WHILE; END;; DELIMITER ;
Start test
GE's computer configuration is relatively low: win10 standard slag i5 reads and writes about 500MB SSD
Due to the low configuration, only 3148000 pieces of data were prepared for this test, occupying 5g of disk (without index) and running for 38min. Students with computer configuration can insert multi-point data test
SELECT count(1) FROM `user_operation_log`
Return result: 3148000
The three query times are:
-
14060 ms
-
13755 ms
-
13447 ms
General paging query
MySQL supports the LIMIT statement to select the specified number of pieces of data, and Oracle can use ROWNUM to select.
MySQL paging query syntax is as follows:
SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset
-
The first parameter specifies the offset of the first return record line
-
The second parameter specifies the maximum number of record rows to return
Let's start testing the query results:
SELECT * FROM `user_operation_log` LIMIT 10000, 10
The query times are as follows:
-
59 ms
-
49 ms
-
50 ms
It seems that the speed is OK, but it is a local database. The speed is naturally faster.
Test from another angle
Same offset, different data
SELECT * FROM `user_operation_log` LIMIT 10000, 10 SELECT * FROM `user_operation_log` LIMIT 10000, 100 SELECT * FROM `user_operation_log` LIMIT 10000, 1000 SELECT * FROM `user_operation_log` LIMIT 10000, 10000 SELECT * FROM `user_operation_log` LIMIT 10000, 100000 SELECT * FROM `user_operation_log` LIMIT 10000, 1000000
The query time is as follows:
quantity | for the first time | The second time | third time |
---|---|---|---|
Article 10 | 53ms | 52ms | 47ms |
100 articles | 50ms | 60ms | 55ms |
1000 articles | 61ms | 74ms | 60ms |
10000 | 164ms | 180ms | 217ms |
100000 articles | 1609ms | 1741ms | 1764ms |
1000000 articles | 16219ms | 16889ms | 17081ms |
From the above results, we can conclude that the larger the amount of data, the longer it takes
Same data amount, different offset
SELECT * FROM `user_operation_log` LIMIT 100, 100 SELECT * FROM `user_operation_log` LIMIT 1000, 100 SELECT * FROM `user_operation_log` LIMIT 10000, 100 SELECT * FROM `user_operation_log` LIMIT 100000, 100 SELECT * FROM `user_operation_log` LIMIT 1000000, 100
Offset | for the first time | The second time | third time |
---|---|---|---|
100 | 36ms | 40ms | 36ms |
1000 | 31ms | 38ms | 32ms |
10000 | 53ms | 48ms | 51ms |
100000 | 622ms | 576ms | 627ms |
1000000 | 4891ms | 5076ms | 4856ms |
From the above results, we can conclude that the larger the offset, the longer the time it takes
SELECT * FROM `user_operation_log` LIMIT 100, 100 SELECT id, attr FROM `user_operation_log` LIMIT 100, 100
How to optimize
Now that we have come to the conclusion after the above tossing, we have started to optimize the above two problems: large offset and large amount of data
Optimization of large offset
Adopt sub query method
We can locate the id of the offset position first, and then query the data
SELECT * FROM `user_operation_log` LIMIT 1000000, 10 SELECT id FROM `user_operation_log` LIMIT 1000000, 1 SELECT * FROM `user_operation_log` WHERE id >= (SELECT id FROM `user_operation_log` LIMIT 1000000, 1) LIMIT 10
The query results are as follows:
sql | Spend time |
---|---|
Article 1 | 4818ms |
Article 2 (without index) | 4329ms |
Article 2 (with index) | 199ms |
Article 3 (without index) | 4319ms |
Article 3 (with index) | 201ms |
From the above results, it can be concluded that:
-
The first takes the most time, and the third is slightly better than the first
-
Subqueries use indexes faster
Disadvantages: it is only applicable to the case where the id is incremented
In the case of non incremental id, you can use the following method, but this disadvantage is that paging queries can only be placed in sub queries
Note: some mysql versions do not support the use of limit in the in clause, so multiple nested select ions are used
SELECT * FROM `user_operation_log` WHERE id IN (SELECT t.id FROM (SELECT id FROM `user_operation_log` LIMIT 1000000, 10) AS t)
id limiting method is adopted
This method requires higher requirements. The id must be incremented continuously, and the range of id must be calculated. Then use between. The sql is as follows
SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100 SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100
The query results are as follows:
sql | Spend time |
---|---|
Article 1 | 22ms |
Article 2 | 21ms |
It can be seen from the results that this method is very fast
Note: the LIMIT here limits the number of entries and does not use offset
Optimize the problem of large amount of data
The amount of data returned will also directly affect the speed
SELECT * FROM `user_operation_log` LIMIT 1, 1000000 SELECT id FROM `user_operation_log` LIMIT 1, 1000000 SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000
The query results are as follows:
sql | Spend time |
---|---|
Article 1 | 15676ms |
Article 2 | 7298ms |
Article 3 | 15960ms |
The results show that the query efficiency can be significantly improved by reducing unnecessary columns
The first and third queries are almost the same speed. When you are sure to make complaints about it, then I write so many fields.
Note that my MySQL server and client are in_ Same machine_ Therefore, the query data is not much different. Students with conditions can separate the test client from mysql
SELECT * doesn't it smell good?
By the way, I would like to add why SELECT * is prohibited. Isn't it simple and brainless? Isn't it fragrant?
There are two main points:
-
Using "SELECT *" database needs to parse more objects, fields, permissions, attributes and other related contents. In the case of complex SQL statements and more hard parsing, it will cause a heavy burden on the database.
-
Increase the network overhead, * sometimes bring useless and large text fields such as log and IconMD5 by mistake, and the data transmission size will increase geometrically. In particular, MySQL and applications are not on the same machine, and this overhead is very obvious.