Is there a performance difference between utf8? General? Ci and UTF8? Unicode? CI?
#1 building
I want to know the performance difference between utf8 ﹣ general ﹣ Ci and UTF8 ﹣ Unicode ﹣ Ci, but I didn't find any benchmarks on the Internet, so I decided to create my own.
I created a very simple table with 500000 rows:
CREATE TABLE test( ID INT(11) DEFAULT NULL, Description VARCHAR(20) DEFAULT NULL ) ENGINE = INNODB CHARACTER SET utf8 COLLATE utf8_general_ci;
I then run this stored procedure to populate it with random data:
CREATE PROCEDURE randomizer() BEGIN DECLARE i INT DEFAULT 0; DECLARE random CHAR(20) ; theloop: loop SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36); INSERT INTO test VALUES (i+1, random); SET i=i+1; IF i = 500000 THEN LEAVE theloop; END IF; END LOOP theloop; END
Then, I created the following stored procedures to benchmark simple SELECT, SELECT and LIKE and sort (SELECT and ORDER BY):
CREATE PROCEDURE benchmark_simple_select() BEGIN DECLARE i INT DEFAULT 0; theloop: loop SELECT * FROM test WHERE Description = 'test' COLLATE utf8_general_ci; SET i = i + 1; IF i = 30 THEN LEAVE theloop; END IF; END LOOP theloop; END; CREATE PROCEDURE benchmark_select_like() BEGIN DECLARE i INT DEFAULT 0; theloop: loop SELECT * FROM test WHERE Description LIKE '%test' COLLATE utf8_general_ci; SET i = i + 1; IF i = 30 THEN LEAVE theloop; END IF; END LOOP theloop; END; CREATE PROCEDURE benchmark_order_by() BEGIN DECLARE i INT DEFAULT 0; theloop: loop SELECT * FROM test WHERE ID > FLOOR(1 + RAND() * (400000 - 1)) ORDER BY Description COLLATE utf8_general_ci LIMIT 1000; SET i = i + 1; IF i = 10 THEN LEAVE theloop; END IF; END LOOP theloop; END;
In the above stored procedure, utf8? General? CI classification is used, but of course, during the test, I used utf8? General? Ci and UTF8? Unicode? Ci at the same time.
For each collation, I call each stored procedure five times (five times for utf8 ﹣ general ﹣ Ci, five times for utf8 ﹣ Unicode ﹣ Ci, five times for utf8 ﹣ Unicode ﹣ CI) and calculate the average.
My results are:
benchmark_simple_select()
- And UTF8? General? CI: 9957 MS
- And UTF8 ﹣ Unicode ﹣ CI: 10271 MS
In this benchmark, using utf8? Unicode? CI is 3.2% slower than utf8? General? CI.
benchmark_select_like()
- Use utf8? General? CI: 11441 MS
- And UTF8 ﹣ Unicode ﹣ CI: 12811ms
In this benchmark, using utf8? Unicode? CI is 12% slower than utf8? General? CI.
benchmark_order_by()
- And UTF8? General? CI: 11944 MS
- And UTF8 ﹣ Unicode ﹣ CI: 12887 MS
In this benchmark, using utf8 ﹣ Unicode ﹣ CI is 7.9% slower than utf8 ﹣ general ﹣ CI.
#2 building
This article It is well described.
In short: utf8 ﹣ Unicode ﹣ CI uses the Unicode sorting algorithm defined in the Unicode standard, while utf8 ﹣ general ﹣ CI is a simpler sort order, resulting in "less accurate" sorting results.
#3 building
In short:
If you need a better sort order, use utf8? Unicode? CI (this is the preferred method),
But if you're completely interested in performance, use utf8? General? Ci, but be aware that it's a bit out of date.
There is little difference in performance.
#4 building
Some details (PL)
Just as we Ad locum As you can see (Peter Gulutzan), sorting / comparing Polish letters "Ł" (L-html esc with stroke: &Ł) (lowercase: "ł" - html esc: &ł) is different - we have the following assumptions:
utf8_polish_ci Ł greater than L and less than M utf8_unicode_ci Ł greater than L and less than M utf8_unicode_520_ci Ł equal to L utf8_general_ci Ł greater than Z
In Polish, the letter precedes the letters L and M. This kind of coding is neither good nor bad, which depends on your needs.
#5 building
According to this article, when utf8mb4 ﹣ general ﹣ CI is used instead of utf8mb4 ﹣ Unicode ﹣ Ci, there are considerable performance advantages in MySQL 5.7: https ://www.percona.com/blog/2019/02/27/charset-and-collation-settings-impact -About MySQL performance/