What is the difference between utf8? General? Ci and UTF8? Unicode? Ci

Keywords: Stored Procedure less MySQL

Is there a performance difference between utf8? General? Ci and UTF8? Unicode? CI?

#1 building

I want to know the performance difference between utf8 ﹣ general ﹣ Ci and UTF8 ﹣ Unicode ﹣ Ci, but I didn't find any benchmarks on the Internet, so I decided to create my own.

I created a very simple table with 500000 rows:

CREATE TABLE test(
  ID INT(11) DEFAULT NULL,
  Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;

I then run this stored procedure to populate it with random data:

CREATE PROCEDURE randomizer()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE random CHAR(20) ;
  theloop: loop
    SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);
    INSERT INTO test VALUES (i+1, random);
    SET i=i+1;
    IF i = 500000 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END

Then, I created the following stored procedures to benchmark simple SELECT, SELECT and LIKE and sort (SELECT and ORDER BY):

CREATE PROCEDURE benchmark_simple_select()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description = 'test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_select_like()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description LIKE '%test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_order_by()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE ID > FLOOR(1 + RAND() * (400000 - 1))
    ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;
    SET i = i + 1;
    IF i = 10 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

In the above stored procedure, utf8? General? CI classification is used, but of course, during the test, I used utf8? General? Ci and UTF8? Unicode? Ci at the same time.

For each collation, I call each stored procedure five times (five times for utf8 ﹣ general ﹣ Ci, five times for utf8 ﹣ Unicode ﹣ Ci, five times for utf8 ﹣ Unicode ﹣ CI) and calculate the average.

My results are:

benchmark_simple_select()

  • And UTF8? General? CI: 9957 MS
  • And UTF8 ﹣ Unicode ﹣ CI: 10271 MS

In this benchmark, using utf8? Unicode? CI is 3.2% slower than utf8? General? CI.

benchmark_select_like()

  • Use utf8? General? CI: 11441 MS
  • And UTF8 ﹣ Unicode ﹣ CI: 12811ms

In this benchmark, using utf8? Unicode? CI is 12% slower than utf8? General? CI.

benchmark_order_by()

  • And UTF8? General? CI: 11944 MS
  • And UTF8 ﹣ Unicode ﹣ CI: 12887 MS

In this benchmark, using utf8 ﹣ Unicode ﹣ CI is 7.9% slower than utf8 ﹣ general ﹣ CI.

#2 building

This article It is well described.

In short: utf8 ﹣ Unicode ﹣ CI uses the Unicode sorting algorithm defined in the Unicode standard, while utf8 ﹣ general ﹣ CI is a simpler sort order, resulting in "less accurate" sorting results.

#3 building

In short:

If you need a better sort order, use utf8? Unicode? CI (this is the preferred method),

But if you're completely interested in performance, use utf8? General? Ci, but be aware that it's a bit out of date.

There is little difference in performance.

#4 building

Some details (PL)

Just as we Ad locum As you can see (Peter Gulutzan), sorting / comparing Polish letters "Ł" (L-html esc with stroke: &Ł) (lowercase: "ł" - html esc: &ł) is different - we have the following assumptions:

utf8_polish_ci      Ł greater than L and less than M
utf8_unicode_ci     Ł greater than L and less than M
utf8_unicode_520_ci Ł equal to L
utf8_general_ci     Ł greater than Z

In Polish, the letter precedes the letters L and M. This kind of coding is neither good nor bad, which depends on your needs.

#5 building

According to this article, when utf8mb4 ﹣ general ﹣ CI is used instead of utf8mb4 ﹣ Unicode ﹣ Ci, there are considerable performance advantages in MySQL 5.7: https ://www.percona.com/blog/2019/02/27/charset-and-collat​​ion-settings-impact -About MySQL performance/

Posted by delphipgmr on Fri, 13 Dec 2019 04:00:15 -0800