mysql Character Set and Proofreading Rules - Case Sensitive

Keywords: MySQL Database encoding ascii

    In relational databases, oracle/sql_server/db2 has case-sensitive data, but mysql is a bit strange. It is case-insensitive to data. See below for details.

    MySQL default rules for database name, table name, column name, alias case in Linux are as follows:

        1. Database names and table names are case-sensitive.

        2. Aliases of tables are case-sensitive.

        3. Column names and aliases are case-insensitive in all cases.

        4. Field content (i.e. data) is case-insensitive by default.

        5. Variable names (functions and stored procedures) are also case-sensitive.

    So we can not help but want to modify, and what is the control of these? It is the character set and character set proofreading rules that limit.


What are the character sets and proofreading rules?

    Character set is our character parsing encoding table. At the bottom of the computer, any character is just code that can't be parsed directly. These encoding tables are needed to parse exactly what words are. The most basic is ASCII code table, but this table character is too few to represent English and some everyday punctuation symbols. But it is obviously not enough to represent so many words in the world. In order to parse all kinds of characters, such as gb2312 in simplified Chinese, big5 in traditional Chinese, utf8 in the most famous Universal Code, and utf8mb4 in support of emoji expression.

    We often say that access scrambling is caused by asymmetric character encoding. It may be between you and the server, between the server and the database, or between the internal code without conversion, and so on.

    Character set is universal and exists in all kinds of environments in the computer world. Database is only one of them. The proofreading rules are for mysql, and the rules are fixed. Some people may feel confused. With character set, why proofreading rules? It's easy to understand that there are differences between characters, but what can we rely on to reflect the differences (such as sorting and grouping operations)? That's that. For example, if you don't distinguish between case and case, A and a have the same meaning, but if you have a strict proofreading rule, you can distinguish between case and case, A and a are different.


character set

    We can use the following commands to see which character sets and proofreading rules are supported.

#See which character sets are supported, excerpts
mysql> show character set;
+----------+---------------------------------+---------------------+--------+
| Charset  | Description                     | Default collation   | Maxlen |
+----------+---------------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
| latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
| ascii    | US ASCII                        | ascii_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese       | gb2312_chinese_ci   |      2 |
| gbk      | GBK Simplified Chinese          | gbk_chinese_ci      |      2 |
| utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
| utf8mb4  | UTF-8 Unicode                   | utf8mb4_general_ci  |      4 |
| utf16    | UTF-16 Unicode                  | utf16_general_ci    |      4 |
| utf32    | UTF-32 Unicode                  | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset           | binary              |      1 |
| gb18030  | China National Standard GB18030 | gb18030_chinese_ci  |      4 |
+----------+---------------------------------+---------------------+--------+
41 rows in set (0.01 sec)

    Generally speaking, mysql still supports more character sets by default, but in most cases we still use utf8 more. The reason is that in the last column of Maxlen, it represents the maximum number of bytes occupied by a character after using this character set. Big5 (Chinese Traditional Style) and GB2312 (Chinese Simplified Style) occupy less than 2 bytes, but their versatility is not good. Although utf16 is very powerful, it occupies a large number of bytes. The byte count is slightly more, occupying 4 bytes, and not necessarily so much. In compromise, utf8 is better, one character occupies 3 bytes.

    Changing the default character set of mysql can be added to the configuration file my.cnf

#To add the subitem [mysqld]
[mysqld]
#Global default character set type, set as required
character-set-server = utf8
#Restart after the change, go to mysql and have a look. (If you don't restart, change slowly one by one)
mysql> show variables like 'character%';
+--------------------------+---------------------------------------------------------------+
| Variable_name            | Value                                                         |
+--------------------------+---------------------------------------------------------------+
| character_set_client     | utf8                                                          |
| character_set_connection | utf8                                                          |
| character_set_database   | utf8                                                          |
| character_set_filesystem | binary                                                        |
| character_set_results    | utf8                                                          |
| character_set_server     | utf8                                                          |
| character_set_system     | utf8                                                          |
| character_sets_dir       | /usr/local/mysql-5.7.18-linux-glibc2.5-x86_64/share/charsets/ |
+--------------------------+---------------------------------------------------------------+
8 rows in set (0.00 sec)


Character proofreading rules

    Having said the character set, let's look at the rules of character proofreading.

#Show what proofreading rules are in utf8 character set, excerpts
mysql> SHOW COLLATION like 'utf8\_%';
+--------------------------+---------+-----+---------+----------+---------+
| Collation                | Charset | Id  | Default | Compiled | Sortlen |
+--------------------------+---------+-----+---------+----------+---------+
| utf8_general_ci          | utf8    |  33 | Yes     | Yes      |       1 |
| utf8_bin                 | utf8    |  83 |         | Yes      |       1 |
| utf8_unicode_ci          | utf8    | 192 |         | Yes      |       8 |
| utf8_icelandic_ci        | utf8    | 193 |         | Yes      |       8 |
| utf8_unicode_520_ci      | utf8    | 214 |         | Yes      |       8 |
| utf8_vietnamese_ci       | utf8    | 215 |         | Yes      |       8 |
| utf8_general_mysql500_ci | utf8    | 223 |         | Yes      |       1 |
+--------------------------+---------+-----+---------+----------+---------+
27 rows in set (0.00 sec)

    There are many proofreading rules, but most of them are not used. As you can see, many other rules are sortlen 8, which is more common than the first two. It will cost more resources.

    Each character set has a default proofreading rule. For example, the default proofreading rule of utf8 is utf8_general_ci. There is a naming convention for the proofreading rule: they start with the relevant character set name, usually including a language name, and end with _ci (case insensitive), _cs (case sensitive) or _bin (binary/case sensitive).
    So mysql's utf8 character set defaults to being insensitive to the case of data.

    Changing the default character proofreading rules for mysql can also be added to the configuration file my.cnf

#To add the subitem [mysqld]
[mysqld]
#Global default character proofreading rules, set as required
collation_server = utf8_bin
#Restart after the change, go to mysql and have a look. (If you don't restart, change slowly one by one)
mysql> show variables like 'collation_%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_bin        |
| collation_server     | utf8_bin        |
+----------------------+-----------------+
3 rows in set (0.00 sec)
#Of course, you can only change the library level or link level according to the actual situation, which is allowed to be different, but you need to know that the risk is there.
collation_database = utf8_bin
#In particular, it should be noted that the Connection-level character proofreading rules are determined by the client and are not valid by changing the configuration file.
#So if you want to change it, you can only execute it by command.
mysql> set collation_connection = utf8_bin


Be careful

    It should be noted that these changes are only valid for subsequently created database tables and users. Existing databases are not affected. If you want to change the definition of existing tables, you must change the table structure or rebuild the database.

    The way to change is alter. If you create a method by default, you don't need to do anything. If you don't default, you have to specify it artificially.

#Changing Character Sets and Character Correction Rules of Tables
ALTER TABLE Table name MODIFY COLUMN Field name varchar(50) CHARACTER SET utf8 COLLATE utf8_bin;

    It should be noted that changing character proofreading rules for tables in existing databases does not affect the use of tables, as long as there is no conflict in the data of tables, but some queries and sorting/grouping operations will result in gaps, but if the character set is changed in disorder, it will directly cause data scrambling, so we should be very cautious, sometimes it is better to rebuild the database.

    If only the table name of the database is case sensitive, my.cnf has a special configuration option. When the direct configuration is completed, the restart ignores the case rule of the table name, lower_case_table_names.

#To add the subitem [mysqld]
[mysqld]
#Case-sensitive option for table names, case-sensitive for 0 and case-insensitive for 1
lower_case_table_names = 1
#Restart takes effect


Posted by Rai_de on Sat, 29 Jun 2019 16:27:40 -0700