Fault analysis of dual master master replication Err 1677

Keywords: Database MySQL Session SQL

2020-03-29 20:00:27
 

1, Error message

Recently, the project implementation colleagues upgraded the system and changed a field of test.test tab. The SQL statement is as follows:

ALTER TABLE TEST.TEST_TAB_T1 MODIFY BXXX VARCHAR(200);

After the system upgrade of the project, the MySQL master-slave synchronization error is reported as follows:

mysql>show slave status\G
      Master_Log_File: binlog.000233
  Read_Master_Log_Pos: 274415020
       Relay_Log_File: relay-bin.000253
        Relay_Log_Pos: 175535154
Relay_Master_Log_File: binlog.000233
     Slave_IO_Running: Yes
    Slave_SQL_Running: No
    .................: 
           Last_Errno: 1677
           Last_Error: Column 28 of table 'test.test_tab_t1' cannot be converted from type 'varchar(30)(bytes))' to type 'varchar(400(bytes) gbk)'
         Skip_Counter: 0
  Exec_Master_Log_Pos: 175536357
      Relay_Log_Space: 274410464
 

Error information of MySQL alarm log:

2020-03-24T16:53:16.051244Z 11686 [ERROR] Slave SQL for channel '': Column 28 of table 'test.test_tab_t1' cannot be converted from type 'varchar(30(bytes))' to type 'varchar(400(bytes) gbk)', Error_code: 1677
2020-03-24T16:53:16.051269Z 11686 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000233' position 175536357.
 

2, Environmental information

At present, mysql community version 5.7.24 is used in this project. The character set is gbk, binlog is ROW format, and the architecture of maintained + master is adopted.

mysql>select @@version,@@character_set_server,@@binlog_format;
+-----------+------------------------+-----------------+
| @@version | @@character_set_server | @@binlog_format |
+-----------+------------------------+-----------------+
| 5.7.24   | gbk             | ROW             |
+-----------+------------------------+-----------------+
 

3, Diagnostic process

1. According to the MySQL replication breaks with error 1677: column.. of table '...' cannot be converted (DOC ID 2037712.1) document to see the error message err1677, analyze it to see a column (Column 28) in the test.test tab table, and there is a character set related conversion error.

First, compare the character set setting information of (master1-master2):

Then compare (master1-master2) the character set information of Library level, table level and column level:

Character set information of MASTER1 library level, table level and column level:

mysql>select * from information_schema.schemata where schema_name='test';
+--------------+-------------+----------------------------+------------------------+----------+--------------------+
| CATALOG_NAME | SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | SQL_PATH | DEFAULT_ENCRYPTION |
+--------------+-------------+----------------------------+------------------------+----------+--------------------+
| def    | test    | gbk             | gbk_general_ci    |   NULL | NO           |
+--------------+-------------+----------------------------+------------------------+----------+--------------------+

mysql>select table_schema,table_name,table_type,table_collation from information_schema.tables  where table_name='test_tab_t1';
+--------------+------------+------------+-----------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_TYPE | TABLE_COLLATION |
+--------------+------------+------------+-----------------+
| test    | test_tab_t1| BASE TABLE | gbk_chinese_ci |
+--------------+------------+------------+-----------------+

mysql>select table_name,column_name,character_maximum_length,character_octet_length,character_set_name,collation_name 
from information_schema.columns where table_name='test_tab_t1' and table_schema='test' and ordinal_position=29;
+------------+-------------+--------------------------+------------------------+--------------------+----------------+
| TABLE_NAME | COLUMN_NAME | CHARACTER_MAXIMUM_LENGTH | CHARACTER_OCTET_LENGTH | CHARACTER_SET_NAME | COLLATION_NAME |
+------------+-------------+--------------------------+------------------------+--------------------+----------------+
| test_tab_t1| BXXX     |               200 |              400  | gbk            | gbk_chinese_ci |
+------------+-------------+--------------------------+------------------------+--------------------+----------------+
 

Character set information of MASTER2 library level, table level and column level:

mysql>select * from information_schema.schemata where schema_name='test';
+--------------+-------------+----------------------------+------------------------+----------+--------------------+
| CATALOG_NAME | SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | SQL_PATH | DEFAULT_ENCRYPTION |
+--------------+-------------+----------------------------+------------------------+----------+--------------------+
| def       | test     | gbk                | gbk_general_ci      |   NULL | NO           |
+--------------+-------------+----------------------------+------------------------+----------+--------------------+

mysql>select table_schema,table_name,table_type,table_collation from information_schema.tables  where table_name='test_tab_t1';
+--------------+------------+------------+-----------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_TYPE | TABLE_COLLATION |
+--------------+------------+------------+-----------------+
| test      | test_tab_t1| BASE TABLE | gbk_chinese_ci |
+--------------+------------+------------+-----------------+

mysql>select table_name,column_name,character_maximum_length,character_octet_length,character_set_name,collation_name 
from information_schema.columns where table_name='test_tab_t1' and table_schema='test' and ordinal_position=29;
+------------+-------------+--------------------------+------------------------+--------------------+----------------+
| TABLE_NAME | COLUMN_NAME | CHARACTER_MAXIMUM_LENGTH | CHARACTER_OCTET_LENGTH | CHARACTER_SET_NAME | COLLATION_NAME |
+------------+-------------+--------------------------+------------------------+--------------------+----------------+
| test_tab_t1| BXXX   |           200 |         400 | gbk        | gbk_chinese_ci |
+------------+-------------+--------------------------+------------------------+--------------------+----------------+
 

According to the document ((Doc ID 2037712.1), there is a problem with the column corresponding to the system upgrade change table field (alter table test. Test? Tab? T1 modify bxxx varchar (200)).

Error: last_error: column 28 of table 'test. Test tab_t1' cannot be converted from type 'varchar (30 (bytes))' to type 'varchar (400 (bytes) GBK)

That is to say, there is a character set problem in the field bxxx (ordinal position = 29) changed in the test.test tab table during the copying process. The byte size of gbk is 2. Before the change, the bxxx is indeed varchar(15).

mysql> select * from information_schema.character_sets where character_set_name='gbk';
+--------------------+----------------------+------------------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION        | MAXLEN |
+--------------------+----------------------+------------------------+--------+
| gbk           | gbk_chinese_ci     | GBK Simplified Chinese |    2 |
+--------------------+----------------------+------------------------+--------+
 

But compared with the (master1-master2) library level, table level, column level character set information is consistent, there is no problem of character set inconsistency. At this time, I feel that the problem is not so simple, and then I will analyze it.

At the beginning, I thought about whether our implementers didn't operate according to the specification, because it was a dual primary environment, whether the statement was executed in two instances at the same time. During the troubleshooting process, I found that in the client of / etc/my.cnf, character ﹐ set ﹐ client, utf8 on master1 and gbk on master2. So let's look at the information in binlog at the time of failure. The error is reported when binlog.000233 and position is 175536357.

mysqlbinlog --no-defaults --start-position=175536357 --database=test  /opt/mysql/log/binlog/binlog.000233 --verbose
# at 175536357
#200325  0:53:16 server id 1  end_log_pos 175536422 CRC32 0xddd5d37 Anonymous_GTID  last_committed=271185   sequence_number=27186  rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/;
# at 175536422
#200325  0:53:16 server id 1  end_log_pos 175536505 CRC32 0x3799f3b Query   thread_id=14154792  exec_time=0   error_code=0
SET TIMESTAMP=1585068796/*!*/;
SET @@session.pseudo_thread_id=14154792/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1075838976/*!*/;
SET @@session.auto_increment_increment=2, @@session.auto_increment_offset=2/*!*/;
/*!\C gbk *//*!*/;
SET @@session.character_set_client=28,@@session.collation_connection=28,@@session.collation_server=28/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 175536505
#200325  0:53:16 server id 1  end_log_pos 175536685 CRC32 0xe3db6b6b    Table_map: `test`.`test_tab_t1` mapped to number 23434
# at 175536685
#200325  0:53:16 server id 1  end_log_pos 175537481 CRC32 0xf1343123    Update_rows: table id 2334 flags: STMT_END_F
### UPDATE `test`.`test_tab_t1`
### WHERE
###   @1='........'
###   ..........
###   @29='........'
###   ..........
###   @40='...'
### SET
###   @1='........'
###   ..........
###   @29='........'
###   ..........
###   @40='...'
 

You can see that the character set in binlog is indeed GBK

mysql> select * from information_schema.collations where id=28;
+----------------+--------------------+----+------------+-------------+---------+---------------+
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+----------------+--------------------+----+------------+-------------+---------+---------------+
| gbk_chinese_ci | gbk       | 28 | Yes     | Yes      |     1 | PAD SPACE   |
+----------------+--------------------+----+------------+-------------+---------+---------------+
1 row in set (0.01 sec)
 

 

The above is certainly not caused by a character set inconsistency. Therefore, it is invalid to use the officially recommended method to set the parameter slave ﹣ type ﹣ conversions = all ﹣ lossy / all ﹣ non ﹣ lossy to solve this problem, and the method also has the risk of data conversion loss.

In addition, according to https://bugs.mysql.com/bug.php?id=83461, it also describes the problem of err1677 error reported by the master-slave copy with inconsistent character set of a field. At the same time, it also mentions another situation, when binlog_format=ROW, because of the parsing problem when parsing the relay log, it can be considered to use binlog_format=MIXED format, and try to pull up the slave.

After analyzing this, I feel that I don't have a way of thinking. Go on searching for information. Sometimes it's just like this. The mountain is poor and the water is full of doubts, and there is another village with a bright future. Finally, we found a representative reference: https://bugs.mysql.com/bug.php?id=88595, which is the bug on 5.6.37, row based master replication broken by add column or drop column. But the environment of the project is 5.7.24, and the architecture is binlog_format=ROW.

 

By comparing the binlog information posted by me with the above information, it is true that there is an update test.test tab tab T1 table at the time of interruption. Generally speaking, although the versions are inconsistent, they are almost close according to the fault phenomenon.

In the case of algorithm = replace (default):

master1 executed: alter table test.test tab T1 modify bxxx varchar (200); press to record in binlog,

Master 2 synchronizes the statement of changing field information by pulling binlog. In the running process, master 2 also has the situation of update test.test tab T1 table (completed before the changed field statement)

When master 1 synchronizes the statement of update test.test tab T1 table, because the field has been changed and the table structure has changed, and the update statement is still the information of the original structure, the error of err1677 appears. Zhengzhou infertility hospital: http://jbk.39.net/yiyunfengcai/tsyl_zztjyy/3030/

 

Basically speaking, there are two questions. One is that this business is not a configured dual write, only one instance is written. The only thing that can be explained is that the implementation of colleagues is a DDL action on a non write instance (master1). The other one is the version problem. Later, I consulted my colleagues in the project implementation. The database was upgraded from 5.6. Hard to get to? Is there a legacy problem after upgrading? At present, the capacity is limited. Record it first and then analyze it further.

 

4, Solutions

The solution here has no reference for my situation. After the above analysis and according to the business situation of the table, I decided to skip the error first and then check the data consistency.

stop slave; 
-- Indicates skip step error, followed by variable numbers, 1 indicates skip step 1
set global sql_slave_skip_counter = 1; 
start slave;
# PT table checksum can be used to check the data consistency of the test tab T1 table in the master 1-master 2 database test. If there is any inconsistency, use this tool to repair it.
pt-table-checksum --nocheck-replication-filters --databases=test --replicate=test.test_tab_t1 --create-replicate-table --host=xxx --port 3306 -uroot -pxxx
 

 

5, Fault summary

 

Through the above fault analysis and troubleshooting process, I found that this BUG is very similar to OGG-01161 and OGG-01163 that oracle goldengate only copies DML and the source table has DDL changes. How much is it to make test tube baby in Zhengzhou: http://www.changhong120.com/

Many articles on OGG-01161 and OGG-01163 change the source table structure and cause the replicat e process abend. They all mentioned that comparing the table structure of the source and the target, restarting the R process or reporting an error.

Although the length of the field on the source and target side has been modified, and the def file has also been modified, the meta information in the generated trail file will not be updated. The replicat e process operates according to the meta information in the trail file by default. So it's still a mistake. The OVERRIDE option needs to be added so that the new def content can overwrite the meta information in the trail.

So this fault case is also similar to the update statement of master2. Before the DDL change, the table meta information synchronized to master1 is also before the change, and master1 is already the later structure of the table meta information, so this error will appear.

It's true that all things are interlinked, and database is no exception.

Posted by PigsHidePies on Wed, 22 Apr 2020 02:33:56 -0700