Automatic Construction and Automated Fault Repair of MySQL MGR Cluster Single-master Mode

Keywords: MySQL SQL

/*
the waiting game:
Although life is so difficult, don't give up; don't compromise; don't lose hope.
*/

 

With the upgrade of MySQL MGR version and the maturity of technology, MGR has become the preferred solution for high availability of MySQL after MHA has been pulled down from the altar.
The construction of MGR is not very complicated, but there are a series of manual operation steps. In order to simplify the construction and fault diagnosis of MGR, an automated script is completed here to realize the automated construction, automated fault diagnosis and repair of MGR.

 

Automation of MGR
For the sake of simplicity, the test is carried out in the mode of single machine and multiple instances.
Firstly, three MySQL instances are installed. The port number is 7001,7002,7003, of which 7001 is the write node, and the other two nodes are the read section. 8000 nodes are another test node of the author. Please ignore it.
When the master and slave nodes are specified, the following test demo is used to build the MGR cluster for the mgr_tool.py key

 

MGR Fault Simulation 1

The MGR node fault automatic monitoring and self-healing are realized. The following is the MGR cluster built after completion. At present, the cluster is in a completely normal state.

Subjectively causing binlog loss between master and slave nodes

For data operations lost from the primary node, GTID cannot find the corresponding data, and group replication goes off immediately.

Error at non-write node

Look at errorlog

If it's done manually, it's still GTID that skips the wrong routine, GTID information on master

Try to skip the latest thing ID and reconnect to the group. You can connect to the group normally, and the other node is still in error state.

stop group_replication;
SET GTID_NEXT='6c81c118-e67c-4416-9cb0-2d573d178c1d:13';
BEGIN; COMMIT;
set gtid_next='automatic'; 

The other node is similar and solved in turn.

MGR Fault Simulation 2

Departure from Group from Node

This is a relatively simple case. Just start group_replication again.

 

Automatic Detection and Repair of MGR Faults

For the two cases mentioned above,
1. If something loses its primary node from the slave node, try to skip GTID from the slave node and start replication again.
2. If the slave node does not lose the master node thing, try to restart group replication from the slave node.

The implementation code is as follows

def auto_fix_mgr_error(conn_master_dict,conn_slave_dict):
    group_replication_status = get_group_replication_status(conn_slave_dict)
    if(group_replication_status[0]["MEMBER_STATE"]=="ERROR" or group_replication_status[0]["MEMBER_STATE"] == "OFFLINE"):
        print(conn_slave_dict["host"]+str(conn_slave_dict["port"])+'------>'+group_replication_status[0]["MEMBER_STATE"])
        print("auto fixing......")
        while 1 > 0:
            master_gtid_list = get_gtid(conn_master_dict)
            slave_gtid_list = get_gtid(conn_slave_dict)
            master_executed_gtid_value = int((master_gtid_list[-1]["Executed_Gtid_Set"]).split("-")[-1])
            slave_executed_gtid_value = int(slave_gtid_list[-1]["Executed_Gtid_Set"].split("-")[-1])
            slave_executed_gtid_prefix = slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[0]
            slave_executed_skiped_gtid = slave_executed_gtid_value + 1
            if (master_executed_gtid_value > slave_executed_gtid_value):
                print("skip gtid and restart group replication,skiped gtid is "
                      + slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[-1].split("-")[0]
                      + ":"+str(slave_executed_skiped_gtid))
                slave_executed_skiped_gtid = slave_executed_gtid_prefix+":"+str(slave_executed_skiped_gtid)
                skip_gtid_on_slave(conn_slave_dict,slave_executed_skiped_gtid)
                time.sleep(10)
                start_group_replication(conn_slave_dict)
                if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"):
                    print("mgr cluster fixed,back to normal")
                    break
            else:
                start_group_replication(conn_slave_dict)
                if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"):
                    print("mgr cluster fixed,back to normal")
                break
    elif (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
        print("mgr cluster is normal,nothing to do")
        check_replication_group_members(conn_slave_dict)

 

Automated Repair of GTID Inconsistence for Fault Type 1

 

Automated Repair of Fault Type 2 Slave Node offline

 

Complete implementation code

This process requires that MySQL instances must satisfy the basic requirements of MGR. If the environment itself cannot satisfy MGR, nothing can be talked about. Therefore, we should be very clear about the basic requirements of MGR environment.

The completed implementation code is as follows. It took one afternoon to write. At present, there are the following shortcomings.
1. When creating a replicate user, no specific slave machine was specified. Currently, the direct specified%: create user repl @'%'identified by repl
2. For the repair of slave, at present, it can not be repaired as a whole, only one repair can be done one by one. In fact, it is a process of judging slave by a circular machine.
3. At present, reset master (master and slave, mainly cleaning up the possible residual GTID) will be used before building, so it is only suitable for the new environment.
4. Currently, only offline and gtid conflict type repair is supported, but other MGR error type repair is not supported.
5. The development environment is single-machine and multi-instance mode testing, which is not fully tested in multi-machine and single-instance mode.
All these will be gradually improved and strengthened.

# -*- coding: utf-8 -*-

import pymysql
import logging
import time
import decimal


def execute_query(conn_dict,sql):
    conn = pymysql.connect(host=conn_dict['host'],
                           port=conn_dict['port'],
                           user=conn_dict['user'],
                           passwd=conn_dict['password'],
                           db=conn_dict['db'])
    cursor = conn.cursor(pymysql.cursors.DictCursor)
    cursor.execute(sql)
    list = cursor.fetchall()
    cursor.close()
    conn.close()
    return list


def execute_noquery(conn_dict,sql):
    conn = pymysql.connect(host=conn_dict['host'],
                           port=conn_dict['port'],
                           user=conn_dict['user'],
                           passwd=conn_dict['password'],
                           db=conn_dict['db'])
    cursor = conn.cursor()
    cursor.execute(sql)
    conn.commit()
    cursor.close()
    conn.close()
    return list


def get_gtid(conn_dict):
    sql = "show master status;"
    list = execute_query(conn_dict,sql)
    return list


def skip_gtid_on_slave(conn_dict,gtid):
    sql_1 = 'stop group_replication;'
    sql_2 = '''set gtid_next='{0}';'''.format(gtid)
    sql_3 = 'begin;'
    sql_4 = 'commit;'
    sql_5 = '''set gtid_next='automatic';'''

    try:
        execute_noquery(conn_dict, sql_1)
        execute_noquery(conn_dict, sql_2)
        execute_noquery(conn_dict, sql_3)
        execute_noquery(conn_dict, sql_4)
        execute_noquery(conn_dict, sql_5)
    except:
        raise


def get_group_replication_status(conn_dict):
    sql = '''select MEMBER_STATE from performance_schema.replication_group_members 
            where (MEMBER_HOST = '{0}' or ifnull(MEMBER_HOST,'') = '')   
            AND (MEMBER_PORT={1} or ifnull(MEMBER_PORT,'') ='') ; '''.format(conn_dict["host"], conn_dict["port"])
    result = execute_query(conn_dict,sql)
    if result:
        return result
    else:
        return None


def check_replication_group_members(conn_dict):
    print('-------------------------------------------------------')
    result = execute_query(conn_dict, " select * from performance_schema.replication_group_members; ")
    if result:
        column = result[0].keys()
        current_row = ''
        for key in column:
            current_row += str(key) + "    "
        print(current_row)

        for row in result:
            current_row = ''
            for key in row.values():
                current_row += str(key) + "    "
            print(current_row)
    print('-------------------------------------------------------')


def auto_fix_mgr_error(conn_master_dict,conn_slave_dict):
    group_replication_status = get_group_replication_status(conn_slave_dict)
    if(group_replication_status[0]["MEMBER_STATE"]=="ERROR" or group_replication_status[0]["MEMBER_STATE"] == "OFFLINE"):
        print(conn_slave_dict["host"]+str(conn_slave_dict["port"])+'------>'+group_replication_status[0]["MEMBER_STATE"])
        print("auto fixing......")
        while 1 > 0:
            master_gtid_list = get_gtid(conn_master_dict)
            slave_gtid_list = get_gtid(conn_slave_dict)
            master_executed_gtid_value = int((master_gtid_list[-1]["Executed_Gtid_Set"]).split("-")[-1])
            slave_executed_gtid_value = int(slave_gtid_list[-1]["Executed_Gtid_Set"].split("-")[-1])
            slave_executed_gtid_prefix = slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[0]
            slave_executed_skiped_gtid = slave_executed_gtid_value + 1
            if (master_executed_gtid_value > slave_executed_gtid_value):
                print("skip gtid and restart group replication,skiped gtid is "
                      + slave_gtid_list[-1]["Executed_Gtid_Set"].split(":")[-1].split("-")[0]
                      + ":"+str(slave_executed_skiped_gtid))
                slave_executed_skiped_gtid = slave_executed_gtid_prefix+":"+str(slave_executed_skiped_gtid)
                skip_gtid_on_slave(conn_slave_dict,slave_executed_skiped_gtid)
                time.sleep(10)
                start_group_replication(conn_slave_dict)
                if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"):
                    print("mgr cluster fixed,back to normal")
                    break
            else:
                start_group_replication(conn_slave_dict)
                if(get_group_replication_status(conn_slave_dict)[0]["MEMBER_STATE"]=="ONLINE"):
                    print("mgr cluster fixed,back to normal")
                break
    elif (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
        print("mgr cluster is normal,nothing to do")
        check_replication_group_members(conn_slave_dict)

'''
reset master
'''
def reset_master(conn_dict):
    try:
        execute_noquery(conn_dict, "reset master;")
    except:
        raise


def install_group_replication_plugin(conn_dict):
    get_plugin_sql = "SELECT name,dl FROM mysql.plugin WHERE name = 'group_replication';"
    install_plugin_sql = '''install plugin group_replication soname 'group_replication.so'; '''
    try:
        result = execute_query(conn_dict, get_plugin_sql)
        if not result:
            execute_noquery(conn_dict, install_plugin_sql)
    except:
        raise


def create_mgr_repl_user(conn_master_dict,user,password):
    try:
        reset_master(conn_master_dict)
        sql_exists_user = '''select user from mysql.user where user = '{0}'; '''.format(user)
        user_list = execute_query(conn_master_dict,sql_exists_user)
        if not user_list:
            create_user_sql = '''create user {0}@'%' identified by '{1}'; '''.format(user,password)
            grant_privilege_sql = '''grant replication slave on *.* to {0}@'%';'''.format(user)
            execute_noquery(conn_master_dict,create_user_sql)
            execute_noquery(conn_master_dict, grant_privilege_sql)
            execute_noquery(conn_master_dict, "flush privileges;")
    except:
        raise


def set_super_read_only_off(conn_dict):
    super_read_only_off = '''set global super_read_only = 0;'''
    execute_noquery(conn_dict, super_read_only_off)


def open_group_replication_bootstrap_group(conn_dict):
    sql = '''select variable_name,variable_value from performance_schema.global_variables where variable_name = 'group_replication_bootstrap_group';'''
    result = execute_query(conn_dict, sql)
    open_bootstrap_group_sql = '''set @@global.group_replication_bootstrap_group=on;'''
    if result and result[0]['variable_value']=="OFF":
        execute_noquery(conn_dict, open_bootstrap_group_sql)


def close_group_replication_bootstrap_group(conn_dict):
    sql = '''select variable_name,variable_value from performance_schema.global_variables where variable_name = 'group_replication_bootstrap_group';'''
    result = execute_query(conn_dict, sql)
    close_bootstrap_group_sql = '''set @@global.group_replication_bootstrap_group=off;'''
    if result and result[0]['variable_value'] == "ON":
        execute_noquery(conn_dict, close_bootstrap_group_sql)


def start_group_replication(conn_dict):
    start_group_replication = '''start group_replication;'''
    group_replication_status = get_group_replication_status(conn_dict)
    if not (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
        execute_noquery(conn_dict, start_group_replication)


def connect_to_group(conn_dict,repl_user,repl_password):
    connect_to_group_sql = '''change master to
                                    master_user='{0}',
                                    master_password='{1}'
                                    for channel 'group_replication_recovery'; '''.format(repl_user,repl_password)
    try:
        execute_noquery(conn_dict, connect_to_group_sql)
    except:
        raise


def start_mgr_on_master(conn_master_dict,repl_user,repl_password):
    try:
        set_super_read_only_off(conn_master_dict)
        reset_master(conn_master_dict)
        create_mgr_repl_user(conn_master_dict,repl_user,repl_password)
        connect_to_group(conn_master_dict,repl_user,repl_password)

        open_group_replication_bootstrap_group(conn_master_dict)
        start_group_replication(conn_master_dict)
        close_group_replication_bootstrap_group(conn_master_dict)

        group_replication_status = get_group_replication_status(conn_master_dict)
        if (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
            print("master added in mgr and run successfully")
            return True
    except:
        raise
        print("############start master mgr error################")
        exit(1)


def start_mgr_on_slave(conn_slave_dict,repl_user,repl_password):
    try:
        set_super_read_only_off(conn_slave_dict)
        reset_master(conn_slave_dict)
        connect_to_group(conn_slave_dict,repl_user,repl_password)
        start_group_replication(conn_slave_dict)
        # wait for 10
        time.sleep(10)
        # then check mgr status
        group_replication_status = get_group_replication_status(conn_slave_dict)
        if (group_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
            print("slave added in mgr and run successfully")
        if (group_replication_status[0]['MEMBER_STATE'] == 'RECOVERING'):
            print("slave is recovering")
    except:
        print("############start slave mgr error################")
        exit(1)


def auto_mgr(conn_master,conn_slave_1,conn_slave_2,repl_user,repl_password):
    install_group_replication_plugin(conn_master)
    master_replication_status = get_group_replication_status(conn_master)

    if not (master_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
        start_mgr_on_master(conn_master,repl_user,repl_password)

    slave1_replication_status = get_group_replication_status(conn_slave_1)
    if not (slave1_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
        install_group_replication_plugin(conn_slave_1)
        start_mgr_on_slave(conn_slave_1, repl_user, repl_user)

    slave2_replication_status = get_group_replication_status(conn_slave_2)
    if not (slave2_replication_status[0]['MEMBER_STATE'] == 'ONLINE'):
        install_group_replication_plugin(conn_slave_2)
        start_mgr_on_slave(conn_slave_2, repl_user, repl_user)

    check_replication_group_members(conn_master)

if __name__ == '__main__':
    conn_master  = {'host': '127.0.0.1', 'port': 7001, 'user': 'root', 'password': 'root', 'db': 'mysql', 'charset': 'utf8mb4'}
    conn_slave_1 = {'host': '127.0.0.1', 'port': 7002, 'user': 'root', 'password': 'root', 'db': 'mysql', 'charset': 'utf8mb4'}
    conn_slave_2 = {'host': '127.0.0.1', 'port': 7003, 'user': 'root', 'password': 'root', 'db': 'mysql', 'charset': 'utf8mb4'}
    repl_user = "repl"
    repl_password = "repl"
    #auto_mgr(conn_master,conn_slave_1,conn_slave_2,repl_user,repl_password)

    auto_fix_mgr_error(conn_master,conn_slave_1)
    check_replication_group_members(conn_master)

Posted by jeffwhite on Sat, 24 Aug 2019 03:46:46 -0700