Cerebral fissure
In the high availability (HA) system, when the "heartbeat line" connecting two nodes is disconnected, the HA system, which is originally a whole and coordinated action, will be split into two independent individuals. Because they lost contact with each other, they all thought it was the other party's fault. The HA software on the two nodes is like a "brain cracker". If they compete for "shared resources" and "application services", serious consequences will occur - or the shared resources will be divided up and the two side "services" will not work; Or the "services" on both sides are up, but the "shared storage" is read and written at the same time, resulting in data corruption (it is common to see an error in the online log polled by the database).
Causes of cerebral fissure
Generally speaking, there are several reasons for the occurrence of cerebral fissure:
- The heartbeat line link between the highly available server pair fails, resulting in failure of normal communication
- Because the jumper is broken (including broken and aging)
- ip configuration and conflict problems due to network card and related driver failure (network card direct connection)
- Equipment failure due to connection between core jumpers (network card and switch)
- There is a problem with the arbitration machine (using the arbitration scheme)
- iptables firewall is enabled on the highly available server to block the transmission of heartbeat messages
- The heartbeat network card address and other information on the highly available server are not configured correctly, resulting in the failure of sending heartbeat
- Other reasons such as improper service configuration, such as different heartbeat modes, heartbeat wide plug-in conflicts, software bugs, etc
Common solutions for cerebral fissure
In the actual production environment, we can prevent the occurrence of brain cracking from the following aspects:
- Connect the serial cable and Ethernet cable at the same time, and use two heartbeat lines at the same time. If one line is broken, the other is still good, and the heartbeat message can still be transmitted
- When a split brain is detected, forcibly close a heartbeat node (this function needs the support of special equipment, such as stoneth and feyce). It is equivalent to that the standby node cannot receive the heartbeat and sends a shutdown command through a separate line to turn off the power of the primary node
- Monitor and alarm the cracked brain (e.g. e-mail, mobile phone short message, etc. or on duty). When the problem occurs, intervene in arbitration at the first time to reduce the loss. For example, Baidu's monitoring and alarm SMS has the difference between uplink and downlink. The alarm message is sent to the administrator's mobile phone. The administrator can reply to the corresponding number or simple string operation through the mobile phone and return it to the server. Let the server automatically handle the corresponding fault according to the instruction, so that the time to solve the fault is shorter
Of course, when implementing the high availability scheme, it is necessary to determine whether such losses can be tolerated according to the actual business needs. For general website routine business, this loss is tolerable
Monitor the brain fissure
environment
host name | ip | service |
---|---|---|
zabbix | 192.168.172.150 | zabbix |
master | 192.168.172.143 | Master keepalived |
slave | 192.168.172.142 | Reserved zabbix_anget |
zabbix server installation configuration
Installing ZABBIX on the client_ agentd
[root@slave src]# ls debug kernels zabbix-5.4.4.tar.gz [root@slave src]# tar xf zabbix-5.4.4.tar.gz //Create user [root@slave src]# cd zabbix-5.4.4 [root@slave zabbix-5.4.4]# useradd -r -M -s /sbin/nologin zabbix //Install dependent packages [root@slave zabbix-5.4.4]# yum -y install vim wget gcc gcc-c++ make pcre-devel openssl openssl-devel //Compile and install [root@agent zabbix-5.4.4]# ./configure --enable-agent [root@agent zabbix-5.4.4]# make install //Change profile [root@slave ]# vim /usr/local/etc/zabbix_agentd.conf ... Server=192.168.172.150 ServerActive=192.168.172.150 Hostname=111222 ... //Start agent [root@slave ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 0.0.0.0:80 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 128 0.0.0.0:10050 0.0.0.0:* LISTEN 0 128 [::]:22 [::]:*
keepalived
Refer to the previous article for configuration
View vip
[root@master ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:db:3c:14 brd ff:ff:ff:ff:ff:ff inet 192.168.172.143/24 brd 192.168.172.255 scope global noprefixroute ens160 valid_lft forever preferred_lft forever inet 192.168.172.250/32 scope global ens160 valid_lft forever preferred_lft forever [root@slave ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:f9:d6:50 brd ff:ff:ff:ff:ff:ff inet 192.168.172.142/24 brd 192.168.172.255 scope global noprefixroute ens160 valid_lft forever preferred_lft forever
Write brain fissure monitoring script
[root@slave scripts]# cat check_keepalived.sh if [ `ip a show ens160 |grep 192.168.129.144|wc -l` -ne 0 ] then echo "1" else echo "0" fi [root@slave scripts]# chmod +x check_keepalived.sh [root@slave scripts]# chown -R zabbix.zabbix check_keepalived.sh //Brain fissure monitoring script is written to the configuration file [root@slave ~]# vim /usr/local/etc/zabbix_agentd.conf ... UnsafeUserParameters=1 UserParameter=check_keepalived.sh,/scripts/keepalive.sh
//Restart agent
[root@slave ~]# pkill zabbix_agentd
[root@slave ~]# zabbix_agentd
**zabbix web** Create host ![Please add a picture description](https://img-blog.csdnimg.cn/41ed408794a24c88afcc1cc32cabafad.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) ![Please add a picture description](https://img-blog.csdnimg.cn/4cbfaf81385f4519821b829e2373daf9.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) Create monitor item ![Please add a picture description](https://img-blog.csdnimg.cn/df8fff1950c74d0d912388404486f17d.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) ![Please add a picture description](https://img-blog.csdnimg.cn/caa2780783a34e5b9319f3d6405a4648.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) ![Please add a picture description](https://img-blog.csdnimg.cn/a06ac291096e4a84aee7ccca41a6827e.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) ![Please add a picture description](https://img-blog.csdnimg.cn/71aabf543c6c4d64bc80afee3ab7b363.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) ![Please add a picture description](https://img-blog.csdnimg.cn/37958aaedd444da08af833fc32e690bd.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) Create trigger ![Please add a picture description](https://img-blog.csdnimg.cn/2c70b9ed36ca47c9bb53619216a39d5b.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) ![Please add a picture description](https://img-blog.csdnimg.cn/16cb76a554994ca287afc9cf00a4b19d.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16) //test ```shell The impersonation master server hung up [root@master ~]# systemctl stop keepalived [root@master ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:db:3c:14 brd ff:ff:ff:ff:ff:ff inet 192.168.172.143/24 brd 192.168.172.255 scope global noprefixroute ens160 valid_lft forever preferred_lft forever [root@slave ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:0c:29:f9:d6:50 brd ff:ff:ff:ff:ff:ff inet 192.168.172.142/24 brd 192.168.172.255 scope global noprefixroute ens160 valid_lft forever preferred_lft forever inet 192.168.172.250/32 scope global ens160 valid_lft forever preferred_lft forever