Monitoring of cerebral fissure

Keywords: network


Cerebral fissure
In the high availability (HA) system, when the "heartbeat line" connecting two nodes is disconnected, the HA system, which is originally a whole and coordinated action, will be split into two independent individuals. Because they lost contact with each other, they all thought it was the other party's fault. The HA software on the two nodes is like a "brain cracker". If they compete for "shared resources" and "application services", serious consequences will occur - or the shared resources will be divided up and the two side "services" will not work; Or the "services" on both sides are up, but the "shared storage" is read and written at the same time, resulting in data corruption (it is common to see an error in the online log polled by the database).

Causes of cerebral fissure

Generally speaking, there are several reasons for the occurrence of cerebral fissure:

  • The heartbeat line link between the highly available server pair fails, resulting in failure of normal communication
  • Because the jumper is broken (including broken and aging)
  • ip configuration and conflict problems due to network card and related driver failure (network card direct connection)
  • Equipment failure due to connection between core jumpers (network card and switch)
  • There is a problem with the arbitration machine (using the arbitration scheme)
  • iptables firewall is enabled on the highly available server to block the transmission of heartbeat messages
  • The heartbeat network card address and other information on the highly available server are not configured correctly, resulting in the failure of sending heartbeat
  • Other reasons such as improper service configuration, such as different heartbeat modes, heartbeat wide plug-in conflicts, software bugs, etc

Common solutions for cerebral fissure

In the actual production environment, we can prevent the occurrence of brain cracking from the following aspects:

  • Connect the serial cable and Ethernet cable at the same time, and use two heartbeat lines at the same time. If one line is broken, the other is still good, and the heartbeat message can still be transmitted
  • When a split brain is detected, forcibly close a heartbeat node (this function needs the support of special equipment, such as stoneth and feyce). It is equivalent to that the standby node cannot receive the heartbeat and sends a shutdown command through a separate line to turn off the power of the primary node
  • Monitor and alarm the cracked brain (e.g. e-mail, mobile phone short message, etc. or on duty). When the problem occurs, intervene in arbitration at the first time to reduce the loss. For example, Baidu's monitoring and alarm SMS has the difference between uplink and downlink. The alarm message is sent to the administrator's mobile phone. The administrator can reply to the corresponding number or simple string operation through the mobile phone and return it to the server. Let the server automatically handle the corresponding fault according to the instruction, so that the time to solve the fault is shorter
      
    Of course, when implementing the high availability scheme, it is necessary to determine whether such losses can be tolerated according to the actual business needs. For general website routine business, this loss is tolerable

Monitor the brain fissure

environment

host nameipservice
zabbix192.168.172.150zabbix
master192.168.172.143Master keepalived
slave192.168.172.142Reserved
zabbix_anget

zabbix server installation configuration
Installing ZABBIX on the client_ agentd

[root@slave src]# ls
debug  kernels zabbix-5.4.4.tar.gz
[root@slave src]# tar xf zabbix-5.4.4.tar.gz 

//Create user
[root@slave src]# cd zabbix-5.4.4
[root@slave zabbix-5.4.4]# useradd -r -M -s /sbin/nologin zabbix

//Install dependent packages
[root@slave zabbix-5.4.4]# yum -y install vim wget gcc gcc-c++ make pcre-devel openssl openssl-devel

//Compile and install
[root@agent zabbix-5.4.4]# ./configure --enable-agent
[root@agent zabbix-5.4.4]# make install

//Change profile
[root@slave ]# vim /usr/local/etc/zabbix_agentd.conf
...
Server=192.168.172.150			
ServerActive=192.168.172.150
Hostname=111222
...

//Start agent
[root@slave ~]# ss -antl
State  Recv-Q Send-Q Local Address:Port   Peer Address:Port 
LISTEN 0      128          0.0.0.0:80          0.0.0.0:*    
LISTEN 0      128          0.0.0.0:22          0.0.0.0:*    
LISTEN 0      128          0.0.0.0:10050       0.0.0.0:*    
LISTEN 0      128             [::]:22             [::]:*    

keepalived
Refer to the previous article for configuration
View vip

[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:db:3c:14 brd ff:ff:ff:ff:ff:ff
    inet 192.168.172.143/24 brd 192.168.172.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever
    inet 192.168.172.250/32 scope global ens160
       valid_lft forever preferred_lft forever


[root@slave ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:f9:d6:50 brd ff:ff:ff:ff:ff:ff
    inet 192.168.172.142/24 brd 192.168.172.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever

Write brain fissure monitoring script

[root@slave scripts]# cat check_keepalived.sh 
if [ `ip a show ens160 |grep 192.168.129.144|wc -l` -ne 0 ]
then
        echo "1"
else
        echo "0"
fi
[root@slave scripts]# chmod +x check_keepalived.sh
[root@slave scripts]# chown -R zabbix.zabbix check_keepalived.sh


//Brain fissure monitoring script is written to the configuration file
[root@slave ~]# vim /usr/local/etc/zabbix_agentd.conf
...
UnsafeUserParameters=1                    
UserParameter=check_keepalived.sh,/scripts/keepalive.sh

//Restart agent
[root@slave ~]# pkill zabbix_agentd
[root@slave ~]# zabbix_agentd

**zabbix web**
Create host
![Please add a picture description](https://img-blog.csdnimg.cn/41ed408794a24c88afcc1cc32cabafad.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
![Please add a picture description](https://img-blog.csdnimg.cn/4cbfaf81385f4519821b829e2373daf9.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
Create monitor item
![Please add a picture description](https://img-blog.csdnimg.cn/df8fff1950c74d0d912388404486f17d.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
![Please add a picture description](https://img-blog.csdnimg.cn/caa2780783a34e5b9319f3d6405a4648.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
![Please add a picture description](https://img-blog.csdnimg.cn/a06ac291096e4a84aee7ccca41a6827e.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
![Please add a picture description](https://img-blog.csdnimg.cn/71aabf543c6c4d64bc80afee3ab7b363.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
![Please add a picture description](https://img-blog.csdnimg.cn/37958aaedd444da08af833fc32e690bd.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
Create trigger
![Please add a picture description](https://img-blog.csdnimg.cn/2c70b9ed36ca47c9bb53619216a39d5b.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)
![Please add a picture description](https://img-blog.csdnimg.cn/16cb76a554994ca287afc9cf00a4b19d.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBAZUhWNGFXNW4=,size_20,color_FFFFFF,t_70,g_se,x_16)

//test
```shell
 The impersonation master server hung up
[root@master ~]# systemctl stop keepalived
[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:db:3c:14 brd ff:ff:ff:ff:ff:ff
    inet 192.168.172.143/24 brd 192.168.172.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever


[root@slave ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:0c:29:f9:d6:50 brd ff:ff:ff:ff:ff:ff
    inet 192.168.172.142/24 brd 192.168.172.255 scope global noprefixroute ens160
       valid_lft forever preferred_lft forever
    inet 192.168.172.250/32 scope global ens160
       valid_lft forever preferred_lft forever

Posted by drayfuss on Sun, 24 Oct 2021 22:31:49 -0700