Operations Note 31 (Summary of Pacemaker High Availability Cluster Buildings)

Summary:

Pacemaker is a resource manager separated from heartbeat in v3 version, so pacemaker does not provide heartbeat information. Our cluster needs corosync support to be complete. The function of pacemaker is to manage the control center of the whole HA. The client configures and manages the whole cluster through pacemaker. There is also a CRM shell that helps us automatically generate configuration files and synchronize the node configuration files. It is a powerful tool when we build a cluster.

1. Installing Cluster Software

    yum install pacemaker corosync -y

Install pacemaker and corosync directly through yum

crmsh-1.2.6-0.rc2.2.1.x86_64.rpm

pssh-2.3.1-2.1.x86_64.rpm

Install the above two rpm packages, where crmsh is dependent on pssh.

2. Configuring clusters through crm

[root@ha1 ~]# crm
crm(live)#

Enter crm(cluster resource manager) directly into cluster resource manager

crm(live)# 
?           cib         exit        node        ra          status      
bye         configure   help        options     resource    up          
cd          end         history     quit        site

Enter the tab key to see the related management items

We now need to configure the cluster, all of which go into configure.

ERROR: running cibadmin -Ql: Could not establish cib_rw connection: Connection refused (111)
Signon to CIB failed: Transport endpoint is not connected
Init failed, could not perform requested operations

There was an error like this, which should have been caused by the failure to open the corosync service. Even if we don't see any errors, we haven't even opened the heartbeat layer, let alone opened the higher level cluster management, so now we configure corosync.

[root@ha1 ~]# rpm -ql corosync
/etc/corosync
/etc/corosync/corosync.conf.example

Use the rpm command to find the location of the configuration file for corosync.

Remove the example after the configuration file and modify the content of the configuration file as follows:

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
	version: 2
	secauth: off
	threads: 0
	interface {
		ringnumber: 0			
		bindnetaddr: 192.168.5.0		#Network segment transmitted by cluster management information
		mcastaddr: 226.94.1.1			#Determining Multicast Address
		mcastport: 5405				#Determining Multicast Port
		ttl: 1					#Multicast ttl-1 messages outward only to prevent loops
	}
}

logging {
	fileline: off
	to_stderr: no
	to_logfile: yes
	to_syslog: yes
	logfile: /var/log/cluster/corosync.log
	debug: off
	timestamp: on
	logger_subsys {
		subsys: AMF
		debug: off
	}
}

amf {
	mode: disabled
}
service {		#Let corosync load pacemaker
	name: pacemaker
	ver: 0		#Version number. If version number is 1, the plug-in will not start pacemaker. If version number is 0, pacemaker will be automatically enabled.
}

Next, start corosync. If it starts successfully and there are no errors in the log, it succeeds.

Now crm should be working properly.

crm(live)# configure 
crm(live)configure# show
node ha1.mo.com
node ha2.mo.com
property $id="cib-bootstrap-options" \
	dc-version="1.1.10-14.el6-368c726" \
	cluster-infrastructure="classic openais (with plugin)" \
	expected-quorum-votes="2"

[root@ha1 cluster]# crm configure show
node ha1.mo.com
node ha2.mo.com
property $id="cib-bootstrap-options" \
	dc-version="1.1.10-14.el6-368c726" \
	cluster-infrastructure="classic openais (with plugin)" \
	expected-quorum-votes="2"

Entering the corresponding command under bash will also show, but there is no completion.

Now let's add services to the cluster.

First is the simpler ip service

crm(live)configure# primitive vip ocf:heartbeat:IPaddr2 params ip=192.168.5.100 cidr_netmask=24 op monitor interval=30s

This command seems very long, but in fact it is all complementary. You can configure it without memory as long as you understand your operation. Among them, ocf represents cluster service script, LSB is standard script under linux, that is, script placed under / etc/init.d.

Every time you modify the configuration file, it is not immediately saved and output to program-readable xml, which requires you to commit.

crm(live)configure# commit
   error: unpack_resources: 	Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources: 	Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources: 	NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
Do you still want to commit?

I submitted the above error, it is STONITH problem, said that we defined STONITH, but did not configure, here we ignore, because we added the ip service, directly determine the submission. Note that after confirmation of submission, the service will take effect.

We use the check function of crm to see if the service is normal.

crm(live)configure# cd
crm(live)# resource 
crm(live)resource# show
 vip	(ocf::heartbeat:IPaddr2):	Stopped 
crm(live)resource# start vip
crm(live)resource# show
 vip	(ocf::heartbeat:IPaddr2):	Stopped

Go back to the original directory through cd, and then go to resource to check the resource situation. It's strange that it didn't start. After manual boot, it still failed, indicating that the configuration is wrong. Let's check the log.

GINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
Feb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
Feb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity

Only STONITH errors were found. We tried to turn STONITH off.

crm(live)configure# property stonith-enabled=false
crm(live)resource# show
 vip	(ocf::heartbeat:IPaddr2):	Started

The service was found to be normal. So ERROR must be cleared. After the above operation, you must feel that this pacemaker is very useful. When configuring the cluster, as long as you modify it on one node, all the nodes will be modified without further distribution operations.

Now test whether there is a health check-up and shut down ha1's network

[root@ha2 ~]# crm_mon

Last updated: Mon Feb 27 07:30:23 2017
Last change: Mon Feb 27 07:16:50 2017 via cibadmin on ha1.mo.com
Stack: classic openais (with plugin)
Current DC: ha2.mo.com - partition WITHOUT quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
1 Resources configured

Online: [ ha2.mo.com ]
OFFLINE: [ ha1.mo.com ]

Generally STONITH is a hardware device, our service is a virtual machine, so we need a virtual fence device.

[root@ha1 ~]# stonith_admin -I
 fence_pcmk
 fence_legacy
2 devices found

Look at the installed fence device, there is no fence_xvm we need. Let's check the omnipotent yum

fence-virt.x86_64 : A pluggable fencing framework for virtual machines

Find out that this meets our needs. Install it and have a look.

[root@ha1 ~]# stonith_admin -I
 fence_xvm
 fence_virt
 fence_pcmk
 fence_legacy
4 devices found

Now we have the fence_xvm we need

[root@ha1 ~]# stonith_admin -M -a fence_xvm

Add the fence agent using the above command
Enter crm to add the fence configuration.

crm(live)configure# primitive vmfence stonith:fence_xvm params pcmk_host_map="ha1.mo.com:ha1;ha2.mo.com:ha2" op monitor interval=20s

The pcmk_host_map above represents the corresponding relationship between the host name of the virtual machine and the domain name of the virtual machine.
Now take a look at how fence works

vmfence (stonith:fence_xvm):    Started ha2.mo.com

Now add an http service test.

crm(live)configure# primitive apache lsb:httpd op monitor interval=30s

View the operation
Now, in conjunction with the RHCS suite we learned a few days ago, the order in which ip and http services start is sequential, so we'll define the order in which services start.

crm(live)configure# group website vip apache

This binds vip and apache to a group, and vip starts first and then http services. Now take a look at the status of the service

crm(live)resource# show
 vmfence	(stonith:fence_xvm):	Started 
 Resource Group: website
     vip	(ocf::heartbeat:IPaddr2):	Started 
     apache	(lsb:httpd):	Started

Now that the basic prototype of a service has come out, let's test whether fence works. Close the http service for ha1.

Failed actions:
    apache_monitor_30000 on ha1.mo.com 'not running' (7): call=27, status=complete, last-rc-change='Mon Feb 27 22:32:36 2017', queued=0ms, exec=0ms

By observing the cluster on ha2, the cluster has found that the http service on HA1 is closed, but it does not start fence, but directly opens the http service on ha1.
Now let ha1's network card hang up

2 Nodes configured, 2 expected votes
3 Resources configured


Node ha1.mo.com: UNCLEAN (offline)
Online: [ ha2.mo.com ]

 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha1.mo.com
     apache     (lsb:httpd):    Started ha1.mo.com

There is a strange phenomenon that the service is still on ha1 without switching. The original pacemaker has a quorum option we did not set, if opened, the cluster will think that when fewer than two nodes cluster is broken, in practice, it is a disaster recovery strategy.

crm(live)configure# property no-quorum-policy=ignore

Put this in and continue testing. The current service is on 2. Now shut down the network card of 2.

Last change: Mon Feb 27 22:46:35 2017 via cibadmin on ha2.mo.com
Stack: classic openais (with plugin)
Current DC: ha1.mo.com - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
3 Resources configured


Online: [ ha1.mo.com ha2.mo.com ]

vmfence (stonith:fence_xvm):    Started ha1.mo.com
 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha1.mo.com
     apache     (lsb:httpd):    Started ha1.mo.com

You can see that the service is cut to 1 and ha2 is off.

Now add the ldirectord service so that our cluster can operate on lvs. The configuration of ldirectord has been explained in the blog in the previous chapter. Here we need to configure a virtual ip of 172.25.3.100 and two nodes of load distribution ip of 172.25.3.3 and 172.25.3.4.

Now add ldirectord to the configuration file

crm(live)configure# primitive lvs lsb:ldirectord op  monitor interval=30s

Next we will add storage services for this website. Before that, I introduced several commands to get a node offline and offline.

Last updated: Tue Feb 28 22:35:00 2017
Last change: Tue Feb 28 22:34:04 2017 via cibadmin on ha1.mo.com
Stack: classic openais (with plugin)
Current DC: ha1.mo.com - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
3 Resources configured


Node ha1.mo.com: standby
Online: [ ha2.mo.com ]

vmfence (stonith:fence_xvm):    Started ha2.mo.com
 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha2.mo.com
     apache     (lsb:httpd):    Started ha2.mo.com

Now the service runs on ha2, let HA2 offline to see the results

Last updated: Tue Feb 28 22:37:21 2017
Last change: Tue Feb 28 22:37:21 2017 via crm_attribute	on ha2.mo.com
Stack: classic openais (with plugin)
Current DC: ha1.mo.com - partition with quorum
Version: 1.1.10-14.el6-368c726
2 Nodes configured, 2 expected votes
3 Resources configured


Node ha1.mo.com: standby
Node ha2.mo.com: standby

Now both nodes are in standby state. Let's put ha1 online.

Node ha2.mo.com: standby
Online: [ ha1.mo.com ]

vmfence (stonith:fence_xvm):    Started ha1.mo.com
 Resource Group: website
     vip        (ocf::heartbeat:IPaddr2):	Started ha1.mo.com
     apache     (lsb:httpd):    Started ha1.mo.com

ha1 begins takeover

If there are no errors in the configuration file, but the service is still not working. For example, after I opened the cluster, I forgot to open the fence_virtd of the real machine and the vmfence of the virtual machine could not start. You can try the following command. The function of cleanup is to refresh the status of the resource.

crm(live)resource# cleanup vmfence

Cleaning up vmfence on ha1.mo.com
Cleaning up vmfence on ha2.mo.com
Waiting for 1 replies from the CRMd. OK

Now let's look at some of the requirements for each resource script

start and stop Apache HTTP Server (lsb:httpd)

The Apache HTTP Server is an efficient and extensible  \
 	       server implementing the current HTTP standards.

Operations' defaults (advisory minimum):

    start         timeout=15
    stop          timeout=15
    status        timeout=15
    restart       timeout=15
    force-reload  timeout=15
    monitor       timeout=15 interval=15

These are some introductions to apache scripts.

Next, add a drbd shared storage and mysql service to the cluster.

Firstly, two 4G hard disks are added for ha1 and ha2. The specific process of DRBD from source package to rpm package can be described. Portal

[root@ha1 x86_64]# ls
drbd-8.4.2-2.el6.x86_64.rpm                  drbd-heartbeat-8.4.2-2.el6.x86_64.rpm                 drbd-pacemaker-8.4.2-2.el6.x86_64.rpm  drbd-xen-8.4.2-2.el6.x86_64.rpm
drbd-bash-completion-8.4.2-2.el6.x86_64.rpm  drbd-km-2.6.32_431.el6.x86_64-8.4.2-2.el6.x86_64.rpm  drbd-udev-8.4.2-2.el6.x86_64.rpm
drbd-debuginfo-8.4.2-2.el6.x86_64.rpm        drbd-km-debuginfo-8.4.2-2.el6.x86_64.rpm              drbd-utils-8.4.2-2.el6.x86_64.rpm

The resulting rpm package. Then download mysql and put mysql files into drbd's shared storage.

Create the meta data of drbd, start the service and force it to primary. Note here that your DRBD underlying storage must not be formatted, otherwise how you force primary will not succeed. I have made two mistakes. Mount the DRBD device to / var/lib/mysql, which is the root directory of mysql, so that the data of MySQL is in the DRBD device. Remember to stop MySQL and switch drbd's backup. Don't let mysql's sock file exist in drbd's storage.

Now shut down the dbrd service and start letting the pacemaker cluster take over.

First add drbd resources

crm(live)resource# primitive drbddata ocf:linbit:drbd params drbd_resource=mo op monitor interval=120s

The script used this time is ocf's linbit, and it's important to define drbd_resource

Setting up drbd master

crm(live)resource# ms drbdclone drbddata meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

Setting up mounting of drbd devices

crm(live)resource# primitive sqlfs ocf:heartbeat:Filesystem params device=/dev/drbd1 directory=/var/lib/mysql fstype=ext4

Set sqlfs and drbd in a union to facilitate the subsequent definition of startup order

crm(live)resource# colocation sqlfs-with-drbd inf: sqlfs drbdclone:Master

Set the file system to start when drbd is the primary device

crm(live)resource# order sqlfs-after-drbd inf: drbdclone:promote sqlfs:start

Now commit and see if it works. If warning occurs in time, they can be ignored temporarily.

crm(live)resource# show
 vmfence	(stonith:fence_xvm):	Started 
 Resource Group: website
     vip	(ocf::heartbeat:IPaddr2):	Started 
     apache	(lsb:httpd):	Started 
     sqlfs	(ocf::heartbeat:Filesystem):	Started 
 Master/Slave Set: drbdclone [drbddata]
     Masters: [ ha1.mo.com ]

You can see that the service is working properly.

Finally, add the configuration of mysql service to the configuration file

crm(live)configure# primitive mysql lsb:mysqld op monitor interval=60s

crm(live)configure# group mydb vip sqlfs mysql

Remove the previous website group and now see if the service is working properly.

crm(live)resource# show
 vmfence	(stonith:fence_xvm):	Started 
 Master/Slave Set: drbdclone [drbddata]
     Masters: [ ha2.mo.com ]
     Stopped: [ ha1.mo.com ]
 apache	(lsb:httpd):	Started 
 Resource Group: mydb
     vip	(ocf::heartbeat:IPaddr2):	Started 
     sqlfs	(ocf::heartbeat:Filesystem):	Started 
     mysql	(lsb:mysqld):	Started

Posted by waffle72 on Mon, 08 Apr 2019 23:24:31 -0700

Programmer Group

Operations Note 31 (Summary of Pacemaker High Availability Cluster Buildings)

Hot Keywords