Summary:
Pacemaker is a resource manager separated from heartbeat in v3 version, so pacemaker does not provide heartbeat information. Our cluster needs corosync support to be complete. The function of pacemaker is to manage the control center of the whole HA. The client configures and manages the whole cluster through pacemaker. There is also a CRM shell that helps us automatically generate configuration files and synchronize the node configuration files. It is a powerful tool when we build a cluster.
1. Installing Cluster Software
Install pacemaker and corosync directly through yumyum install pacemaker corosync -y
crmsh-1.2.6-0.rc2.2.1.x86_64.rpm
Install the above two rpm packages, where crmsh is dependent on pssh.pssh-2.3.1-2.1.x86_64.rpm
2. Configuring clusters through crm
Enter crm(cluster resource manager) directly into cluster resource manager[root@ha1 ~]# crm crm(live)#
Enter the tab key to see the related management itemscrm(live)# ? cib exit node ra status bye configure help options resource up cd end history quit site
We now need to configure the cluster, all of which go into configure.
There was an error like this, which should have been caused by the failure to open the corosync service. Even if we don't see any errors, we haven't even opened the heartbeat layer, let alone opened the higher level cluster management, so now we configure corosync.ERROR: running cibadmin -Ql: Could not establish cib_rw connection: Connection refused (111) Signon to CIB failed: Transport endpoint is not connected Init failed, could not perform requested operations
Use the rpm command to find the location of the configuration file for corosync.[root@ha1 ~]# rpm -ql corosync /etc/corosync /etc/corosync/corosync.conf.example
Remove the example after the configuration file and modify the content of the configuration file as follows:
# Please read the corosync.conf.5 manual page compatibility: whitetank totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 bindnetaddr: 192.168.5.0 #Network segment transmitted by cluster management information mcastaddr: 226.94.1.1 #Determining Multicast Address mcastport: 5405 #Determining Multicast Port ttl: 1 #Multicast ttl-1 messages outward only to prevent loops } } logging { fileline: off to_stderr: no to_logfile: yes to_syslog: yes logfile: /var/log/cluster/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } service { #Let corosync load pacemaker name: pacemaker ver: 0 #Version number. If version number is 1, the plug-in will not start pacemaker. If version number is 0, pacemaker will be automatically enabled. }
Next, start corosync. If it starts successfully and there are no errors in the log, it succeeds.
Now crm should be working properly.
crm(live)# configure crm(live)configure# show node ha1.mo.com node ha2.mo.com property $id="cib-bootstrap-options" \ dc-version="1.1.10-14.el6-368c726" \ cluster-infrastructure="classic openais (with plugin)" \ expected-quorum-votes="2"
Entering the corresponding command under bash will also show, but there is no completion.[root@ha1 cluster]# crm configure show node ha1.mo.com node ha2.mo.com property $id="cib-bootstrap-options" \ dc-version="1.1.10-14.el6-368c726" \ cluster-infrastructure="classic openais (with plugin)" \ expected-quorum-votes="2"
Now let's add services to the cluster.
First is the simpler ip service
This command seems very long, but in fact it is all complementary. You can configure it without memory as long as you understand your operation. Among them, ocf represents cluster service script, LSB is standard script under linux, that is, script placed under / etc/init.d.crm(live)configure# primitive vip ocf:heartbeat:IPaddr2 params ip=192.168.5.100 cidr_netmask=24 op monitor interval=30s
Every time you modify the configuration file, it is not immediately saved and output to program-readable xml, which requires you to commit.
I submitted the above error, it is STONITH problem, said that we defined STONITH, but did not configure, here we ignore, because we added the ip service, directly determine the submission. Note that after confirmation of submission, the service will take effect.crm(live)configure# commit error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid Do you still want to commit?
We use the check function of crm to see if the service is normal.
Go back to the original directory through cd, and then go to resource to check the resource situation. It's strange that it didn't start. After manual boot, it still failed, indicating that the configuration is wrong. Let's check the log.crm(live)configure# cd crm(live)# resource crm(live)resource# show vip (ocf::heartbeat:IPaddr2): Stopped crm(live)resource# start vip crm(live)resource# show vip (ocf::heartbeat:IPaddr2): Stopped
GINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Feb 27 07:14:09 ha1 pengine[6053]: error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined Feb 27 07:14:09 ha1 pengine[6053]: error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option Feb 27 07:14:09 ha1 pengine[6053]: error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Only STONITH errors were found. We tried to turn STONITH off.
The service was found to be normal. So ERROR must be cleared. After the above operation, you must feel that this pacemaker is very useful. When configuring the cluster, as long as you modify it on one node, all the nodes will be modified without further distribution operations.crm(live)configure# property stonith-enabled=false crm(live)resource# show vip (ocf::heartbeat:IPaddr2): Started
Now test whether there is a health check-up and shut down ha1's network
[root@ha2 ~]# crm_mon Last updated: Mon Feb 27 07:30:23 2017 Last change: Mon Feb 27 07:16:50 2017 via cibadmin on ha1.mo.com Stack: classic openais (with plugin) Current DC: ha2.mo.com - partition WITHOUT quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 1 Resources configured Online: [ ha2.mo.com ] OFFLINE: [ ha1.mo.com ]
Generally STONITH is a hardware device, our service is a virtual machine, so we need a virtual fence device.
[root@ha1 ~]# stonith_admin -I fence_pcmk fence_legacy 2 devices found
Look at the installed fence device, there is no fence_xvm we need. Let's check the omnipotent yum
Find out that this meets our needs. Install it and have a look.fence-virt.x86_64 : A pluggable fencing framework for virtual machines
Now we have the fence_xvm we need[root@ha1 ~]# stonith_admin -I fence_xvm fence_virt fence_pcmk fence_legacy 4 devices found
Add the fence agent using the above command[root@ha1 ~]# stonith_admin -M -a fence_xvm
Enter crm to add the fence configuration.
The pcmk_host_map above represents the corresponding relationship between the host name of the virtual machine and the domain name of the virtual machine.crm(live)configure# primitive vmfence stonith:fence_xvm params pcmk_host_map="ha1.mo.com:ha1;ha2.mo.com:ha2" op monitor interval=20s
Now take a look at how fence works
vmfence (stonith:fence_xvm): Started ha2.mo.com
Now add an http service test.
crm(live)configure# primitive apache lsb:httpd op monitor interval=30s
View the operation
Now, in conjunction with the RHCS suite we learned a few days ago, the order in which ip and http services start is sequential, so we'll define the order in which services start.
crm(live)configure# group website vip apache
This binds vip and apache to a group, and vip starts first and then http services. Now take a look at the status of the service
crm(live)resource# show vmfence (stonith:fence_xvm): Started Resource Group: website vip (ocf::heartbeat:IPaddr2): Started apache (lsb:httpd): Started
Now that the basic prototype of a service has come out, let's test whether fence works. Close the http service for ha1.
Failed actions: apache_monitor_30000 on ha1.mo.com 'not running' (7): call=27, status=complete, last-rc-change='Mon Feb 27 22:32:36 2017', queued=0ms, exec=0ms
By observing the cluster on ha2, the cluster has found that the http service on HA1 is closed, but it does not start fence, but directly opens the http service on ha1.
Now let ha1's network card hang up
2 Nodes configured, 2 expected votes 3 Resources configured Node ha1.mo.com: UNCLEAN (offline) Online: [ ha2.mo.com ] Resource Group: website vip (ocf::heartbeat:IPaddr2): Started ha1.mo.com apache (lsb:httpd): Started ha1.mo.com
There is a strange phenomenon that the service is still on ha1 without switching. The original pacemaker has a quorum option we did not set, if opened, the cluster will think that when fewer than two nodes cluster is broken, in practice, it is a disaster recovery strategy.
crm(live)configure# property no-quorum-policy=ignore
Put this in and continue testing. The current service is on 2. Now shut down the network card of 2.
Last change: Mon Feb 27 22:46:35 2017 via cibadmin on ha2.mo.com Stack: classic openais (with plugin) Current DC: ha1.mo.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 3 Resources configured Online: [ ha1.mo.com ha2.mo.com ] vmfence (stonith:fence_xvm): Started ha1.mo.com Resource Group: website vip (ocf::heartbeat:IPaddr2): Started ha1.mo.com apache (lsb:httpd): Started ha1.mo.com
You can see that the service is cut to 1 and ha2 is off.
Now add the ldirectord service so that our cluster can operate on lvs. The configuration of ldirectord has been explained in the blog in the previous chapter. Here we need to configure a virtual ip of 172.25.3.100 and two nodes of load distribution ip of 172.25.3.3 and 172.25.3.4.
Now add ldirectord to the configuration file
Next we will add storage services for this website. Before that, I introduced several commands to get a node offline and offline.crm(live)configure# primitive lvs lsb:ldirectord op monitor interval=30s
Now the service runs on ha2, let HA2 offline to see the resultsLast updated: Tue Feb 28 22:35:00 2017 Last change: Tue Feb 28 22:34:04 2017 via cibadmin on ha1.mo.com Stack: classic openais (with plugin) Current DC: ha1.mo.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 3 Resources configured Node ha1.mo.com: standby Online: [ ha2.mo.com ] vmfence (stonith:fence_xvm): Started ha2.mo.com Resource Group: website vip (ocf::heartbeat:IPaddr2): Started ha2.mo.com apache (lsb:httpd): Started ha2.mo.com
Now both nodes are in standby state. Let's put ha1 online.Last updated: Tue Feb 28 22:37:21 2017 Last change: Tue Feb 28 22:37:21 2017 via crm_attribute on ha2.mo.com Stack: classic openais (with plugin) Current DC: ha1.mo.com - partition with quorum Version: 1.1.10-14.el6-368c726 2 Nodes configured, 2 expected votes 3 Resources configured Node ha1.mo.com: standby Node ha2.mo.com: standby
Node ha2.mo.com: standby Online: [ ha1.mo.com ] vmfence (stonith:fence_xvm): Started ha1.mo.com Resource Group: website vip (ocf::heartbeat:IPaddr2): Started ha1.mo.com apache (lsb:httpd): Started ha1.mo.com
ha1 begins takeover
If there are no errors in the configuration file, but the service is still not working. For example, after I opened the cluster, I forgot to open the fence_virtd of the real machine and the vmfence of the virtual machine could not start. You can try the following command. The function of cleanup is to refresh the status of the resource.
crm(live)resource# cleanup vmfence
Cleaning up vmfence on ha1.mo.com Cleaning up vmfence on ha2.mo.com Waiting for 1 replies from the CRMd. OK
Now let's look at some of the requirements for each resource script
These are some introductions to apache scripts.start and stop Apache HTTP Server (lsb:httpd) The Apache HTTP Server is an efficient and extensible \ server implementing the current HTTP standards. Operations' defaults (advisory minimum): start timeout=15 stop timeout=15 status timeout=15 restart timeout=15 force-reload timeout=15 monitor timeout=15 interval=15
Next, add a drbd shared storage and mysql service to the cluster.
Firstly, two 4G hard disks are added for ha1 and ha2. The specific process of DRBD from source package to rpm package can be described. Portal
[root@ha1 x86_64]# ls drbd-8.4.2-2.el6.x86_64.rpm drbd-heartbeat-8.4.2-2.el6.x86_64.rpm drbd-pacemaker-8.4.2-2.el6.x86_64.rpm drbd-xen-8.4.2-2.el6.x86_64.rpm drbd-bash-completion-8.4.2-2.el6.x86_64.rpm drbd-km-2.6.32_431.el6.x86_64-8.4.2-2.el6.x86_64.rpm drbd-udev-8.4.2-2.el6.x86_64.rpm drbd-debuginfo-8.4.2-2.el6.x86_64.rpm drbd-km-debuginfo-8.4.2-2.el6.x86_64.rpm drbd-utils-8.4.2-2.el6.x86_64.rpm
The resulting rpm package. Then download mysql and put mysql files into drbd's shared storage.
Create the meta data of drbd, start the service and force it to primary. Note here that your DRBD underlying storage must not be formatted, otherwise how you force primary will not succeed. I have made two mistakes. Mount the DRBD device to / var/lib/mysql, which is the root directory of mysql, so that the data of MySQL is in the DRBD device. Remember to stop MySQL and switch drbd's backup. Don't let mysql's sock file exist in drbd's storage.
Now shut down the dbrd service and start letting the pacemaker cluster take over.
First add drbd resources
The script used this time is ocf's linbit, and it's important to define drbd_resourcecrm(live)resource# primitive drbddata ocf:linbit:drbd params drbd_resource=mo op monitor interval=120s
Setting up drbd master
crm(live)resource# ms drbdclone drbddata meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
crm(live)resource# primitive sqlfs ocf:heartbeat:Filesystem params device=/dev/drbd1 directory=/var/lib/mysql fstype=ext4
Set sqlfs and drbd in a union to facilitate the subsequent definition of startup order
Set the file system to start when drbd is the primary devicecrm(live)resource# colocation sqlfs-with-drbd inf: sqlfs drbdclone:Master
crm(live)resource# order sqlfs-after-drbd inf: drbdclone:promote sqlfs:start
Now commit and see if it works. If warning occurs in time, they can be ignored temporarily.
You can see that the service is working properly.crm(live)resource# show vmfence (stonith:fence_xvm): Started Resource Group: website vip (ocf::heartbeat:IPaddr2): Started apache (lsb:httpd): Started sqlfs (ocf::heartbeat:Filesystem): Started Master/Slave Set: drbdclone [drbddata] Masters: [ ha1.mo.com ]
Finally, add the configuration of mysql service to the configuration file
crm(live)configure# primitive mysql lsb:mysqld op monitor interval=60s
Remove the previous website group and now see if the service is working properly.crm(live)configure# group mydb vip sqlfs mysql
crm(live)resource# show vmfence (stonith:fence_xvm): Started Master/Slave Set: drbdclone [drbddata] Masters: [ ha2.mo.com ] Stopped: [ ha1.mo.com ] apache (lsb:httpd): Started Resource Group: mydb vip (ocf::heartbeat:IPaddr2): Started sqlfs (ocf::heartbeat:Filesystem): Started mysql (lsb:mysqld): Started