etcd backup and recovery in K8S cluster

Keywords: Kubernetes Container

Etcd is a very important Service of kubernetes cluster. It stores all data information of kubernetes cluster, such as Namespace, Pod, Service, routing and other status information. In case of etcd cluster disaster or etcd cluster data loss, it will affect k8s cluster data recovery. Therefore, it is very important to backup etcd data to realize the disaster recovery environment of kubernetes cluster.


1, etcd cluster backup

The etcdctl commands of different versions of etcd are different, but roughly the same. napshot save is used for snapshot backup here.
Points to note:
  • The backup operation can be performed on one of the nodes of the etcd cluster.
  • The api of etcd v3 is used here. Since k8s 1.13, k8s no longer supports v2 etcd, that is, the cluster data of k8s exists in v3 etcd. Therefore, the backup data only backs up the etcd data added with v3, and the etcd data added with v2 is not backed up.
  • This case uses the k8s v1.18.6 + Calico container environment for binary deployment (the "ETCDCTL_API=3 etcdctl" in the following command is equivalent to "etcdctl")
1) Before starting the backup, check the etcd data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 etcd Data directory [root@k8s-master01 ~]# cat /opt/k8s/bin/ |grep "ETCD_DATA_DIR=" export ETCD_DATA_DIR="/data/k8s/etcd/data"        etcd WAL catalogue [root@k8s-master01 ~]# cat /opt/k8s/bin/ |grep "ETCD_WAL_DIR=" export ETCD_WAL_DIR="/data/k8s/etcd/wal"   [root@k8s-master01 ~]# ls /data/k8s/etcd/data/ member [root@k8s-master01 ~]# ls /data/k8s/etcd/data/member/ snap [root@k8s-master01 ~]# ls /data/k8s/etcd/wal/ 0000000000000000-0000000000000000.wal  0.tmp


2) Perform etcd cluster data backup
Perform the backup operation on one node of the etcd cluster, and then copy the backup files to other nodes.
First create a backup directory on each node of the etcd cluster
1 # mkdir -p /data/etcd_backup_dir

Perform a backup on one of the etcd cluster nodes (here at k8s-master01):

1 [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --endpoints= snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db

Copy the backup file to another etcd node
1 2 [root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/ [root@k8s-master01 ~]# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/


You can put the etcd backup command of the above k8s-master01 node into the script and perform scheduled backup in combination with crontab:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [root@k8s-master01 ~]# cat /data/etcd_backup_dir/ #!/usr/bin/bash   date; CACERT="/etc/kubernetes/cert/ca.pem" CERT="/etc/etcd/cert/etcd.pem" EKY="/etc/etcd/cert/etcd-key.pem" ENDPOINTS=""   ETCDCTL_API=3 /opt/k8s/bin/etcdctl \ --cacert="${CACERT}" --cert="${CERT}" --key="${EKY}" \ --endpoints=${ENDPOINTS} \ snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db   # Backup retention for 30 days find /data/etcd_backup_dir/ -name "*.db" -mtime +30 -exec rm -f {} \;   # Synchronize to the other two etcd nodes /bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master02:/data/etcd_backup_dir/ /bin/rsync -e "ssh -p5522" -avpgolr --delete /data/etcd_backup_dir/ root@k8s-master03:/data/etcd_backup_dir/


Set the crontab scheduled backup task and perform the backup at 5 a.m. every day:
1 2 3 4 [root@k8s-master01 ~]# chmod 755 /data/etcd_backup_dir/ [root@k8s-master01 ~]# crontab -l #etcd cluster data backup 0 5 * * * /bin/bash -x /data/etcd_backup_dir/ > /dev/null 2>&1

2, etcd cluster recovery

The etcd cluster backup operation only needs to be completed on one of the etcd nodes, and then copy the backup files to other nodes.
However, etcd cluster recovery must be completed on all etcd nodes!

1) Simulating etcd cluster data loss
Delete the data of the three etcd cluster nodes (or directly delete the data directory)

1 # rm -rf /data/k8s/etcd/data/*


To view k8s cluster status:

1 2 3 4 5 6 7 [root@k8s-master01 ~]# kubectl get cs NAME                 STATUS      MESSAGE                                                                                           ERROR etcd-2               Unhealthy   Get dial tcp connect: connection refused etcd-1               Unhealthy   Get dial tcp connect: connection refused etcd-0               Unhealthy   Get dial tcp connect: connection refused scheduler            Healthy     ok controller-manager   Healthy     ok


Since the three nodes of the etcd cluster are still in service at this time, check the cluster status later and it will return to normal:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [root@k8s-master01 ~]# kubectl get cs NAME                 STATUS    MESSAGE             ERROR controller-manager   Healthy   ok scheduler            Healthy   ok etcd-0               Healthy   {"health":"true"} etcd-2               Healthy   {"health":"true"} etcd-1               Healthy   {"health":"true"}   [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints=",," --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health is healthy: successfully committed proposal: took = 9.918673ms is healthy: successfully committed proposal: took = 10.985279ms is healthy: successfully committed proposal: took = 13.422545ms   [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints=",," --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem member list --write-out=table +------------------+---------+------------+----------------------------+----------------------------+------------+ |        ID        | STATUS  |    NAME    |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER | +------------------+---------+------------+----------------------------+----------------------------+------------+ | 1d1d7edbba38c293 | started | k8s-etcd03 | | |      false | | 4c0cfad24e92e45f | started | k8s-etcd02 | | |      false | | 79cf4f0a8c3da54b | started | k8s-etcd01 | | |      false | +------------------+---------+------------+----------------------------+----------------------------+------------+


As found above, the leader s of the three nodes of the etcd cluster are false, that is, there is no master selected. At this time, you need to restart the etcd service of the three nodes:
1 # systemctl restart etcd


After restart, check again and find that the etcd cluster has been selected successfully and the cluster status is normal!
1 2 3 4 5 6 7 8 [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints=",," endpoint status +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | | 79cf4f0a8c3da54b |   3.4.9 |  1.6 MB |      true |      false |         5 |      24658 |              24658 |        | | | 4c0cfad24e92e45f |   3.4.9 |  1.6 MB |     false |      false |         5 |      24658 |              24658 |        | | | 1d1d7edbba38c293 |   3.4.9 |  1.7 MB |     false |      false |         5 |      24658 |              24658 |        | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+


However, k8s cluster data has actually been lost. There are no resources such as pod under the namespace namespace. At this time, you need to restore through etcd cluster backup file, that is, through the etcd cluster snapshot file above.
1 2 3 4 5 6 7 8 9 10 [root@k8s-master01 ~]# kubectl get ns NAME              STATUS   AGE default           Active   9m47s kube-node-lease   Active   9m39s kube-public       Active   9m39s kube-system       Active   9m47s [root@k8s-master01 ~]# kubectl get pods -n kube-system No resources found in kube-system namespace. [root@k8s-master01 ~]# kubectl get pods --all-namespaces No resources found


2) etcd cluster data recovery, namely kubernetes cluster data recovery
Before etcd data recovery, turn off the Kube aposerver services of all master nodes. The etcd services of all etcd nodes are as follows:
1 2 # systemctl stop kube-apiserver # systemctl stop etcd


Special note: before recovering etcd cluster data, you must delete the old data and wal working directories of all etcd nodes, which refer to the / data/k8s/etcd/data folder and / data/k8s/etcd/wal folder, which may lead to recovery failure (an error occurs when the recovery command is executed, and the data directory already exists).
1 # rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal


Perform the restore operation on each etcd node:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 node ------------------------------------------------------- ETCDCTL_API=3 etcdctl \ --name=k8s-etcd01 \ --endpoints="" \ --cert=/etc/etcd/cert/etcd.pem \ --key=/etc/etcd/cert/etcd-key.pem \ --cacert=/etc/kubernetes/cert/ca.pem \ --initial-cluster-token=etcd-cluster-0 \ --initial-advertise-peer-urls= \ --initial-cluster=k8s-etcd01=,k8s-etcd02=,k8s-etcd03= \ --data-dir=/data/k8s/etcd/data \ --wal-dir=/data/k8s/etcd/wal \ snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db node ------------------------------------------------------- ETCDCTL_API=3 etcdctl \ --name=k8s-etcd02 \ --endpoints="" \ --cert=/etc/etcd/cert/etcd.pem \ --key=/etc/etcd/cert/etcd-key.pem \ --cacert=/etc/kubernetes/cert/ca.pem \ --initial-cluster-token=etcd-cluster-0 \ --initial-advertise-peer-urls= \ --initial-cluster=k8s-etcd01=,k8s-etcd02=,k8s-etcd03= \ --data-dir=/data/k8s/etcd/data \ --wal-dir=/data/k8s/etcd/wal \ snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db node ------------------------------------------------------- ETCDCTL_API=3 etcdctl \ --name=k8s-etcd03 \ --endpoints="" \ --cert=/etc/etcd/cert/etcd.pem \ --key=/etc/etcd/cert/etcd-key.pem \ --cacert=/etc/kubernetes/cert/ca.pem \ --initial-cluster-token=etcd-cluster-0 \ --initial-advertise-peer-urls= \ --initial-cluster=k8s-etcd01=,k8s-etcd02=,k8s-etcd03= \ --data-dir=/data/k8s/etcd/data \ --wal-dir=/data/k8s/etcd/wal \ snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db


Start the etcd services of all etcd nodes in sequence:
1 2 # systemctl start etcd # systemctl status etcd


Check the status of etcd cluster (as shown below, it is found that the master has been successfully selected in etcd cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl --endpoints=",," --cert=/etc/etcd/cert/etcd.pem --key=/etc/etcd/cert/etcd-key.pem --cacert=/etc/kubernetes/cert/ca.pem endpoint health is healthy: successfully committed proposal: took = 12.837393ms is healthy: successfully committed proposal: took = 13.306671ms is healthy: successfully committed proposal: took = 13.602805ms   [root@k8s-master01 ~]# ETCDCTL_API=3 etcdctl -w table --cacert=/etc/kubernetes/cert/ca.pem   --cert=/etc/etcd/cert/etcd.pem   --key=/etc/etcd/cert/etcd-key.pem   --endpoints=",," endpoint status +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | | 79cf4f0a8c3da54b |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        | | | 4c0cfad24e92e45f |   3.4.9 |  9.0 MB |      true |      false |         2 |         13 |                 13 |        | | | 5f70664d346a6ebd |   3.4.9 |  9.0 MB |     false |      false |         2 |         13 |                 13 |        | +----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+


Then start the Kube apiserver service of all master nodes in turn:
1 2 # systemctl start kube-apiserver # systemctl status kube-apiserver


To view kubernetes cluster status:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [root@k8s-master01 ~]# kubectl get cs NAME                 STATUS      MESSAGE                                  ERROR controller-manager   Healthy     ok scheduler            Healthy     ok etcd-2               Unhealthy   HTTP probe failed with statuscode: 503 etcd-1               Unhealthy   HTTP probe failed with statuscode: 503 etcd-0               Unhealthy   HTTP probe failed with statuscode: 503   because etcd Once the service is restarted, it needs to be brushed several times, and the status will be normal: [root@k8s-master01 ~]# kubectl get cs NAME                 STATUS    MESSAGE             ERROR controller-manager   Healthy   ok scheduler            Healthy   ok etcd-2               Healthy   {"health":"true"} etcd-0               Healthy   {"health":"true"} etcd-1               Healthy   {"health":"true"}


To view the resources of kubernetes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 [root@k8s-master01 ~]# kubectl get ns NAME              STATUS   AGE default           Active   7d4h kevin             Active   5d18h kube-node-lease   Active   7d4h kube-public       Active   7d4h kube-system       Active   7d4h   [root@k8s-master01 ~]# kubectl get pods --all-namespaces NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE default       dnsutils-ds-22q87                          0/1     ContainerCreating   171        7d3h default       dnsutils-ds-bp8tm                          0/1     ContainerCreating   138        5d18h default       dnsutils-ds-bzzqg                          0/1     ContainerCreating   138        5d18h default       dnsutils-ds-jcvng                          1/1     Running             171        7d3h default       dnsutils-ds-xrl2x                          0/1     ContainerCreating   138        5d18h default       dnsutils-ds-zjg5l                          1/1     Running             0          7d3h default       kevin-t-84cdd49d65-ck47f                   0/1     ContainerCreating   0          2d2h default       nginx-ds-98rm2                             1/1     Running             2          7d3h default       nginx-ds-bbx68                             1/1     Running             0          7d3h default       nginx-ds-kfctv                             0/1     ContainerCreating   1          5d18h default       nginx-ds-mdcd9                             0/1     ContainerCreating   1          5d18h default       nginx-ds-ngqcm                             1/1     Running             0          7d3h default       nginx-ds-tpcxs                             0/1     ContainerCreating   1          5d18h kevin         nginx-ingress-controller-797ffb479-vrq6w   0/1     ContainerCreating   0          5d18h kevin         test-nginx-7d4f96b486-qd4fl                0/1     ContainerCreating   0          2d1h kevin         test-nginx-7d4f96b486-qfddd                0/1     Running             0          2d1h kube-system   calico-kube-controllers-578894d4cd-9rp4c   1/1     Running             1          7d3h kube-system   calico-node-d7wq8                          0/1     PodInitializing     1          7d3h
After etcd cluster data is recovered, the pod container will also slowly recover to the running state. So far, the entire kubernetes cluster has been restored through etcd backup data.

3, Final summary

Kubernetes cluster backup mainly backs up ETCD clusters. During recovery, the whole sequence of recovery is mainly considered:
Stop Kube apiserver -- > stop etcd -- > recover data -- > start etcd -- > start Kube apiserver
Special attention:
  • When backing up an ETCD cluster, only one ETCD data needs to be backed up and then synchronized to other nodes.
  • When recovering ETCD data, you can recover the backup data of one of the nodes

Turn from

K8S cluster disaster recovery environment deployment - scattered flashiness - blog Park

Posted by whitchman on Wed, 01 Dec 2021 10:13:38 -0800