An article about TiDB disaster recovery recommendation

Keywords: Database ansible MySQL Session curl

An article to play with TiDB disaster recovery

1, Background

High availability is another major feature of TiDB. All three components of TiDB/TiKV/PD can tolerate partial instance failure without affecting the availability of the whole cluster. The following describes the availability of these three components, the consequences of a single instance failure, and how to recover.

TiDB
TiDB is stateless. It is recommended to deploy at least two instances. The front end provides services through load balancing components. When a single instance fails, it will affect the Session on the instance. From the perspective of application, a single request fails. After reconnecting, you can continue to obtain the service. After a single instance fails, you can restart the instance or deploy a new instance.

PD
PD is a cluster. It maintains data consistency through the Raft protocol. When a single instance fails, if the instance is not the leader of Raft, the service will not be affected at all. If the instance is the leader of Raft, a new Raft leader will be selected and the service will be restored automatically. PD is unable to provide external services during the election process, which is about 3 seconds. It is recommended to deploy at least three PD instances. After a single instance fails, restart the instance or add a new instance.

TiKV
TiKV is a cluster, which maintains data consistency through Raft protocol (the number of copies can be configured, and three copies are saved by default), and makes load balancing scheduling through PD. When a single node fails, all regions stored on this node will be affected. For the Leader node in the Region, the service will be interrupted and waiting for re election; for the Follower node in the Region, the service will not be affected. When a TiKV node fails and cannot be recovered within a period of time (default 30 minutes), PD will migrate the data on it to other TiKV nodes.

2, Architecture

wtidb28.add.shbt.qihoo.net  192.168.1.1  TiDB/PD/pump/prometheus/grafana/CCS
wtidb27.add.shbt.qihoo.net  192.168.1.2  TiDB
wtidb26.add.shbt.qihoo.net  192.168.1.3  TiDB
wtidb22.add.shbt.qihoo.net  192.168.1.4  TiKV
wtidb21.add.shbt.qihoo.net  192.168.1.5  TiKV
wtidb20.add.shbt.qihoo.net  192.168.1.6  TiKV
wtidb19.add.shbt.qihoo.net  192.168.1.7  TiKV
wtidb18.add.shbt.qihoo.net  192.168.1.8  TiKV
wtidb17.add.shbt.qihoo.net  192.168.1.9  TiFlash
wtidb16.add.shbt.qihoo.net  192.168.1.10  TiFlash

The cluster uses 3TiDB node, 5TiKV and 2TiFlash architecture to test disaster recovery. The way of TiFlash is to deploy the cluster first, and then deploy TiFlash, version 3.1.0GA

3, Downtime of two tests

There are three default replicas in the cluster, any two of the five machines are down. In theory, there are three situations. One is that two of the three replicas happen to be down on the two machines. One is that only one region in the three replicas is down on the two machines. The other is that there is no replica of some content in the two down machines. This time, we test two TiKV, wtidb21 and wtidb22 Node down.

Let's take a look at the status of the pre outage test table

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.91 sec)

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.98 sec)

After two downtime at the same time:

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;               
ERROR 9002 (HY000): TiKV server timeout 
mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
ERROR 9005 (HY000): Region is unavailable

Take a look at the two down store "IDS

/data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -i -u http://192.168.1.1:2379
» store

Know it's 1 and 4

Check the region of more than half copies on the failed node

[tidb@wtidb28 bin]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379  -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4) then . else empty end) | length>=$total-length)}'
{"id":18,"peer_stores":[4,6,1]}
{"id":405,"peer_stores":[7,4,1]}
{"id":120,"peer_stores":[4,1,6]}
{"id":337,"peer_stores":[4,5,1]}
{"id":128,"peer_stores":[4,1,6]}
{"id":112,"peer_stores":[1,4,6]}
{"id":22,"peer_stores":[4,6,1]}
{"id":222,"peer_stores":[7,4,1]}
{"id":571,"peer_stores":[4,6,1]}

Execute the script to stop kv on the remaining normal kv nodes:

ps -ef|grep tikv
sh /data1/tidb/deploy/scripts/stop_tikv.sh
ps -ef|grep tikv

Change its owner and copy it to the tidb directory
chown -R tidb. /home/helei/tikv-ctl

In this case, If there are too many regions, it will be slow and cumbersome to repair according to the region. It is convenient to use all, but it is possible to kill two copies of peer by mistake, that is to say, the machine you broke may have only one region on this machine, but it will also keep only one copy of region in the cluster
The following operation is to perform stop kv operation (kv is required to be closed) on all surviving nodes, and then execute

[tidb@wtidb20 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 1,4 --all-regions                                                                    
removing stores [1, 4] from configrations...
success

Restart pd node

ansible-playbook stop.yml --tags=pd
 If the pd is off here, you can't board the library
[helei@db-admin01 ~]$ /usr/local/mysql56/bin/mysql -u mydba -h 10.217.36.146 -P 43000 -p4c41b9
6c7687f66d 
...
...
...
ansible-playbook start.yml --tags=pd

Restart the surviving kv node
sh /data1/tidb/deploy/scripts/start_tikv.sh

Check no region in leader state
[tidb@wtidb28 bin]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
I haven't found any non leader region s here, only the number of copies is 3, and there are more than 3 machines hanging at the same time, and it happens that all regions of some regions are on these 3 machines. The previous step is unsafe All region and pd are not displayed in the region step in the leader state. Only when the data corresponding to the table query is lost, can empty regions be created. In my case, as long as I keep a copy, no matter whether unsafe executes all regions or specifies a specific region number, the following steps are not needed

After the cluster is started normally, PD CTL can be used to view the previous regions. Theoretically, after using unsafe -- all regions, only one of the remaining regions becomes the leader, and the remaining kv nodes copy two follower s to other store s through the raft protocol
For example, in this case
{"id":18,"peer_stores":[4,6,1]}
Through PD CTL, we can see that he is now hesitating about the damage of 1,4 kV nodes. After executing unsafe recover remove fail stores -- all regions, he will remove 1,4, and the remaining 6 will become leader s. By using the raft protocol, he will replicate a new follower in 5,7 nodes, and achieve 3 replicas to start the cluster successfully

» region 18
{
  "id": 18,
  "start_key": "7480000000000000FF0700000000000000F8",
  "end_key": "7480000000000000FF0900000000000000F8",
  "epoch": {
    "conf_ver": 60,
    "version": 4
  },
  "peers": [
    {
      "id": 717,
      "store_id": 6
    },
    {
      "id": 59803,
      "store_id": 7
    },
    {
      "id": 62001,
      "store_id": 5
    }
  ],
  "leader": {
    "id": 717,
    "store_id": 6
  },
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 1,
  "approximate_keys": 0
}

If only two machines are hung up at the same time, it's over here. If only one machine is hung up, it's not necessary to deal with it

First look at the data. Now it's OK. The previous steps are recovering smoothly

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.86 sec)

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.98 sec)

Here's an episode
When I recover the nodes that are down in 1 and 4, no new data has been written to the cluster during this period. Originally, 6 was used as the leader and the newly generated 5 and 7 were used as the follower. After recovery, 5 and 7 were removed and 1 and 4 were used as the follower again. region 18 is still the store_id of 1, 4 and 6.

4, Downtime 3 tests

If three or more devices are hung at the same time, the above non leader step check will have content
We let the following three down this time:

wtidb22.add.shbt.qihoo.net  192.168.1.4  TiKV
wtidb21.add.shbt.qihoo.net  192.168.1.5  TiKV
wtidb20.add.shbt.qihoo.net  192.168.1.6  TiKV

First, stop all normal tikv. This case is wtidb19,wtidb18

Take a look at the two down store "IDS

/data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -i -u http://192.168.1.1:2379
» store

Know it's 1, 4, 5

Check the region of more than half copies on the failed node

[tidb@wtidb28 bin]$  /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379  -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4,5) then . else empty end) | length>=$total-length)}'
{"id":156,"peer_stores":[1,4,6]}
{"id":14,"peer_stores":[6,1,4]}
{"id":89,"peer_stores":[5,4,1]}
{"id":144,"peer_stores":[1,4,6]}
{"id":148,"peer_stores":[6,1,4]}
{"id":152,"peer_stores":[7,1,4]}
{"id":260,"peer_stores":[6,1,4]}
{"id":480,"peer_stores":[7,1,4]}
{"id":132,"peer_stores":[5,4,6]}
{"id":22,"peer_stores":[6,1,4]}
{"id":27,"peer_stores":[4,1,6]}
{"id":37,"peer_stores":[1,4,6]}
{"id":42,"peer_stores":[5,4,6]}
{"id":77,"peer_stores":[5,4,6]}
{"id":116,"peer_stores":[5,4,6]}
{"id":222,"peer_stores":[6,1,4]}
{"id":69,"peer_stores":[5,4,6]}
{"id":73,"peer_stores":[7,4,1]}
{"id":81,"peer_stores":[5,4,1]}
{"id":128,"peer_stores":[6,1,4]}
{"id":2,"peer_stores":[5,6,4]}
{"id":10,"peer_stores":[7,4,1]}
{"id":18,"peer_stores":[6,1,4]}
{"id":571,"peer_stores":[6,5,4]}
{"id":618,"peer_stores":[7,1,4]}
{"id":218,"peer_stores":[6,5,1]}
{"id":47,"peer_stores":[1,4,6]}
{"id":52,"peer_stores":[6,1,4]}
{"id":57,"peer_stores":[4,7,1]}
{"id":120,"peer_stores":[6,1,4]}
{"id":179,"peer_stores":[5,1,4]}
{"id":460,"peer_stores":[5,7,1]}
{"id":93,"peer_stores":[6,1,4]}
{"id":112,"peer_stores":[6,1,4]}
{"id":337,"peer_stores":[5,6,4]}
{"id":400,"peer_stores":[5,7,1]}

Now there are two left

wtidb19.add.shbt.qihoo.net  192.168.1.7  TiKV
wtidb18.add.shbt.qihoo.net  192.168.1.8  TiKV

The following operation needs to be performed at all surviving sections (wtidb19 and wtidb18 in this case) first (kv is required to be closed), and then

[tidb@wtidb19 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 1,4,5 --all-regions                                                                 
removing stores [1, 4, 5] from configrations...
success

Restart pd node

ansible-playbook stop.yml --tags=pd
ansible-playbook start.yml --tags=pd

Restart the surviving kv node
sh /data1/tidb/deploy/scripts/start_tikv.sh

Check the regions that are not in the leader state. As you can see here, 1, 4 and 5 because all regions are on the damaged three machines, the data cannot be recovered after these regions are discarded

[tidb@wtidb28 tidb-ansible-3.1.0]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
{"id":179,"peer_stores":[5,1,4]}
{"id":81,"peer_stores":[5,4,1]}
{"id":89,"peer_stores":[5,4,1]}

Confirm which table region belongs to according to region ID

[tidb@wtidb28 tidb-ansible-3.1.0]$ curl http://192.168.1.1:10080/regions/179
{
 "region_id": 179,
 "start_key": "dIAAAAAAAAA7X2mAAAAAAAAAAwOAAAAAATQXJwE4LjQuMAAAAPwBYWxsAAAAAAD6A4AAAAAAAqs0",
 "end_key": "dIAAAAAAAAA7X3KAAAAAAAODBA==",
 "frames": [
  {
   "db_name": "hl",
   "table_name": "rpt_qdas_show_shoujizhushou_channelver_mix_daily(p201910)",
   "table_id": 59,
   "is_record": false,
   "index_name": "key2",
   "index_id": 3,
   "index_values": [
    "20191015",
    "8.4.0",
    "all",
    "174900"
   ]
  },
  {
   "db_name": "hl",
   "table_name": "rpt_qdas_show_shoujizhushou_channelver_mix_daily(p201910)",
   "table_id": 59,
   "is_record": true,
   "record_id": 230148
  }
 ]
}

If you want to see the cluster status at this time,

» store
{
  "count": 5,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "192.168.1.4:20160",
        "version": "3.1.0",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 3,
        "region_weight": 1,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "192.168.1.5:20160",
        "version": "3.1.0",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 3,
        "region_weight": 1,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "192.168.1.6:20160",
        "version": "3.1.0",
        "state_name": "Down"

No data for monitoring

Library queries are still blocked

mysql> use hl
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;

Create an empty region to solve the unavailable state. This command requires pd and kv to be in the closed state
Here, you must write - r one by one, or report an error:

[tidb@wtidb19 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p 192.168.1.1:2379 -r 89,179,81
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', src/libcore/result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Aborted

./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 89
./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 81
./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 179

After starting pd and tikv, run again
[tidb@wtidb28 tidb-ansible-3.1.0]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
No results meet expectations

You can see that the data is lost by querying again, because we have lost several region s (81, 89179)

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1262523 |
+----------+
1 row in set (0.92 sec)

You can see that the index data is no longer in region (81, 89179), so it is the same as before

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (1.01 sec)

At this point, the test is complete

5, Summary

After reading this article, I'm sure you won't recover the data of virtual TiDB's multi-point power failure. Under normal circumstances, very few of them have multiple machines down at the same time in the cluster. If only one machine is down, it will not affect the operation of the cluster, and it will automatically handle it. When a TiKV node fails and cannot be recovered within a period of time (default 30 minutes), PD The data on it will be migrated to other TiKV nodes. But if there are two or even three or more down at the same time, then you who read this article believe that you will not be confused! ~

Posted by misschristina95 on Sun, 10 May 2020 20:43:02 -0700