How to recover lost data after PostgreSQL failover

1. background

PostgreSQL's HA schemes are generally based on its native stream replication technology, supporting synchronous replication and asynchronous replication modes.
Although the synchronous replication mode can ensure the data is not lost to the greatest extent, it usually needs to deploy at least three machines to ensure that there are more than two standby nodes.
Therefore, many primary and standby HA clusters use asynchronous replication.

Under asynchronous replication, when the primary database goes down and the standby node is switched to a new primary node, a small amount of recently updated data may be lost.
If the lost data is important to the business, can it be retrieved from the database?

Here's how to retrieve these data

2. principle

Basic process

When the standby database is promoted to a new master, a new timeline will be generated. The starting point of this new timeline is called the bifurcation point.

After the old master fault is repaired, the WAL file is parsed from the fork point on the old master, and all the data changes generated by the submitted transactions are parsed into SQL.

The premise is that the old primary disk is not damaged and can start normally. However, the most common failure in production is physical downtime, which can be recovered after restarting the machine.

After the business obtains these SQL statements and confirms them manually, it will supplement the data.

In order to parse the complete SQL from the WAL record, it is better to set the WAL level to logical and have the primary key on the table.
At this time, for the DML statement we are concerned about, the WAL record contains enough information to restore the data changes to SQL.
The details are as follows:

INSERT
The WAL record contains the complete tuple data. Combined with the table definition in the system table, SQL can be restored.
UPDATE
The WAL record contains the complete updated tuple data. For the tuple before the update, it depends on the following situations.

-Table sets the replica identity full property    
    The WAL record contains the complete tuple data before the update
 -Table contains replica identity key (or primary key) and the value of replica identity key has changed    
    The WAL record contains the field value of the replica identity key (or primary key) of the tuple before the update
 - other    
    The WAL record does not contain the tuple data before the update

DELETE
The WAL record may contain the deleted tuple information, depending on the following.

-Table sets the replica identity full property    
    The WAL record contains the complete deleted tuple data
 -Table contains replica identity key (or primary key)    
    The Val record contains the field value of the replica identity key (or primary key) of the deleted tuple
 - other    
    The WAL record does not contain the deleted tuple data

If wal_level is not logical or there is no primary key on the table, the pre change tuple can also be resolved from the historical FPI(FULL PAGE IANGE) in the WAL.

Therefore, in principle, it is feasible to parse SQL from WAL. And there are already open source tools to support this.

3. tools

Use the revised walminer tool to parse the WAL file.

https://gitee.com/skykiker/XLogMiner

walminer is a good tool to parse the original SQL and undo SQL from the WAL file.
However, there are still some problems in the current native walminer to support this scenario, and the speed of parsing WAL files is very slow.

The new version of walminer branch adds the parsing function based on LSN location, fixes some bugs, and improves the speed of parsing WAL files by about 10 times.
Some of the modifications are expected to be incorporated into the walminer main branch later.

3. Prerequisites

The WAL log file after the fork point was not cleared

Normal is enough. You can also set a reasonable 'WAL keep segments' parameter to keep more wals in the' PG WAL 'directory. For example:

wal_keep_segments=100

If you have configured a WAL archive, you can also use the WAL in the archive directory.

WAL log level is set to logical
```
wal_level=logical
```
Table has primary key or replica identity key/replica identity full
Table definition does not change after fork point

Note: 2 and 3 of the above conditions can also be supported if they are not satisfied, but all wals after the previous checkpoint of the fork point need to be reserved and resolved.

4. Use demonstration

4.1 environmental preparation

Build a HA cluster for asynchronous replication of primary and standby

Machine:

Node1 (main)
Node2 (preparation)

Software:

PostgreSQL 10

Parameters:

wal_level=logical

4.2 install the walminer plug-in

Download the revised walminer plug-in source code from the following location

https://gitee.com/skykiker/XLogMiner/

Install walminer in the active and standby databases respectively

cd walminer
make && make install

Create the walminer extension in the main library

create extension walminer

4.3 create test table

create table tb1(id int primary key, c1 text);
insert into tb1 select id,'xxx' from generate_series(1,10000) id;

4.4 simulate business load

Prepare test script

test.sql

\set id1 random(1,10000)
\set id2 random(1,10000)

insert into tb1 values(:id1,'yyy') on conflict (id)
  do update set c1=excluded.c1;

delete from tb1 where id=:id2;

Execute test script in main database to simulate business load

pgbench -c 8 -j 8 -T 1000 -f test.sql

4.5 simulate main database downtime

Kill PG process in main database

killall -9 postgres

4.6 upgrading the standby database to a new master

Perform the upgrade operation in the standby database

pg_ctl promote

View timeline breakpoints when switching

[postgres@host2 ~]$tail -1 /pgsql/data10/pg_wal/00000002.history
1    0/EF76440    no recovery target specified

4.7 retrieve lost data in the old main database

After the old master library is activated, the wal2sql() function is invoked to retrieve all SQL that has been committed to the transaction on the old main library after retrieving the bifurcation point.

postgres=# select xid,timestamptz,op_text from wal2sql(NULL,'0/EF76440') ;
NOTICE:  Get data dictionary from current database.
NOTICE:  Wal file "/pgsql/data10/pg_wal/00000001000000000000000F" is not match with datadictionary.
NOTICE:  Change Wal Segment To:/pgsql/data10/pg_wal/00000001000000000000000C
NOTICE:  Change Wal Segment To:/pgsql/data10/pg_wal/00000001000000000000000D
NOTICE:  Change Wal Segment To:/pgsql/data10/pg_wal/00000001000000000000000E
  xid   |          timestamptz          |                           op_text                           
--------+-------------------------------+-------------------------------------------------------------
 938883 | 2020-03-31 17:12:10.331487+08 | DELETE FROM "public"."tb1" WHERE "id"=7630;
 938884 | 2020-03-31 17:12:10.33149+08  | INSERT INTO "public"."tb1"("id", "c1") VALUES(5783, 'yyy');
 938885 | 2020-03-31 17:12:10.331521+08 | DELETE FROM "public"."tb1" WHERE "id"=3559;
 938886 | 2020-03-31 17:12:10.331586+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=7585;
 938887 | 2020-03-31 17:12:10.331615+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=973;
 938888 | 2020-03-31 17:12:10.331718+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(7930, 'yyy');
 938889 | 2020-03-31 17:12:10.33173+08  | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=1065;
 938890 | 2020-03-31 17:12:10.331741+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(2627, 'yyy');
 938891 | 2020-03-31 17:12:10.331766+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=1012;
 938892 | 2020-03-31 17:12:10.33178+08  | INSERT INTO "public"."tb1"("id", "c1") VALUES(4740, 'yyy');
 938893 | 2020-03-31 17:12:10.331814+08 | DELETE FROM "public"."tb1" WHERE "id"=4275;
 938894 | 2020-03-31 17:12:10.331892+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=8651;
 938895 | 2020-03-31 17:12:10.33194+08  | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=9313;
 938896 | 2020-03-31 17:12:10.331967+08 | DELETE FROM "public"."tb1" WHERE "id"=3251;
 938897 | 2020-03-31 17:12:10.332001+08 | DELETE FROM "public"."tb1" WHERE "id"=2968;
 938898 | 2020-03-31 17:12:10.332025+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(5331, 'yyy');
 938899 | 2020-03-31 17:12:10.332042+08 | UPDATE "public"."tb1" SET "c1" = 'yyy' WHERE "id"=3772;
 938900 | 2020-03-31 17:12:10.332048+08 | INSERT INTO "public"."tb1"("id", "c1") VALUES(94, 'yyy');
(18 rows)

Time: 2043.380 ms (00:02.043)

The output of wal2sql() above is sorted by the order in which transactions are committed in the WAL. These SQL can be imported to a file and provided to the business repair order.

4.8 restore old owner

You can use PG rewind to quickly back off the excess data of the old master, and then rebuild the replication relationship as the backup database of the new master to restore HA.

5. summary

With the help of the new version of walminer, you can easily and quickly retrieve the lost data after the PostgreSQL failover.

walminer can not only generate forward SQL, but also generate reverse undo SQL, which is known as flashback function.
Please refer to the open source project documentation for the generation methods and usage restrictions of undo SQL.

However, when used as a flashback function, walminer still needs to be further improved, the most obvious is the parsing speed.
Because it is necessary to enable replica identity full to fully parse undo SQL from WAL records, and many systems may not open replica identity full settings for each table.
Without replica identity full, the generation of undo SQL must rely on historical FPI.

Although the new version of walminer has increased the parsing speed by many times, the resource and time consumption is still a big problem if you are faced with dozens of GB of WAL files, parsing and collecting all FPI in history.

Posted by d0rr on Sun, 05 Apr 2020 11:28:50 -0700

Programmer Group