Synchronization Principle of MongoDB Copy Set

Keywords: Database MongoDB less Fragment

MongoDB synchronization principle, the official documents introduced less, online information is not too much, the following is a combination of official documents, online information and testing logs, sorted out a little.
Because each fragment of MongoDB is also a replica set, only the synchronization principle of the replica set is needed.

I. Initial Sync

Generally speaking, MongoDB replica set synchronization consists of two steps:

1 Initial Sync, full synchronization
 2. Replication, sync oplog

First, the full data is synchronized by init sync, and then the oplog synchronization incremental data on Primary is replayed continuously by replication. When full synchronization is completed, members change from STARTUP2 to SECONDARY

1.1 Initialization of synchronization process

1) Start full synchronization to get the latest timestamp t1 on the synchronization source
 2) Full synchronization of aggregated data and indexing (time-consuming)
3) Get the latest timestamp t2 on the synchronization source
 4) Play back all oplog s between t1 and t2
 5) Complete Synchronization End

Simply put, it is to traverse all collections of DB on Primary, copy the data to its own nodes, and then read the oplog from the start to the end of the full synchronization period and play back.

After the initial sync is over, Secondary will set up tailable cursor of local.oplog.rs on Primary, constantly retrieve the newly written oplog from Primary and apply it to itself.

1.2 Initialization Synchronization Scenario

Secondary nodes need advanced full synchronization when the following conditions occur

1) oplog is empty
 2) Set the _initialSyncFlag field of the local. replset. minvalid collection to true (for init sync failure handling)
3) The memory tag initialSyncRequested is set to true (for resync commands, resync commands only for master/slave architecture, replica sets are not available)

These three scenarios correspond to each other (Scenario 2 and Scenario 3 do not see the official documents written, refer to Zhang Youdong Dashen Blog)

1) Add new nodes, any oplog, at this time need advanced initial sync
 2) When the initial sync starts, it actively sets the _initial SyncFlag field to true and then false after normal completion; if the node restarts, it finds that the _initial SyncFlag is true, indicating that the last full synchronization failed, it should be re-entered at this time.
3) When the user sends the resync command, the initial SyncRequested is set to true, forcing a restart of the initial sync

1.3 Interpretation of Question Points

1.3.1 When synchronizing data in full amount, will the oplog of source data be overwritten, leading to the failure of synchronization in full amount?

In version 3.4 and beyond, No.
The following chart illustrates the improvement of 3.4 on total synchronization (from Zhang Youdong blog):

The official documents are:

initial sync constructs all collection indexes when copying documents for each collection. In earlier versions of MongoDB (before 3.4), only _id builds indexes at this stage.
When Initial sync replicates data, it stores the new oplog records locally (3.4 additions).

Replication

2.1 Process of sync oplog

After full synchronization, Secondary starts to set up tailable cursor from the end point, and constantly pulls oplog from synchronization source and replays it to itself. This process is not completed by one thread. In order to improve synchronization efficiency, mongodb divides pulling oplog and replaying oplog into different threads to execute.
Specific threads and functions are as follows (this part is not found in official documents for the time being, from Zhang Youdong Dashen blog):

  • producer thread: This thread constantly pulls oplog from the synchronous source and joins a queue of BlockQueue to save it. BlockQueue stores 240MB oplog data. When it exceeds this threshold, it must wait until the oplog is consumed by replBatcher to continue pulling.
  • replBatcher thread: This thread takes oplog out of the producer thread queue one by one and puts it in the queue it maintains. This queue allows up to 5,000 elements and the total size of the elements is no more than 512 MB. When the queue is full, it needs to wait for oplog application to consume.
  • The oplog application takes out all the elements of the replBatch thread's current queue and distributes the elements to different replWriter threads according to docId (if the storage engine does not support document locks, then according to the collection name). The replWriter threads apply all oplogs to itself; waiting for all oplogs to be applied, oplog Application Threads write all oplogs sequentially to the local.oplog.rs collection.

For the above narrative, a picture is drawn to facilitate understanding.

The statistical information of buffer and application threads of producer can be queried by db.serverStatus().metrics.repl.

2.2 Interpretation of Process Questions

2.2.1 Why do oplog playback require so many threads?

Like mysql, one thread does one thing, pulling oplog is a single thread, and other threads play back; multiple playback threads speed up.

2.2.2 Why do replBatcher threads need to be transferred?

When oplog is replayed, the order must be maintained. When DDL commands such as create and drop are encountered, these commands and other add, delete and change checking commands can not be combined. These controls are accomplished by replBatcher.

2.2.3 How to solve the problem that the secondary node oplog replay can not catch up with primary?

Method 1: Set a larger number of playback threads

  * The mongod command line specifies: mongod -- setParameter replWriter ThreadCount = 32
  * Specify in the configuration file
setParameter:
  replWriterThreadCount: 32

Method 2: Increase the size of oplog
Method 3: Disperse the writeOpsToOplog step into multiple replWriter threads to execute concurrently, and see that the official developer log has implemented this (version 3.4.0-rc2)

2.3 Notes
  • Initial sync single-threaded replication data, the efficiency is relatively low, the production environment should try to avoid the emergence of initial sync, need to configure oplog reasonably.
  • When adding new nodes, we can avoid initial sync by physical replication, copy dbpath on Primary to new nodes, and then start directly.
  • When the Secondary synchronization lag is due to the high primary concurrent write, the sizeBytes value of db.serverStatus().metrics.repl.buffer keeps approaching maxSizeBytes, it can be improved by adjusting the number of replWriter concurrent threads on the Secondary.

III. Log Analysis

3.1 Initialization of Synchronization Logs

Set the log level verbosity to 1, and then filter the logs
cat mg36000.log |egrep "clone|index|oplog" >b.log
Finally, take out some filtered logs.

3.4.21 New Join Node Log

Because there are too many logs, it doesn't make sense to post too much. Here's one of the db01 libraries.
Collection logs.
It can be found that the collection index is created first, then the clone collection data and index data are created, thus completing the clone of the collection. Finally, change the configuration to the next collection.
2019-08-21T16:50:10.880+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-27-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.test2" }),
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.test2" }
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.882+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-28-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.test2" }),
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.test2" }
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.901+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: num_1
2019-08-21T16:50:10.906+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: _id_
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11] collection clone finished: db01.test2
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11]     collection: db01.test2, stats: { ns: "db01.test2", documentsToCopy: 2000, documentsCopied: 2000, indexes: 2, fetchedBatches: 1, start: new Date(1566377410875), end: new Date(1566377410913), elapsedMillis: 38 }
2019-08-21T16:50:10.920+0800 D STORAGE  [InitialSyncInserters-db01.collection10] create uri: table:db01/index-30-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),

3.6.12 Add New Node Log

The difference between 3.6 and 3.4 is that the thread replicating the database makes it clear that the repl writer worker replays the database (see document 3.4 already).
There is also the clear use of cursors.
Others are no different from 3.4, which is to create an index and then clone the data.
2019-08-22T13:59:39.444+0800 D STORAGE  [repl writer worker 9] create uri: table:db01/index-32-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=true)
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T13:59:39.447+0800 D REPL     [replication-1] Collection cloner running with 1 cursors established.
2019-08-22T13:59:39.681+0800 D INDEX    [repl writer worker 7]      bulk commit starting for index: _id_
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7] collection clone finished: db01.collection1
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     database: db01, stats: { dbname: "db01", collections: 1, clonedCollections: 1, start: new Date(1566453579439), end: new Date(1566453579725), elapsedMillis: 286 }
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     collection: db01.collection1, stats: { ns: "db01.collection1", documentsToCopy: 50000, documentsCopied: 50000, indexes: 1, fetchedBatches: 1, start: new Date(1566453579440), end: new Date(1566453579725), elapsedMillis: 285 }
2019-08-22T13:59:39.731+0800 D STORAGE  [repl writer worker 8] create uri: table:test/index-34-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "test.user1" }),log=(enabled=true)

4.0.11 Add New Node Log

Use cursors, which is basically the same as 3.6
2019-08-22T15:02:13.806+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-30--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.collection1" }
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.816+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-31--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.820+0800 D REPL     [replication-0] Collection cloner running with 1 cursors established.

3.2 Replication Log

2019-08-22T15:15:17.566+0800 D STORAGE  [repl writer worker 2] create collection db01.collection2 { uuid: UUID("8e61a14e-280c-4da7-ad8c-f6fd086d9481") }
2019-08-22T15:15:17.567+0800 I STORAGE  [repl writer worker 2] createCollection: db01.collection2 with provided UUID: 8e61a14e-280c-4da7-ad8c-f6fd086d9481
2019-08-22T15:15:17.567+0800 D STORAGE  [repl writer worker 2] stored meta data for db01.collection2 @ RecordId(22)
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] db01.collection2: clearing plan cache - collection info cache reset
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] create uri: table:db01/index-43--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection2" }),log=(enabled=false)

Reference resources:
https://docs.mongodb.com/v4.0/core/replica-set-sync/
https://docs.mongodb.com/v4.0/tutorial/resync-replica-set-member/#replica-set-auto-resync-stale-member
http://www.mongoing.com/archives/2369


Author: hs2021

Read the original text

This article is the original content of Yunqi Community, which can not be reproduced without permission.

Posted by Cory94bailly on Sun, 25 Aug 2019 23:43:45 -0700