Principle and workflow analysis of rsync algorithm

Keywords: Java Database rsync Algorithm

This paper analyzes the principle of rsync algorithm and the workflow of rsync through examples, which is the explanation of rsync's official technical report and official recommended articles. This article will not describe how to use the rsync command, but explain in detail how it can achieve efficient incremental transmission.
The following is the rsync series:
1. Rsync (I): rsync basic usage
2. Rsync (II): rsync cold backup
3. Rsync (III): detailed description of inotify+rsync and sersync
4.rsync algorithm principle and workflow analysis

Before analyzing the principle of the algorithm, briefly explain the incremental transmission function of rsync.
Assuming that the file to be transferred is a, if there is no file a in the target path, rsync will directly transfer file A. if file a already exists in the target path, the sending end will decide whether to transfer file a according to the situation. rsync uses the "quic check" algorithm by default. It will compare the file size and modification time mtime of the source file and the target file (if any). If the size or mtime of the files at both ends are different, the sender will transfer the file, otherwise the file will be ignored.
If the "quick check" algorithm determines to transfer file a, it will not transfer the whole file a, but only the different parts of source file a and target file A. This is the real incremental transfer.
In other words, the incremental transfer of rsync is embodied in two conveniences: file level incremental transfer and data block level incremental transfer. Incremental transfer at the file level means that if there is on the source host but not on the target host, the file will be transferred directly. Incremental transfer at the database level refers to the transfer of only the part of data different from the two files. But in essence, file level incremental transfer is a special case. This is easy to understand after reading through this article.

1.1 problems to be solved

Suppose that there is file A on host A and file A 'on host b (in fact, these two files are files with the same name, so they are named A and A' here for distinction). Now keep files A 'and A synchronized.
The simplest way is to copy the A file directly to the b host. However, if file A is large and A 'and A are very similar (meaning that the actual contents of the two files are only A few different), copying the whole file A may take A lot of time. If you can copy A small part different from A ', the transmission process will be fast. rsync incremental transfer algorithm makes full use of the similarity of files and solves the problem of remote incremental copy.
Suppose the content of file A is "123xxabc def" and the content of file A 'is "123abcdefg". Compared with A ', the same data part has 123/abc/def. The extra content in A is xx and A space, but file A' has more data g than file A. the ultimate goal is to make the contents of A 'and A the same.
If rsync incremental transmission algorithm is adopted, host a only transmits xx and space data in file a to host b. for those same contents 123/abc/def, host b will copy them directly from the a 'file. According to the data from these two sources, host b can form a copy of file A. finally, rename the copy file and overwrite the a 'file to ensure synchronization.
Although the process seems very simple, there are many details to explore. For example, how does host A know which parts of the A file are different from the A 'file, how does host b receive the data of different parts of A and A' sent by host A, and how to form A copy of file A.

1.2 principle of Rsync incremental transmission algorithm

Suppose that the rsync command is to push the a file to host b to keep the a 'and a files synchronized, that is, host a is the source host, the sender of data, and host b is the target host, the receiver of data. When ensuring the synchronization of a' and a files, there are roughly the following six processes:
(1).a host tells b host file A to be transferred.

(2).b after receiving the information, the host divides the file A 'into A series of fixed size data blocks (the recommended size is between 500-1000 bytes), numbers the data blocks with the chunk number, and records the starting offset address and length of the data block. Obviously, the size of the last data block may be smaller.

For the content "123abcdefg" of file A ', assuming that the size of the divided data block is 3 bytes, it is divided into the following data blocks according to the number of characters:

Count = 4, n = 3, REM = 1, which means that four data blocks are divided. The size of the data block is 3 bytes, and the remaining 1 byte is given to the last data block

chunk[0]: offset=0 len=3 The corresponding content of the data block is 123
chunk[1]: offset=3 len=3 The corresponding content of this data block is abc
chunk[2]: offset=6 len=3 The corresponding content of this data block is def
chunk[3]: offset=9 len=1 The corresponding content of this data block is g

Of course, the actual information will certainly not include the contents of the file.

(3).b the host calculates two check codes for each data block of file a 'according to its content: a 32-bit weak rolling check code and a 128 bit MD4 strong check code (the current version of rsync already uses a 128 bit MD5 strong check code). All rolling checksum s and strong check codes calculated in file a 'are followed by the corresponding data block chunk[N] to form a check code set, and then sent to host a.

In other words, the contents of the check code set are roughly as follows: sum1 is rolling check sum and sum2 is strong check code.

chunk[0] sum1=3ef2c827 sum2=3efa923f8f2e7
chunk[1] sum1=57ac2aaf sum2=aef2dedba2314
chunk[2] sum1=92d7edb4 sum2=a6sd6a9d67a12
chunk[3] sum1=afe74939 sum2=90a12dfe7485c

It should be noted that the rolling checksum calculated for data blocks with different contents may be the same, but the probability is very small.

(4) After host a receives the check code set of file a ', host a will calculate the 16 bit hash value for each rolling checksum in the check code set, and put every 216 hash values into a hash table in hash order. Each hash entry in the hash table points to the chunk number of the corresponding rolling checksum in the check code set, Then sort the check code set according to the hash value, so that the order in the sorted check code set can correspond to the order in the hash table.

Therefore, the correspondence between the hash table and the sorted check code set is roughly as follows: it is assumed that the hash values in the hash table are sorted in the order of [0-9a-f] according to the first character.

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-8jgdrfki-1631248769156) (/ users / shutdown / GitHub. Paypay. COM / shutdown / goalienginxqa / blog / Rsync series / images/image-20210910114006438.png)]

It should also be noted that the hash values calculated by different rolling checksums may be the same, and the probability is relatively small, but the probability of repetition is higher than that of rolling checksums.

(5) . then host a will process file a. The processing process is to take a data block of the same size from the first byte, and calculate its check code to match the check code in the check code set. If a data block entry in the upper check code set can be matched, it means that the data block is the same as the data block in file b. It does not need to be transmitted, so host a directly jumps to the end offset address of the data block and continues to take the data block from this offset for matching. If the data block entry in the check code set cannot be matched, it indicates that the data block is a non matching data block, which needs to be transmitted to host b, so host a will jump to the next byte and continue to take the data block from this byte for matching. Note that the entire matching data block is skipped when the matching is successful, and only one byte is skipped when the matching is unsuccessful. This can be understood in conjunction with the examples in the next section.

The data block matching mentioned above is only a description, and the specific matching behavior needs to be refined. rsync algorithm divides the data block matching process into three levels of search matching process.

First, host a will calculate its rolling checksum for the obtained data block according to its content, and then calculate the hash value according to the rolling checksum.

Then, match the hash value with the hash entry in the hash table, * * this is the first level search matching process, which compares the hash value** If a match can be found in the hash table, it indicates that there is a potential possibility of the same data block, so it enters the second level of search matching.

**The second level of search matching is comparing rolling checksum** Since the hash value of the first level matches the result, the rolling checksum corresponding to this hash value in the check code set will be searched. Since the check code set is sorted according to the hash value, its order is consistent with that in the hash table, that is, just scan down from the rolling chksum corresponding to the hash value. During scanning, if the rolling checksum of A file data block can match an item, it indicates that the data block has the potential same possibility, so stop scanning and enter the third level of search matching for final determination. Or if no matching item is scanned, it indicates that the data block is A non matching block, and the scanning will also stop. This indicates that the rolling checksum is different, but A small probability repeat event occurs according to the hash value calculated by it.

The third level of search matching is A strong check code. At this time, A new strong check code will be calculated for the data block of file A (before the third level, only the rolling checksum and its hash value are calculated for the data block of file A), and the strong check code will be matched with the corresponding strong check code in the check code set. If it can be matched, it means that the data block is exactly the same, and if it cannot be matched, it means that the data block is different, Then start to remove A data block for processing.

The reason why we need to calculate the hash value and put it into the hash table is that the performance of comparing rolling checksum is not as good as that of hash value comparison, and the performance of the algorithm through hash search is very high. Because the probability of hash value repetition is small enough, most data blocks with different contents can be directly compared through the hash value of the first level search. Even if there is a small probability hash value repetition event, it can quickly locate and compare the rolling checksum with smaller probability repetition. Even if the rolling checksum calculated by different contents may be repeated, its repetition probability is smaller than that of hash value. Therefore, almost all different data blocks can be compared through these two levels of search. Assuming that the rolling checksum of data blocks with different contents still has a small probability of repetition, it will compare the strong parity check codes at the third level. It adopts MD4 (now MD5). This algorithm has "avalanche effect". As long as it is a little different, the results are different. Therefore, in the process of practical use, it can be assumed that it can make the final comparison.

The size of data block will affect the performance of rsync algorithm. If the data block size is too small, there will be too many data blocks, too many data block check codes to be calculated and matched, and the performance will be poor. In addition, the possibility of hash value duplication and rolling checksum duplication will also increase; If the data block size is too large, many data blocks may not match, resulting in these data blocks being transmitted, reducing the advantage of incremental transmission. Therefore, it is very important to divide the appropriate data block size. By default, rsync will automatically determine the data block size according to the file size, but the "- B" (or "– block size") option of rsync command supports manual size specification. If it is specified manually, the official recommended size is between 500-1000 bytes.

(6) When host A finds that it is A matching data block, it will only send the additional information of this matching block to host b. At the same time, if there is unmatched data between two matching data blocks, these unmatched data will also be sent. When host b receives these data continuously, it will create A temporary file and reorganize the temporary file through these data to make its content the same as that of file A. After the reorganization of the temporary file is completed, modify the attribute information of the temporary file (such as permission, owner, mtime, etc.), then rename the temporary file and replace the A 'file, so that the A' file is synchronized with the A file.

1.3 analyze rsync algorithm through examples

With so many theories mentioned above, we may have seen the clouds. Next, we will analyze the principle of incremental transmission algorithm in the previous section in detail through the examples of A and A 'files. Since the examples have been given in the processes (1) - (4) in the previous section, we will continue to analyze process (5) and process (6).

First look at the check code set and hash table sorted in file A '(the content is "123abcdefg").

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-aawqzw3j-1631248769158) (/ users / shutdown / GitHub. Payway. COM / shutdown / goalienginxqa / blog / Rsync series / images/image-20210910114112088.png)]

When host A starts processing file A, for the content "123xxabc def" of file A, the data block with the same size is taken from the first byte, so the content of the first data block obtained is "123", which is exactly the same as the chunk[0] content of file A ' α The rolling checksum value calculated by the host for this data block must be "3ef2e827", and the corresponding hash value is "e827". Therefore, host A matches the hash value to the hash table. During the matching process, it is found that the hash value pointing to chunk[0] can match, so it enters the rolling checksum comparison of the second level, that is, it starts to scan down from the entry of chunk[0] pointed to by the hash value. During the scanning process, it is found that the rolling checksum of the first information scanned (i.e. the entry corresponding to chunk[0]) can be matched, so the scanning is terminated, and the search matching of the third level is entered. At this time α The host will calculate A new strong check code for the data block "123" and compare it with the strong check code corresponding to chunk[0] in the check code set. Finally, it is found that it can match. Therefore, it is determined that the data block "123" in file A is A matching data block and does not need to be transmitted to host b.

Although the matching data block does not need to be transmitted, the matching related information needs to be transmitted to host b immediately, otherwise host b does not know how to reorganize the copy of file A. The information to be transmitted by the matching block includes: the chunk[0] data block in file b is matched. The starting offset address of the data block in file A is the first byte and the length is 3 bytes.

After the matching information transmission of data block "123" is completed, host a will take the second data block for processing. Originally, the data block should be taken from the second byte, but since the three bytes in the data block "123" are completely matched successfully, the whole data block "123" can be directly skipped, that is, the data block can be taken from the fourth byte, so the content of the second data block obtained by host a is "xxa". Similarly, you need to calculate its rolling checksum and hash values, search the hash entries in the matching hash table, and find that no hash value can be matched, so immediately determine that the data block is a non matching data block.

At this time, the A host will continue to fetch the third data block in the A file for processing. Since the second data block does not match, only one byte is skipped when taking the third data block, that is, starting from the fifth byte, and the obtained data block content is "xab". After some calculation and matching, it is found that this data block and the second data block are unmatched data blocks. Therefore, continue to skip A byte forward, that is, take the fourth data block from the sixth byte. The content of the obtained data block is "abc". This data block is A matching data block, so the processing method is the same as that of the first data block. The only difference is that the first data block to the fourth data block, and the middle two data blocks are non matching data blocks, Therefore, after determining that the fourth data block is A matching data block, the non matching content in the middle (i.e. XX in the middle of 123xxabc) will be sent to the server byte by byte β host.

(as mentioned earlier, hash value and rolling checksum have a small probability of repetition. How to match when repetition occurs? See the tail of this section.)

All data blocks in A are processed in this way, and finally there are 3 matching data blocks chunk[0], chunk[1] and chunk[2], as well as 2 segments of unmatched data "xx" and "". In this way, the b host receives the matching information of the matching data block and the byte by byte unmatched pure data. These data are β Key information of host reorganization file A copy. Its general contents are as follows:

chunk[0] of size 3 at 0 offset=0
data receive 2 at 3
chunk[1] of size 3 at 3 offset=5
data receive 1 at 8
chunk[2] of size 3 at 6 offset=9

To illustrate this information, first look at the contents of file A and file A 'and mark their offset addresses.

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-m8yzuubp-1631248769160) (/ users / shutdown / GitHub. Paypay. COM / shutdown / goalienginxqa / blog / Rsync series / images/image-20210910114153622.png)]

For "chunk[0] of size 3 at 0 offset=0", this section indicates that this is A matching data block, the matching is chunk[0] in file A ', the data block size is 3 bytes, the keyword at indicates that the starting offset address of this matching block in file A' is 0, and the keyword offset indicates that the starting offset address of this matching block in file A is also 0, It can also be considered an offset in the reorganization temporary file. That is, when host B reorganizes the file, the data block corresponding to chunk[0] with A length of 3 bytes will be copied from the "at 0" offset of file B, and the content of this data block will be written to the offset=0 offset in the temporary file, so that there is the first section of data "123" in the temporary file.

For "data receive 2 at 3", this paragraph indicates that this is the received pure data information, not the matching data block. 2 indicates the number of data bytes received, "at 3" indicates that the two bytes of data are written at the starting offset 3 of the temporary file. In this way, the temporary file has "123xx" containing data.

For "chunk[1] of size 3 at 3 offset=5", this section indicates that it is A matching data block, that is, copy the data block corresponding to chunk[1] with A length of 3 bytes from the starting offset address at=3 of file A ', and write the content of this data block at the starting offset offset=5 of the temporary file, so that the temporary file has "123xxabc".

For "data receive 1 at 8", this description receives pure data information, which means that the received data of 1 byte is written to the starting offset address 8 of the temporary file, so there is "123xxabc" in the temporary file.

The last paragraph "chunk[2] of size 3 at 6 offset=9" means that the data block corresponding to chunk[2] with a length of 3 bytes is copied from the starting offset address at=6 of file a ', and the content of this data block is written to the starting offset offset=9 of the temporary file, so that the temporary file contains "123xxabc def". So far, the reorganization of the temporary file is completed, and its content is completely consistent with that of file a on host A. then, just modify the properties of the temporary file, rename and replace file a ', so as to synchronize file B and file a.

The whole process is as follows:

[the external chain image transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-l81kpjaf-1631248769162) (/ users / shutdown / GitHub. Paypay. COM / shutdown / goalienginxqa / blog / Rsync series / images/image-20210910114219860.png)]

It should be noted that host a does not send relevant data to host b until all data blocks have been searched. Instead, each time a matching data block is searched, the relevant information of the matching block and the non matching data between the current matching block and the previous matching block will be sent to host b immediately β The host and starts processing the next data block when β Every time the host receives a piece of data, it will immediately reorganize it into a temporary file. Therefore, host a and host b try not to waste any resources.

1.3.1 matching process when hash value and rolling checksum are repeated

In the above example analysis, hash value duplication and rolling checksum duplication are not involved, but they may be repeated. Although the matching process after repetition is the same, it may not be easy to understand.
Or look at the check code set after sorting the A 'file.
[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-kjlnom9w-1631248769163) (/ users / shutdown / GitHub. Paypay. COM / shutdown / goalienginxqa / blog / Rsync series / images/image-20210910114437094.png)]

When file A processes data blocks, it is assumed that it is processing the second data block, which is A non matching data block. For this data block, the rolling checksum and hash values will be calculated. Assuming that the hash value of this data block is successfully matched from the hash table, for example, to the value "4939" in the above figure, the rolling checksum of this second data block will be compared with the rolling checksum of chunk[3] pointed to by the hash value "4939". The hash value is repeated and the possibility of repeated rolling checksum is almost zero, so it can be determined that this data block is A non matching data block.

Consider an extreme case. If the file A 'is large and the number of data blocks is large, the rolling checksum of the data blocks contained in the file B itself may have duplicate events, and the hash value may also have duplicate events.

For example, the rolling checksum of chunk[0] and chunk[3] are different, but the hash value calculated according to the rolling checksum is the same. At this time, the corresponding relationship between the hash table and the check code set is roughly as follows:
[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-btvf4t7p-1631248769164) (/ users / shutdown / GitHub. Paypay. COM / shutdown / goalienginxqa / blog / Rsync series / images/image-20210910114516472.png)]

If the hash value of the data block in file a can match "c827", prepare to compare the rolling checksum. At this time, the checksum set will be scanned downward from the chunk[0] pointed to by the hash value "c827". When it is found that the rolling checksum of the data block can exactly match a rolling checksum during scanning, such as the rolling checksum corresponding to chunk[0] or chunk[3], the scanning will terminate and enter the third level of search matching. If no matching rolling checksum is found until chunk[2] during the downward scanning, the scanning is terminated because the hash value of chunk[2] is different from the hash value of the data block. Finally, it is determined that the data block is a non matching data block, so host a continues to process the next data block.
[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-mpgtuv7q-1631248769164) (/ users / shutdown / GitHub. Paypay. COM / shutdown / goalienginxqa / blog / Rsync series / images/image-20210910114543967.png)]

If the rolling checksum of the data block in file B is repeated (this only shows one thing, you are too lucky), it will only be matched by a strong check code.

1.4 rsync workflow analysis

The core of rsync incremental transmission has been analyzed above. The implementation of rsync incremental transmission algorithm and the whole process of rsync transmission will be analyzed below. Before that, it is necessary to explain the related concepts such as client/server, sender, receiver and generator designed during rsync transmission.

1.4.1 several processes and terms

rsync works in three ways. The first is local transmission, the second is remote shell connection, and the third is rsync daemon connection using network socket.
When using the remote shell such as ssh connection mode, after knocking down the rsync command locally, it will request to establish a remote shell connection such as ssh connection with the remote host. After the connection is established successfully, the fork remote shell process will call the remote rsync program on the remote host, and pass the options required for rsync to the remote rsync through the remote shell command such as ssh. In this way, rsync is started on both sides, and then they will communicate through pipeline (even if they are in local and remote relationship).
When connecting rsync daemon using network socket, when establishing a connection with remote running rsync through network socket, rsync daemon process will create a child process to respond to the connection and be responsible for all subsequent communication of the connection. In this way, rsync required for connection is started at both ends, and then the communication mode is completed through network socket.

Local transmission is actually a special working mode. First, when the rsync command is executed, there will be an rsync process, and then another rsync process will be fork ed according to this process as the opposite end of the connection. After the connection is established, all subsequent communications will adopt the pipeline mode.

No matter what connection method is used, the end that initiates the connection is called the client, that is, the section that executes the rsync command, and the other end of the connection is called the server end. Note that the server side does not represent the rsync daemon side. The server side is a general term in rsync, which is relative to the client side. As long as it is not the client side, it belongs to the server side. It can be the local side, the opposite side of the remote shell, or the remote rsync daemon side, which is different from the server side of most daemon services.

The concept of client and server of rsync has a short life cycle. When both client and server start rsync process and establish rsync connection (pipeline and network socket), sender and receiver will be used to replace the concept of client and server. The sender end is the file sending end and the receiver end is the file receiving end.

After the rsync connection at both ends is established, the rsync process on the sender side is called the sender process, which is responsible for all the work on the sender side. The rsync process on the receiver side is called the receiver process, which is responsible for receiving the data sent by the sender side and completing the file reorganization. The receiver side also has a core process - the generator process, which is responsible for executing the "– delete" action on the receiver side, comparing the file size and mtime to determine whether the file is skipped, dividing data blocks for each file, calculating the check code and generating the check code set, and then sending the check code set to the sender side.
The whole transmission process of rsync is completed by these three processes, which are highly pipelined. The output result of the generator process is used as the input of the sender end, and the output result of the sender end is used as the input of the receiver end. Namely:

generator process --> sender process --> receiver process

Although these three processes are pipelined, it does not mean that they have data waiting delay. They work completely independently and in parallel. The generator calculates the check code set of a file and sends it to the sender. It will immediately calculate the check code set of the next file. As soon as the sender process receives the check code set of the generator, it will immediately start processing the file. When processing a file, it will immediately send this part of relevant data to the receiver process for processing the next data block, After receiving the data sent by the sender, the receiver process will immediately start the reorganization. In other words, as long as the process is created, the three processes will not wait for each other.

In addition, pipelining does not mean that there will be no communication between processes, but that the main workflow of rsync transmission process is pipelined. For example, after receiving the file list, the receiver process gives the file list to the generator process.

1.4.2 rsync whole workflow

Suppose you execute the rsync command on host a to push a large number of files to host b.

1 first, the client and server establish a connection for rsync communication. The remote shell connection mode establishes a pipeline, and the network socket is established when connecting rsync daemon.

2 after the rsync connection is established, the sender process on the sender side collects the files to be synchronized according to the source files given in the rsync command line, puts these files into the file list and transmits them to host b. In the process of creating a file list, there are several points to be explained:

  • When creating a file list, you will first sort by directory, and then number the files in the sorted file list. In the future, you will directly use the file number to reference the files.

  • The file list also contains some attribute information of the file, including permission mode, file size len, owner and group uid/gid, latest modification time mtime, etc. of course, some information needs to be attached after specifying options. For example, if "- o" and "- g" options are not specified, uid/gid will not be included, and if "- checksum" is specified, file level checksum value will also be included.

  • When sending the file list, it is not sent at one time after the collection is completed, but one directory is sent after collecting one directory in order. Similarly, when the receiver receives it, it is also received one directory by one directory, and the received file list has been sorted.

  • If exclude or hide rules are specified in the rsync command, the files filtered by these rules will be marked as hide in the file list (the essence of the exclude rule is also hide). Files with the hide flag are invisible to the receiver, so the receiver side will think that the sender side does not have these hidden files.

3 after the receiver receives the contents in the file list from the beginning, it immediately fork s out the generator process according to the receiver process. The generator process will scan the local directory tree according to the file list. If the file under the target path already exists, the file will become the basis file.

The work of the generator is generally divided into three steps:

  • If the "– delete" option is specified in the rsync command, first execute the delete action on host b to delete the files that are not in the source path but in the target path.
  • Then compare the file size and mtime of the local corresponding file one by one according to the file order in the file list. If the size or mtime of the local file is found to be the same as that in the file list, it means that the file does not need to be transferred, and the file will be skipped directly.
  • If the size or mtime of the local file is found to be different, it indicates that the file is a file to be transmitted. The generator will immediately divide and number the file, calculate the weak rolling checksum and strong checksum for each database, and combine these checksum with the data block number to form a checksum set, Then send the file number and check code set to the sender side. After sending, start processing the next file in the file list.

Note that if there are files on host a but not on host b, the generator will set the check code of this file in the file list to null and send it to the sender. If the "– whole file" option is specified, the generator will set the check codes of all files in the file list to null, which makes rsync force full transfer instead of incremental transfer.

Starting from step 4 below, these steps have been explained in great detail when analyzing the principle of rsync algorithm, so they are only described in general here. If you have any doubts, please turn to Above View related content.

4. When the sender process receives the data sent by the generator, it will read the file number and check code set. Then calculate the hash value according to the weak rolling checksum in the check code set, put the hash value into the hash table, and sort the check code set according to the hash value, so that the order of the check code set and the hash table can be exactly the same.

5 after sorting the check code set, the sender process processes the local corresponding files according to the read file number. The purpose of processing is to find matching data blocks (i.e. data blocks with exactly the same content) and unmatched data. Whenever a matching data block is found, some matching information will be sent to the receiver process immediately. After sending all the data of the file, the sender process will also generate a file level whole file check code for the file to the receiver.

6 after receiving the instructions and data sent by the sender, the receiver process immediately creates a temporary file under the target path and reorganizes the temporary file according to the received data and instructions, so as to make the file completely consistent with the file on host a. In the reorganization process, the matched data blocks will be copied from the basis file and written to the temporary file, and the unmatched data will be received from the sender.

7 after the temporary file reorganization is completed, a file level check code will be generated for the temporary file and compared with the whole file check code sent by the sender. If the matching is successful, it means that the temporary file and the source file are exactly the same, which means that the temporary file reorganization is successful. If the check code matching fails, it means that there may be an error in the reorganization process, This source file will be processed completely from scratch.

8 when the temporary file is reorganized successfully, the receiver process will modify the attribute information of the temporary file, including permission, owner, group, mtime, etc. Finally, rename the file and overwrite the existing file under the target path (i.e. base file). At this point, the file synchronization is complete.

1.5 analyze rsync working principle according to the execution process

In order to more intuitively experience the principle and workflow of rsync algorithm explained above, two examples of rsync execution process will be given below, and the workflow will be analyzed. One is the example of full transmission and the other is the example of incremental transmission.
To view the process performed by rsync, add the "- vvvv" option to the rsync command line.

1.5.1 analysis of full transmission execution process

Command to execute:

rsync -a -vvvv /etc/cron.d/ /var/log/puppet /etc/issue ci-user@10.176.6.24:/home/ci-user/tmp

The purpose is to transfer the / etc/cron.d directory, / var/log/puppet directory and / etc/issue files to the / tmp directory on the host 10.176.6.24. Since these files do not exist in the / tmp directory, the whole process is a full transfer process. But its essence is still the incremental transmission algorithm, except that the check code set sent by the generator is all empty.
The following is the hierarchy of / etc/cron.d and / var/log/puppet directories

[root@clickhouse-backup-test01 ~]# tree -C /etc/cron.d/  /var/log/puppet
/etc/cron.d/
|-- 0hourly
|-- hardening_audit
|-- infosec_ntpstat
|-- puppet-cron
|-- puppet-reboot
`-- sysstat
/var/log/puppet
`-- puppet-agent.log

0 directories, 7 files

The following is the execution process

[root@dev8 ~]# rsync -a -vvvv /etc/cron.d/ /var/log/puppet /etc/issue ci-user@10.176.6.24:/home/ci-user/tmp
# Use ssh (ssh is the default remote shell) to execute the remote rsync command to establish a connection 
cmd=<NULL> machine=10.176.6.24 user=ci-user path=/home/ci-user/tmp
cmd[0]=ssh cmd[1]=-l cmd[2]=ci-user cmd[3]=10.176.6.24 cmd[4]=rsync cmd[5]=--server cmd[6]=-vvvvlogDtpre.iLsfxC cmd[7]=. cmd[8]=/home/ci-user/tmp
opening connection using: ssh -l ci-user 10.176.6.24 rsync --server -vvvvlogDtpre.iLsfxC . /home/ci-user/tmp  (9 args)
msg checking charset: ANSI_X3.4-1968
*******************************************************************************
                             NOTICE:

This system/network is for the use of authorized users for legitimate
business purposes only. The unauthorized access, use or modification of
this system or of the data contained therein or in transit to/from it
is a criminal violation of federal and state laws.

All individuals using this computer system are subject to having their
activities on this system monitored and recorded by systems personnel.
Anyone using this system expressly consents to such monitoring. Any
evidence of suspected criminal activity revealed by such monitoring may
be provided to law enforcement officials by systems personnel.
*******************************************************************************

Password:
# Both parties send the protocol version number to each other and negotiate to use the lower version. Here is Rsync version 3.1.2 protocol version 31
(Server) Protocol versions: remote=31, negotiated=31
(Client) Protocol versions: remote=31, negotiated=31

# The sender side generates a file list and sends it to the receiver side
sending incremental file list
[sender] make_file(.,*,0)      # The first file directory to be transferred: under the source path / etc/cron.d /, pay attention to the difference between / etc/cron.d / and / var/log/puppet
[sender] pushing local filters for /etc/cron.d/
[sender] make_file(0hourly,*,2)    # Second file directory to transfer: 0hourly 
[sender] make_file(sysstat,*,2)    # The third file directory to transfer: sysstat
[sender] make_file(puppet,*,0)     # The fourth file directory to be transferred: puppet file. Note that puppet here is the file to be transferred, not the directory
[sender] make_file(issue,*,0)      # The fifth file directory to transfer: issue file

# Indicate that it starts from item 1 of the file list, and confirm that there are 5 items to be transmitted to the receiver this time
[sender] flist start=1, used=5, low=0, high=4
# Generate list information for these five items, including the file id, directory, permission mode, length, uid/gid, and finally a modifier
[sender] i=1 /etc/cron.d ./ mode=040500 len=36 uid=0 gid=0 flags=1005
[sender] i=2 /etc/cron.d 0hourly mode=0100644 len=128 uid=0 gid=0 flags=0
[sender] i=3 /etc issue mode=0100644 len=805 uid=0 gid=0 flags=1005
[sender] i=4 /etc/cron.d sysstat mode=0100600 len=233 uid=0 gid=0 flags=1000
[sender] i=5 /var/log puppet/ mode=040755 len=30 uid=0 gid=0 flags=1005
send_file_list done
file list sent
# The only thing to note is the directory where the file is located, for example, / var/log puppet /, but actually, / var/log/puppet is specified on the command line.
# log and puppet in the information here are separated by a space, which is very important. The one to the left of the space represents the implied directory (see the "- R" option of man rsync),
# On the right is the entire file or directory to be transferred. By default, a puppet / directory will be generated on the receiver side, but the implied directory on the left will not be created.
# However, by specifying special options (such as "- R"), rsync can also create an implicit directory on the receiver side at the same time, so as to create the entire directory hierarchy.
# For example, if the / a directory of host a has many subdirectories such as B and c, and there are d files in directory B, now you only want to transfer / a/b/d and keep the directory hierarchy of / a/b,
# You can change the directory where the file is located to "/ a /" through special options. For specific implementation methods, see "rsync -R option example".
 
############ sender Send file attribute information on client #####################
# Since two entries in the previous file list are directories, attribute information should be generated for each file in the directory and sent to the receiver
send_files starting
[sender] pushing local filters for /var/log/puppet/
[sender] make_file(puppet/puppet-agent.log,*,2)
[sender] flist start=7, used=1, low=0, high=0
[sender] i=7 /var/log puppet/puppet-agent.log mode=0100644 len=891 uid=52 gid=52 flags=1000
[sender] flist_eof=1

############## server End related activities ################
# First, start the rsync process on the server side
server_recv(2) starting pid=24070
process has 2 gids:  991 1010
gid 0() maps to 0
recv_file_name(.)
recv_file_name(0hourly)
recv_file_name(sysstat)
recv_file_name(puppet)
recv_file_name(issue)
received 5 names
[Receiver] flist start=1, used=5, low=0, high=4
[Receiver] i=1 0 ./ mode=040500 len=36 gid=(0) flags=1405
[Receiver] i=2 1 0hourly mode=0100644 len=128 gid=(0) flags=400
[Receiver] i=3 1 issue mode=0100644 len=805 gid=(0) flags=1400
[Receiver] i=4 1 sysstat mode=0100600 len=233 gid=(0) flags=1400
[Receiver] i=5 1 puppet/ mode=040755 len=30 gid=(0) flags=1405
recv_file_list done
# First data reception completed
# Start the generator process on the receiver side
get_local_name count=5 /home/ci-user/tmp  # Get local pathname
generator starting pid=24070  # Start the generator process
delta-transmission enabled    # Enable incremental transfer algorithm
# The generator process has been set up

# A total of 5 files were received
recv_generator(.,0)
set modtime of . to (1567013864) Wed Aug 28 10:37:44 2019
recv_generator(.,1) 
recv_generator(0hourly,2)
recv_generator(issue,3)
recv_generator(sysstat,4)
recv_generator(puppet,5)
recv_files(5) starting
send_files(0, /etc/cron.d/.)   # The generator receives the file id=1 notified by the receiver process
./
send_files(2, /etc/cron.d/0hourly)
count=0 n=0 rem=0     #   This item refers to the database information of file 0hourly segmentation on the target host. count represents the quantity, n represents the fixed size of the data block, rem means remain, and represents the remaining data length, that is, the size of the best data block,
                      #   Here, because there is no 0hourly file on the target side, it is all set to 0
send_files mapped /etc/cron.d/0hourly of size 128  # The sender side maps the / etc/cron.d/0hourly file so that the sender can get the relevant contents of the file
calling match_sums /etc/cron.d/0hourly   # The sender side calls the check code matching function
0hourly
sending file_sum                         # After matching, send the file level checksum to the receiver
false_alarms=0 hash_hits=0 matches=0     # Output statistics related to data block matching
sender finished /etc/cron.d/0hourly  
# The file / etc/cron.d/0hourly has been sent. Because there is no 0hourly file on the target, the whole process is very simple. All data in 0hourly can be directly transmitted

# Send file / etc/issue
send_files(3, /etc/issue)
count=0 n=0 rem=0
send_files mapped /etc/issue of size 805
calling match_sums /etc/issue
issue
sending file_sum
false_alarms=0 hash_hits=0 matches=0
sender finished /etc/issue     

# Send file / etc/crond.d/sysstat
send_files(4, /etc/cron.d/sysstat)
count=0 n=0 rem=0
send_files mapped /etc/cron.d/sysstat of size 233
calling match_sums /etc/cron.d/sysstat
sysstat
sending file_sum
false_alarms=0 hash_hits=0 matches=0
sender finished /etc/cron.d/sysstat

# Receive file / puppet/puppet-agent.log
[receiver] receiving flist for dir 1
uid 52(puppet) maps to 52
gid 52(puppet) maps to 52
recv_file_name(puppet/puppet-agent.log)
received 1 names
[receiver] flist start=7, used=1, low=0, high=0
[receiver] i=7 2 puppet/puppet-agent.log mode=0100644 len=891 gid=(52) flags=1400
recv_file_list done
[receiver] flist_eof=1
recv_files(.)
recv_files(0hourly)
data recv 128 at 0
got file_sum
set modtime of .0hourly.vGJ3Rk to (1560075748) Sun Jun  9 03:22:28 2019
renaming .0hourly.vGJ3Rk to 0hourly
recv_files(issue)
data recv 805 at 0
got file_sum
set modtime of .issue.RTRDTB to (1566935887) Tue Aug 27 12:58:07 2019
renaming .issue.RTRDTB to issue
recv_files(sysstat)
data recv 233 at 0
got file_sum
set modtime of .sysstat.pEYfVS to (1567013864) Wed Aug 28 10:37:44 2019
renaming .sysstat.pEYfVS to sysstat

# The generator starts processing the file
[generator] receiving flist for dir 1
uid 52(puppet) maps to 52
gid 52(puppet) maps to 52
recv_file_name(puppet/puppet-agent.log)
received 1 names
[generator] flist start=7, used=1, low=0, high=0
[generator] i=7 2 puppet/puppet-agent.log mode=0100644 len=891 gid=(52) flags=1400
recv_file_list done
recv_generator(puppet,6)
set modtime of puppet to (1568192553) Wed Sep 11 02:02:33 2019
recv_generator(puppet/puppet-agent.log,7)
touch_up_dirs: . (0)
set modtime of . to (1567013864) Wed Aug 28 10:37:44 2019
[generator] flist_eof=1
generate_files phase=1
send_files(6, /var/log/puppet)
puppet/
send_files(7, /var/log/puppet/puppet-agent.log)
count=0 n=0 rem=0
send_files mapped /var/log/puppet/puppet-agent.log of size 891
calling match_sums /var/log/puppet/puppet-agent.log
puppet/puppet-agent.log
sending file_sum
false_alarms=0 hash_hits=0 matches=0
sender finished /var/log/puppet/puppet-agent.log
recv_files(puppet)
recv_files(puppet/puppet-agent.log)
data recv 891 at 0
got file_sum
set modtime of puppet/.puppet-agent.log.JN0aX9 to (1568192556) Wed Sep 11 02:02:36 2019
renaming puppet/.puppet-agent.log.JN0aX9 to puppet/puppet-agent.log
touch_up_dirs: puppet (1)
set modtime of puppet to (1568192553) Wed Sep 11 02:02:33 2019


send_files phase=1
recv_files phase=1
generate_files phase=2
send_files phase=2
send files finished # The sender process disappears and outputs the matching statistics and the total amount of pure data transmitted
total: matches=0  hash_hits=0  false_alarms=0 data=2057
recv_files phase=2
recv_files finished
generate_files phase=3
generate_files finished
client_run waiting on 24985

sent 2,440 bytes  received 3,055 bytes  845.38 bytes/sec
total size is 2,057  speedup is 0.37
[sender] _exit_cleanup(code=0, file=main.c, line=1179): entered
[sender] _exit_cleanup(code=0, file=main.c, line=1179): about to call exit(0)
[root@dev8 ~]#

1.5.2 incremental transmission execution process analysis

The commands to execute are:

[yuatang@dev8 a]$ rsync -a -vvvv /mnt/disks/2/a/ ci-user@10.176.6.24:/data01/dev8-ck-backup/a
# Use ssh(ssh is the default remote shell) to execute the remote rsync command to establish a connection
cmd=<NULL> machine=10.176.6.24 user=ci-user path=/data01/dev8-ck-backup/a
cmd[0]=ssh cmd[1]=-l cmd[2]=ci-user cmd[3]=10.176.6.24 cmd[4]=rsync cmd[5]=--server cmd[6]=-vvvvlogDtpre.iLsfxC cmd[7]=. cmd[8]=/data01/dev8-ck-backup/a
opening connection using: ssh -l ci-user 10.176.6.24 rsync --server -vvvvlogDtpre.iLsfxC . /data01/dev8-ck-backup/a  (9 args)
msg checking charset: ANSI_X3.4-1968
The authenticity of host '10.176.6.24 (10.176.6.24)' can't be established.
ECDSA key fingerprint is SHA256:8gCByzMuPJc6E3bieuGbsqXT3TeGnd2M5+yRwaW8oEs.
ECDSA key fingerprint is MD5:05:a1:11:0e:1b:cb:6b:64:fe:ec:7f:56:4b:3e:87:2c.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.176.6.24' (ECDSA) to the list of known hosts.
*******************************************************************************
                             NOTICE:

This system/network is for the use of authorized users for legitimate
business purposes only. The unauthorized access, use or modification of
this system or of the data contained therein or in transit to/from it
is a criminal violation of federal and state laws.

All individuals using this computer system are subject to having their
activities on this system monitored and recorded by systems personnel.
Anyone using this system expressly consents to such monitoring. Any
evidence of suspected criminal activity revealed by such monitoring may
be provided to law enforcement officials by systems personnel.
*******************************************************************************

Password:
# Both parties shall send the agreement version number to each other and negotiate to use the lower version
(Server) Protocol versions: remote=31, negotiated=31
(Client) Protocol versions: remote=31, negotiated=31
sending incremental file list
# Five incremental files need to be sent
[sender] make_file(.,*,0)
[sender] pushing local filters for /mnt/disks/2/a/
[sender] make_file(paypal,*,2)
[sender] make_file(c.txt,*,2)
[sender] make_file(b,*,2)
[sender] make_file(guangzhou.txt,*,2)
# # Indicate that it starts from item 1 in the file list and confirm that there are 5 items to be transmitted to the receiver this time. What does 4 mean?
[sender] flist start=1, used=5, low=0, high=4
[sender] i=1 /mnt/disks/2/a ./ mode=040755 len=4,096 uid=999 gid=999 flags=1005
[sender] i=2 /mnt/disks/2/a c.txt mode=0100644 len=0 uid=999 gid=999 flags=1000
[sender] i=3 /mnt/disks/2/a guangzhou.txt mode=0100644 len=84 uid=999 gid=999 flags=1000
[sender] i=4 /mnt/disks/2/a b/ mode=040755 len=4,096 uid=999 gid=999 flags=1004
[sender] i=5 /mnt/disks/2/a paypal/ mode=040755 len=4,096 uid=999 gid=999 flags=1004
send_file_list done
# So far, save the file name to be transferred into the list and send it
file list sent

# Start sending file content related information
send_files starting
[sender] pushing local filters for /mnt/disks/2/a/b/
[sender] make_file(b/c,*,2)
[sender] make_file(b/ccc.txt,*,2)
[sender] flist start=7, used=2, low=0, high=1
[sender] i=7 /mnt/disks/2/a b/ccc.txt mode=0100644 len=0 uid=999 gid=999 flags=1000
[sender] i=8 /mnt/disks/2/a b/c/ mode=040755 len=4,096 uid=999 gid=999 flags=1004
[sender] pushing local filters for /mnt/disks/2/a/b/c/
[sender] make_file(b/c/e,*,2)
[sender] flist start=10, used=1, low=0, high=0
[sender] i=10 /mnt/disks/2/a b/c/e/ mode=040755 len=4,096 uid=999 gid=999 flags=1004
[sender] pushing local filters for /mnt/disks/2/a/b/c/e/
[sender] make_file(b/c/e/f,*,2)
[sender] flist start=12, used=1, low=0, high=0
[sender] i=12 /mnt/disks/2/a b/c/e/f/ mode=040755 len=4,096 uid=999 gid=999 flags=1004
[sender] pushing local filters for /mnt/disks/2/a/b/c/e/f/
[sender] make_file(b/c/e/f/shanghai.txt,*,2)
[sender] flist start=14, used=1, low=0, high=0
[sender] i=14 /mnt/disks/2/a b/c/e/f/shanghai.txt mode=0100644 len=22 uid=999 gid=999 flags=1000
[sender] pushing local filters for /mnt/disks/2/a/paypal/
[sender] make_file(paypal/ceshi.txt,*,2)
[sender] make_file(paypal/a,*,2)
[sender] flist start=16, used=2, low=0, high=1
[sender] i=16 /mnt/disks/2/a paypal/ceshi.txt mode=0100644 len=227 uid=999 gid=999 flags=1000
[sender] i=17 /mnt/disks/2/a paypal/a/ mode=040755 len=4,096 uid=999 gid=999 flags=1004
[sender] pushing local filters for /mnt/disks/2/a/paypal/a/
[sender] make_file(paypal/a/a.txt,*,2)
[sender] flist start=19, used=1, low=0, high=0
[sender] i=19 /mnt/disks/2/a paypal/a/a.txt mode=0100644 len=0 uid=999 gid=999 flags=1000
[sender] flist_eof=1

# Start the generator process on the receiver side
server_recv(2) starting pid=5337
uid 999(polkitd) maps to 999
process has 2 gids:  991 1010
gid 999(input) maps to 999
recv_file_name(.)
recv_file_name(paypal)
recv_file_name(c.txt)
recv_file_name(b)
recv_file_name(guangzhou.txt)
received 5 names
[Receiver] flist start=1, used=5, low=0, high=4
[Receiver] i=1 0 ./ mode=040755 len=4,096 gid=(999) flags=1405
[Receiver] i=2 1 c.txt mode=0100644 len=0 gid=(999) flags=1400
[Receiver] i=3 1 guangzhou.txt mode=0100644 len=84 gid=(999) flags=1400
[Receiver] i=4 1 b/ mode=040755 len=4,096 gid=(999) flags=1404
[Receiver] i=5 1 paypal/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
# First data reception completed

get_local_name count=5 /data01/dev8-ck-backup/a # Get local pathname
# The generator process starts
generator starting pid=5337
# Enable incremental transfer algorithm
delta-transmission enabled
# The above generator process has been set

# First process the received ordinary files
recv_generator(.,0)
set modtime of . to (1631167647) Wed Sep  8 23:07:27 2021
# The generator receives the file directory with file id=1 notified by the receiver process
recv_generator(.,1)
recv_generator(c.txt,2)
c.txt is uptodate  # c.txt file is already up to date
recv_generator(guangzhou.txt,3)
gen mapped guangzhou.txt of size 75
generating and sending sums for 3
count=1 rem=75 blength=700 s2length=2 flength=75
chunk[0] offset=0 len=75 sum1=e6d21c1e
recv_generator(b,4)
recv_generator(paypal,5)
send_files(0, /mnt/disks/2/a/.)
./
send_files(2, /mnt/disks/2/a/c.txt)
send_files(3, /mnt/disks/2/a/guangzhou.txt)
# This item is the information of the data blocks divided by the file / mnt/disks/2/a/guangzhou.txt on the target host. count represents the quantity, n represents the fixed size (700) of the data block, and rem represents the remaining data length, that is, the size of the last data block
count=1 n=700 rem=75
# chunk[0] represents the first data block, with a length of 75, and offset=0 represents
chunk[0] len=75 offset=0 sum1=e6d21c1e
send_files mapped /mnt/disks/2/a/guangzhou.txt of size 84
calling match_sums /mnt/disks/2/a/guangzhou.txt  # Call check code matching function
guangzhou.txt
built hash table   # Build hash table
hash search b=700 len=84
sum=ca581e7a k=84
hash search s->blength=700 len=84 count=1
done hash search
sending file_sum  # After the data block matching is completed, send the file level checksum to the receiver
false_alarms=0 hash_hits=0 matches=0  # Output statistics during matching
sender finished /mnt/disks/2/a/guangzhou.txt  # guangzhou.txt file transfer completed

recv_files(5) starting
[receiver] receiving flist for dir 1
recv_file_name(b/c)
recv_file_name(b/ccc.txt)
received 2 names
[receiver] flist start=7, used=2, low=0, high=1
[receiver] i=7 2 b/ccc.txt mode=0100644 len=0 gid=(999) flags=1400
[receiver] i=8 2 b/c/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
# The file information in directory b has been received

[receiver] receiving flist for dir 3
recv_file_name(b/c/e)
received 1 names
[receiver] flist start=10, used=1, low=0, high=0
[receiver] i=10 3 b/c/e/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
# b/c/e / file directory information received 

[receiver] receiving flist for dir 4
recv_file_name(b/c/e/f)
received 1 names
[receiver] flist start=12, used=1, low=0, high=0
[receiver] i=12 4 b/c/e/f/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
# b/c/e/f file directory information received 

[receiver] receiving flist for dir 5
recv_file_name(b/c/e/f/shanghai.txt)
received 1 names
[receiver] flist start=14, used=1, low=0, high=0
[receiver] i=14 5 b/c/e/f/shanghai.txt mode=0100644 len=22 gid=(999) flags=1400
recv_file_list done
# /b/c/e/f/shanghai.txt file information received

[receiver] receiving flist for dir 2
recv_file_name(paypal/ceshi.txt)
recv_file_name(paypal/a)
received 2 names
[receiver] flist start=16, used=2, low=0, high=1
[receiver] i=16 2 paypal/ceshi.txt mode=0100644 len=227 gid=(999) flags=1400
[receiver] i=17 2 paypal/a/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
[receiver] receiving flist for dir 6
recv_file_name(paypal/a/a.txt)
received 1 names
[receiver] flist start=19, used=1, low=0, high=0
[receiver] i=19 3 paypal/a/a.txt mode=0100644 len=0 gid=(999) flags=1400
recv_file_list done
[receiver] flist_eof=1

# The generator process starts to compare the check codes of B / C and B / ccc.txt
[generator] receiving flist for dir 1
recv_file_name(b/c)
recv_file_name(b/ccc.txt)
received 2 names
[generator] flist start=7, used=2, low=0, high=1
[generator] i=7 2 b/ccc.txt mode=0100644 len=0 gid=(999) flags=1400
[generator] i=8 2 b/c/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
recv_generator(b,6)
recv_generator(b/ccc.txt,7)
b/ccc.txt is uptodate
recv_generator(b/c,8)
[generator] receiving flist for dir 3
recv_file_name(b/c/e)
received 1 names
[generator] flist start=10, used=1, low=0, high=0
[generator] i=10 3 b/c/e/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
recv_generator(b/c,9)
recv_generator(b/c/e,10)
[generator] receiving flist for dir 4
recv_file_name(b/c/e/f)
received 1 names
[generator] flist start=12, used=1, low=0, high=0
[generator] i=12 4 b/c/e/f/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
recv_generator(b/c/e,11)
recv_generator(b/c/e/f,12)
[generator] receiving flist for dir 5
recv_file_name(b/c/e/f/shanghai.txt)
received 1 names
[generator] flist start=14, used=1, low=0, high=0
[generator] i=14 5 b/c/e/f/shanghai.txt mode=0100644 len=22 gid=(999) flags=1400
recv_file_list done
recv_generator(b/c/e/f,13)
recv_generator(b/c/e/f/shanghai.txt,14)
b/c/e/f/shanghai.txt is uptodate
# Because / b/c/e/f/shanghai.txt update processing is completed

# The generator process starts processing the file list paypal/ceshi.txt paypal/a
[generator] receiving flist for dir 2
recv_file_name(paypal/ceshi.txt)
recv_file_name(paypal/a)
received 2 names
[generator] flist start=16, used=2, low=0, high=1
[generator] i=16 2 paypal/ceshi.txt mode=0100644 len=227 gid=(999) flags=1400
[generator] i=17 2 paypal/a/ mode=040755 len=4,096 gid=(999) flags=1404
recv_file_list done
recv_generator(paypal,15)
set modtime of paypal to (1631167681) Wed Sep  8 23:08:01 2021
recv_generator(paypal/ceshi.txt,16)
gen mapped paypal/ceshi.txt of size 118
generating and sending sums for 16
count=1 rem=118 blength=700 s2length=2 flength=118
chunk[0] offset=0 len=118 sum1=30e828ec
recv_generator(paypal/a,17)
[generator] receiving flist for dir 6
recv_file_name(paypal/a/a.txt)
received 1 names
[generator] flist start=19, used=1, low=0, high=0
[generator] i=19 3 paypal/a/a.txt mode=0100644 len=0 gid=(999) flags=1400
recv_file_list done
recv_generator(paypal/a,18)
recv_generator(paypal/a/a.txt,19)
paypal/a/a.txt is uptodate
# paypal/a/a.txt is up-to-date because there is no change
[generator] flist_eof=1
generate_files phase=1
send_files(6, /mnt/disks/2/a/b)
send_files(7, /mnt/disks/2/a/b/ccc.txt)
send_files(9, /mnt/disks/2/a/b/c)
send_files(11, /mnt/disks/2/a/b/c/e)
send_files(13, /mnt/disks/2/a/b/c/e/f)
send_files(14, /mnt/disks/2/a/b/c/e/f/shanghai.txt)
send_files(15, /mnt/disks/2/a/paypal)
paypal/
send_files(16, /mnt/disks/2/a/paypal/ceshi.txt)
count=1 n=700 rem=118
chunk[0] len=118 offset=0 sum1=30e828ec
send_files mapped /mnt/disks/2/a/paypal/ceshi.txt of size 227
calling match_sums /mnt/disks/2/a/paypal/ceshi.txt
paypal/ceshi.txt
built hash table
hash search b=700 len=227
sum=f23d3e41 k=227
hash search s->blength=700 len=227 count=1
done hash search
sending file_sum
false_alarms=0 hash_hits=0 matches=0
sender finished /mnt/disks/2/a/paypal/ceshi.txt
send_files(18, /mnt/disks/2/a/paypal/a)
send_files(19, /mnt/disks/2/a/paypal/a/a.txt)
recv_files(.)
recv_files(c.txt)
recv_files(guangzhou.txt)
recv mapped guangzhou.txt of size 75
data recv 84 at 0
got file_sum
set modtime of .guangzhou.txt.2XJrfW to (1631167646) Wed Sep  8 23:07:26 2021
renaming .guangzhou.txt.2XJrfW to guangzhou.txt
touch_up_dirs: . (0)
set modtime of . to (1631167647) Wed Sep  8 23:07:27 2021
touch_up_dirs: b (1)
touch_up_dirs: b/c (3)
touch_up_dirs: b/c/e (4)
touch_up_dirs: b/c/e/f (5)
recv_files(b)
recv_files(b/ccc.txt)
recv_files(b/c)
recv_files(b/c/e)
recv_files(b/c/e/f)
recv_files(b/c/e/f/shanghai.txt)
recv_files(paypal)
recv_files(paypal/ceshi.txt)
recv mapped paypal/ceshi.txt of size 118
data recv 227 at 0
got file_sum
set modtime of paypal/.ceshi.txt.Nk2NXb to (1631167681) Wed Sep  8 23:08:01 2021
renaming paypal/.ceshi.txt.Nk2NXb to paypal/ceshi.txt
touch_up_dirs: paypal (2)
set modtime of paypal to (1631167681) Wed Sep  8 23:08:01 2021
touch_up_dirs: paypal/a (6)
recv_files(paypal/a)
recv_files(paypal/a/a.txt)
send_files phase=1
recv_files phase=1
generate_files phase=2
send_files phase=2
send files finished
total: matches=0  hash_hits=0  false_alarms=0 data=311
recv_files phase=2
recv_files finished
generate_files phase=3
generate_files finished
client_run waiting on 21757

sent 821 bytes  received 6,800 bytes  461.88 bytes/sec
total size is 333  speedup is 0.04
[sender] _exit_cleanup(code=0, file=main.c, line=1179): entered
[sender] _exit_cleanup(code=0, file=main.c, line=1179): about to call exit(0)

1.6 analyze the usage scenario of rsync from the working principle

(1) What resources are consumed by rsync?
As we know from the previous article, the sender side of rsync consumes a lot of cpu because it has to calculate and compare various check codes many times, and the receiver side consumes a lot of io because it has to copy data from the basis file. However, this is only the case for rsync incremental transmission. If it is full transmission (such as the first synchronization, or the full transmission option "– whole file" is explicitly used), the sender side does not need to calculate and compare the check code, and the receiver side does not need copy basis file, which is the same as the resources consumed by scp.
(2) rsync is not suitable for real-time synchronization of database files.
Large files such as databases are frequently accessed. If rsync is used for real-time synchronization, there are many data block check codes to be calculated and compared on the sender side, and the cpu will remain high for a long time, which will affect the performance of database services. On the other hand, the receiver must copy most of the same data blocks from the huge base file (generally, the database files provided are at least dozens of G) every time, reorganize the new file, which is almost equivalent to directly cp a file. It must not be able to withstand the huge io pressure, and no matter how good the machine is.
Therefore, for a single large file that changes frequently, rsync is only suitable for occasional synchronization, that is, the backup function. It is not suitable for real-time synchronization. Like database files, to synchronize in real time, you should use the database's own replication function.

(3) rsync can be used to synchronize a large number of small files in real time
Because rsync is incremental synchronization, the sender side will not send files that already exist on the receiver side and are the same as those on the sender side (small files and last modification time), so that both the sender side and the receiver side only need to process a small number of files. Because the files are small, neither the cpu on the sender side nor the io on the receiver side is a problem.
However, the real-time synchronization function of rsync is realized by means of tools, such as inotify+rsync and sersync. Therefore, these tools should be set reasonably, otherwise the real-time synchronization is equally inefficient. However, this is not the low efficiency caused by rsync, but the problem of the configuration of these tools.

Excerpt from:

  • https://www.cnblogs.com/f-ck-need-u/p/7226781.html#auto_id_9

Posted by sean04 on Sat, 20 Nov 2021 11:14:26 -0800