SeafileServer Source Analysis Version Management System: Submit Differences and Merge

2021SC@SDUSC

Submit Differences

To facilitate user review and prepare for merge, you need to design an algorithm to count the differences between two submissions (branches). Before thinking about the specific flow of the algorithm, you need to summarize the types of differences.

Difference Type

If there are two commit commits, Parent, let's assume we consider the differences between Commits and Parents. That is, Commit is my submission and Parent is someone else's submission. Now you need to consider how my submission differs from others'submissions.

File Differences

difference	Explain	Symbol
Add to	Commit added the file relative to Parent.	A
delete	Commit deleted the file relative to Parent.	D
modify	Commit added the file relative to Parent.	M
rename	Commit renamed the file relative to Parent and the contents have not been modified.	R

Catalog differences

difference	Explain	Symbol
Add to	Commit added the directory relative to Parent.	B
delete	Commit deleted the directory relative to Parent.	C
rename	Commit renamed the directory relative to Parent, and the link has not been modified.	E

Unmerged differences

Special cases in the three-way difference.

difference	Explain
STATUS_UNMERGED_NONE	Not merged
STATUS_UNMERGED_BOTH_CHANGED	Both are modified: the same file is modified to a different result
STATUS_UNMERGED_BOTH_ADDED	Both are added: both merges add the same file
STATUS_UNMERGED_I_REMOVED	One removes the file, the other modifies it
STATUS_UNMERGED_OTHERS_REMOVED	One modifies the file, the other removes the file
STATUS_UNMERGED_DFC_I_ADDED_FILE	One replaces the directory with a file, the other modifies the file in the directory
STATUS_UNMERGED_DFC_OTHERS_ADDED_FILE	One modifies the files in the directory and the other replaces the directory with a file

(The purpose in actual use is unknown because no relevant implementation code has been found)

algorithm

Technological process

Homogeneous Recursion

The goal is to recursively traverse the parts of each file tree that have the same location. The pseudocode is as follows:

File tree isomorphic recursion ( Tree[s][n]: n A directory with the same location on the road file tree:
	Dirents[n]: n Catalog items with the same location on the road file tree
	While (1): 
		Get isomorphic catalog items to Dirents
		If n Road Dirents Exactly the same:
	    	Return
		File Difference Processing ( Dirents)
		Catalog Difference Processing ( Dirents)

File Difference Processing ( Dirents[n]: n Files with the same location on the road file tree:
	Get the corresponding Files，Then call the difference callback function

Catalog Difference Processing ( Dirents[n]: n Files with the same location on the road file tree:
	Get the corresponding Dir，Then call the difference callback function
	File tree isomorphic recursion ( Dirs)

Homogeneous catalog items

It is important for isomorphic recursion to be able to achieve multichannel location isomorphism that you need to obtain isomorphic catalog items, which require some benchmarks. Returning to the nature of location isomorphism, it essentially requires that nodes have exactly the same path to the root. Consider searching for binary/multifork trees, because there are weights, so it is easy to find paths where nodes in multiple trees have exactly the same root weights. So what can you use as weights in a file tree? File/directory name.

We think that directory items with the same name are also positionally isomorphic in multi-location isomorphic directories. Based on this assumption, we can quickly devise an efficient and complex way to traverse isomorphisms. Consider sorting directory items by name (which is also the default when getting Seafdir), then we can use a similar merge sort method to get multiple identical directory items based on multiple pointers.

Of course, knowing the isomorphism alone is still not enough, because our purpose is to know the difference of isomorphism, so we need to use the difference algorithm further to get the difference.

Two way difference and three way difference

There are two different algorithms for generating submissions:

Two-way difference: Applies to normal submissions. Determine the difference between Commit and Parent. Non-differentiated content can be identified directly, while content needs to be judged for modification.
Three-way difference: Applies to merged submissions. Determine the difference between Commit and ParentA and ParentB. You can distinguish the addition or deletion of Commits relative to both. However, the modification still needs to be judged by the content.

Neither the two-way nor the three-way differences tell whether the user renamed the file or directory, because in name-based location isomorphism, renaming is equivalent to adding after deletion. This will be handled in post-processing.

Postprocessing

Approximate judgment of rename

It is known that in name-based homogeneity, if a file is renamed, it may be judged both deleted and increased. So how do you restore the real state?

The answer is that you only need to find the same content, and if a delete-add pair exists under the same content, the directory or file is considered renamed.

Note that this is an approximation, and it is possible that the user really deletes and increases, but since we do not monitor the user's behavior, it is considered renamed.
Redundant empty directory

Another point where a bug occurs is in the empty directory. An empty directory is a special existence where all empty directories point to the same empty directory object because their contents are the same, so SHA1, the object name, is the same. Therefore, there are two cases where an empty directory can be misjudged:
1. After adding files to an empty directory, it is determined that the empty directory has been deleted.
2. When a directory is empty, it is determined that the empty directory is increased.
So how do you handle both cases? In fact, it is very simple to traverse through all directories with added or deleted states, then make a special judgment and correct the difference type.

Realization

Differential Objects

typedef struct DiffEntry { 	// Differential Objects
    char type; 				// Difference Type
    char status; 			// Difference status
    int unmerge_state; 		// Unmerged state
    unsigned char sha1[20]; // Used to resolve rename issues
    char *name; 			// Name
    char *new_name;         // New name, used only for rename scenarios
    gint64 size; 			// Size
    gint64 origin_size;     // Original size, used only to modify the situation
} DiffEntry;

Difference Comparison Options

typedef struct DiffOptions { // Comparison Options
    char store_id[37]; 		// Storage id
    int version; 			// seafile version
    // Two callbacks
    DiffFileCB file_cb; 	// File Difference Processing Callback
    DiffDirCB dir_cb;		// Directory Difference Processing Callback
    void *data; 			// User parameters
} DiffOptions;

The directory and file callback functions need to be set when the structure is passed in to the difference comparison operation in use. Callback is the difference algorithm.

Difference comparison algorithm

Homogeneous Location Recursion

static int // File Tree Homogeneous Recursion
diff_trees_recursive (int n, SeafDir *trees[],
                      const char *basedir, DiffOptions *opt);
int // File Tree Differential Processing, Encapsulation
diff_trees (int n, const char *roots[], DiffOptions *opt)
static int // Catalog Difference Processing
diff_directories (int n, SeafDirent *dents[], const char *basedir, DiffOptions *opt)
static int // File Difference Processing
diff_files (int n, SeafDirent *dents[], const char *basedir, DiffOptions *opt)

Difference algorithm

static int // Two-way File Difference Processing
twoway_diff_files (int n, const char *basedir, SeafDirent *files[], void *vdata)
static int // Two-way directory difference handling
twoway_diff_dirs (int n, const char *basedir, SeafDirent *dirs[], void *vdata,
                  gboolean *recurse)
static int // Three-way File Differential Processing
threeway_diff_files (int n, const char *basedir, SeafDirent *files[], void *vdata)
static int // Three-way directory difference handling (default no operation)
threeway_diff_dirs (int n, const char *basedir, SeafDirent *dirs[], void *vdata,
                    gboolean *recurse)

One of the little things worth mentioning is how to judge if the content is the same. This again reflects the benefits of the SHA1 Abstract naming: we only need to compare the IDs of two file system objects to directly determine if their contents are the same.

Postprocessing

void // Resolve renaming issues
diff_resolve_renames (GList **diff_entries)
void // Solve the problem of redundant empty directories
diff_resolve_empty_dirs (GList **diff_entries)

encapsulation

General Submission

int // Compare differences between two submissions
diff_commits (SeafCommit *commit1, SeafCommit *commit2, GList **results, // Results are recorded in the results list
              gboolean fold_dir_diff);
int // Compare the differences between the two submissions; Given Root Directory
diff_commit_roots (const char *store_id, int version,
                   const char *root1, const char *root2, GList **results,
                   gboolean fold_dir_diff);

Submit after merge

int // Compare differences before and after merging (compared with two parent submissions)
diff_merge (SeafCommit *merge, GList **results, gboolean fold_dir_diff);
int // Compare the differences before and after merging; Given Root Directory
diff_merge_roots (const char *store_id, int version,
                  const char *merged_root, const char *p1_root, const char *p2_root,
                  GList **results, gboolean fold_dir_diff);

Branch Merge

Merger, Conflict and Resolution

Merging is an important means of combining multiple submissions (branches). A merged object is a merge of two submissions, but is generally considered a merge of branches because branches are required as a submission pointer. The nature of branch merging in the graph is that the degree equals two, that is, the generation of a submission is generated by two parent submissions and can carry incremental content.

Conflict resolution is the most important issue when merging two submissions (branches). Conflict judgments are actually similar to difference judgments, all made with the help of location homogeneity, and are not described here. The key issue here is how to determine and resolve conflicts. Conflicts are sensitive to users, who need to choose which content to put in a new version by comparing it. Conflicts can also be user-defined, and all non-identical content can be conflicts.

There are two solutions: the first is two-way merger, the second is three-way merge.

Two-way merge is very simple, only the same content is retained, deletion, addition and modification are all left to the user's choice. Because it is entirely impossible to decide whether to keep or discard the non-identical contents of the two paths. This is easy to do, but cumbersome for users, but it's very safe because the conflicting choices are entirely left to the users and can be completely user-defined.

The three-way merge has some autonomy. It introduces Base, the common ancestor of the two branches, and then uses Base to decide whether to keep the content or leave it to the user. This improves efficiency, but is not necessarily insured, because conflicts between algorithm definitions may not be user-defined.

Three-way merger

The merged participants consist of three branches, Base, Head, Remote. Our goal is to merge Head and Remote, and Base is their common ancestor.

After the introduction of Base, all the differences have criteria to judge. The algorithm considers that changes (including additions and deletions) that do not cross between the two branches relative to their ancestors are preserved, while cross-changes are conflicting. Such a conflict definition actually fits most situations and further narrows the potential scope of conflict, so to speak, a balance has been found between efficiency and user needs.

We use tables to represent the possible occurrence of three branches in a merge. Assuming that the content of the isomorphic portion of each branch is capitalized, the following merge results can be obtained:

Base	Head	Remote	Result	Explain
A	A	A	A	Same Content
A	A	B	B	If only one of the parties has changed, then choose the one that changed
A	B	A	B	Ditto
A	B	B	B	If both parties have the same changes, choose the modified
A	B	C	conflict	If both sides have changed and are different, a conflict is reported and needs to be resolved by the user

From: https://blog.csdn.net/longintchar/article/details/83049840

From the table above, we can directly show how the merging of isomorphic locations is handled. So the three-way merge algorithm also appears naturally: apply the previous isomorphic recursion, and then implement the judgment in the table.

Realization

Function recursion and invocation flow as shown in Fig.

The content is to implement the three-way merge algorithm (the two-way merge is not implemented, only callback functions will be called).

Finally, you need to clarify the storage of the merged results. For merged results, there are two options:If do_merge=true, writes directly to the hard disk and returns a root directory id to merged_tree_root; Conversely, only callback functions are called.

No more detailed source code will be pasted here as follows: merge-new.c.

Posted by Chotor on Wed, 17 Nov 2021 09:44:04 -0800

Programmer Group

SeafileServer Source Analysis Version Management System: Submit Differences and Merge

Submit Differences

Difference Type

File Differences

Catalog differences

Unmerged differences

algorithm

Technological process

Two way difference and three way difference

Postprocessing

Realization

Differential Objects

Difference Comparison Options

Difference comparison algorithm

encapsulation

Branch Merge

Merger, Conflict and Resolution

Three-way merger

Realization

Hot Keywords