SeafileServer Source Analysis Version Management System: Submit Differences and Merge

Keywords: git Algorithm

2021SC@SDUSC

Submit Differences

To facilitate user review and prepare for merge, you need to design an algorithm to count the differences between two submissions (branches). Before thinking about the specific flow of the algorithm, you need to summarize the types of differences.

Difference Type

If there are two commit commits, Parent, let's assume we consider the differences between Commits and Parents. That is, Commit is my submission and Parent is someone else's submission. Now you need to consider how my submission differs from others'submissions.

File Differences

differenceExplainSymbol
Add toCommit added the file relative to Parent.A
deleteCommit deleted the file relative to Parent.D
modifyCommit added the file relative to Parent.M
renameCommit renamed the file relative to Parent and the contents have not been modified.R

Catalog differences

differenceExplainSymbol
Add toCommit added the directory relative to Parent.B
deleteCommit deleted the directory relative to Parent.C
renameCommit renamed the directory relative to Parent, and the link has not been modified.E

Unmerged differences

Special cases in the three-way difference.

differenceExplain
STATUS_UNMERGED_NONENot merged
STATUS_UNMERGED_BOTH_CHANGEDBoth are modified: the same file is modified to a different result
STATUS_UNMERGED_BOTH_ADDEDBoth are added: both merges add the same file
STATUS_UNMERGED_I_REMOVEDOne removes the file, the other modifies it
STATUS_UNMERGED_OTHERS_REMOVEDOne modifies the file, the other removes the file
STATUS_UNMERGED_DFC_I_ADDED_FILEOne replaces the directory with a file, the other modifies the file in the directory
STATUS_UNMERGED_DFC_OTHERS_ADDED_FILEOne modifies the files in the directory and the other replaces the directory with a file

(The purpose in actual use is unknown because no relevant implementation code has been found)

algorithm

Technological process

  • Homogeneous Recursion

    The goal is to recursively traverse the parts of each file tree that have the same location. The pseudocode is as follows:

    File tree isomorphic recursion ( Tree[s][n]: n A directory with the same location on the road file tree:
    	Dirents[n]: n Catalog items with the same location on the road file tree
    	While (1): 
    		Get isomorphic catalog items to Dirents
    		If n Road Dirents Exactly the same:
    	    	Return
    		File Difference Processing ( Dirents)
    		Catalog Difference Processing ( Dirents)
    
    File Difference Processing ( Dirents[n]: n Files with the same location on the road file tree:
    	Get the corresponding Files,Then call the difference callback function
    
    Catalog Difference Processing ( Dirents[n]: n Files with the same location on the road file tree:
    	Get the corresponding Dir,Then call the difference callback function
    	File tree isomorphic recursion ( Dirs)
    
  • Homogeneous catalog items

    It is important for isomorphic recursion to be able to achieve multichannel location isomorphism that you need to obtain isomorphic catalog items, which require some benchmarks. Returning to the nature of location isomorphism, it essentially requires that nodes have exactly the same path to the root. Consider searching for binary/multifork trees, because there are weights, so it is easy to find paths where nodes in multiple trees have exactly the same root weights. So what can you use as weights in a file tree? File/directory name.

    We think that directory items with the same name are also positionally isomorphic in multi-location isomorphic directories. Based on this assumption, we can quickly devise an efficient and complex way to traverse isomorphisms. Consider sorting directory items by name (which is also the default when getting Seafdir), then we can use a similar merge sort method to get multiple identical directory items based on multiple pointers.

Of course, knowing the isomorphism alone is still not enough, because our purpose is to know the difference of isomorphism, so we need to use the difference algorithm further to get the difference.

Two way difference and three way difference

There are two different algorithms for generating submissions:

  1. Two-way difference: Applies to normal submissions. Determine the difference between Commit and Parent. Non-differentiated content can be identified directly, while content needs to be judged for modification.

  2. Three-way difference: Applies to merged submissions. Determine the difference between Commit and ParentA and ParentB. You can distinguish the addition or deletion of Commits relative to both. However, the modification still needs to be judged by the content.


Neither the two-way nor the three-way differences tell whether the user renamed the file or directory, because in name-based location isomorphism, renaming is equivalent to adding after deletion. This will be handled in post-processing.

Postprocessing

  • Approximate judgment of rename

    It is known that in name-based homogeneity, if a file is renamed, it may be judged both deleted and increased. So how do you restore the real state?

    The answer is that you only need to find the same content, and if a delete-add pair exists under the same content, the directory or file is considered renamed.

    Note that this is an approximation, and it is possible that the user really deletes and increases, but since we do not monitor the user's behavior, it is considered renamed.

  • Redundant empty directory

    Another point where a bug occurs is in the empty directory. An empty directory is a special existence where all empty directories point to the same empty directory object because their contents are the same, so SHA1, the object name, is the same. Therefore, there are two cases where an empty directory can be misjudged:

    1. After adding files to an empty directory, it is determined that the empty directory has been deleted.
    2. When a directory is empty, it is determined that the empty directory is increased.

    So how do you handle both cases? In fact, it is very simple to traverse through all directories with added or deleted states, then make a special judgment and correct the difference type.

Realization

  • Differential Objects

    typedef struct DiffEntry { 	// Differential Objects
        char type; 				// Difference Type
        char status; 			// Difference status
        int unmerge_state; 		// Unmerged state
        unsigned char sha1[20]; // Used to resolve rename issues
        char *name; 			// Name
        char *new_name;         // New name, used only for rename scenarios
        gint64 size; 			// Size
        gint64 origin_size;     // Original size, used only to modify the situation
    } DiffEntry;
    
  • Difference Comparison Options

    typedef struct DiffOptions { // Comparison Options
        char store_id[37]; 		// Storage id
        int version; 			// seafile version
        // Two callbacks
        DiffFileCB file_cb; 	// File Difference Processing Callback
        DiffDirCB dir_cb;		// Directory Difference Processing Callback
        void *data; 			// User parameters
    } DiffOptions;
    

    The directory and file callback functions need to be set when the structure is passed in to the difference comparison operation in use. Callback is the difference algorithm.

  • Difference comparison algorithm

    • Homogeneous Location Recursion
    static int // File Tree Homogeneous Recursion
    diff_trees_recursive (int n, SeafDir *trees[],
                          const char *basedir, DiffOptions *opt);
    int // File Tree Differential Processing, Encapsulation
    diff_trees (int n, const char *roots[], DiffOptions *opt)
    static int // Catalog Difference Processing
    diff_directories (int n, SeafDirent *dents[], const char *basedir, DiffOptions *opt)
    static int // File Difference Processing
    diff_files (int n, SeafDirent *dents[], const char *basedir, DiffOptions *opt)
    
    • Difference algorithm
    static int // Two-way File Difference Processing
    twoway_diff_files (int n, const char *basedir, SeafDirent *files[], void *vdata)
    static int // Two-way directory difference handling
    twoway_diff_dirs (int n, const char *basedir, SeafDirent *dirs[], void *vdata,
                      gboolean *recurse)
    static int // Three-way File Differential Processing
    threeway_diff_files (int n, const char *basedir, SeafDirent *files[], void *vdata)
    static int // Three-way directory difference handling (default no operation)
    threeway_diff_dirs (int n, const char *basedir, SeafDirent *dirs[], void *vdata,
                        gboolean *recurse)
    

    One of the little things worth mentioning is how to judge if the content is the same. This again reflects the benefits of the SHA1 Abstract naming: we only need to compare the IDs of two file system objects to directly determine if their contents are the same.

    • Postprocessing
    void // Resolve renaming issues
    diff_resolve_renames (GList **diff_entries)
    void // Solve the problem of redundant empty directories
    diff_resolve_empty_dirs (GList **diff_entries)
    
  • encapsulation

    • General Submission
    int // Compare differences between two submissions
    diff_commits (SeafCommit *commit1, SeafCommit *commit2, GList **results, // Results are recorded in the results list
                  gboolean fold_dir_diff);
    int // Compare the differences between the two submissions; Given Root Directory
    diff_commit_roots (const char *store_id, int version,
                       const char *root1, const char *root2, GList **results,
                       gboolean fold_dir_diff);
    
    • Submit after merge
    int // Compare differences before and after merging (compared with two parent submissions)
    diff_merge (SeafCommit *merge, GList **results, gboolean fold_dir_diff);
    int // Compare the differences before and after merging; Given Root Directory
    diff_merge_roots (const char *store_id, int version,
                      const char *merged_root, const char *p1_root, const char *p2_root,
                      GList **results, gboolean fold_dir_diff);
    

Branch Merge

Merger, Conflict and Resolution

Merging is an important means of combining multiple submissions (branches). A merged object is a merge of two submissions, but is generally considered a merge of branches because branches are required as a submission pointer. The nature of branch merging in the graph is that the degree equals two, that is, the generation of a submission is generated by two parent submissions and can carry incremental content.

Conflict resolution is the most important issue when merging two submissions (branches). Conflict judgments are actually similar to difference judgments, all made with the help of location homogeneity, and are not described here. The key issue here is how to determine and resolve conflicts. Conflicts are sensitive to users, who need to choose which content to put in a new version by comparing it. Conflicts can also be user-defined, and all non-identical content can be conflicts.

There are two solutions: the first is two-way merger, the second is three-way merge.

Two-way merge is very simple, only the same content is retained, deletion, addition and modification are all left to the user's choice. Because it is entirely impossible to decide whether to keep or discard the non-identical contents of the two paths. This is easy to do, but cumbersome for users, but it's very safe because the conflicting choices are entirely left to the users and can be completely user-defined.

The three-way merge has some autonomy. It introduces Base, the common ancestor of the two branches, and then uses Base to decide whether to keep the content or leave it to the user. This improves efficiency, but is not necessarily insured, because conflicts between algorithm definitions may not be user-defined.

Three-way merger

The merged participants consist of three branches, Base, Head, Remote. Our goal is to merge Head and Remote, and Base is their common ancestor.

After the introduction of Base, all the differences have criteria to judge. The algorithm considers that changes (including additions and deletions) that do not cross between the two branches relative to their ancestors are preserved, while cross-changes are conflicting. Such a conflict definition actually fits most situations and further narrows the potential scope of conflict, so to speak, a balance has been found between efficiency and user needs.

We use tables to represent the possible occurrence of three branches in a merge. Assuming that the content of the isomorphic portion of each branch is capitalized, the following merge results can be obtained:

BaseHeadRemoteResultExplain
AAAASame Content
AABBIf only one of the parties has changed, then choose the one that changed
ABABDitto
ABBBIf both parties have the same changes, choose the modified
ABCconflictIf both sides have changed and are different, a conflict is reported and needs to be resolved by the user

From: https://blog.csdn.net/longintchar/article/details/83049840

From the table above, we can directly show how the merging of isomorphic locations is handled. So the three-way merge algorithm also appears naturally: apply the previous isomorphic recursion, and then implement the judgment in the table.

Realization

Function recursion and invocation flow as shown in Fig.

The content is to implement the three-way merge algorithm (the two-way merge is not implemented, only callback functions will be called).

Finally, you need to clarify the storage of the merged results. For merged results, there are two options:If do_merge=true, writes directly to the hard disk and returns a root directory id to merged_tree_root; Conversely, only callback functions are called.

No more detailed source code will be pasted here as follows: merge-new.c.

Posted by Chotor on Wed, 17 Nov 2021 09:44:04 -0800