Python 3 standard library: difflib difference calculation tool

Keywords: Python Unix Lambda

1. difflib difference calculation tool

This module provides classes and functions for comparing sequences. For example, it can be used to compare files and produce different information in various formats, including HTML and context, as well as differences in uniform formats. For a comparison of directories and files, see the filecmp module.

class difflib.SequenceMatcher(None,s1,s2)

This is a flexible class that can be used to compare any type of sequence pair, as long as the sequence element is a hashable object. Its basic algorithm is earlier than that published by Ratcliffe and Obershelp in the late 1980s and named after the exaggerated name of "Gestalt pattern matching", and it is more interesting. The idea is to find the longest continuous matching subsequence without "junk" element, which means that it has no value in some sense, such as blank line or blank character. (dealing with garbage elements is an extension of the Ratcliff and Obershelp algorithms.) Then the same idea will be applied recursively to the left and right sequence segments of matching sequence. This doesn't produce the smallest editing sequence, but it does produce a match that people think is "right.".

1.1 comparative text

The difference class is used to process the sequence of text lines and generate human readable differences (deltas) or change differences in instruction lines. The default output generated by differ is similar to the diff command-line tool under unix, including the original input values of the table (including common values) and the tag data indicating what changes have been made.

Rows with - prefixes are in the first sequence, not the second.

Rows with a + prefix are in the second sequence, not the first.

If there is an incremental difference between versions of a line, a plus prefix is used to highlight changes in the new version.

If a row does not change, the output is printed, and there is an extra space in the left column to align it with its different output.

Before passing the text into compare(), it is decomposed into a sequence of single lines of text, which can generate more readable output than the incoming string.

import difflib

text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
pharetra tortor.  In nec mauris eget magna consequat
convalis. Nam sed sem vitae odio pellentesque interdum. Sed
consequat viverra nisl. Suspendisse arcu metus, blandit quis,
rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
tristique enim. Donec quis lectus a justo imperdiet tempus."""

text1_lines = text1.splitlines()

text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
pharetra tortor. In nec mauris eget magna consequat
convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
consequat viverra nisl. Suspendisse arcu metus, blandit quis,
rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Duis vulputate tristique enim. Donec quis lectus a
justo imperdiet tempus.  Suspendisse eu lectus. In nunc."""

text2_lines = text2.splitlines()

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print('\n'.join(diff))

Result:

The beginning of the two text segments in the sample data is the same, so the first line prints directly without any extra annotation.

The third line of the data changes and the modified text contains a comma. The data lines of both versions will be printed, and the extra information on the fifth line will show which column in the text has been modified. The characters are added here.

The next few lines of the output show that an extra space has been removed.

Then there's a more complex change that replaces multiple words in a phrase.

The last sentence in the paragraph changes the most, so the old version is completely deleted and the new version is added.

 

The output generated by the ndiff() function is basically the same, processing the text data through special "processing" and deleting the "noise" in the input.

The difference () class displays all the input lines, while the unified diff class only contains the modified text lines and some contexts. The unified_diff() function generates this output.

import difflib

text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
pharetra tortor.  In nec mauris eget magna consequat
convalis. Nam sed sem vitae odio pellentesque interdum. Sed
consequat viverra nisl. Suspendisse arcu metus, blandit quis,
rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
tristique enim. Donec quis lectus a justo imperdiet tempus."""

text1_lines = text1.splitlines()

text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
pharetra tortor. In nec mauris eget magna consequat
convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
consequat viverra nisl. Suspendisse arcu metus, blandit quis,
rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Duis vulputate tristique enim. Donec quis lectus a
justo imperdiet tempus.  Suspendisse eu lectus. In nunc."""

text2_lines = text2.splitlines()

diff = difflib.unified_diff(
    text1_lines,
    text2_lines,
    lineterm='',
)
print('\n'.join(diff))

The lineterm parameter is used to tell unified ˊ diff() that it does not have to append line breaks to the control lines it returns, because the input lines do not include them. Line breaks are added to all lines when printing. For many users of common version control tools, the output should look familiar.

Using context_diff() produces similar continuous output.  

1.2 useless data

All functions that generate a sequence of differences take parameters to indicate which lines should be ignored and which characters in a line should be ignored. For example, these parameters can be used to skip tag or whitespace changes in both versions of the file.

from difflib import SequenceMatcher

def show_results(match):
    print('  a    = {}'.format(match.a))
    print('  b    = {}'.format(match.b))
    print('  size = {}'.format(match.size))
    i, j, k = match
    print('  A[a:a+size] = {!r}'.format(A[i:i + k]))
    print('  B[b:b+size] = {!r}'.format(B[j:j + k]))

A = " abcd"
B = "abcd abcd"

print('A = {!r}'.format(A))
print('B = {!r}'.format(B))

print('\nWithout junk detection:')
s1 = SequenceMatcher(None, A, B)
match1 = s1.find_longest_match(0, len(A), 0, len(B))
show_results(match1)

print('\nTreat spaces as junk:')
s2 = SequenceMatcher(lambda x: x == " ", A, B)
match2 = s2.find_longest_match(0, len(A), 0, len(B))
show_results(match2)

The default buffer does not explicitly ignore any lines or characters, but relies on the ability of the SequenceMatcher to detect noise. The default behavior of ndiff() is to ignore spaces and tabs.

1.3 compare any type

The SequenceMatcher class can compare two sequences of any type as long as their values are hashable. This class uses an algorithm to identify the longest consecutive matching block in the sequence and delete useless values that have no contribution to the actual data.

The get ou opcodes() function returns a list of instructions to modify the first sequence to match the second. These instructions are encoded as 5-element tuples, including a string instruction ("opcode") and two pairs of start and end indexes (represented by i1, i2, j1, and j2) of the sequence.

value

Significance

'replace'

a[i1:i2] should be replaced by b[j1:j2].

'delete'

a[i1:i2] should be deleted. Note that in this case, j 1 = = J 2.

'insert'

B [J 1: J 2] should be inserted into a [I 1: I 1]. Note that in this case i1 = = i2.

'equal'

a[i1:i2] = = b[j1:j2] (the two subsequences are the same).

import difflib

s1 = [1, 2, 3, 5, 6, 4]
s2 = [2, 3, 5, 4, 6, 1]

print('Initial data:')
print('s1 =', s1)
print('s2 =', s2)
print('s1 == s2:', s1 == s2)
print()

matcher = difflib.SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in reversed(matcher.get_opcodes()):

    if tag == 'delete':
        print('Remove {} from positions [{}:{}]'.format(
            s1[i1:i2], i1, i2))
        print('  before =', s1)
        del s1[i1:i2]

    elif tag == 'equal':
        print('s1[{}:{}] and s2[{}:{}] are the same'.format(
            i1, i2, j1, j2))

    elif tag == 'insert':
        print('Insert {} from s2[{}:{}] into s1 at {}'.format(
            s2[j1:j2], j1, j2, i1))
        print('  before =', s1)
        s1[i1:i2] = s2[j1:j2]

    elif tag == 'replace':
        print(('Replace {} from s1[{}:{}] '
               'with {} from s2[{}:{}]').format(
                   s1[i1:i2], i1, i2, s2[j1:j2], j1, j2))
        print('  before =', s1)
        s1[i1:i2] = s2[j1:j2]

    print('   after =', s1, '\n')

print('s1 == s2:', s1 == s2)

This example compares two lists of integers and uses get Β opcodes() to get the instructions to convert the original list to a new list. Here the changes are applied in reverse order so that the list index is still correct after adding and removing elements.

SequenceMatcher is used to handle custom classes as well as built-in types, provided they are hashable.

Posted by meltingpoint on Mon, 17 Feb 2020 19:51:07 -0800