Python regular expressions use short notes

Keywords: Python regex

A recent Python experiment found that regular expressions and their standard library in py have a lot of points to think about. So I decided to write a short note and I'd like to review it later.

noun

Because regular expressions are described differently in different literature, here is a list of noun representations that I know:

In this notebook Other sayings
Pattern Expression / pattern
Submode Subexpression/Subgroup/subpattern
Greedy Mode greedy mode
Non-greedy mode Non-greedy/lazy/lazy mode
Non-capture group non-capturing groups
Find Forward look-ahead
Search backwards look-behind
Character Groups character class

Subpattern Extended Syntax

  • look-behind syntax problem

    Expand reading

    This section focuses on (?<=[pattern]) and (?<![pattern]) Two sub-patterns expand the syntax.

    s = 'Dr.David Jone,Ophthalmology,x2441 \
    Ms.Cindy Harriman,Registry,x6231 \
    Mr.Chester Addams,Mortuary,x6231 \
    Dr.Hawkeye Pierce,Surgery,x0986'
    pattern=re.compile(r'(?<=\s*)([A-Za-z]*)(?=,)')
    

    In this case, I wanted to find the name of the person in the string, but as soon as I had a hot \s*, I ran and returned the error:

    re.error: look-behind requires fixed-width pattern

    I did not respond in a moment, nor did domestic search engines find a general description. When we calmed down, we noticed the sentence require fixed-width pattern, which means we need a pattern (expression) with a known matching length, and then look-behind in front of us. All of a sudden, we realized:

    pattern=re.compile(r'(?<=\s)([A-Za-z]*)(?=,)')
    

    That's OK. We matched all the surnames:

    print(pattern.findall(s))
    # ['Jone', 'Harriman', 'Addams', 'Pierce']
    

    The so-called look-behind is actually a (?<=[pattern]) type of sub-pattern extended grammar.

    • Note that discrimination (?<=[pattern]) and (?=[pattern]) precede the regular expression to be matched and follow the regular expression to be matched.

    • These two sub-patterns extend the grammar to match [pattern], but do not return this sub-pattern in the result.

    • To illustrate this through a table, the function is to return [pattern2] matches if they are matched:

      Regular Writing Correct or False
      (?<=[pattern1])[pattern2](?=[pattern3])
      (?=[pattern1])[pattern2](?<=[pattern3]) ×
      [pattern4](?<=[pattern1])[pattern2](?=[pattern3])
      [pattern4](?<=[pattern1])[pattern2](?=[pattern3])[pattern5]
      (?<=[pattern1])[pattern2]
      [pattern2](?=[pattern1])

    Take the pattern (expression) above for example:

    (?<=\s)([A-Za-z]*)(?=,)
    

    In terms of matching content, the pattern (expression) is:

    \s([A-Za-z]*),
    

    But if the pattern (expression) matches the content, the part returned does not contain the matches (?<=\s) and (?=,):

    [A-Za-z]*
    

    Cough, a bit off, keep going. The regular expression to be matched follows (?<=[pattern]), so the matching is backward looking, so (?<=[pattern]) is called look-behind.

    Taken together, the look-behind requires fixed-width pattern error means that the width of the sub-pattern to be matched in (?<=[pattern]) must be determined!

    Our previous writing (?<=[pattern]*) used a metacharacter*, which means that the previous [pattern] would be matched 0 or more times repeatedly, so the width is indeterminate, causing an error.

    (?<![pattern]*) is also a look-behind sub-pattern, so it also applies to the above situation.

    • Also note that discrimination (?<![pattern]) and (?![pattern]) precede the regular expression to be matched and follow the regular expression to be matched.

    • The function of these two sub-modes to extend the grammar is to match the contents of the [pattern] if they do not appear, but this sub-pattern will not be returned in the result.

    One sentence summary: To sum up, when using (?<=[pattern]*) and (?<![pattern]), do not use in [pattern]?, *, +. These metacharacters result in indeterminate widths.

    Metacharacter function
    ? Match the previous sub-pattern 0 or 1 times, or specify the previous sub-pattern for non-greedy matching
    * Match previous sub-patterns 0 or more times
    + Match the previous subpattern one or more times

    Remember well ~

  • Differences between non-capture groups and look-ahead,look-behind

    Expand reading

    In sub-pattern extended grammar, non-capturing groups are written (?: [pattern]), look-ahead is looking forward, look-behind is looking backward, and we list tables:

    English Terms Chinese Terminology Pattern
    Look forward and backward positive look-behind (?<=)
    Look Forward positive look-ahead (?=)
    Negative backward search negative look-behind (?<!)
    Negative Forward Search negative look-ahead (?!)

    The positive and negative fingers are matched when they appear and matched when they do not.

    We've talked about look-ahead and look-behind in the previous section, and now there's a non-capture group.

    The function of a non-capture group (?: [pattern]) is to match [pattern], but this group will not be recorded. Take a look at the whole example:

    import re
    s = 'Cake is better than potato'
    pattern = re.compile(r'(?:is\s)better(\sthan)')
    print(pattern.search(s).group(0))
    # is better than
    print(pattern.search(s).group(1))
    # than
    

    The group(num/name) method of the Match object returns the contents of the corresponding group, starting with the subpattern number 1. group(0) returns matching content for the entire pattern (is better than), while group(1) returns content for the first sub-pattern (than).

    Here you can see that the first sub-pattern corresponds to (\sthan) instead of (?: is\s), which means that (?: is\s) this group is not captured (not recorded)

    The question is, positive look-ahead (?=[pattern]) and positive look-behind (?<=[pattern]) are matches when [pattern] appears, but do not return what matches the sub-pattern. What is the difference between them and (?:[pattern])?

    Take a list of the results of executing this code:

    import re
    s = 'Cake is better than potato'
    pattern = re.compile(r'(?:is\s)better(\sthan)')
    pattern2 = re.compile(r'(?<=is\s)better(\sthan)')
    
    Subpattern Extended Syntax pattern.group(0) pattern.group(1)
    (?:[subpattern]) is better than Space than
    (?<=[subpattern]) better than Space than

    Summarize from the above results:

    1. (?<=[pattern]) and (?=[pattern]) are sub-patterns that are matched to [pattern] and will not be returned or captured, so there is no is space in the matching result for the entire pattern in the above example.

    2. (?: [pattern]) is that a match to [pattern] returns, but the [pattern] sub-pattern is not recorded, so there is an is space in the whole match result in the above example.

    3. The common point of (?: [pattern]), (?<=[pattern]), (?=[pattern]) is that [pattern] sub-patterns (sub-groups) are not recorded, so the first group found by group(1) in the example above is (\sthan) matched to a space other than.

Basic Grammar Relevance

  • Non-greedy mode

    Expand reading

    To find the name and extension code of the person in the string, I would write as follows:

    import re
    s = 'Dr.David Jone,Ophthalmology,x2441 \
    Ms.Cindy Harriman,Registry,x6231 \
    Mr.Chester Addams,Mortuary,x6231 \
    Dr.Hawkeye Pierce,Surgery,x0986'
    pattern = re.compile(r'(?<=\s)([A-Za-z]*)(?=,).*?(?<=x)(\d{4})')
    print(pattern.findall(s))
    # [('Jone', '2441'), ('Harriman', '6231'), ('Addams', '6231'), ('Pierce', '0986')]
    

    The main idea is that the previous pattern matches first to the last name based on spaces and commas, and the latter pattern matches to the four-digit telephone extension key by starting with x and \d{4}.

    The first thing I wrote between the previous and subsequent patterns was. *, * metacharacters repeat the match of.0 or more times, and then we get the result: [('Jone','0986')] (feed directly in one step! (#`O')

    Did I look at the metacharacter table a few times or did I stop this greedy match? Yes, but because I remember? Non-greedy performance is matching strings as short as possible, and think again * Metacharacters are matched at least 0 times over and over again! That question mark can't be added after. *!

    Then I tried the following:

    (?<=\s)([A-Za-z]*)(?=,).*(?<=x)(\d{4})?
    (?<=\s)([A-Za-z]*)(?=,).*(?<=x)?(\d{4})?
    (?<=\s)([A-Za-z]*)(?=,).*(?<=x)?(\d{4})
    (?<=\s)([A-Za-z]*)(?=,).*(?<=x)(\d{4})\s
    (?<=\s)([A-Za-z]*)(?=,).*(?<=x)(\d{4})?\s
    

    Of course, none of these pattern matches turned out as I expected. I can't help but change the middle part to. *?, And that's it!

    (?<=\s)([A-Za-z]*)(?=,).*?(?<=x)(\d{4})
    

    Think about it, the original so-called matching of strings as short as possible is not from the functional point of view of metacharacters.

    For the string 2between1and3:

    • If I write one alone. *? Match, match a loneliness,

    • But if I qualify on both sides: \d+.*?\d+(. *? Matches must be in the digital folder),

    • In the. * greedy mode, the match would be between1and, but because it is. *? Non-greedy mode, matches a smaller portion of the result string between.

    In summary, non-greedy means making the final matching result as short as possible while conforming to the current pattern.

    Using non-greedy mode? Symbols are designed with context in mind.

  • Metacharacters in Brackets

    Expand reading

    This section was written because the Python teacher said that the metacharacters in brackets [] were only treated as normal characters, but the goose, when I was doing the experiment, found that they were not. (<) ε President')

    Look at this regular expression that matches a single Python identifier:

    ^\D[\w]*
    # Python identifiers cannot begin with a number
    

    This pattern matches hello_successfully World2,_ hey_there is a string of this type. Wait a moment. That doesn't mean that metacharacters like \w can be used in []!

    Let's try these again:

    ^\D[z\wza]* # Identifiers can still be matched, \w really works
    ^\D[z\dza]* # Can match hz2333a, 2333a also playe d a role
    ^\D[z\nza]* # Matches to hz\naaa with line breaks, \n also works
    

    It is easy to see that metacharacters such as \w, \s, \n, \v, \t, \r are all metacharacters that can function properly in brackets [], and others such as \b. Using them in brackets is a pointless question, and Python does not error.

    Then try these again:

    ^\D[\w+]* # Match to hello+world  
    ^\D[\w+*]* # Match to hello+world*2
    ^\D[\w+*?]* # Can match hello+wo?rld*2
    ^\D[(\w+*)]* # Matches to hello+(world)*2
    ^\D[(\w{1,3}+*)]* # Matches to hello+(world)*2,{1,3}  
    ^\D[\w$]* # Match to hello$world
    ^\D[\(\w\*\?\\)\$]* # Can it match hello$wor\ld*?  
    

    When I got here, I found that the metacharacters that my teacher said were treated as common characters in [] were only part of it, mainly *,?,+, {}, (), $these metacharacters.

    As can be seen from the example above, these metacharacters in middle brackets are equivalent to: \*, \?, \+, \ {\}, \(), $

    There are two main metacharacters for brackets []: ^Reverse, -Range Specifier, such as:

    [^a-z]
    

    The match is any character outside the a-z lowercase letter set.

    To summarize:

    1. \w, \s, \n, \v, \t, \r,... A metacharacter with the opposite meaning (e.g. \w to \W) is completely usable in [], but it is not a meaningful question.

    2. *,?,+, Other symbolic metacharacters such as {}, (), $,... Can also be used in [], all treated as normal characters.

    3. Python using the above metacharacters in brackets will not error, please be assured ٩ () ۶ Oyster

  • Subpattern Reference Method\num

    Expand reading

    The use of \num is mentioned in the textbook for the sub-mode functions listed above, but it is really just mentioned:

    The num here refers to a positive integer that represents the subpattern number. For example,'(.)\1'matches two consecutive identical characters

    At first I didn't really understand what this meant, thinking I was referencing the previous sub-patterns repeatedly:

    (\d)[A-Za-z_]+\1
    

    I tried to use this pattern to match the string 12hello3, and then I returned to loneliness...

    What gui, isn't \1 here a duplicate (\d) that matches another number?

    Then I changed the string to match and the result was:

    Str to Match Matching results
    12hello3 None
    12hello1 12hello1
    12hello2 2hello2

    Boy, the original \num doesn't refer to the sub-pattern itself, but to the matching results of known sub-patterns

    In the example above (\d) is the first sub-pattern. If the result of the match is 2, then the next \1 must be 2 to match. Let's take a few more examples:

    (\d)(\d)[A-Za-z_]+\2\1 # Matches to 34 hello43
    (\d)(\d)[A-Za-z_]+\1world\2 # Matches to 34 Hello 3world4
    (\d)(\d)[A-Za-z_]+\1*world\2 # Matches to 34 Hello 33333world4  
    

    Summary:

    1. \num refers to the result of the corresponding sub-pattern matching, note that only the sequence number of the sub-pattern can be used here.

    2. The sequential number of the sub-patterns starts from 1.

    3. If you need to reference a sub-pattern, you can extend the syntax (?<sub-pattern name>) and (?=sub-pattern name) with sub-patterns, for example:

      import re
      s = '34hello33333world4'
      pattern = re.compile(r'(?P<f>\d)(\d)[A-Za-z_]+(?P=f)*world\2')
      print(pattern.match(s).group(0))
      # Matches to 34 Hello 33333world4
      
    4. \num is ineffective in brackets [] (linked with the previous section).

re module modifier

  • How to use multiple flags at the same time

    Expand reading

    Functions like re.compile, re.search, re.match, re.findall all allow modifier flags as parameters. Let's take re.compile for example:

    import re
    s='''Hello line1
    hello line2
    hello line3
    '''
    pattern=re.compile('^hElLo',re.I)
    print(pattern.findall(s))
    

    That's not exciting! What if I want to match multiple lines and I want to make sure that I ignore case? ()"

    So, that's it!

    pattern=re.compile('^hElLo',re.I | re.M)
    

    Here | can be called a pipe character (which seems to be the name in Shell). It doesn't matter what the name is. With this symbol we can use more than one symbol! (although we usually don't use more than two)

    I'm tough. I don't use the | sign. Hum! ()

    OK, no problem! Let's go to Submode and buy some extended grammar first!

    There is also a sub-pattern extension syntax in Python that applies multiple modifiers (flags) to the entire module, which are (? Modifiers):

    pattern=re.compile('(?im)^hElLo') # I->Ignore case, m->Multiline Matching
    pattern=re.compile('(?sm)^hElLo') # S->Line Break Recognition, m->Multiline Matching
    

    It is worth noting that this sub-pattern extension syntax should best be placed at the top of the entire pattern, or Python will report a "no suggestion" tip: DeprecationWarning: Flags not at the start of the expression.

  • Several common modifiers

    Expand reading
    Modifier function
    re.S Allow metacharacters. Line break support\n
    re.M Match multiple lines, affecting metacharacters ^ and $
    re.I Ignore case in matching
    re.X Allow spaces and multiple lines in the mode for easy reading

    Note: There is no re.U in Python 3.

    Before giving an example, let's start with a memory method:

    • re.S is related to the metacharacter. It can be recited.S, expanded into a word recited as DOT SEARCH, which means that this match is related to the point character.

    • re.I ignores case and literally means IGNORE CASE.

    • re.M is a multiline match, or it can literally mean MULTILINE.

    • re.X... if you can't imagine it, you'll have to die (,)

    Starting with re.I, this one actually lets the pattern ignore case to match:

    import re
    s='''Hello line1
    hello line2
    hello line3
    '''
    pattern=re.compile('hElLo')
    print(pattern.findall(s)) # []
    pattern2=re.compile('hElLo',re.I)
    print(pattern2.findall(s)) # ['Hello', 'hello', 'hello']  
    

    The words re.M mainly affect the matching of two metacharacters: ^matching at the beginning and $matching at the end

    Normally, ^ matches the beginning of the entire string, while $matches the end of a single line string or the end of the last line in a multiline string.

    But with re.M, ^ matches not only the beginning of the string but also the beginning of each line for multiline strings. And $matches the end of each line and the end of the string, so here are a few examples:

    import re
    s='''Hello line1
    hello line2
    hello line3
    '''
    print( re.findall('^hElLo\slINe\d',s,re.I) )
    # ['Hello line1']
    print( re.findall('hElLo\slINe\d$',s,re.I) )
    # ['hello line3']
    print( re.findall('^hElLo\slINe\d$',s,re.I) )
    # []
    
    print( re.findall('^hElLo\slINe\d',s,re.I | re.M) )
    # ['Hello line1', 'hello line2', 'hello line3']
    print( re.findall('hElLo\slINe\d$',s,re.I | re.M) )
    # ['Hello line1', 'hello line2', 'hello line3']
    print( re.findall('^hElLo\slINe\d$',s,re.I | re.M) )
    # ['Hello line1', 'hello line2', 'hello line3']
    

    re.S makes metacharacters. Matches all characters including the line break\n!

    Metacharacters by default. Can only match any character except the line break\n.

    Example:

    import re
    s='''Hello line1
    hello line2
    hello line3
    '''
    print( re.findall('line(.*)hello',s) )
    # []
    print( re.findall('line(.*)hello',s,re.S) )
    # ['1\nhello line2\n']
    print( re.findall('line(.*?)hello',s,re.S) )
    # ['1\n', '2\n']
    

    re.X is a modifier that increases the readability of regular expressions, making writing rules more elegant ~_(_)

    Let's start with a direct example:

    import re
    s = 'Dr.David Jone,Ophthalmology,x2441 \
    Ms.Cindy Harriman,Registry,x6231 \
    Mr.Chester Addams,Mortuary,x6231 \
    Dr.Hawkeye Pierce,Surgery,x0986'
    pattern = re.compile(r'(?<=\s)([A-Za-z]*)(?=,).*?(?<=x)(\d{4})')
    print(pattern.findall(s))
    

    The more complex the rule, the less readable it will be in a single line. That's fine. We want to be elegant! (,), so you can write as follows:

    pattern = re.compile(r'''
    (?<=\s) # Match the approximate location of the last name based on the space  
    ([A-Za-z]*) # Last name is made up of English letters
    (?=,) # There is a comma after the last name  
    .*? # Match between last name and extension number
    (?<=x) # Find common prefix x x for telephone extension
    (\d{4}) # Extension numbers are always 4 digits
    ''', re.X)
    

    Just one red wine cup away 🍷 There's wood. It's much more elegant! Significant increase in readability o(*> <)

    As you can see from the example above, re.X ignores white space, line breaks, and #in multiline mode.

    Here's an official document describing re.X:

    Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

    That is to say, there are exceptions to the ignorance of spaces:

    1. Spaces are not ignored when they are in character classes, or brackets.

      import re
      s = '''Dr.David Jone,Ophthalmology,x2441 
      Ms.Cindy Harriman,Registry,x6231 
      Mr.Chester Addams,Mortuary,x6231 
      Dr.Hawkeye Pierce,Surgery,x0986'''
      # We use the property of not ignoring spaces within brackets to match names in the above string, such as Dr.David Jone
      
      print(re.findall(r'''
      ^[a-zA-Z.]*?
      [\w]* # No spaces in brackets
      (?=,) 
      ''', s, re.X | re.M))
      # None of them match
      
      print(re.findall(r'''
      ^[a-zA-Z.]*?
      [ \w]* # Space in brackets
      (?=,) 
      ''', s, re.X | re.M))
      # Able to match: ['Dr.David Jone','Ms.Cindy Harriman','Mr.Chester Addams','Dr.Hawkeye Pierce']
      
    2. When a space in the pattern is preceded by an escape slash\, this space is not ignored.

      import re
      s = '''Dr.David Jone,Ophthalmology,x2441 
      Ms.Cindy Harriman,Registry,x6231 
      Mr.Chester Addams,Mortuary,x6231 
      Dr.Hawkeye Pierce,Surgery,x0986'''
      # We use the property of not ignoring spaces within brackets to match names in the above string, such as Dr.David Jone
      print(re.findall(r'''
      ^[a-zA-Z.]*?
      # There's only a plain space here
      [\w]* 
      (?=,) 
      ''', s, re.X | re.M))
      # None of them match
      
      print(re.findall(r'''
      ^[a-zA-Z.]*?
      \ # There is a space escaped here
      [\w]* 
      (?=,) 
      ''', s, re.X | re.M))
      # Matched: ['Dr.David Jone','Ms.Cindy Harriman','Mr.Chester Addams','Dr.Hawkeye Pierce']
      
    3. When the space is *?, (?:, (?P<...> is not ignored in this grammar. After testing, I think it makes sense that this and the previous escape will not be ignored (the official documents are not very detailed). In testing, this writing will not be ignored:

      \ *?
      (?:\ )
      (?P<...>\ )
      

      It's clear that it's actually a space escape, and of course it's also possible that I'm misinterpreting it.

    However, this method of matching spaces is certainly rarely used in practice, and it's hard to see a space in any corner (#`O') when someone reads a regular expression like this.

    This is much simpler for the #commentator, and there are only two cases in the pattern #that will not be ignored:

    1. # exists in character class, which is when it is enclosed in brackets [].

    2. #escaped by backslash\.

After All

Regular expressions are not always useful, especially when the efficiency of regularization is not as good as string processing.

However, in cases where string processing is very cumbersome to write, regularity does help us save a lot of time and improve our productivity.

In our view, regular expressions and SQL statements have a common feature, that is, they are generic in their system: they can be used in almost all programming languages, and SQL statements can also be used in standardized relational database management systems.

I have a bad writing style, which may be a bit rough. I hope this abbreviation will help you to master regular expressions and thank you for your patience.

If later I have new recordable points when learning Python regular expressions, I will continue to update this article.

To be continued...

Posted by e1seix on Wed, 03 Nov 2021 09:50:44 -0700