python's regular expression (re function, greedy and non-greedy)

Keywords: Python Linux JSON Attribute

We connect Linux to implement regular expressions

1. Python3 Regular Expressions

Regular expressions are a special sequence of characters that help you easily check if a string matches a pattern.

Python has added the re module since version 1.5 to provide Perl-style regular expression patterns.

The re module gives the Python language full regular expression functionality.

The compile function generates a regular expression object based on a pattern string and an optional flag parameter.The object has a series of methods for regular expression matching and substitution.

The re module also provides functions that are fully consistent with these methods and use a pattern string as their first parameter.

This chapter focuses on the regular expression processing functions commonly used in Python. If you don't know about regular expressions, you can see our Regular Expression-Tutorial.

1,re.split

The split method divides a string into matching substrings and returns a list as follows:

re.split(pattern, string[, maxsplit=0, flags=0])

Parameters:

parameter describe
pattern Matching Regular Expressions
string The string to match.
maxsplit Number of delimitations, maxsplit=1 delimits once, defaulting to 0, unlimited number of times.
flags Flag bits, used to control how regular expressions are matched, such as case-sensitive, multi-line matching, and so on.See: Regular Expression Modifier-Optional Flag

Example

import re

# fLags=re.IGNORECASE: Ignore case
data = 'Last login: Tue Mar 31 17:56:11 2020 from 192.168.1.80'
new_data = re.split('[:.]\s*', data)
print(new_data)

print(data.split(': '))
The output from the above example is as follows:
['Last login', 'Tue Mar 31 17', '56', '11 2020 from 192', '168', '1', '80']
['Last login', 'Tue Mar 31 17:56:11 2020 from 192.168.1.80']

The following is the basic syntax for regular expressions:

Pattern describe
^ Beginning of match string
$ Matches the end of the string.
. Matches any character except line breaks. Any character including line breaks can be matched when the re.DOTALL tag is specified.
[...] Used to represent a set of characters, listed separately: [a m k] matches'a','m'or'k'
[^...] Characters not in []: [^a B c] matches characters other than a,b,c.
re* Match 0 or more expressions.
re+ Match one or more expressions.
re? Match zero or one fragment defined by the previous regular expression, non-greedy
re{ n} Matches n previous expressions.For example,'o{2}'cannot match'o' in'Bob', but it can match two'o'in'food'.
re{ n,} Exact match n previous expressions.For example, "o{2,}" does not match "o" in "Bob", but matches all o in "foood"."o{1,}" is equivalent to "o+"."o{0,}" is equivalent to "o*".
re{ n, m} Matches fragments n to m times defined by previous regular expressions, greedy

2. Special Character Classes

Example describe
. Matches any single character except'\n'.To match any character including'\n', use a pattern like'[. \n]'.
\d Matches a numeric character.Equivalent to [0-9].
\D Matches a non-numeric character.Equivalent to [^0-9].
\s Match any white space characters, including spaces, tabs, page breaks, and so on.Equivalent to [\f\n\rt\v].
\S Matches any non-whitespace characters.Equivalent to [^ \f\nrt\v].
\w Match any word character that includes an underscore.Equivalent to'[A-Za-z0-9_]'.
\W Match any non-word characters.Equivalent to'[^A-Za-z0-9_]'.
# ?[a-zA-Z]+
# To match possible spaces before and after a word, [a-zA-Z] stands for one or more English letters

# Match an IP address 192.168.1.80
# [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}

3. findall function

Finds all the substrings matched by the regular expression in the string and returns a list, or an empty list if no match is found.

Note: match and search are matches that match all findall s at once.

The grammar format is:

re.findall(string[, pos[, endpos]])

Parameters:

  • String The string to be matched.
  • pos optional parameter that specifies the starting position of the string, defaulting to 0.
  • The endpos optional parameter specifies the end position of the string, defaulting to the length of the string.

Find all the numbers in the string:

import re

pattern = re.compile(r'\d+')   # Find Numbers
result1 = pattern.findall('runoob 123 google 456')
result2 = pattern.findall('run88oob123google456', 0, 10)

print(result1)
print(result2)
The output from the above example is as follows:
['123', '456']
['88', '12']

4. compile function

The compile function compiles a regular expression and generates a Pattern object for use by the match() and search() functions.

The grammar format is:

re.compile(pattern[, flags])

Parameters:

  • pattern: A regular expression in the form of a string
  • flags are optional and represent matching patterns, such as ignoring case and multiline patterns, with specific parameters:
    • re.I ignores case

    • re.L denotes the special character set\w, \W, \b, \B, \s, \S depending on the current environment
    • re.M Multiline Mode
    • re.S is'. 'and any characters including line breaks ('.' excludes line breaks)
    • re.U stands for the special character set\w, \W, \b, \B, \d, \D, \s, \S depending on the Unicode character attribute database
    • re.X Ignores spaces and comments after '#' for readability

Example 1

>>>import re
>>> pattern = re.compile(r'\d+')                    # Used to match at least one number
>>> m = pattern.match('one12twothree34four')        # Find header, no match
>>> print( m )
None
>>> m = pattern.match('one12twothree34four', 2, 10) # Match from'e'position, no match
>>> print( m )
None
>>> m = pattern.match('one12twothree34four', 3, 10) # Match from the position of'1', just match
>>> print( m )                                        # Return a Match object
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # Omit 0
'12'
>>> m.start(0)   # Omit 0
3
>>> m.end(0)     # Omit 0
5
>>> m.span(0)    # Omit 0
(3, 5)

Above, when the match succeeds, a Match object is returned, where:

  • group([group1,...]) The method is used to obtain one or more grouped matching strings, and group() or group(0) can be used directly when the entire matching substring is to be obtained;
  • The start([group]) method is used to get the starting position (index of the first character of the substring) of the grouping match throughout the string, and the default value of the parameter is 0.
  • The end([group]) method is used to get the end position of the grouped matching substring in the entire string (index + 1 of the last character of the substring), with a default parameter of 0;
  • The span([group]) method returns (start (group), end (group).

Example 2

import re
# flags=re. IGNORECASE: ignoring case
data = 'Linux System built-in Python 2.7.5,We installed Python 3.8.1. '
print(re.findall( 'python [0-9]\.[0-9]\.[0-9]', data, flags=re.IGNORECASE))
#
re_obj = re.compile('python [0-9]\.[0-9]\.[0-9]', flags=re.IGNORECASE)
print(re_obj.findall(data))
The output from the above example is as follows:
['Python 2.7.5', 'Python 3.8.1']
['Python 2.7.5', 'Python 3.8.1']

5. Test the reading speed of findall and compile

(1) Generating digital files on Linux

[root@python ~]# seq 10000 > data.txt

(2) pycharm creates files for findall and compile to read data.txt

findall

import re

def main():
    pattern = "[0-9]+"
    with open('~/data.txt') as f:
        for line in f:
            re.findall(pattern, line)
if __name__ == 'main':
    main()

compile

import re
def main() :
    pattern = "[0-9]+"
    re_obj = re.compile(pattern)
    with open("~/data.txt") as f:
        for line in f:
            re_obj.findall(line)
if __name__ == "main":
    main( )

(3) Upload files to Linux

The following message appears at the bottom and uploaded successfully

(4) Linux test download speed

Enter uploaded directory/opt

[root@python ~]# cd /opt/
[root@python opt]# cd exercise
[root@python Practice]# ls
001.py  findall.py  compile.py

test

[root@python Practice]# time python3 findall.py
real    0m0.058s
user    0m0.005s
sys 0m0.029s

[root@python Practice]# time python3 compile.py 

real    0m0.018s
user    0m0.014s
sys 0m0.004s

Tests show that compile s read faster

2. Common re functions

data = 'What is the difference between python 2.7.5 and Python 3.8.1 ?'
import re
print(re.findall('[0-9]\.[0-9]\.[0-9]',data))
print(re.findall('python [0-9]\.[0-9]\.[0-9]',data))
print(re.findall('Python [0-9]\.[0-9]\.[0-9]',data))
print(re.findall('ython [0-9]\.[0-9]\.[0-9]',data))

print(data.startswith('What'))
print(data.endswith('?'))
print(re.match('What',data))

word = "123 is one hender and twentyu-there"
print(re.match('\d+',word))
r = re.match('\d+',word)
print(r)

print(r.start())
print(r.end())
print(r.re)
print(r.group())
print(r.string)

rr = re.finditer('[0-9]\.[0-9]\.[0-9]',data)
print(rr)
# print([r for r in rr])
for it in rr:
    print(it.group(0))
The above example outputs the results:
# Output number of type'x.x.x'
['2.7.5', '3.8.1']
# Output number of type'python x.x.x'
['python 2.7.5']
# Output a number of type'Python x.x.x'
['Python 3.8.1']
# Output a number of type'ython x.x.x'
['ython 2.7.5', 'ython 3.8.1']
# Find out if'What'is in the data
True
# Find out if'J'is in the data
True
# Find out if'What'is in the data
&lt;re.Match object; span=(0, 4), match='What'&gt;
# Find if there are'numeric characters'in the data
&lt;re.Match object; span=(0, 3), match='123'&gt;
# Find if there are'numeric characters'in the data
&lt;re.Match object; span=(0, 3), match='123'&gt;
# The starting position of a matched substring in the entire string
0
# The end position of the matched substring in the entire string
3
# Gets the type of re function
re.compile('\\d+')
# Get one or more grouped matching strings
123
# Matched string
123 is one hender and twentyu-there
# Output rr
&lt;callable_iterator object at 0x000001B92D1613D0&gt;
# Output rr file type'x.x.x'number one line at a time
2.7.5
3.8.1

(1) Matching classes

1. re.match function

re.match attempts to match a pattern from the beginning of the string, and returns none if the match is not successful.

Functional syntax:

re.match(pattern, string, flags=0)

Function parameter description:

parameter describe
pattern Matching Regular Expressions
string The string to match.
flags Flag bits, used to control how regular expressions are matched, such as case-sensitive, multi-line matching, and so on.See: Regular Expression Modifier-Optional Flag

The match successful re.match method returns a matching object, otherwise returns None.

We can use the group(num) or groups() Match Object function to get a match expression.

Match Object Method describe
group(num=0) A string of matching entire expressions in which group() can enter more than one group number at a time, in which case it will return a tuple containing the corresponding values for those groups.
groups() Returns a tuple containing all the group strings, from 1 to the group number contained.
import re
print(re.match('www', 'www.runoob.com').span())  # Match at start position
print(re.match('com', 'www.runoob.com'))         # Not Matching at Start
The above example outputs the results:
(0, 3)
None
import re

line = "Cats are smarter than dogs"
# *Indicates any matching of any single or multiple characters other than line breaks (\n, \r)
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")
The above example outputs the results:
matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

2. compile function

The compile function compiles a regular expression and generates a Pattern object for use by the match() and search() functions.

The grammar format is:

re.compile(pattern[, flags])

Parameters:

  • pattern: A regular expression in the form of a string
  • flags are optional and represent matching patterns, such as ignoring case and multiline patterns, with specific parameters:
    • re.I ignores case

    • re.L denotes the special character set\w, \W, \b, \B, \s, \S depending on the current environment
    • re.M Multiline Mode
    • re.S is'. 'and any characters including line breaks ('.' excludes line breaks)
    • re.U stands for the special character set\w, \W, \b, \B, \d, \D, \s, \S depending on the Unicode character attribute database
    • re.X Ignores spaces and comments after '#' for readability

Example

>>>import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I means ignoring case
>>> m = pattern.match('Hello World Wide Web')
>>> print( m )                            # Match succeeded, returning a Match object
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # Returns the entire substring that matched successfully
'Hello World'
>>> m.span(0)                             # Returns the index of the entire substring that matched successfully
(0, 11)
>>> m.group(1)                            # Returns the first successful substring for grouping matching
'Hello'
>>> m.span(1)                             # Returns the index of the first group matching successful substring
(0, 5)
>>> m.group(2)                            # Returns the substring of the second group matching success
'World'
>>> m.span(2)                             # Returns the substring index of the second group that matched successfully
(6, 11)
>>> m.groups()                            # Equivalent to (m.group(1), m.group(2),...)
('Hello', 'World')
>>> m.group(3)                            # No third grouping exists
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

Above, when the match succeeds, a Match object is returned, where:

  • group([group1,...]) The method is used to obtain one or more grouped matching strings, and group() or group(0) can be used directly when the entire matching substring is to be obtained;
  • The start([group]) method is used to get the starting position (index of the first character of the substring) of the grouping match throughout the string, and the default value of the parameter is 0.
  • The end([group]) method is used to get the end position of the grouped matching substring in the entire string (index + 1 of the last character of the substring), with a default parameter of 0;
  • The span([group]) method returns (start (group), end (group).

3. re.search method

re.search scans the entire string and returns the first successful match.

Functional syntax:

re.search(pattern, string, flags=0)

Function parameter description:

parameter describe
pattern Matching Regular Expressions
string The string to match.
flags Flag bits, used to control how regular expressions are matched, such as case-sensitive, multi-line matching, and so on.See: Regular Expression Modifier-Optional Flag

The matched successful re.search method returns a matching object, otherwise returns None.

We can use the group(num) or groups() Match Object function to get a match expression.

Match Object Method describe
group(num=0) A string of matching entire expressions in which group() can enter more than one group number at a time, in which case it will return a tuple containing the corresponding values for those groups.
groups() Returns a tuple containing all the group strings, from 1 to the group number contained.

Example

import re

print(re.search('www', 'www.runoob.com').span())  # Match at start position
print(re.search('com', 'www.runoob.com').span())         # Not Matching at Start
The above example outputs the results:
(0, 3)
(11, 14)

4. Differences between re.match and re.search

re.match matches only the beginning of the string. If the beginning of the string does not match the regular expression, the match fails, the function returns None, and re.search matches the entire string until a match is found.

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print ("match --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print ("search --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")
The above example outputs the results:
No match!!
search --&gt; matchObj.group() :  dogs

5. Parameters:

parameter describe
pattern Matching Regular Expressions
string The string to match.
flags Flag bits, used to control how regular expressions are matched, such as case-sensitive, multi-line matching, and so on.See: Regular Expression Modifier-Optional Flag

Example

import re

it = re.finditer(r"\d+","12a32bc43jf3") 
for match in it: 
    print (match.group() )
Output results:
12 
32 
43 
3

(2) Modifying classes

1. Retrieval and Replacement

Python's re module provides re.sub s to replace matches in strings.

Grammar:

re.sub(pattern, repl, string, count=0, flags=0)

Parameters:

  • Pattern: The pattern string in the regular.
  • repl: Replaced string, or a function.
  • String: The original string to be replaced by the search.
  • count: Maximum number of replacements after pattern matching, default 0 means replacing all matches.
  • flags: Compile-time matching pattern, in numeric form.

The first three parameters are required, and the last two are optional.

import re

phone = "2004-959-559 # This is a telephone number.

# Delete Note
num = re.sub(r'#.*$', "", phone)
print ("Phone number : ", num)

# Remove content other than numbers
num = re.sub(r'\D', "", phone)
print ("Phone number : ", num)
Output results:
Phone number: 2004-959-559 
Phone number: 2004959559

The repl parameter is a function

In the following example, the matching number in the string is multiplied by 2:

import re

# Multiply the matching number by 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)

s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))
Output results:
A46G8HFD1134

2,re.split

The split method divides a string into matching substrings and returns a list as follows:

re.split(pattern, string[, maxsplit=0, flags=0])
parameter describe
pattern Matching Regular Expressions
string The string to match.
maxsplit Number of delimitations, maxsplit=1 delimits once, defaulting to 0, unlimited number of times.
flags Flag bits, used to control how regular expressions are matched, such as case-sensitive, multi-line matching, and so on.See: Regular Expression Modifier-Optional Flag
>>>import re
>>> re.split('\W+', 'runoob, runoob, runoob.')
['runoob', 'runoob', 'runoob', '']
>>> re.split('(\W+)', ' runoob, runoob, runoob.') 
['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', '']
>>> re.split('\W+', ' runoob, runoob, runoob.', 1) 
['', 'runoob, runoob, runoob.']

>>> re.split('a*', 'hello world')   # Split does not split a string for which no match can be found
['hello world']

(3) Greedy and non-greedy modes

1. Concepts

Let's start with an example:

example = "abbbbbbc"
pattern = re.compile("ab+")

Greedy pattern: Regular expressions tend to match at the maximum length, also known as greedy matching.If the pattern pattern pattern is used above to match the string example, the result of the match is the entire string of "abbbb".

Non-greedy pattern: match as few as possible, provided the entire expression matches successfully.If the pattern pattern pattern pattern is used above to match the string example, the only result to match is the entire string of "ab".

2. Usage

In python, greedy mode is used by default, and in the case of non-greedy mode, just add a question mark directly after the quantifier?".
In the first article, there are five quantifiers in regular expressions:

3. Principle analysis

Greedy is the default in regular expressions. In the example above, the whole expression can be successfully matched when "ab" is already matched, but since greedy is used, matching needs to continue to occur later, and longer strings can be matched when checking.Until the last "b" is matched, there is no string that can be successfully matched. The match ends.Returns the matching result "abbbb".
So we can think of the greedy pattern as matching as much as possible, given that the entire expression matches successfully.

The non-greedy pattern is to change the regular expression "ab+" to "ab+?" in our example. When matched to "ab", the match succeeded, ending the match directly, and returning the matched string "ab" instead of trying backwards.
So we can think of a non-greedy pattern as matching as little as possible, given that the entire expression matches successfully

4. Instances

import re
text = 'Beautifulis better than ugly. Explicit is better than implicit.'
print(re.findall('Beautifulis.*\.',text))
print(re.findall('Beautifulis.*?\.',text))
Output results:
['Beautifulis better than ugly. Explicit is better than implicit.']
['Beautifulis better than ugly.']

5. Summary

1. Greedy and non-greedy from an application perspective

Greedy and non-greedy modes affect the matching behavior of quantifier-modified subexpressions. Greedy modes match as many as possible if the whole expression matches successfully, while non-greedy modes match as few as possible if the whole expression matches successfully.

2. Greedy and non-greedy from the point of view of matching principle

The greedy and non-greedy modes that can achieve the same matching result are usually more efficient to match.All non-greedy modes can be converted to greedy mode by modifying quantifier-modified subexpressions.Greedy mode can be combined with solid grouping to improve matching efficiency, but not greedy mode.

(4) Python3 replace() method

describe

The replace() method replaces old (old string) with new (new string) in the string and, if a third parameter, max, no more than max times.

grammar

replace() method syntax:

str.replace(old, new[, max])

parameter

  • old -- The substring to be replaced.
  • New -- A new string that replaces the old substring.
  • Max -- Optional string, replaced no more than max times

Return value

Returns the new string generated by replacing old (old string) with new (new string) in a string, and no more than max times if a third parameter, max, is specified.

Example

The following examples show how to use the replace() function:

data = 'What is the difference between python 2.7.5 and Python 3.8.1 ?'
print(data)

import re
r_data = data.replace('2.7.5','x.x.x')
r_data2 = r_data.replace('3.8.1','x.x.x')
print(r_data2)

print(re.sub('[0-9]\.[0-9]\.[0-9]','x.x.x',data))

print(data.split())
print(re.split('[ .]+',data))
Output results:
What is the difference between python 2.7.5 and Python 3.8.1 ?
What is the difference between python x.x.x and Python x.x.x ?
What is the difference between python x.x.x and Python x.x.x ?
['What', 'is', 'the', 'difference', 'between', 'python', '2.7.5', 'and', 'Python', '3.8.1', '?']
['What', 'is', 'the', 'difference', 'between', 'python', '2', '7', '5', 'and', 'Python', '3', '8', '1', '?']

(5) Draw a simple epidemic map

from pyecharts.charts import Map
from pyecharts import options as opt
import requests
import json

#get data
data = requests.get( 'https://gwpre.sina.cn/interface/fymap2020_data.json').content
data = json.loads(data)
print(data)

#Filter data
sub_data = list()
for i in data['data']['list']:
    sub_data.append((i['name'],i['value']))
print(sub_data)

#Mapping China
map_info = Map()

#Set up basic information for the map
map_info.set_global_opts(title_opts=opt.TitleOpts('Real-time epidemic map-'+data['data' ]['times']
                                                  ,subtitle='data sources',
                                                  subtitle_link='https://news.sina.cn/zt_d/yiqing0121?vt=4&pos=222')
                         ,visualmap_opts=opt.VisualMapOpts (max_=1500,is_piecewise=True))
map_info.add('Diagnosis', sub_data, maptype='china')

#Generate Web Page File
map_info.render( '20200403.html' )
After the output, a web page information is generated, which can be seen by executing the web page:

(6) Using regular expressions to resolve all http or https links within a page

import re
import requests
r = requests.get('https://www.lagou.com/beijing')
# print(r)
result = re.findall('"(https?://.*?)"',r.content.decode('utf-8'))
print(result)
Output results:
['https://www.lagou.com/beijing/', 'https://www.lagou.com/', 'https://www.lagou.com/about.html', 'http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010802024043', 'https://www.lagou.com/upload/oss.js?v=1010']
-----

Posted by zrocker on Thu, 09 Apr 2020 10:48:30 -0700