1. Introduction
Regular expression itself is a small, highly specialized programming language, and in python, through the embedded integration of re module, program generators can call directly to achieve regular matching. Regular expression patterns are compiled into a series of bytecodes, which are then executed by a matching engine written in C.
2. Character Meaning in Regular Expressions
2.1 Ordinary characters and 11 metacharacters:
Here we need to emphasize the role of backslash :
Backslash followed by metacharacters removes special functions; (converts special characters into ordinary characters)
Backslash followed by ordinary characters to achieve special functions; (that is, predefined characters)
A string matched by a reference to a word group corresponding to the ordinal number.
>>> import re >>> print(re.search(r'(tina)(fei)haha\2','tinafeihahafei tinafeihahatina').group()) tinafeihahafei
2.2 Predefined Character Set (which can be written in Character Set [...]
Emphasize the understanding of the word boundary ofb:
>>> print(re.findall('\btina','tian tinaaaa')) [] >>> print(re.findall(r'\btina','tian tinaaaa')) ['tina'] >>> print(re.findall(r'\btina','tian#tinaaaa')) ['tina'] >>> print(re.findall(r'\btina\b','tian#tina@aaa')) ['tina']
2.3 Special Grouping Usage
3. Common Functions in re Module
3.1 compile()
Compile the regular expression pattern and return the pattern of an object. (You can compile regular expressions that are commonly used into regular expression objects, which can be a little more efficient.)
Format:
re.compile(pattern,flags=0)
pattern: The expression string used at compile time.
Flags compiles flags to modify the matching of regular expressions, such as case-sensitive, multi-line matching, etc. Commonly used flags are:
>>> import re >>> tt = "Tina is a good girl, she is cool, clever, and so on..." >>> rr = re.compile(r'\w*oo\w*') >>> print(rr.findall(tt)) ['good', 'cool']
3.2 match()
Determines whether RE matches at the beginning of the string. // Note: This method is not a perfect match. At the end of the pattern, if string has any remaining characters, it is still considered successful. To match perfectly, you can add a boundary matcher'$'at the end of the expression.
Format:
>>> import re >>> print(re.match('com','comwww.runcomoob').group()) com >>> print(re.match('com','Comwww.runcomoob',re.I).group()) Com
3.3 search()
Format: re.search(pattern, string, flags=0)
The re.search function looks for pattern matches within the string, as long as the first match is found and returns, and if the string does not match, returns None.
print(re.search('\dcom','www.4comrunoob.5com').group())
The results are as follows:
4com
* Note: Once a match and search match successfully, it is a match object object object, and the match object object object has the following methods:
group() returns a string matched by RE
start() returns the starting position of the match
end() returns the location of the end of the match
span() returns a tuple containing the location of the match (start, end)
group() returns a string matched by re as a whole, and can input multiple group numbers at a time, corresponding to the string matched by the group numbers.
a. group () returns a string that re matches as a whole.
b. group (n,m) returns the string matched by the group number N and m, and if the group number does not exist, the indexError exception is returned.
The c.groups() groups () method returns a tuple containing all the group strings in a regular expression. From 1 to the group number contained, groups() usually returns a tuple without parameters, and the tuple in the tuple is the group defined in the regular expression.
import re a = "123abc456" print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(0)) #123 ABC 456, back to the whole print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(1)) #123 print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(2)) #abc print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(3)) #456
group(1) lists the first parentheses matching part, group(2) lists the second parentheses matching part, and group(3) lists the third parentheses matching part.
3.4 findall()
re.findall traverses matches to retrieve all matched strings in the string and return a list.
Format:
re.findall(pattern, string, flags=0) p = re.compile(r'\d+') print(p.findall('o1n2m3k4'))
The results are as follows:
['1', '2', '3', '4']
import re tt = "Tina is a good girl, she is cool, clever, and so on..." rr = re.compile(r'\w*oo\w*') print(rr.findall(tt)) print(re.findall(r'(\w)*oo(\w)',tt))#() Represents subexpressions
The results are as follows:
['good', 'cool'] [('g', 'd'), ('c', 'l')]
3.5 finditer()
Search string and return an iterator that accesses each matched result (Match object) sequentially. Find all substrings matched by RE and return them as an iterator.
Format: re.finditer(pattern, string, flags=0)
iter = re.finditer(r'\d+','12 drumm44ers drumming, 11 ... 10 ...') for i in iter: print(i) print(i.group()) print(i.span())
The results are as follows:
<_sre.SRE_Match object; span=(0, 2), match='12'> 12 (0, 2) <_sre.SRE_Match object; span=(8, 10), match='44'> 44 (8, 10) <_sre.SRE_Match object; span=(24, 26), match='11'> 11 (24, 26) <_sre.SRE_Match object; span=(31, 33), match='10'> 10 (31, 33)
3.6 split()
The string is split into matchable substrings and returned to the list. You can use re.split to split strings, such as: re.split (r' s+', text); to split strings into a list of words by spaces.
Format:
re.split(pattern, string[, maxsplit])
maxsplit is used to specify the maximum number of partitions, not to specify that all partitions will be partitioned.
>>> print(re.split('\d+','one1two2three3four4five5')) ['one', 'two', 'three', 'four', 'five', '']
3.7 sub()
Replace each matching substring in string with re and return the replaced string.
Format:
re.sub(pattern, repl, string, count)
>>> import re >>> text = "JGood is a handsome boy, he is cool, clever, and so on..." >>> print(re.sub(r'\s+', '-', text)) JGood-is-a-handsome-boy,-he-is-cool,-clever,-and-so-on...
The second function is the replaced string; in this case,'-'
The fourth parameter refers to the number of substitutions. The default is 0, indicating that each match is replaced.
re.sub also allows complex processing of matching substitutions using functions.
For example: re. sub (r' s', lambda m:'['+ M. group (0) +']', text, 0); replace the space''in the string with'[]'.
>>> import re >>> text = "JGood is a handsome boy, he is cool, clever, and so on..." >>> print(re.sub(r'\s+', lambda m:'['+m.group(0)+']', text,0)) JGood[ ]is[ ]a[ ]handsome[ ]boy,[ ]he[ ]is[ ]cool,[ ]clever,[ ]and[ ]so[ ]on...
3.8 subn()
Returns the number of substitutions
Format:
subn(pattern, repl, string, count=0, flags=0)
>>> print(re.subn('[1-2]','A','123456abcdef')) ('AA3456abcdef', 2) >>> print(re.sub("g.t","have",'I get A, I got B ,I gut C')) I have A, I have B ,I have C >>> print(re.subn("g.t","have",'I get A, I got B ,I gut C')) ('I have A, I have B ,I have C', 3)
4. Some Points for Attention
4.1 The difference between re.match and re.search and re.findall:
re.match matches only the beginning of the string. If the string does not match the regular expression, the match fails, and the function returns None; and re.search matches the entire string until a match is found.
a=re.search('[\d]',"abc33").group() print(a) p=re.match('[\d]',"abc33") print(p) b=re.findall('[\d]',"abc33") print(b)
Implementation results:
3 None ['3', '3']
4.2 Greedy Matching and Non-Greedy Matching
* The previous *, +, etc. are greedy matches, that is, matching as much as possible, followed by a? Sign to make it inert matches.
print(re.findall(r"a(\d+?)",'a23b')) print(re.findall(r"a(\d+)",'a23b'))
Implementation results:
['2'] ['23']
print(re.match('<(.*)>','<H1>title<H1>').group()) print(re.match('<(.*?)>','<H1>title<H1>').group())
Implementation results:
<H1>title<H1> <H1>
print(re.findall(r"a(\d+)b",'a3333b')) print(re.findall(r"a(\d+?)b",'a3333b'))
The results are as follows:
['3333'] ['3333']
It should be noted that if there are restrictions before and after, there will be no greedy mode, and the non-matching mode will fail.
4.3 Small pits encountered with flags
Print (re. split ('a','1A1a2A3', re. I)] Output results are not case-sensitive
This is because re.split(pattern, string, maxsplit,flags) defaults to four parameters. When we pass in three parameters, the system defaults that re.I is the third parameter, so it doesn't work. If you want the re.I here to work, write flags=re.I.
5. Regular Small Practice
5.1 Matching Telephone Number
>>> print(re.compile(r'\d{3}-\d{6}').findall('010-628888')) ['010-628888']
5.2 Matching IP
>>> re.search(r"(([01]?\d?\d|2[0-4]\d|25[0-5])\.){3}([01]?\d?\d|2[0-4]\d|25[0-5]\.)","192.168.1.1") <_sre.SRE_Match object; span=(0, 11), match='192.168.1.1'>