re module matching characters is simple and practical

re.match() find one from scratch

re.search() found one**
re.findall() find all
**Returns a list. If there is no, it is an empty list

re.findall('\d','chuan1zhi2')    The result is['1','2']

re.sub() replace

re.sub('\d','_','chuan1zhi2')   The result is['chuan_zhi_']

Re.compile**
Returns a model p with the same method as re, but with different parameters passed
The matching pattern needs to be transferred to compile
**If the re.S parameter is not used, matching will only be performed in each row. If there is no row, it will be replaced with the next row and start again without crossing rows. After using the re.S parameter, the regular expression will take the string as a whole, add '\ n' as an ordinary character to the string, and match it in the whole.

p=re.compile('\d',re.S)
p.findall('chuan1zhi2')

Match Chinese**
In some cases, we want to match the Chinese characters in the text. It should be noted that the unicode coding range of Chinese is mainly [u4e99-u9fa5. Here, it is mainly because this range is not complete. For example, full angle (Chinese) punctuation is not included. However, in most cases, it should be sufficient.
**Suppose you want to extract Chinese from the string title=u 'hello, hello, world', you can do this:

import re
title=u'Hello, hello，world'
pattern=re.compile('[\u4e99-\u9fa5]+')
result=pattern.findall(title)
print(result)

Use of findall module

import re
s='abcasdc'
r=re.findall('ab.*?c',s)
r

The result is: abc

Without a question mark, the result is
abcasdc

With Hello, yes no greedy matching


import re
s='abcasdc'
r=re.findall('ab(.*)c',s)
r

The result is casd

Parentheses are added to match only the contents in parentheses
 If there are multiple parentheses, the result is in the form of Yuanzu, with one outside[]


Yes findall,The result is a list

re.sub()  replace

re.sub('\d','_','chuan1zhi2')   The result is['chuan_zhi_']

Matching rules

rule	explain
\d	Match any decimal number; It is equivalent to class [0-9]
\D	*Match any non numeric characters; It is equivalent to class * * * 0-9 * *.
\s	Match any white space characters; It is equivalent to class [fv]
\S	*Match any non blank characters; It is equivalent to class * ^ * * fv * * ^.
\w	Match any alphanumeric character; It is equivalent to class [a-zA-Z0-9_].
\W	*Match any non alphanumeric characters; It is equivalent to class * * * a-zA-Z0-9 * *.
*****	Match the previous subexpression zero or more times; {0，}
+	Match the previous subexpression one or more times; {1，}
？	Matches the previous subexpression zero or once; {0，1}

What is greedy and non greedy matching of regular expressions

**For example:**

String str="abcaxc";

Patter p="ab.*c";

**Greedy matching: regular expressions generally tend to match the maximum length, which is the so-called greedy matching. If the pattern p is used to match the string str above, the result is: abcaxc(ab.c)*

**Non greedy matching: just match the result, and there are fewer matching characters. If the pattern p is used to match the string str above, the result is: abc(ab.c)*

2. How to distinguish two modes in programming

**The default is greedy mode; Add a question mark directly after the quantifier? It's a non greedy model**

**Quantifier: {m,n}: m to n**

**: any number of*

**+: one to more**

**　　　　　？: 0 or one**

Notice the parentheses

import re

string="abcdefg  acbdgef  abcdgfe  cadbgfe"

#The difference between bracketed and non bracketed
#Without parentheses
regex=re.compile("((\w+)\s+\w+)")
print(regex.findall(string))
#Output: [('abcdefg acbdgef ',' ABCDEFG '), ('abcdgfe cadbgfe', 'abcdgfe')]

regex1=re.compile("(\w+)\s+\w+")
print(regex1.findall(string))
#Output: ['abcdefg ','abcdgfe']

regex2=re.compile("\w+\s+\w+")
print(regex2.findall(string))
#Output: ['abcdefg acbdgef ',' abcdgfe cadbgfe ']

Copy code**
****The first regex contains two parentheses. We can see that its output is a list containing two tuple s**

The second regex contains a bracket, and its output is the content matched by the bracket, not the result matched by the whole expression.

The third regex does not contain parentheses, and its output is what the entire expression matches.

Conclusion: findall() returns the result matched by parentheses (such as regex1). Multiple parentheses will return the result matched by multiple parentheses (such as regex). If there are no parentheses, it will return the result matched by the whole statement (such as regex2). So we need to pay attention to this pit when extracting data.

**In fact, it is not unique to python, which is unique to regular. The use of regular in any high-level language meets this feature: when there are parentheses, it can only match the contents in parentheses, and there are no parentheses [equivalent to adding a parenthesis in the outermost layer]. In the regular "()" means grouping. A bracket represents a grouping. You can only match the content in "()"**

Posted by davemwohio on Mon, 01 Nov 2021 20:44:39 -0700

Programmer Group

re module matching characters is simple and practical

Use of findall module

Matching rules

What is greedy and non greedy matching of regular expressions

Notice the parentheses

Hot Keywords