re.match() find one from scratch
re.search() found one**
re.findall() find all
**Returns a list. If there is no, it is an empty list
re.findall('\d','chuan1zhi2') The result is['1','2']
re.sub() replace
re.sub('\d','_','chuan1zhi2') The result is['chuan_zhi_']
Re.compile**
Returns a model p with the same method as re, but with different parameters passed
The matching pattern needs to be transferred to compile
**If the re.S parameter is not used, matching will only be performed in each row. If there is no row, it will be replaced with the next row and start again without crossing rows. After using the re.S parameter, the regular expression will take the string as a whole, add '\ n' as an ordinary character to the string, and match it in the whole.
p=re.compile('\d',re.S) p.findall('chuan1zhi2')
Match Chinese**
In some cases, we want to match the Chinese characters in the text. It should be noted that the unicode coding range of Chinese is mainly [u4e99-u9fa5. Here, it is mainly because this range is not complete. For example, full angle (Chinese) punctuation is not included. However, in most cases, it should be sufficient.
**Suppose you want to extract Chinese from the string title=u 'hello, hello, world', you can do this:
import re title=u'Hello, hello,world' pattern=re.compile('[\u4e99-\u9fa5]+') result=pattern.findall(title) print(result)
Use of findall module
import re s='abcasdc' r=re.findall('ab.*?c',s) r The result is: abc Without a question mark, the result is abcasdc With Hello, yes no greedy matching import re s='abcasdc' r=re.findall('ab(.*)c',s) r The result is casd Parentheses are added to match only the contents in parentheses If there are multiple parentheses, the result is in the form of Yuanzu, with one outside[] Yes findall,The result is a list re.sub() replace re.sub('\d','_','chuan1zhi2') The result is['chuan_zhi_']
Matching rules
rule | explain |
---|---|
\d | **Match any decimal number; It is equivalent to class [0-9]** |
\D | **Match any non numeric characters; It is equivalent to class * * * * 0-9 * *. |
\s | **Match any white space characters; It is equivalent to class [fv]** |
\S | **Match any non blank characters; It is equivalent to class * * ^ * * fv * * ^. |
\w | Match any alphanumeric character; It is equivalent to class [a-zA-Z0-9_]. |
\W | **Match any non alphanumeric characters; It is equivalent to class * * * * a-zA-Z0-9 * *. |
***** | **Match the previous subexpression zero or more times; {0,}** |
+ | **Match the previous subexpression one or more times; {1,}** |
? | Matches the previous subexpression zero or once; {0,1} |
What is greedy and non greedy matching of regular expressions
**For example:**
String str="abcaxc"; Patter p="ab.*c";
**Greedy matching: regular expressions generally tend to match the maximum length, which is the so-called greedy matching. If the pattern p is used to match the string str above, the result is: abcaxc(ab.c)*
**Non greedy matching: just match the result, and there are fewer matching characters. If the pattern p is used to match the string str above, the result is: abc(ab.c)*
2. How to distinguish two modes in programming
**The default is greedy mode; Add a question mark directly after the quantifier? It's a non greedy model**
**Quantifier: {m,n}: m to n**
**: any number of*
**+: one to more**
** ?: 0 or one**
Notice the parentheses
import re string="abcdefg acbdgef abcdgfe cadbgfe" #The difference between bracketed and non bracketed #Without parentheses regex=re.compile("((\w+)\s+\w+)") print(regex.findall(string)) #Output: [('abcdefg acbdgef ',' ABCDEFG '), ('abcdgfe cadbgfe', 'abcdgfe')] regex1=re.compile("(\w+)\s+\w+") print(regex1.findall(string)) #Output: ['abcdefg ','abcdgfe'] regex2=re.compile("\w+\s+\w+") print(regex2.findall(string)) #Output: ['abcdefg acbdgef ',' abcdgfe cadbgfe ']
Copy code**
****The first regex contains two parentheses. We can see that its output is a list containing two tuple s**
The second regex contains a bracket, and its output is the content matched by the bracket, not the result matched by the whole expression.
The third regex does not contain parentheses, and its output is what the entire expression matches.
Conclusion: findall() returns the result matched by parentheses (such as regex1). Multiple parentheses will return the result matched by multiple parentheses (such as regex). If there are no parentheses, it will return the result matched by the whole statement (such as regex2). So we need to pay attention to this pit when extracting data.
**In fact, it is not unique to python, which is unique to regular. The use of regular in any high-level language meets this feature: when there are parentheses, it can only match the contents in parentheses, and there are no parentheses [equivalent to adding a parenthesis in the outermost layer]. In the regular "()" means grouping. A bracket represents a grouping. You can only match the content in "()"**