regular expression
What is regular expression? Regular expression is a logical formula for string operation, which is to form a "regular string" with some specific characters defined in advance and the combination of these specific characters. This "regular string" is used to express a filtering logic for Strings
So regular expression is not unique to a programming language. It has different uses in different environments. For example, the awk command of shell script or ා,% split character and so on, but no matter which method, regular expression provides us with great help to solve the problem of text matching.
When applied to the crawler, when we get the html of the target web page, we often need to match the target strings in batches. These strings may be the web address or the key information of the web page, but it is unrealistic to rely on manual copying and pasting, so we need regular expressions to match and output.
Article directory
The following examples are all python scripts
Escape character, qualifier, special character
\w | Match alphanumeric and underline |
---|---|
\W | Match non alphanumeric and underline |
\s | Match any white space character, equivalent to [\ t\n\r\f] |
\S | Match any non empty character |
\d | Match any number, equivalent to [0-9] |
\D | Match any non number |
\A | Match string start |
\Z | Match the end of the string. If there is a line break, only the end string before the line break is matched. |
\z | End of match string |
\G | Match where the last match was completed |
\n | Match a line break |
\t | Match a tab |
^ | Match the beginning of the string |
$ | End of matching string |
. | Match any character except line break. When the re.DOTALL tag is specified, any character including line break can be matched. |
[...] | Used to represent a set of characters, listed separately: [a m k] matches "a","m" or "K" |
[^...] | Characters not in: [^ a B C] matches characters other than a,b,c |
* | Match 0 or more expressions |
+ | Match 1 or more expressions |
? | Matches 0 or 1 fragments defined by the previous regular expression, non greedy |
{n} | Exactly match n previous expressions |
{n,m} | Match fragments defined by previous regular expressions n to m times, greedy way |
a|b | Match a or b |
() | Matches the expression in parentheses and also represents a group |
re.match
Always match from the first character
import re content = "Hello 123 4567 World_This is a Regex Demo" result = re.match("^Hello\s\d{3}\s\d{4}\s\w{10}.*Demo$",content) #Refer to the above table, '^' matches the beginning and '$' matches the end #Accept three parameters, regular expression, target string, matching pattern (optional) print(len(content)) print(result) ##Print results as objects print(result.group()) #Return matching results, if there are multiple results, use index to get print(result.span()) #Return matching length
output
41 <re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'> Hello 123 4567 World_This is a Regex Demo (0, 41)
Universal matching
result = re.match("^Hello.*Demo$",content) #Same match as above #Use '. *' here to match everything in 'Hello' and 'Demo'
Target matching
import re content = "Hello 1234567 World_This is a Regex Demo" result = re.match("^Hello\s(\d+)\s\w.*Demo$",content) #Specify endpoint, left \ s, right \ s print(result.group()) print(result.group(1)) #If there are parentheses in the expression, you can pass parameters to group to extract the content in the nth parenthesis. #Brackets can also be specified as a whole for operation
output
Hello 1234567 World_This is a Regex Demo 1234567
Greedy matching and non greedy matching
import re content = "Hello 1234567 World_This is a Regex Demo" result = re.match("^He.*(\d+).*Demo$",content) #Greedy match, '. *' matches as many characters as possible, so (\ d +) actually matches only one number print(result.group()) print(result.group(1)) result1 = re.match("^He.*?(\d+).*Demo$",content) #'?' specifies a non greedy match that matches as few characters as possible, with or without print(result1.group()) print(result1.group(1))
output
Hello 1234567 World_This is a Regex Demo 7 Hello 1234567 World_This is a Regex Demo 1234567
Matching mode
import re content = '''Hello 1234567 World_This \n is a Regex Demo''' result = re.match("^He.*?(\d+).*?Demo$",content) result1 = re.match("^He.*?(\d+).*?Demo$",content,re.S) #'\' can't match normally due to line break in content, return None #After specifying the match pattern (re.S), '\' can match any character including line breaks print(result) print(result1)
output
None <_sre.SRE_Match object; span=(0, 41), match='Hello 1234567 World_This \nis a Regex Demo'>
Transferred meaning
import re content = "price is $5.00" #If the special string in the above table exists in the target string, the escape character '\' is required for special processing. result = re.match("price is \$5\.00", content) #Special characters use newline '\' print(result.group())
output
price is $5.00
One defect of re.match method is that it can only match from the first character. If the regular expression we provide is different from the first character of the target string, the matching result will return None.
such as
import re content = "price is $5.00" result = re.match("rice is \$5\.00", content) #The first character is different print(result)
output
None
re.search
And re.search is the way to solve this problem.
It scans the entire string and returns the first successful match
import re content = "Extra stings Hello 1234567 World_This is a Regex Demo Extra stings" result = re.match("Hello.*?(\d+).*?Demo",content) #First character mismatch, return None result1 = re.search("Hello.*?(\d+).*?Demo",content) print(result) print(result1)
output
None <_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>
re.findall
The principle is similar. Scan the whole string and return all the results that meet the conditions. The returned results can be accessed by index.
re.sub
Replace each matching substring in the string and return the replaced string
import re content = "Extra stings Hello 1234567 World_This is a Regex Demo Extra stings" result = re.sub("\d+","Re",content) #Replace number with 'Re' print(result)
output
Extra stings Hello Re World_This is a Regex Demo Extra stings
Or you want to add a new element after the specified string
import re content = "Extra stings Hello 1234567 World_This is a Regex Demo Extra stings" result = re.sub("(\d+)", r"\1 8910",content) #'\ 1' represents the first combination of characters to be replaced, r represents making it a native character. #Replace "\ \ 1 8910" with "\ \ 1 8910" for the same effect print(result)
output
Extra stings Hello 1234567 8910 World_This is a Regex Demo Extra stings
re.compile
Compile regular strings into regular expression objects for reuse
import re content = '''Hello 1234567 World_This is a Regex Demo''' pattern = re.compile("Hello.*Demo",re.S) result = re.match(pattern,content) print(result.group())
output
Hello 1234567 World_This is a Regex Demo
Example demonstration
import re import requests content = requests.get("https://book.douban.com/").text pattern = re.compile('<div class="title">.*?href="(.*?)".*?title="(.*?)"',re.S) results = re.findall(pattern,content) for result in results: url,name = result print(url,name) #Long running time
https://Book. Double. COM / subject / 34778578 /? ICN = index latestbook subject wife fearing family https://Book. Double. COM / subject / 30432492 /? ICN = index latestbook subject travel https://Book. Double. COM / subject / 34454619 /? ICN = index latestbook subject forgotten, Interpol https://Book. Double. COM / subject / 30264052 /? ICN = index latestbook subject general history of the West: from ancient sources to the 20th century https://Book. Double. COM / subject / 33420970 /? ICN = index latestbook subject circled the sun https://Book. Double. COM / subject / 30420913 /? ICN = index latestbook subject American bureaucracy https://Book. Double. COM / subject / 30435811 /? ICN = index latestbook subject apartment without men https://Book. Double. COM / subject / 34442426 /? ICN = index latestbook subject Nazi Hunter
Output movie review website and movie name
Recently, there are a lot of things in the school, the update should not be very frequent, in addition, ig, fpx rush!