python web crawler notes re regular expression

Keywords: Programming shell Python

regular expression

What is regular expression? Regular expression is a logical formula for string operation, which is to form a "regular string" with some specific characters defined in advance and the combination of these specific characters. This "regular string" is used to express a filtering logic for Strings

So regular expression is not unique to a programming language. It has different uses in different environments. For example, the awk command of shell script or ා,% split character and so on, but no matter which method, regular expression provides us with great help to solve the problem of text matching.

When applied to the crawler, when we get the html of the target web page, we often need to match the target strings in batches. These strings may be the web address or the key information of the web page, but it is unrealistic to rely on manual copying and pasting, so we need regular expressions to match and output.

Article directory


The following examples are all python scripts

Escape character, qualifier, special character

\w Match alphanumeric and underline
\W Match non alphanumeric and underline
\s Match any white space character, equivalent to [\ t\n\r\f]
\S Match any non empty character
\d Match any number, equivalent to [0-9]
\D Match any non number
\A Match string start
\Z Match the end of the string. If there is a line break, only the end string before the line break is matched.
\z End of match string
\G Match where the last match was completed
\n Match a line break
\t Match a tab
^ Match the beginning of the string
$ End of matching string
. Match any character except line break. When the re.DOTALL tag is specified, any character including line break can be matched.
[...] Used to represent a set of characters, listed separately: [a m k] matches "a","m" or "K"
[^...] Characters not in: [^ a B C] matches characters other than a,b,c
* Match 0 or more expressions
+ Match 1 or more expressions
? Matches 0 or 1 fragments defined by the previous regular expression, non greedy
{n} Exactly match n previous expressions
{n,m} Match fragments defined by previous regular expressions n to m times, greedy way
a|b Match a or b
() Matches the expression in parentheses and also represents a group

re.match

Always match from the first character

import re
content = "Hello 123 4567 World_This is a Regex Demo"
result = re.match("^Hello\s\d{3}\s\d{4}\s\w{10}.*Demo$",content)
#Refer to the above table, '^' matches the beginning and '$' matches the end
#Accept three parameters, regular expression, target string, matching pattern (optional)
print(len(content))
print(result)  ##Print results as objects
print(result.group())  #Return matching results, if there are multiple results, use index to get
print(result.span()) #Return matching length

output

41
<re.Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
Hello 123 4567 World_This is a Regex Demo
(0, 41) 

Universal matching

result = re.match("^Hello.*Demo$",content) 
#Same match as above
#Use '. *' here to match everything in 'Hello' and 'Demo'

Target matching

import re
content = "Hello 1234567 World_This is a Regex Demo"
result = re.match("^Hello\s(\d+)\s\w.*Demo$",content) 
#Specify endpoint, left \ s, right \ s
print(result.group())
print(result.group(1))  
#If there are parentheses in the expression, you can pass parameters to group to extract the content in the nth parenthesis.
#Brackets can also be specified as a whole for operation

output

Hello 1234567 World_This is a Regex Demo
1234567

Greedy matching and non greedy matching

import re
content = "Hello 1234567 World_This is a Regex Demo"
result = re.match("^He.*(\d+).*Demo$",content) 
#Greedy match, '. *' matches as many characters as possible, so (\ d +) actually matches only one number
print(result.group())
print(result.group(1))
result1 = re.match("^He.*?(\d+).*Demo$",content) 
#'?' specifies a non greedy match that matches as few characters as possible, with or without
print(result1.group())
print(result1.group(1))

output

Hello 1234567 World_This is a Regex Demo
7
Hello 1234567 World_This is a Regex Demo
1234567

Matching mode

import re
content = '''Hello 1234567 World_This \n is a Regex Demo'''
result = re.match("^He.*?(\d+).*?Demo$",content)
result1 = re.match("^He.*?(\d+).*?Demo$",content,re.S) 
#'\' can't match normally due to line break in content, return None
#After specifying the match pattern (re.S), '\' can match any character including line breaks
print(result)
print(result1)

output

None
<_sre.SRE_Match object; span=(0, 41), match='Hello 1234567 World_This \nis a Regex Demo'>

Transferred meaning

import re
content = "price is $5.00"
#If the special string in the above table exists in the target string, the escape character '\' is required for special processing.
result = re.match("price is \$5\.00", content)
#Special characters use newline '\'
print(result.group())

output

price is $5.00

One defect of re.match method is that it can only match from the first character. If the regular expression we provide is different from the first character of the target string, the matching result will return None.

such as

import re
content = "price is $5.00"
result = re.match("rice is \$5\.00", content)
#The first character is different
print(result)

output

None

re.search

And re.search is the way to solve this problem.

It scans the entire string and returns the first successful match

import re
content = "Extra stings Hello 1234567 World_This is a Regex Demo Extra stings"
result = re.match("Hello.*?(\d+).*?Demo",content) #First character mismatch, return None
result1 = re.search("Hello.*?(\d+).*?Demo",content)
print(result)
print(result1)

output

None
<_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>

re.findall

The principle is similar. Scan the whole string and return all the results that meet the conditions. The returned results can be accessed by index.

re.sub

Replace each matching substring in the string and return the replaced string

import re
content = "Extra stings Hello 1234567 World_This is a Regex Demo Extra stings"
result = re.sub("\d+","Re",content) #Replace number with 'Re'
print(result)

output

Extra stings Hello Re World_This is a Regex Demo Extra stings

Or you want to add a new element after the specified string

import re
content = "Extra stings Hello 1234567 World_This is a Regex Demo Extra stings"
result = re.sub("(\d+)", r"\1 8910",content)  
#'\ 1' represents the first combination of characters to be replaced, r represents making it a native character.
#Replace "\ \ 1 8910" with "\ \ 1 8910" for the same effect
print(result)

output

Extra stings Hello 1234567 8910 World_This is a Regex Demo Extra stings

re.compile

Compile regular strings into regular expression objects for reuse

import re
content = '''Hello 1234567 World_This
is a Regex Demo'''
pattern = re.compile("Hello.*Demo",re.S)
result = re.match(pattern,content)
print(result.group())

output

Hello 1234567 World_This
is a Regex Demo

Example demonstration

import re
import requests
content = requests.get("https://book.douban.com/").text
pattern = re.compile('<div class="title">.*?href="(.*?)".*?title="(.*?)"',re.S)
results = re.findall(pattern,content)
for result in results:
	url,name = result
	print(url,name) #Long running time
https://Book. Double. COM / subject / 34778578 /? ICN = index latestbook subject wife fearing family
https://Book. Double. COM / subject / 30432492 /? ICN = index latestbook subject travel
https://Book. Double. COM / subject / 34454619 /? ICN = index latestbook subject forgotten, Interpol
https://Book. Double. COM / subject / 30264052 /? ICN = index latestbook subject general history of the West: from ancient sources to the 20th century
https://Book. Double. COM / subject / 33420970 /? ICN = index latestbook subject circled the sun
https://Book. Double. COM / subject / 30420913 /? ICN = index latestbook subject American bureaucracy
https://Book. Double. COM / subject / 30435811 /? ICN = index latestbook subject apartment without men
https://Book. Double. COM / subject / 34442426 /? ICN = index latestbook subject Nazi Hunter

Output movie review website and movie name

Recently, there are a lot of things in the school, the update should not be very frequent, in addition, ig, fpx rush!

Posted by southeastweb on Sat, 26 Oct 2019 20:26:11 -0700