7. Using regular expressions in Python
7.1 regular expression
When dealing with strings, we often use the need to find strings that meet some complex rules. Regular expressions are tools for describing these rules. Regular expressions are the code used to record text rules.
7.1.1 line locator
The boundary used to describe a string. ^Indicates the beginning of the line; $indicates the end of the line.
7.1.2 metacharacters
\B mr \ w * \ B: matches words beginning with the letter mr, starting with the word (\ b), matching the letter mr, then any number of letters or numbers (\ w *), and ending with a single time (\ b).
7.1.3 repetition
Match a specific number of characters: for example, match an 8-bit QQ number: ^ \ d{8} $.
7.1.4 character class
If you match a character set that does not have predefined metacharacters, you only need to list them in brackets. [aeiou] matches any English vowel.
7.1.5 exclusion characters
The expression used to match a character that is not a specified character is "^": [^ a-zA-Z]: used to match a character that is not a child parent.
7.1.6 select character
Contains the logic of conditional selection, which needs to be implemented by selecting characters (|). It can be understood as "or", such as matching ID card: (^ \ D {15} $) | (^ \ D {18} $) | (^ \ D {17}) | (\ d| x|) $, indicating that it can match 15 or 18 digits, or 17 digits and the last digit. The last digit can be a number or X or X.
Escape character
The escape character in a regular expression is (\), which changes a special character (.? \, etc.) into a normal character. For example, when matching IP address, 127.0.0.1: [1-9] {1,3} \. [0-9] {1,3} \. [0-9] {1,3} \. [0-9] {1,3}. If no escape character is added,. Can match any character.
7.1.8 grouping
The first function of the curly bracket character is to change the scope of the qualified character. For example (this|four) th: means to match the word this or fourth.
The second function of curly braces is grouping, that is, subexpressions. For example: (\. [0-9] {1,3}) {3}) is to repeat operations on grouping (\. [0-9] {1,3}).
7.2 regular expression operation with re module
import re
7.2.1 matching string
- Use the match() method to match. Match from the beginning of the string. The start bit matches successfully. Return the object. Otherwise, return None.
import re pattern = r'mr_\w+' #Pattern string string = 'MR_SHOP mr_shop' # String to match match = re.match(pattern, string, re.I) # Match string is not case sensitive print(match) #Match successful string = 'entry name MR_SHOP mr_shop' match = re.match(pattern, string, re.I) print(match) # Match failed ''' <_sre.SRE_Match object; span=(0, 7), match='MR_SHOP'> None ''' print('Start of match:', match.start()) print('End of match:', match.end()) print('Tuple of matching position:', match.span()) print('String to match:', match.string) print('Match data:', match.group()) ''' //Start of match: 0 //End of match: 7 //Tuple matching position: (0, 7) //String to match: MR_SHOP mr_shop //Matching data: MR_SHOP '''
Define a mode string to verify the mobile phone number. Use the mode string to verify the two mobile phone numbers. The verification results are as follows:
pattern = r'(13[4-9]\d{8})$|(15[01289]\d{8})' mobile = '13634222222' match = re.match(pattern, mobile) if match == None: print(mobile, 'Not a valid phone number') else: print(mobile, 'Is a valid phone number') # 13634222222 is a valid phone number
- Use the search() method to match. search() can search not only at the beginning, but also at other places of the string.
pattern = r'mr_\w+' string = 'MR_SHOP mr_shop' match = re.search(pattern, string, re.I) print(match) # <_sre.SRE_Match object; span=(0, 7), match='MR_SHOP'> string = 'entry name MR_SHOP mr_shop' match = re.search(pattern, string, re.I) print(match) # <_sre.SRE_Match object; span=(4, 11), match='MR_SHOP'>
- Use the findall() method. Used to search the entire string for all strings that match the regular expression and return them as a list.
pattern = r'mr_\w+' string = 'MR_SHOP mr_shop' match = re.findall(pattern, string, re.I) print(match) # ['MR_SHOP', 'mr_shop'] string = 'entry name MR_SHOP mr_shop' match = re.findall(pattern, string) print(match) # ['mr_shop']
Returns a list of text that matches the grouping if it is included in the specified pattern string.
pattern = r'[1-9]{1,3}(\.[0-9]{1,3}){3}' str1 = '127.0.0.1 192.168.1.66' match = re.findall(pattern, str1) print(match) # ['.1', '.66']
There is no matching IP address in the above code, because there is a group in the pattern string, the result is the result of matching according to the group, that is (\. [0-9] {3} matching result.
pattern = r'([1-9]{1,3}(\.[0-9]{1,3}){3})' str1 = '127.0.0.1 192.168.1.66' match = re.findall(pattern, str1) for it in match: print(it[0]) ''' 127.0.0.1 192.168.1.66 '''
7.2.2 replace string
Use the sub() method to replace the string.
pattern = r'1[34578]\d{9}' string = 'Winning number: 84978981 contact number: 13632453167' reslut = re.sub(pattern, '1XXXXXXXXXXX', string) print(reslut) # Winning number: 84978981 contact number: 1xxxxxxx pattern = r'(hacker)|(Grab bag)|(monitor)|(Trojan)' about = 'I'm a programmer. I like reading books about hackers. I want to study them Trojan. ' sub = re.sub(pattern, '@_@', about) print(sub) about = 'I am a programmer. I like reading books about computer network and developing websites.' sub = re.sub(pattern, '@_@', about) print(sub) ''' //I'm a programmer. I like watching@_ @I want to study some books about@_ @. //I am a programmer. I like reading books about computer network and developing websites. '''
7.2.3 splitting strings with regular expressions
pattern = r'[?|&]' url = 'https://www.mingrisoft.com/login.jsp?username="mr"&pwd="mrsoft"' reslut = re.split(pattern, url) print(reslut) # ['https://www.mingrisoft.com/login.jsp', 'username="mr"', 'pwd="mrsoft"'] str1 = '@Tomorrow Technology @Mark Zuckerberg @Gates' pattern = r'\s*@' list1 = re.split(pattern, str1) #Split strings with spaces and @ or separate @ print('you@Our friends are:') for it in list1: if it != "": # Output non empty elements print(it) # Output each friend name ''' //Your @ friends are: //Tomorrow Technology //Mark Zuckerberg //Gates '''