BS4 itself is a function operation module encapsulating description language. It encapsulates all kinds of nodes, labels, attributes, contents and so on in document objects into attributes of python objects by providing object-oriented operation mode. In the process of query operation, it directly performs data matching and retrieval operation by calling specified functions, which is very simple and flexible.
Generally BS4 converts HTML document objects into document trees of four types
- Tag: Tag object
- Navigable String: Character Content Operating Object
- BeautifulSoup: Document object
- Comment: Special type of Navigable String
Get tag content
from bs4 import BeautifulSoup # Construction object soup = BeautifulSoup(open('westos.html','rb'), 'html.parser') # Get the label, and by default get the first matching content found print(soup.title) print(type(soup.title)) print(soup.p)
If there is an error in Unicode Decode Error:'gbk'codec can't decode byte 0xa2 in position 218: illegal multibyte sequence error means that there is an error in Unicode decoding (Error) and decoding by GBK encoding (the string becomes Unicode), but it can't be decoded here by gbk. "Illegal multibyte sequence" means an illegal multibyte sequence, i.e. it cannot be decoded.
Solution: When reading text, add parameter'b', which will not prompt errors, and display the read data through output.
Get the attributes of the tag
from bs4 import BeautifulSoup soup = BeautifulSoup(open('westos.html','rb'),'html.parser') print(soup.p.attrs) # Gets the content of the tag's specified attribute print(soup.p['id']) print(soup.p['class']) print(soup.p['style'])
Modify label attributes
from bs4 import BeautifulSoup soup = BeautifulSoup(open('westos.html','rb'),'html.parser') print(soup.p) soup.p['id'] = 'modifyid' print(soup.p)
Get the text content of the tag
from bs4 import BeautifulSoup soup = BeautifulSoup(open('westos.html','rb'),'html.parser') print(dir(soup.title)) print(soup.title.text) print(soup.title.string) print(soup.title.name) print(soup.head.title.string)
Operating subnodes
from bs4 import BeautifulSoup soup = BeautifulSoup(open('westos.html','rb'),'html.parser') print(soup.head.contents) # The. contents attribute can output the child nodes of the tag as a list. print(soup.head.children) #children returns a generator that can loop to get Tag's child nodes for el in soup.head.children: print(el)
Object-oriented matching
# # Find the specified label content (the specified label) import re from bs4 import BeautifulSoup soup = BeautifulSoup(open('westos.html','rb'),'html.parser') res1 = soup.find_all('p') print(res1) # # Find the content of the specified label (the specified label) -- and the use of regularities res2 = soup.find_all(re.compile(r'd+')) print(res2)
# # Compiling regular expressions to improve the search rate; # pattern = r'd.+' # pattern = re.compile(pattern) # print(re.findall(pattern, 'dog hello d'))
import re from bs4 import BeautifulSoup soup = BeautifulSoup(open('westos.html','rb'),'html.parser') # Find labels in detail print(soup.find_all('p', id='test1')) print(soup.find_all('p', id=re.compile(r'test\d{1}'))) print(soup.find_all('p', class_="class1")) print(soup.find_all('p', class_=re.compile(r'class\d{1}'))) # Find multiple tags print(soup.find_all(['p', 'div'])) print(soup.find_all([re.compile('^d'), re.compile('p')])) # Matching of content print(soup.find_all(text='Article title')) print(soup.find_all(text=re.compile('Title'))) print(soup.find_all(text=[re.compile('Title'), 'Title']))
CSS matching
from bs4 import BeautifulSoup soup = BeautifulSoup(open('westos.html','rb'),'html.parser') # CSS Common selector: tag chooser(div), Class selector(.class1), id selector(#Idname, attribute selector (p[type="text"]) # Label selector (div) res1 = soup.select("p") print(res1) # Class selector (.class1) res2 = soup.select(".class2") print(res2) # id selector(#idname) res3 = soup.select("#test1") print(res3) # Property selector (p[type="text"] print(soup.select("p[id='test1']")) print(soup.select("p['class']"))