python bs4(beautifulsoup4)

Keywords: Programming Attribute Python codec encoding

BS4 itself is a function operation module encapsulating description language. It encapsulates all kinds of nodes, labels, attributes, contents and so on in document objects into attributes of python objects by providing object-oriented operation mode. In the process of query operation, it directly performs data matching and retrieval operation by calling specified functions, which is very simple and flexible.
Generally BS4 converts HTML document objects into document trees of four types

  • Tag: Tag object
  • Navigable String: Character Content Operating Object
  • BeautifulSoup: Document object
  • Comment: Special type of Navigable String

Get tag content

from bs4 import  BeautifulSoup

# Construction object
soup = BeautifulSoup(open('westos.html','rb'), 'html.parser')
# Get the label, and by default get the first matching content found
print(soup.title)
print(type(soup.title))
print(soup.p)

If there is an error in Unicode Decode Error:'gbk'codec can't decode byte 0xa2 in position 218: illegal multibyte sequence error means that there is an error in Unicode decoding (Error) and decoding by GBK encoding (the string becomes Unicode), but it can't be decoded here by gbk. "Illegal multibyte sequence" means an illegal multibyte sequence, i.e. it cannot be decoded.
Solution: When reading text, add parameter'b', which will not prompt errors, and display the read data through output.

Get the attributes of the tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('westos.html','rb'),'html.parser')
print(soup.p.attrs)
# Gets the content of the tag's specified attribute
print(soup.p['id'])
print(soup.p['class'])
print(soup.p['style'])

Modify label attributes

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('westos.html','rb'),'html.parser')
print(soup.p)
soup.p['id'] = 'modifyid'
print(soup.p)

Get the text content of the tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('westos.html','rb'),'html.parser')
print(dir(soup.title))
print(soup.title.text)
print(soup.title.string)
print(soup.title.name)
print(soup.head.title.string)

Operating subnodes

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('westos.html','rb'),'html.parser')
print(soup.head.contents) # The. contents attribute can output the child nodes of the tag as a list.
print(soup.head.children) #children returns a generator that can loop to get Tag's child nodes
for el in soup.head.children:
    print(el)

Object-oriented matching

# # Find the specified label content (the specified label)
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('westos.html','rb'),'html.parser')
res1 = soup.find_all('p')
print(res1)
# # Find the content of the specified label (the specified label) -- and the use of regularities
res2 = soup.find_all(re.compile(r'd+'))
print(res2)

# # Compiling regular expressions to improve the search rate;
# pattern = r'd.+'
# pattern = re.compile(pattern)
# print(re.findall(pattern, 'dog hello d'))
import re

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('westos.html','rb'),'html.parser')
# Find labels in detail
print(soup.find_all('p', id='test1'))
print(soup.find_all('p', id=re.compile(r'test\d{1}')))
print(soup.find_all('p', class_="class1"))
print(soup.find_all('p', class_=re.compile(r'class\d{1}')))
# Find multiple tags
print(soup.find_all(['p', 'div']))
print(soup.find_all([re.compile('^d'), re.compile('p')]))


# Matching of content
print(soup.find_all(text='Article title'))
print(soup.find_all(text=re.compile('Title')))
print(soup.find_all(text=[re.compile('Title'), 'Title']))

CSS matching

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('westos.html','rb'),'html.parser')
# CSS Common selector: tag chooser(div), Class selector(.class1), id selector(#Idname, attribute selector (p[type="text"])
# Label selector (div)
res1 = soup.select("p")
print(res1)
# Class selector (.class1)
res2 = soup.select(".class2")
print(res2)
# id selector(#idname)
res3 = soup.select("#test1")
print(res3)
#  Property selector (p[type="text"]
print(soup.select("p[id='test1']"))
print(soup.select("p['class']"))

Posted by FFEMTcJ on Wed, 30 Jan 2019 13:21:15 -0800