Parsing HTML using BeautifulSoup under Python

Keywords: Python encoding Attribute xml

abstract

Beautiful Soup is a Python library that can extract data from HTML or XML format files by parsing HTML or XML data into Python objects for easy processing through Python code.

Document environment

  • Test environment for code in this document

Instructions for using Beautifu Soup

The basic function of Beautiful Soup is to find and edit HTML tags.

Basic Concepts - Object Types

Beautiful Soup converts complex HTML documents into a complex tree structure, and each node is converted into a Python object. Beautiful Soup defines four types of objects: Tag, NavigableString, BeautifulSoup, Comment.

object type describe
BeautifulSoup Full content of the document
Tag Tags for HTML
NavigableString Text included in label
Comment A special NavigableString type that is defined when a NavigableString in a tag is commented on

Installation and Reference

# Beautiful Soup
pip install bs4

# Parser
pip install lxml
pip install html5lib
# Initialization
from bs4 import BeautifulSoup

# Method one, open the file directly
soup = BeautifulSoup(open("index.html"))

# Method 2, specify data
resp = "<html>data</html>"
soup = BeautifulSoup(resp, 'lxml')

# Sop is a BeautifulSoup type object
print(type(soup))

Label Search and Filtering

Basic methods

Tag search has two basic search methods: find_all() and find(). The find_all() method returns a list of tags for all matching keywords, and the find() method returns only one match.

soup = BeautifulSoup(resp, 'lxml')

# Returns a Tag with the label "a"
soup.find("a")

# Return to all tag lists
soup.find_all("a")

## The find_all method can be abbreviated
soup("a")

#Find all tags that start with b
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

#Find all tags in the list
soup.find_all(["a", "p"])

# The lookup tag is named p and the class attribute is "title"
soup.find_all("p", "title")

# Lookup property id is "link2"
soup.find_all(id="link2")

# Find if there is an attribute id
soup.find_all(id=True)

#
soup.find_all(href=re.compile("elsie"), id='link1')

# 
soup.find_all(attrs={"data-foo": "value"})

#Find label text containing "sisters"
soup.find(string=re.compile("sisters"))

# Get a specified number of results
soup.find_all("a", limit=2)

# Custom Matching Method
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

# Use custom matching methods only for attributes
def not_lacie(href):
        return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

# When the find_all() method of tag is called, Beautiful Soup retrieves all the descendant nodes of the current tag, and if you only want to search for the immediate child nodes of tag, you can use the parameter recursive=False 

soup.find_all("title", recursive=False)

Extension Method

find_parents() All parent nodes
find_parent() First parent node
find_next_siblings() All subsequent sibling nodes
find_next_sibling() The first sibling node after
find_previous_siblings() All previous sibling nodes
find_previous_sibling() First sibling node before
find_all_next() All subsequent elements
find_next() The first element after
find_all_previous() All previous elements
find_previous() First element before

CSS Selector

Beautiful Soup supports most CSS selectors http://www.w3.org/TR/CSS2/selector.html , a tag can be found using the syntax of the CSS selector by passing in a string parameter in the.select() method of the Tag or BeautifulSoup object.

html_doc = """
<html>
<head>
  <title>The Dormouse's story</title>
</head>
<body>
  <p class="title"><b>The Dormouse's story</b></p>

  <p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
    and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
  </p>

  <p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

# All a Labels
soup.select("a")

# Search layer by layer
soup.select("body a")
soup.select("html head title")

# Direct sublabels under tag Tags
soup.select("head > title")
soup.select("p > #link1")

# Brother tags after all matching Tags
soup.select("#link1 ~ .sister")

# First sibling tag after matching tag
soup.select("#link1 + .sister")

# According to calss class name
soup.select(".sister")
soup.select("[class~=sister]")

# Find by ID
soup.select("#link1")
soup.select("a#link1")

# Find by multiple ID s
soup.select("#link1,#link2")

# Find by Attribute
soup.select('a[href]')

# Find by attribute value
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

# Get only one match
soup.select(".sister", limit=1)

# Get only one match
soup.select_one(".sister")

Label object method

Label Properties

soup = BeautifulSoup('<p class="body strikeout" id="1">Extremely bold</p><p class="body strikeout" id="2">Extremely bold2</p>')
# Get all p-Label objects
tags = soup.find_all("p")
# Get the first p-Label object
tag = soup.p
# Output Label Type 
type(tag)
# Label Name
tag.name
# Label Properties
tag.attrs
# Value of label attribute class
tag['class']
# Tag contains text content, object NavigableString content
tag.string

# Returns all text within a label
for string in tag.strings:
    print(repr(string))

# Returns all text within the label, removing empty lines
for string in tag.stripped_strings:
    print(repr(string))

# Gets all the NavigableString contents contained in the tag and includes the descendants tag and outputs them in Unicode string format
tag.get_text()
## Separated by'|'
tag.get_text("|")
## Separated by'|', no empty characters are output
tag.get_text("|", strip=True)

Get Child Nodes

tag.contents  # Returns a list of first-level child nodes
tag.children  # Returns the listiterator object for the first level of child nodes
for child in tag.children:
    print(child)

tag.descendants # Return all child nodes recursively
for child in tag.descendants:
    print(child)

Get Parent Node

tag.parent # Returns the first layer parent node label
tag.parents # Recursively get all parent nodes of an element

for parent in tag.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

Get Brothers Node

# Next Brotherhood Element
tag.next_sibling 

# All sibling elements after the current tag
tag.next_siblings
for sibling in tag.next_siblings:
    print(repr(sibling))

# Previous Brotherhood Element
tag.previous_sibling

# All sibling elements before the current tag
tag.previous_siblings
for sibling in tag.previous_siblings:
    print(repr(sibling))

Traversal of elements

Beautiful Soup defines each tag as an element, each of which is arranged in HTML from top to bottom and can be displayed individually by traversing commands

# Next element of current label
tag.next_element

# All elements after the current label
for element in tag.next_elements:
    print(repr(element))

# The previous element of the current label
tag.previous_element
# All elements before the current label
for element in tag.previous_elements:
    print(repr(element))

Modify Label Properties

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1

tag.string = "New link text."
print(tag)

Modify Label Content (NavigableString)

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.string = "New link text."

Add tag content (NavigableString)

soup = BeautifulSoup("<a>Foo</a>")
tag = soup.a
tag.append("Bar")
tag.contents

# perhaps

new_string = NavigableString("Bar")
tag.append(new_string)
print(tag)

Add Comment

A comment is a special NavigableString object, so it can also be added through the append() method.

from bs4 import Comment
soup = BeautifulSoup("<a>Foo</a>")
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
print(tag)

Add tags (Tag)

There are two ways to add tags, one inside the specified tag (append method) and the other in the specified location (insert, insert_before, insert_after method)

  • append method
    soup = BeautifulSoup("<b></b>")
    tag = soup.b
    new_tag = soup.new_tag("a", href="http://www.example.com")
    new_tag.string = "Link text."
    tag.append(new_tag)
    print(tag)
* insert Method is to insert an object at a specified location in the current list of label subnodes ( Tag or NavigableString)
```python
html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
tag = soup.a
tag.contents
tag.insert(1, "but did not endorse ")
tag.contents
  • The insert_before() and insert_after() methods add elements to sibling nodes before or after the current label
    html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>'
    soup = BeautifulSoup(html)
    tag = soup.new_tag("i")
    tag.string = "Don't"
    soup.b.insert_before(tag)
    soup.b
* wrap() and unwrap()Can be specified for tag Elements are packaged or unpacked,And return the wrapped results.

```python
# Add Packaging
soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
#Output <b>I wish I was bold. </b>

soup.p.wrap(soup.new_tag("div"))
#Output <div><p><b>I wish I was bold. </b></p></div>

# Unpacking
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
#Output <a href="http://example.com/">I linked to example.com</a>

delete a tap

html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
# Clear all child nodes of the current label
soup.b.clear()

# Remove the current label and all child nodes from the soup to return the current label.
b_tag=soup.b.extract()
b_tag
soup

# Remove the current label and all child nodes from the soup without returning.
soup.b.decompose()

# Replace the current label with the specified element
tag=soup.i
new_tag = soup.new_tag("p")
new_tag.string = "Don't"
tag.replace_with(new_tag)

Other methods

output

# Format Output
tag.prettify()
tag.prettify("latin-1")
  • After parsing with Beautiful Soup, the document is converted to Unicode, and the special characters are converted to Unicode. If the document is converted to string, the Unicode encoding is encoded to UTF-8. HTML special characters will not display correctly.
  • Beautiful Soup also intelligently converts quotation marks to special characters in HTML or XML when using Unicode

Document encoding

After parsing with Beautiful Soup, the document is converted to Unicode, which uses the Encoding Auto Detection sublibrary to identify the current document encoding and convert it to Unicode encoding.

soup = BeautifulSoup(html)
soup.original_encoding

# You can also specify the encoding of the document manually 
soup = BeautifulSoup(html, from_encoding="iso-8859-8")
soup.original_encoding

# In order to improve the detection efficiency of Automatic Code Detection, some codes can also be excluded in advance.
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
  • When you output a document through Beautiful Soup, the default output encoding is UTF-8, regardless of how the input document is encoded

Document Parser

Beautiful Soup currently supports "lxml", "html5lib", and "html.parser"

soup=BeautifulSoup("<a><b /></a>")
soup
#Output: <html><body><a><b></b></b></a></body></html>
soup=BeautifulSoup("<a></p>", "lxml")
soup
#Output: <html><body><a></a></body></html>
soup=BeautifulSoup("<a></p>", "html5lib")
soup
#Output: <html><head></head><body><a><p></p></a></body></html>
soup=BeautifulSoup("<a></p>", "html.parser")
soup
#Output: <a></a>

Reference Documents

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

Posted by peter.t on Tue, 14 Jan 2020 09:40:48 -0800