abstract
Beautiful Soup is a Python library that can extract data from HTML or XML format files by parsing HTML or XML data into Python objects for easy processing through Python code.
Document environment
- Test environment for code in this document
Instructions for using Beautifu Soup
The basic function of Beautiful Soup is to find and edit HTML tags.
Basic Concepts - Object Types
Beautiful Soup converts complex HTML documents into a complex tree structure, and each node is converted into a Python object. Beautiful Soup defines four types of objects: Tag, NavigableString, BeautifulSoup, Comment.
object type | describe |
---|---|
BeautifulSoup | Full content of the document |
Tag | Tags for HTML |
NavigableString | Text included in label |
Comment | A special NavigableString type that is defined when a NavigableString in a tag is commented on |
Installation and Reference
# Beautiful Soup pip install bs4 # Parser pip install lxml pip install html5lib
# Initialization from bs4 import BeautifulSoup # Method one, open the file directly soup = BeautifulSoup(open("index.html")) # Method 2, specify data resp = "<html>data</html>" soup = BeautifulSoup(resp, 'lxml') # Sop is a BeautifulSoup type object print(type(soup))
Label Search and Filtering
Basic methods
Tag search has two basic search methods: find_all() and find(). The find_all() method returns a list of tags for all matching keywords, and the find() method returns only one match.
soup = BeautifulSoup(resp, 'lxml') # Returns a Tag with the label "a" soup.find("a") # Return to all tag lists soup.find_all("a") ## The find_all method can be abbreviated soup("a") #Find all tags that start with b for tag in soup.find_all(re.compile("^b")): print(tag.name) #Find all tags in the list soup.find_all(["a", "p"]) # The lookup tag is named p and the class attribute is "title" soup.find_all("p", "title") # Lookup property id is "link2" soup.find_all(id="link2") # Find if there is an attribute id soup.find_all(id=True) # soup.find_all(href=re.compile("elsie"), id='link1') # soup.find_all(attrs={"data-foo": "value"}) #Find label text containing "sisters" soup.find(string=re.compile("sisters")) # Get a specified number of results soup.find_all("a", limit=2) # Custom Matching Method def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') soup.find_all(has_class_but_no_id) # Use custom matching methods only for attributes def not_lacie(href): return href and not re.compile("lacie").search(href) soup.find_all(href=not_lacie) # When the find_all() method of tag is called, Beautiful Soup retrieves all the descendant nodes of the current tag, and if you only want to search for the immediate child nodes of tag, you can use the parameter recursive=False soup.find_all("title", recursive=False)
Extension Method
find_parents() | All parent nodes |
find_parent() | First parent node |
find_next_siblings() | All subsequent sibling nodes |
find_next_sibling() | The first sibling node after |
find_previous_siblings() | All previous sibling nodes |
find_previous_sibling() | First sibling node before |
find_all_next() | All subsequent elements |
find_next() | The first element after |
find_all_previous() | All previous elements |
find_previous() | First element before |
CSS Selector
Beautiful Soup supports most CSS selectors http://www.w3.org/TR/CSS2/selector.html , a tag can be found using the syntax of the CSS selector by passing in a string parameter in the.select() method of the Tag or BeautifulSoup object.
html_doc = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc) # All a Labels soup.select("a") # Search layer by layer soup.select("body a") soup.select("html head title") # Direct sublabels under tag Tags soup.select("head > title") soup.select("p > #link1") # Brother tags after all matching Tags soup.select("#link1 ~ .sister") # First sibling tag after matching tag soup.select("#link1 + .sister") # According to calss class name soup.select(".sister") soup.select("[class~=sister]") # Find by ID soup.select("#link1") soup.select("a#link1") # Find by multiple ID s soup.select("#link1,#link2") # Find by Attribute soup.select('a[href]') # Find by attribute value soup.select('a[href^="http://example.com/"]') soup.select('a[href$="tillie"]') soup.select('a[href*=".com/el"]') # Get only one match soup.select(".sister", limit=1) # Get only one match soup.select_one(".sister")
Label object method
Label Properties
soup = BeautifulSoup('<p class="body strikeout" id="1">Extremely bold</p><p class="body strikeout" id="2">Extremely bold2</p>') # Get all p-Label objects tags = soup.find_all("p") # Get the first p-Label object tag = soup.p # Output Label Type type(tag) # Label Name tag.name # Label Properties tag.attrs # Value of label attribute class tag['class'] # Tag contains text content, object NavigableString content tag.string # Returns all text within a label for string in tag.strings: print(repr(string)) # Returns all text within the label, removing empty lines for string in tag.stripped_strings: print(repr(string)) # Gets all the NavigableString contents contained in the tag and includes the descendants tag and outputs them in Unicode string format tag.get_text() ## Separated by'|' tag.get_text("|") ## Separated by'|', no empty characters are output tag.get_text("|", strip=True)
Get Child Nodes
tag.contents # Returns a list of first-level child nodes tag.children # Returns the listiterator object for the first level of child nodes for child in tag.children: print(child) tag.descendants # Return all child nodes recursively for child in tag.descendants: print(child)
Get Parent Node
tag.parent # Returns the first layer parent node label tag.parents # Recursively get all parent nodes of an element for parent in tag.parents: if parent is None: print(parent) else: print(parent.name)
Get Brothers Node
# Next Brotherhood Element tag.next_sibling # All sibling elements after the current tag tag.next_siblings for sibling in tag.next_siblings: print(repr(sibling)) # Previous Brotherhood Element tag.previous_sibling # All sibling elements before the current tag tag.previous_siblings for sibling in tag.previous_siblings: print(repr(sibling))
Traversal of elements
Beautiful Soup defines each tag as an element, each of which is arranged in HTML from top to bottom and can be displayed individually by traversing commands
# Next element of current label tag.next_element # All elements after the current label for element in tag.next_elements: print(repr(element)) # The previous element of the current label tag.previous_element # All elements before the current label for element in tag.previous_elements: print(repr(element))
Modify Label Properties
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b tag.name = "blockquote" tag['class'] = 'verybold' tag['id'] = 1 tag.string = "New link text." print(tag)
Modify Label Content (NavigableString)
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b tag.string = "New link text."
Add tag content (NavigableString)
soup = BeautifulSoup("<a>Foo</a>") tag = soup.a tag.append("Bar") tag.contents # perhaps new_string = NavigableString("Bar") tag.append(new_string) print(tag)
Add Comment
A comment is a special NavigableString object, so it can also be added through the append() method.
from bs4 import Comment soup = BeautifulSoup("<a>Foo</a>") new_comment = soup.new_string("Nice to see you.", Comment) tag.append(new_comment) print(tag)
Add tags (Tag)
There are two ways to add tags, one inside the specified tag (append method) and the other in the specified location (insert, insert_before, insert_after method)
- append method
soup = BeautifulSoup("<b></b>") tag = soup.b new_tag = soup.new_tag("a", href="http://www.example.com") new_tag.string = "Link text." tag.append(new_tag) print(tag)
* insert Method is to insert an object at a specified location in the current list of label subnodes ( Tag or NavigableString) ```python html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>' soup = BeautifulSoup(html) tag = soup.a tag.contents tag.insert(1, "but did not endorse ") tag.contents
- The insert_before() and insert_after() methods add elements to sibling nodes before or after the current label
html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>' soup = BeautifulSoup(html) tag = soup.new_tag("i") tag.string = "Don't" soup.b.insert_before(tag) soup.b
* wrap() and unwrap()Can be specified for tag Elements are packaged or unpacked,And return the wrapped results. ```python # Add Packaging soup = BeautifulSoup("<p>I wish I was bold.</p>") soup.p.string.wrap(soup.new_tag("b")) #Output <b>I wish I was bold. </b> soup.p.wrap(soup.new_tag("div")) #Output <div><p><b>I wish I was bold. </b></p></div> # Unpacking markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) a_tag = soup.a a_tag.i.unwrap() a_tag #Output <a href="http://example.com/">I linked to example.com</a>
delete a tap
html = '<b><a href="http://example.com/">I linked to <i>example.com</i></a></b>' soup = BeautifulSoup(html) # Clear all child nodes of the current label soup.b.clear() # Remove the current label and all child nodes from the soup to return the current label. b_tag=soup.b.extract() b_tag soup # Remove the current label and all child nodes from the soup without returning. soup.b.decompose() # Replace the current label with the specified element tag=soup.i new_tag = soup.new_tag("p") new_tag.string = "Don't" tag.replace_with(new_tag)
Other methods
output
# Format Output tag.prettify() tag.prettify("latin-1")
- After parsing with Beautiful Soup, the document is converted to Unicode, and the special characters are converted to Unicode. If the document is converted to string, the Unicode encoding is encoded to UTF-8. HTML special characters will not display correctly.
- Beautiful Soup also intelligently converts quotation marks to special characters in HTML or XML when using Unicode
Document encoding
After parsing with Beautiful Soup, the document is converted to Unicode, which uses the Encoding Auto Detection sublibrary to identify the current document encoding and convert it to Unicode encoding.
soup = BeautifulSoup(html) soup.original_encoding # You can also specify the encoding of the document manually soup = BeautifulSoup(html, from_encoding="iso-8859-8") soup.original_encoding # In order to improve the detection efficiency of Automatic Code Detection, some codes can also be excluded in advance. soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
- When you output a document through Beautiful Soup, the default output encoding is UTF-8, regardless of how the input document is encoded
Document Parser
Beautiful Soup currently supports "lxml", "html5lib", and "html.parser"
soup=BeautifulSoup("<a><b /></a>") soup #Output: <html><body><a><b></b></b></a></body></html> soup=BeautifulSoup("<a></p>", "lxml") soup #Output: <html><body><a></a></body></html> soup=BeautifulSoup("<a></p>", "html5lib") soup #Output: <html><head></head><body><a><p></p></a></body></html> soup=BeautifulSoup("<a></p>", "html.parser") soup #Output: <a></a>