Python web crawler and information extraction note 10 HTML content search method based on bs4 Library

In the old rule, don't forget the demo.html we used for testing. First, save the page in the demo variable:

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.status_code
200
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>>

The BeautifulSoup library provides one of the following very, very common methods:

<>.find_all(name,attrs,recursive,string,**kwargs)

Here we will explain one parameter by one:

Name: search string for label name

Returns a list type to store the results of the search. An example is as follows:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>

It can be seen that you only need to add parameter labels in () to find the label content. If you want to query more than one at a time, you can use the list to add several parameters to the same list. In addition, if you want to find all labels, you only need to use the loop to set the parameters to True:

>>> for tag in soup.find_all(True):
	print(tag.name)

	
html
head
title
body
p
b
p
a
a
>>>

If there is another requirement, such as displaying only the labels beginning with b, such as b and b body, we need to import a new third-party library, regular expression library re. Let's take a look at the example first, and we will talk about this library later:

>>> import re
>>> for tag in soup.find_all(re.compile('b')):
	print(tag.name)

	
body
b
>>>

attrs: retrieval string for tag attribute value, which can be annotated for attribute retrieval

If we need to find the information of the course string in the p tag, see the example:

>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>>

If we look for a label element with an id attribute equal to link1:

>>> soup.find_all(id = 'link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>>

Let's look for a tag element with an id field of link that does not exist:

>>> soup.find_all(id = 'link')
[]
>>>

An empty list is returned here. We find that the return value of the previous element is also of list type, and the attribute value must be of exact type. If we need some content, such as link or some content, we still need regular expression:

>>> import re
>>> soup.find_all(id = re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>

In fact, this is similar to our search engine, but we don't know the full name of the content we want to search. This regular expression is similar to this. We search all the content including link and return it;

recursive: whether to retrieve all descendants, True by default

By default, we search the tag information of all subsequent descendant nodes starting from a certain tag. If it is the descendant node of the search part, it needs to be set to False:

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive = False)
[]
>>>

When recursive is set to False, an empty list will be returned, indicating that there is no label information named a on the son node from the root node of the soup, and the a label should be in the subsequent label node of the descendant.

String: the search string of the string area in

>>> soup.find_all(string = "Basic Python")
['Basic Python']
>>>

The string information retrieved here needs to be specified with the correct Basic Python. In addition, we can use regular expressions to retrieve less comprehensive information:

>>> import re
>>> soup.find_all(string = re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>

Here we can easily retrieve the information containing python strings, and the return value is in the form of a list.

The beautifulsop library is very commonly divided into the find all() method. In addition, it has seven common extension methods:

Method	Explain
<>.find()	Search and return only one result, string type, the same as. Find all() parameter
<>.find_parents()	Search in the predecessor node and return the list type, the same as the. Find all() parameter
<>.find_parent()	Return a result in the predecessor node, string type, the same as. find() parameter
<>.find_next_siblings()	Search in the subsequent parallel nodes and return the list type, the same as the. Find all() parameter
<>.find_next_sibling()	Return a result in the subsequent parallel node, string type, the same as the. find() parameter
<>.find_previous_siblings()	Search in the preceding parallel node and return the list type, the same as the. Find all() parameter
<>.find_previous_sibling()	Return a search result in the preorder parallel node, string type, the same as. find() parameter

These seven methods are different in the number of returned values, and their usage is no different from that of find_allw(). It's easy to practice and use them skillfully.

Dream hacker

Published 64 original articles, won praise 28, visited 5085

Private letter follow

Posted by sfmnetsys on Thu, 06 Feb 2020 01:20:50 -0800

Programmer Group

Python web crawler and information extraction note 10 HTML content search method based on bs4 Library

Hot Keywords