Python 3 Web Crawler Actual Warfare - 28, Use of Parse Library: XPath

Keywords: Python Attribute xml less

Last article: Python 3 Web Crawler Actual Warfare - 27, Requests and Regular Expressions Grab Cat Eye Movie Ranks
Next article:

In the last section, we implemented a basic crawler, but we used regular expressions to extract page information. After using them, we will find that it is tedious to construct a regular expression, and if something is wrong, matching may fail. So it is not convenient to extract more or less page information using regular expressions.

For a node of a Web page, it can define id, class, or other attributes, and there is a hierarchical relationship between nodes, in which one or more nodes can be located by XPath or CSS selectors.So when parsing a page, we use XPath or CSS selector to extract a node, then call the appropriate method to get its body content or properties. Can't we extract any information we want?

How do we do this in Python?Don't worry, there are already so many parsing libraries. The more powerful ones are LXML, BeautifulSoup, PyQuery, etc. In this chapter, we will introduce the use of these three parsing libraries. With these libraries, we don't need to worry about regularity anymore, and the parsing efficiency will be greatly improved. They are essential tools for crawlers.

Use of XPath

XPath, the full name of XML Path Language, or XML Path Language, is a language for finding information in XML documents.XPath was originally designed to search for XML documents, but it also works for HTML documents.

So when crawling, we can use XPath to extract information. In this section, we will introduce the basic usage of XPath.

1. XPath overview

XPath has a powerful selection function, it provides a very concise path selection expression, and it also provides over 100 built-in functions for string, numeric, time matching, node, sequence processing, and so on. Almost all the nodes we want to locate can be selected with XPath.

XPath became the W3C standard on November 16, 1999. It was designed to be used by XSLT, XPointer, and other XML parsing software. More documents can be accessed on its official website: https://www.w3.org/TR/xpath/.

2. Common XPath rules

Our current table lists several common rules:

Expression describe
nodename Select all child nodes of this node
/ Select a direct child node from the current node
// Select a descendant node from the current node
. Select Current Node
.. Select the parent of the current node
@ Select Properties

Common XPath matching rules are listed here, for example, / for selecting direct child nodes, // for selecting all descendant nodes,..... For selecting the parent node of the current node, @ for selecting specific nodes with attributes.

For example:

//title[@lang='eng']

This is an XPath rule that represents selecting all nodes named title with the attribute lang value eng.

Later, we will describe the detailed use of XPath, which is used for HTML parsing through Python's LXML library.

3. Preparations

Before using it, we first need to make sure that the LXML library is installed. If it is not, you can refer to the installation process in Chapter 1.

4. Instance introduction

Now let's take a look at the process of parsing a web page using XPath with the following code:

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

Here we first import the etree module of the LXML library, then declare a piece of HTML text, call the HTML class to initialize it, so we have successfully constructed an XPath parsing object. Here we notice that the last li node in the HTML text is not closed, but the etree module can automatically correct the HTML text.

Here we call the tostring() method to output the corrected HTML code, but the result is the bytes type. Here we use the decode() method to convert to the str type, and the result is as follows:

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>

We can see that after processing, the li node label is completed, and body, html nodes are added automatically.

In addition, we can read the text file directly for parsing as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

The content of test.html is the HTML code in the example above, as follows:

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>

This time the output is slightly different, with an additional DOCTYPE declaration, but it has no effect on parsing. The results are as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div></body></html>

5. All nodes

We usually use the XPath rule at the beginning of // to select all nodes that meet the requirements, as shown in the HTML text above. If we want to select all nodes, we can do this:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

Run result:

[<Element html at 0x10510d9c8>, <Element body at 0x10510da08>, <Element div at 0x10510da48>, <Element ul at 0x10510da88>, <Element li at 0x10510dac8>, <Element a at 0x10510db48>, <Element li at 0x10510db88>, <Element a at 0x10510dbc8>, <Element li at 0x10510dc08>, <Element a at 0x10510db08>, <Element li at 0x10510dc48>, <Element a at 0x10510dc88>, <Element li at 0x10510dcc8>, <Element a at 0x10510dd08>]

Here, we use * to represent matching all nodes, that is, all nodes in the entire HTML text are retrieved, and you can see that the return form is a list, each element is an Element type, followed by the name of the node, such as html, body, div, ul, li, a, and so on. All nodes are included in the list.

Of course, matching here can also specify node names, if we want to get all li nodes, an example is as follows:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])

Here we want to select all the li nodes that can use //, then add the name of the node directly, which can be extracted by calling the xpath() method directly.

Run result:

[<Element li at 0x105849208>, <Element li at 0x105849248>, <Element li at 0x105849288>, <Element li at 0x1058492c8>, <Element li at 0x105849308>]
<Element li at 0x105849208>

Here we can see that the result of the extraction is a list of elements, each of which is an Element object. If you want to remove one of the objects, you can index it directly with square brackets, for example, [0].

6. Child Nodes

We can find the child or descendant nodes of an element by/or//and join all the direct a child nodes of the li node that we want to select now, by doing this:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

Here we select all the direct a subnodes of all the Li nodes by appending a/a, because //li is all the Li nodes selected and/a is all the direct a subnodes of the selected Li nodes. Together, all the direct a subnodes of all the Li nodes are obtained.

Run result:

[<Element a at 0x106ee8688>, <Element a at 0x106ee86c8>, <Element a at 0x106ee8708>, <Element a at 0x106ee8748>, <Element a at 0x106ee8788>]

But here / is to select the direct child node, so if we want to get all the descendant nodes, we should use / for example, if we want to get all the descendant a nodes under the ul node, we can do this:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)

The results are the same.

But here if we use //ul/a, we can't get any results, because/is to get the direct child node, and there is no direct a child node under the UL node, only the li node, so we can't get any matching results, the code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul/a')
print(result)

Run result:

[]

So here we want to note the difference between / and //, / Get direct child nodes, // Get descendant nodes.

7. Parent Node

We know that child or descendant nodes can be found by sequential/or//if we know how child nodes can find parent nodes?Here we can use..To get the parent node.

For example, now we first select href as the a node of link4.html, then get its parent node, and then get its class attribute. The code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)

Run result:

['item-1']

Check that it is the class of the target li node that we get, and the parent node succeeds.

We can also get the parent node through parent:: with the following code:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

8. Attribute Matching

We can also filter attributes with the @ symbol when selecting, for example, if we want to select a li node with class item-1 here, we can do this:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)

Here we limit the class attribute of the node to item-0 by adding [@class="item-0"], whereas there are two eligible li nodes in the HTML text, so the return should return two matching elements, as follows:

[<Element li at 0x10a399288>, <Element li at 0x10a3992c8>]

It can be seen that the matching results are exactly two. As to whether the two results are correct, let's verify them later.

9. Text Acquisition

We can get the text in the node using the text() method in XPath. Next, let's try to get the text in the li node above, coded as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/text()')
print(result)

The results are as follows:

['\n     ']

It's strange that we don't get any text, but only a line break. Why?Because text() in XPath is preceded by /, and this / means to select a direct child node, where it is obvious that the direct child nodes of li are all a nodes, and the text is inside a node, the result matched here is the line break inside the modified li node, because the tail label of the automatically corrected li node wraps.

That is, the two nodes are selected:

<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li>

One of the nodes wraps when the tail label of the li node is added because of automatic correction, so the only result of extracting the text is the line break between the tail label of the li node and the tail label of the a node.

So, if we want to get the text inside the li node, there are two ways, one is to pick node a and get the text, the other is to use //. Let's see the difference.

First we pick the a node and get the text. The code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

Run result:

['first item', 'fifth item']

You can see here that the return value is two and the content is the text of the li node with the attribute item-0, which also confirms that the result of the attribute matching above is correct.

Here we select the li node one by one, then use/select its direct child node a, and then select its text. The result is exactly the two expected results.

Let's look at the results in another way//selection, coded as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]//text()')
print(result)

Run result:

['first item', 'fifth item', '\n     ']

Unexpectedly, three results are returned here, and you can imagine that this is the text that selects all the descendant nodes, the first two of which are the text inside the a node of the li child node, and the other one is the text inside the last li node, which is the line break.

So, if we want to get all the text inside the descendant node, we can get it directly by // adding text(), which guarantees the most comprehensive text information, but may include special characters such as line breaks.If you want to get all the text under a particular descendant node, you can select a specific descendant node and then call the text() method to get its internal text, which ensures that the results are clean.

10. Attribute acquisition

We know that text() can be used to get text inside a node, so how do we get node attributes?Actually, you can use the @ symbol, for example, if we want to get the href attribute of all a nodes under all li nodes, the code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

Here we can get the href attribute of a node by @href. Note that there is a difference between this and attribute matching. Attribute matching restricts an attribute by brackets plus attribute name and value, such as [@href="link1.html"]. Here @href refers to an attribute of a node that needs to be differentiated.

Run result:

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

You can see that we have successfully obtained the href attribute of a node under all li nodes and returned it as a list.

11. Attribute Multi-Value Matching

Sometimes an attribute of some nodes may have multiple values, such as the following example:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)

Here, the class attribute of the Li node in the HTML text has two values, Li and li-first, but at this point we can't match if we want to match the previous attributes, so the code runs:

[]

If the attribute has more than one value, then the contains() function is needed, and the code can be overridden as follows:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

So we pass in the attribute name through the contains() method, the first parameter, and the second parameter, so that matching can be done as long as the attribute contains the attribute value passed in.

Run result:

['first item']

This choice is often used when there are multiple values for an attribute of a node, such as the class attribute of a node.

12. Multiple Attribute Matching

In addition, we may encounter a situation where we may need to determine a node based on multiple attributes, which requires matching multiple attributes at the same time. Here we can use the operator and to connect, for example:

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

Here we add an attribute name to the li node of the HTML text, which we need to select according to the class and name attributes at the same time. Then we can connect the two conditions with the andoperator. Both conditions are surrounded by brackets. The result is as follows:

['first item']

Here, and is actually an operator in XPath. There are many other operators, such as or, mod, and so on, which can be summarized as follows:

operator describe Example Return value
or or price=9.80 or price=9.70 Returns true if price is 9.80.If price is 9.50, false is returned.
and and price>9.00 and price<9.90 Returns true if price is 9.80.If price is 8.50, false is returned.
mod Calculate the remainder of a division 5 mod 2 1
\ Compute two node sets //book //cd Returns all node sets with book and cd elements
+ addition 6 + 4 10
- subtraction 6 - 4 2
* multiplication 6 * 4 24
div division 8 div 4 2
= Be equal to price=9.80 Returns true if price is 9.80.If price is 9.90, false is returned.
!= Not equal to price!=9.80 Returns true if price is 9.90.If price is 9.80, false is returned.
< less than price<9.80 Returns true if price is 9.00.If price is 9.90, false is returned.
<= Less than or equal to price<=9.80 Returns true if price is 9.00.If price is 9.90, false is returned.
> greater than price>9.80 Returns true if price is 9.90.If price is 9.80, false is returned.
>= Greater than or equal to price>=9.80 Returns true if price is 9.90.If price is 9.70, false is returned.

This table is referenced from: http://www.w3school.com.cn/xp....

13. Select in order

Sometimes when we select, some attributes may match multiple nodes at the same time, but we only want one of them, such as the second node, or the last node. What should we do?

You can then use the middle bracket to pass in an index to get nodes in a particular order, as shown in the following example:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

For the first time, we selected the first li node and passed in the number 1 in the middle bracket. Note that this is different from the code in that the sequence number starts with 1, not 0.

The second time we select the last li node, pass in last() in the middle bracket, and return the last li node.

For the third selection, we select a li node whose position is less than 3, that is, the nodes whose position numbers are 1 and 2. The result is the first two li nodes.

For the fourth selection, we select the third to last li node, and pass in last()-2 in the middle bracket, because last() is the last one, so last()-2 is the third to last.

The results are as follows:

['first item']
['fifth item']
['first item', 'second item']
['third item']

Here we use last(), position() and other functions. XPath provides more than 100 functions, including access, numeric value, string, logic, node, sequence and other processing functions. All functions can be referred to as: http://www.w3school.com.cn/xp....

13. Node Axis Selection

XPath provides a number of node axis selection methods, called XPath Axes in English, including acquiring child elements, sibling elements, parent elements, ancestor elements, etc. Under certain circumstances, it can be used to easily complete node selection. Let's take a look at an example:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')
print(result)

Run result:

[<Element html at 0x107941808>, <Element body at 0x1079418c8>, <Element div at 0x107941908>, <Element ul at 0x107941948>]
[<Element div at 0x107941908>]
['item-0']
[<Element a at 0x1079418c8>]
[<Element span at 0x107941948>]
[<Element a at 0x1079418c8>]
[<Element li at 0x107941948>, <Element li at 0x107941988>, <Element li at 0x1079419c8>, <Element li at 0x107941a08>]

The first selection we invoked the ancestor axis to get all the ancestor nodes, followed by two colons, then the selector for the nodes, where we used *directly to match all the nodes, so the result is all the ancestor nodes for the first li node, including html, body, div, ul.

The second time we added a restriction, this time we added a div after the colon, so the only result is the div ancestor node.

The third selection we call the attribute axis to get all the attribute values, followed by the selector or *, which represents all the attributes of the node, and the return value is all the attribute values of the li node.

The fourth selection we invoked the child axis to get all the direct child nodes, where we added a restriction to select the a node whose href property is link1.html.

The fifth selection we invoked the descendant axis to get all the descendant nodes, and here we added a restriction to get the span node, so we returned only the span node but no a node.

The sixth selection we called the following axis to get all the nodes after the current node, where we used * matching but added an index selection, so we only got the second subsequent node.

The seventh selection we call the following-sibling axis to get all the siblings after the current node, where we use * matching, so we get all the subsequent siblings.

The above is a simple use of the XPath axis, and more axes can be referenced: http://www.w3school.com.cn/xp....

14. Conclusion

So far we have basically finished describing the XPath selectors that may be used. XPath is very powerful and has a lot of built-in functions, which can greatly improve the efficiency of extracting HTML information after being familiar with it.

If you want to query for more XPath usage, you can see: http://www.w3school.com.cn/xp....

If you want to query for more Python LXML libraries, you can see: http://lxml.de/.

Posted by Joe on Tue, 18 Jun 2019 10:38:42 -0700