Crawler: XPath, lxml module

Keywords: Python xml encoding pip Firefox

1. XPath

1.1 What is XPath

XPath (XML Path Language) is a language for finding information in XML and HTML documents, which can be used to traverse elements and attributes in XML and HTML documents.

1.2 XPath Development Tools

1.2.1 Chrome Plugin XPath Helper

https://jingyan.baidu.com/article/1e5468f94694ac484861b77d.html

1.2.2 Firefox Plugin XPath Checker

https://blog.csdn.net/menofgod/article/details/75646443

1.3 Xpath Syntax

That depends on my article on selenium basics.

https://www.cnblogs.com/liuhui0308/p/11937139.html

2. lxml module

lxml is a parsing library for HTML/XML. Its main function is how to parse and extract HTML/XML data.

lxml, like regular, is implemented in C and is a high-performance Python HTML/XML parser that can quickly locate specific elements and node information using previously learned XPath syntax.

Installable via pip:

pip install lxml -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

2.1 Basic Use

We can use it to parse HTML code, and when parsing HTML code, it will automatically complete if the HTML code is not canonical.

from lxml.html import etree

htmlText = '''
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li> 
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a></li> 
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li> 
    </ul>
</div>
'''

# utilize etree.HTML,Parse string to HTML File
html = etree.HTML(htmlText)

# Serialize by String HTML File
result = etree.tostring(html, encoding='utf-8', pretty_print=True).decode('utf-8')

print(result)

2.2 Read html code in files

In addition to parsing directly using strings, lxml also supports reading from files.

html code:

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="utf-8">
    <title></title>

</head>
<body>
    <div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
</body>
</html>

The etree.parse() method is then used to read the file.

from lxml.html import etree

html = etree.parse('./hello.html')
result = etree.tostring(html, encoding='utf-8', pretty_print=True).decode('utf-8')
print(result)

Result:

 

We see a mistake. Why?(

The error lxml.etree.XMLSyntaxError is reported when html content is parsed using etree.parse() because etree.parse() uses an XML parser by default, so it is reported when html content is not standard, such as when a tag lacks a closed tag.In this case, you can use etree.HTMLParser() to create an html parser and use it as a parameter to the etree.parse() method.

from lxml.html import etree

htmlParser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('./hello.html', parser=htmlParser)
result = etree.tostring(html, encoding='utf-8', pretty_print=True).decode('utf-8')
print(result)

2.3 Use XPath syntax in lxml

With XPath syntax, you should use Element.xpath syntax to perform XPath selection.

The xpath function always returns a list.

Let's first match the li tag with the a tag

from lxml.html import etree

htmlParser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('./hello.html', parser=htmlParser)

lis = html.xpath('//li')
for li in lis:
    print(etree.tostring(li, encoding='utf-8', pretty_print=True).decode('utf-8'), end='')

aList = html.xpath('//a/@href')
for a in aList:
    print(a)

 

Obtain the href attributes and contents of label a under the li tag:

from lxml.html import etree

htmlParser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('./hello.html', parser=htmlParser)

lis = html.xpath('//li')
for li in lis:
    # . Number indicates the current li Match below element
    href = li.xpath('.//a/@href')[0]   #Obtain a Labeled href attribute
    txt = li.xpath('.//a/text()')[0]   #Obtain a The text of the label
    print(href, txt)

Posted by Marsha on Mon, 16 Dec 2019 21:30:53 -0800