1. XPath
1.1 What is XPath
XPath (XML Path Language) is a language for finding information in XML and HTML documents, which can be used to traverse elements and attributes in XML and HTML documents.
1.2 XPath Development Tools
1.2.1 Chrome Plugin XPath Helper
https://jingyan.baidu.com/article/1e5468f94694ac484861b77d.html
1.2.2 Firefox Plugin XPath Checker
https://blog.csdn.net/menofgod/article/details/75646443
1.3 Xpath Syntax
That depends on my article on selenium basics.
https://www.cnblogs.com/liuhui0308/p/11937139.html
2. lxml module
lxml is a parsing library for HTML/XML. Its main function is how to parse and extract HTML/XML data.
lxml, like regular, is implemented in C and is a high-performance Python HTML/XML parser that can quickly locate specific elements and node information using previously learned XPath syntax.
Installable via pip:
pip install lxml -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
2.1 Basic Use
We can use it to parse HTML code, and when parsing HTML code, it will automatically complete if the HTML code is not canonical.
from lxml.html import etree htmlText = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' # utilize etree.HTML,Parse string to HTML File html = etree.HTML(htmlText) # Serialize by String HTML File result = etree.tostring(html, encoding='utf-8', pretty_print=True).decode('utf-8') print(result)
2.2 Read html code in files
In addition to parsing directly using strings, lxml also supports reading from files.
html code:
<!DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="utf-8"> <title></title> </head> <body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body> </html>
The etree.parse() method is then used to read the file.
from lxml.html import etree html = etree.parse('./hello.html') result = etree.tostring(html, encoding='utf-8', pretty_print=True).decode('utf-8') print(result)
Result:
We see a mistake. Why?(
The error lxml.etree.XMLSyntaxError is reported when html content is parsed using etree.parse() because etree.parse() uses an XML parser by default, so it is reported when html content is not standard, such as when a tag lacks a closed tag.In this case, you can use etree.HTMLParser() to create an html parser and use it as a parameter to the etree.parse() method.
from lxml.html import etree htmlParser = etree.HTMLParser(encoding='utf-8') html = etree.parse('./hello.html', parser=htmlParser) result = etree.tostring(html, encoding='utf-8', pretty_print=True).decode('utf-8') print(result)
2.3 Use XPath syntax in lxml
With XPath syntax, you should use Element.xpath syntax to perform XPath selection.
The xpath function always returns a list.
Let's first match the li tag with the a tag
from lxml.html import etree htmlParser = etree.HTMLParser(encoding='utf-8') html = etree.parse('./hello.html', parser=htmlParser) lis = html.xpath('//li') for li in lis: print(etree.tostring(li, encoding='utf-8', pretty_print=True).decode('utf-8'), end='') aList = html.xpath('//a/@href') for a in aList: print(a)
Obtain the href attributes and contents of label a under the li tag:
from lxml.html import etree htmlParser = etree.HTMLParser(encoding='utf-8') html = etree.parse('./hello.html', parser=htmlParser) lis = html.xpath('//li') for li in lis: # . Number indicates the current li Match below element href = li.xpath('.//a/@href')[0] #Obtain a Labeled href attribute txt = li.xpath('.//a/text()')[0] #Obtain a The text of the label print(href, txt)