1. lxml Library
lxml is an HTML/XML parser whose main function is to parse and extract HTML/XML data.
Like regular, lxml is implemented in C and is a high-performance Python HTML/XML parser that allows us to quickly locate specific elements and node information using previously learned XPath syntax.
Official lxml python documentation: http://lxml.de/index.html
The C language library needs to be installed and can be installed using pip: pip3 install lxml
2. Basic Use:
We can use him to parse the HTML code, and when parsing the HTML code, if the HTML code is not standard, it will automatically complete.The sample code is as follows:
When etree is used, pycharm is red because lxml is implemented in C, so there is no prompt, but no error at runtime.
# etree library using lxml from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> # Note that a </li>closed tag is missing here </ul> </div> ''' #Using etree.HTML, parse strings into HTML documents html = etree.HTML(text) # Serialize HTML documents by string result = etree.tostring(html) print(result)
The input results are as follows:
<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
You can see it.lxml automatically modifies HTML code.The example not only completes the li tag, but also adds the body, HTML tag.
3. Read html code from file:
In addition to parsing directly using strings, lxml also supports reading from files.Let's create a new hello.html file:
<!-- hello.html --> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>
The etree.parse() method is then used to read the file.The sample code is as follows:
from lxml import etree # Read the external file hello.html html = etree.parse('hello.html') result = etree.tostring(html, pretty_print=True) print(result) #The default is the XML parser, which can be resolved by specifying the parser as the HTML parser if the HTML code is not canonical and errors occur parse = etree.HTMLParser(encoding='utf-8') html = etree.parse('hello.html',parse=parse)
The input results are the same as before.
4. Use XPath syntax in lxml:
-
Get all li tags:
from lxml import etree html = etree.parse('hello.html') print type(html) # Show etree.parse() return type result = html.xpath('//li') print(result) # Print element collection for <li>label
-
Gets the value of all class attributes under all li elements:
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li/@class') print(result)
-
Get the a tag with href www.baidu.com under the li tag:
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li/a[@href="www.baidu.com"]') print(result)
-
Get all span tags under the li tag:
from lxml import etree html = etree.parse('hello.html') #result = html.xpath('//li/span') #Note that this is incorrect: #Because/is used to get child elements, and <span>is not <li>child elements, use a double slash result = html.xpath('//li//span') print(result)
-
Get all the class es in the a tag under the li tag:
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li/a//@class') print(result)
-
Gets the value corresponding to the href property of a of the last li:
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li[last()]/a/@href') # The last element can be found by the predicate [last()] print(result)
-
Get the contents of the second last li element:
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li[last()-1]/a') # text method can get element content print(result[0].text)
-
The second way to get the contents of the second last li element:
from lxml import etree html = etree.parse('hello.html') result = html.xpath('//li[last()-1]/a/text()') print(result)