Python Crawler lxml Library

Keywords: xml Python pip Pycharm

1. lxml Library

lxml is an HTML/XML parser whose main function is to parse and extract HTML/XML data.

Like regular, lxml is implemented in C and is a high-performance Python HTML/XML parser that allows us to quickly locate specific elements and node information using previously learned XPath syntax.

Official lxml python documentation: http://lxml.de/index.html

The C language library needs to be installed and can be installed using pip: pip3 install lxml

2. Basic Use:

We can use him to parse the HTML code, and when parsing the HTML code, if the HTML code is not standard, it will automatically complete.The sample code is as follows:

When etree is used, pycharm is red because lxml is implemented in C, so there is no prompt, but no error at runtime.

# etree library using lxml
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # Note that a </li>closed tag is missing here
     </ul>
 </div>
'''

#Using etree.HTML, parse strings into HTML documents
html = etree.HTML(text) 

# Serialize HTML documents by string
result = etree.tostring(html) 

print(result)

The input results are as follows:

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

You can see it.lxml automatically modifies HTML code.The example not only completes the li tag, but also adds the body, HTML tag.

3. Read html code from file:

In addition to parsing directly using strings, lxml also supports reading from files.Let's create a new hello.html file:

<!-- hello.html -->
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

The etree.parse() method is then used to read the file.The sample code is as follows:

from lxml import etree

# Read the external file hello.html
html = etree.parse('hello.html')
result = etree.tostring(html, pretty_print=True)

print(result)


#The default is the XML parser, which can be resolved by specifying the parser as the HTML parser if the HTML code is not canonical and errors occur
parse = etree.HTMLParser(encoding='utf-8')
html = etree.parse('hello.html',parse=parse)

The input results are the same as before.

4. Use XPath syntax in lxml:

  1. Get all li tags:

     from lxml import etree
    
     html = etree.parse('hello.html')
     print type(html)  # Show etree.parse() return type
    
     result = html.xpath('//li')
    
     print(result)  # Print element collection for <li>label
    
  2. Gets the value of all class attributes under all li elements:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li/@class')
    
     print(result)
    
  3. Get the a tag with href www.baidu.com under the li tag:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li/a[@href="www.baidu.com"]')
    
     print(result)
    
  4. Get all span tags under the li tag:

     from lxml import etree
    
     html = etree.parse('hello.html')
    
     #result = html.xpath('//li/span')
     #Note that this is incorrect:
     #Because/is used to get child elements, and <span>is not <li>child elements, use a double slash
    
     result = html.xpath('//li//span')
    
     print(result)
    
  5. Get all the class es in the a tag under the li tag:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li/a//@class')
    
     print(result)
    
  6. Gets the value corresponding to the href property of a of the last li:

     from lxml import etree
    
     html = etree.parse('hello.html')
    
     result = html.xpath('//li[last()]/a/@href')
     # The last element can be found by the predicate [last()]
    
     print(result)
    
  7. Get the contents of the second last li element:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li[last()-1]/a')
    
     # text method can get element content
     print(result[0].text)
    
  8. The second way to get the contents of the second last li element:

     from lxml import etree
    
     html = etree.parse('hello.html')
     result = html.xpath('//li[last()-1]/a/text()')
    
     print(result)

 

 

Posted by vicky57t on Tue, 13 Aug 2019 18:41:13 -0700