Web Crawler Notes 7 Implementing XML and HTML Text Information Extraction Using XPATH

Keywords: Big Data xml Attribute encoding Firefox

Extensible Markup Language (XML) is an extensible markup language designed to transfer and store data. Detailed information is available. http://www.w3school.com.cn/xml.
HTML refers to Hyper Text Markup Language (HTML), which is the main tool for writing web pages on WWW. http://www.w3school.com.cn/html

Both XML and HTML are markup languages that use tag tags to describe data that can be used to find and locate data.

Here is an example of an xml document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
  • Each element and attribute of the Parent has a Parent. In the example above, the book element is the Parent node of the title and price elements.
  • Children element nodes can have zero or one or more children. In the example above, title and price elements are all child nodes of book element.
  • Sibling has the same father node, title and price in the example above are sibling nodes.
  • Ancestor is the father of a node, the father of a parent, etc. In the example above, the ancestor node of title and price is bookstore.
  • Descendant is the child of a node. In the example above, the descendant node of bookstore is book, title, price.

XPath syntax

XPath (XML Path Language) is a language for searching information in XML documents, which can be used to traverse elements and attributes in XML documents. Details can be found in official W3School documents: http://www.w3school.com.cn/xpath/index.asp

XPath uses path expressions to select nodes or node sets in an XML document. Nodes are selected by following paths or steps.

XPath development tools include:

  • Open source XPath expression editing tool: XMLQuire(XML format file available)
  • Chrome Plug-in XPath Helper
  • Firefox Plug-in XPath Checker
# XPath Common Path Expression Learning
'''
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
'''

# The most commonly used path expressions are listed below.
'bookstore'   # Select all child nodes of the bookstore element
'/bookstore'  # Select the root element bookstore (if the path starts at /, then the path always represents the absolute path to an element)
'bookstore/book'  # Select all book elements that belong to the child elements of the bookstore 
'//Book'# Selects all book elements, regardless of their location
'bookstore//book'# Select all book Elements in the descendant elements of the bookstore
'//Select all attributes named lang

# Predicate
'/bookstore/book[1]'    # Select the first book element that belongs to the bookstore subelement
'/bookstore/book[last()]'    # Select the last book element that belongs to the bookstore subelement
'/bookstore/book[last()-1]'  # Select the penultimate book element belonging to the bookstore subelement
'/bookstore/book[position < 3]'    # Select the first two book elements that belong to the bookstore subelement
'//title[@lang]'# Selects all title elements that have attributes called Lang
'//title[@lang="lang"]'# Selects title elements with all lang attribute values of eng
'/bookstore/book[price > 35.00]'   # Select the book store, all price elements with a value greater than 35.00
'/bookstore/book[price > 35.00]/title'  # Select the title sub-element of the book element with a value greater than 35.00 under the bookstore

# XPath wildcard
'/bookstore/*'  # Select all child elements of the bookstore element
'//*'# Select all elements in the document
'//Title [@*]'# Selects all title elements with attributes

# Operator, you can choose several paths
'//Book/title |//book/price' Select all title and price elements of the book element
'//title | //price' Selects all title and price elements in the document
'/bookstore/book/title | //Price'# Select all title elements under bookstore/book, and all price elements in the document

lxml Library

xml is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data. Like regular, lxml is also implemented in C. It is a high-performance Python HTML/XML parser. We can quickly locate specific elements and node information by using the XPath grammar we learned before.

# Examples of using lxml to parse xml documents

from lxml import etree

text = '''
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
</div>
'''

# Using etree.HTML to parse strings into HTML documents
html = etree.HTML(text)
# Serialize HTML documents by strings
result = etree.tostring(html)
print(result.decode('utf-8'))

# Note: lxml automatically completes missing </li> closed Tags
Operation results:

<html><body><div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a>
    </li></ul>
</div>
</body></html>

In addition to reading strings directly, lxml also supports reading from files

# Examples of reading xml documents from lxml
from lxml import etree

# Read the external file axmldoc.xml
html = etree.parse('./axmldoc.xml')
result = etree.tostring(html,pretty_print=True)

print(result.decode('utf-8'))

XPath Selection Information Practice

  1. Get all li Tags
  2. Get all class attributes of the li tag
  3. Continue to get a label with href as link1.html under the li label
  4. Get all span tags under the li tag
  5. Get all the class es in the a tag under the li tag
  6. href to get the last li's a
  7. Get the content of the penultimate element
  8. Get the label signature whose class value is bold
from lxml import etree

text = '''
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
    </ul>
</div>
'''
html = etree.HTML(text)

# Get all li Tags
result = html.xpath('//li')
print(result)
print('---'*10)

# Get all class attributes of the li tag
result = html.xpath('//li/@class')
print(result)
print('---'*10)

# Continue to get a label with href as link1.html under the li label
result = html.xpath('//li/a[@href="link1.html"]')
print(result)
print('---'*10)

# Get all span tags under the li tag
result = html.xpath('//li//span')
print(result)
print('---'*10)

# Get all the class es in the a tag under the li tag
result = html.xpath('//li/a//@class')
print(result)
print('---'*10)

# href to get the last li's a
result = html.xpath('//li[last()]/a/@href')
print(result)
print('---'*10)

# Get the content of the penultimate element
result = html.xpath('//li[last()-1]/a/text()')
print(result)
print('---'*10)

# Get the label signature whose class value is bold
result = html.xpath('//*[@class="bold"]')
print(result[0].tag)

Operation results:

[<Element li at 0x1ef3eea6608>, <Element li at 0x1ef3eebd1c8>, <Element li at 0x1ef3eebd208>, <Element li at 0x1ef3eebd248>, <Element li at 0x1ef3eebd288>]
------------------------------
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
------------------------------
[<Element a at 0x1ef3eebd308>]
------------------------------
[<Element span at 0x1ef3ee0e088>]
------------------------------
['bold']
------------------------------
['link5.html']
------------------------------
['fourth item']
------------------------------
span

Posted by Guardian2006 on Sat, 26 Jan 2019 09:42:14 -0800