Information Markup and Extraction of Python Reptiles (XML&JSON&YAML)

Keywords: Python Attribute xml JSON

Information mark

The tagged information can form an information organization structure and increase the information dimension.
The structure of markers is as valuable as information.
Marked information can be used for communication, storage or display.
Marked information is more conducive to program understanding and Application

image.png

HTML is predefined by <>. </> Label format organizes different types of information

Types of Information Markers

XML
JSON
YAML

XML

image.png

JSON

image.png

YAML

image.png

Comparison of Three Marker Types

The earliest general information markup language of XML is extensible but complicated.
JSON information is typed and suitable for program processing (js), which is simpler than XML.
YAML information has no type, the proportion of text information is the highest, and the readability is good.

Information Interaction and Transfer on XML Internet
Information communication between cloud and node in JSON mobile application, no comment
YAML Configuration Files for Various Systems, Annotated and Readable

information extraction

Extracting Concerned Contents from Marked Information

Method 1: Completely parse the marker form of information, and then extract the key information.
XML JSON YAML
Markup parsers are required, such as label tree traversal of bs4 Libraries
Advantages: accurate information analysis
Disadvantage: The extraction process is tedious and slow.
Method 2: Ignore the marker form and search key information directly
search
Text lookup function for information
Advantages: The extraction process is simple and fast.
Disadvantage: Accuracy of extraction results is related to information content
Fusion Method: Combining Formal Analysis and Search Method to Extract Key Information
XML JSON YAML Search
Need to tag parsers and text lookup functions

Example

Extract all URL links in HTML

Train of thought:

1. Search for all < a > tags
1. Parse < a > tag format and extract the link content after href

image.png

An example of extracting html information based on bs4

image.png

<>.find_all(name, attrs, recursive, string, **kwargs)
_name: Retrieval string for tag name
Returns a list type to store the results of the lookup

image.png

<>.find_all(name, attrs, recursive, string, **kwargs)
_name: Retrieval string for tag name
_attrs: Retrievable string for tag attribute values that can be tagged for attribute retrieval

image.png

>>> soup.find_all('p', 'course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all('p','title')
[<p class="title"><b>The demo python introduces several python courses.</b></p>]
>>> soup.find_all(id = 'link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>>

<>.find_all(name, attrs, recursive, string, **kwargs)
_name: Retrieval string for tag name
_attrs: Retrievable string for tag attribute values that can be tagged for attribute retrieval

>>> soup.find_all(id = re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

<>.find_all(name, attrs, recursive, string, **kwargs)
_name: Retrieval string for tag name
_attrs: Retrievable string for tag attribute values that can be tagged for attribute retrieval
_recursive: Whether to retrieve all descendants, default True

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> import re
>>> soup.find_all(string=re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']

<tag>(.) is equivalent to <tag>.find_all(.)
Sop (..) is equivalent to soup.find_all(..)

image.png

Summary

image.png

Posted by xyn on Wed, 12 Dec 2018 02:03:05 -0800

Programmer Group

Information Markup and Extraction of Python Reptiles (XML&JSON&YAML)

Information mark

Types of Information Markers

XML

JSON

YAML

Comparison of Three Marker Types

information extraction

Example

An example of extracting html information based on bs4

Summary

Hot Keywords