python3-cookbook notes: Chapter VI Data Encoding and Processing

Keywords: Python xml Programming less

Each section in the python3-cookbook explores the best solution of Python 3 to a given problem in three parts: problem, solution, and discussion, or how Python 3's own data structure, functions, classes, and so on, can be better used in a given problem.This book is very helpful for understanding Python 3 and improving Python programming capabilities, especially for improving the performance of Python programs. It is strongly recommended that you take a look if you have time.
This is a note for learning. The content in this paper is only part of the book written according to your own work needs and in the ordinary time. Most of the sample code in this paper is pasted directly into the original text code. Of course, most of the code has been validated in Python 3.6 environment.Programming concerns vary from field to field, so you can read the full text if you are interested.
python3-cookbook: https://python3-cookbook.readthedocs.io/zh_CN/latest/index.html

 

6.1 Read and write CSV data

For CSV files, if special processing is not required, the CSV module should always be selected to read and write CSV files in order to minimize accidents.Here are just a few simple examples of reading and writing CSV files:

The CSV file stocks.csv reads as follows:

Symbol,Price,Date,Time,Change,Volume
"AA",39.48,"6/11/2007","9:36am",-0.18,181800
"AIG",71.38,"6/11/2007","9:36am",-0.15,195500
"AXP",62.58,"6/11/2007","9:36am",-0.46,935000
"BA",98.31,"6/11/2007","9:36am",+0.12,104800
"C",53.08,"6/11/2007","9:36am",-0.25,360900
"CAT",78.29,"6/11/2007","9:36am",-0.23,225400
import csv

# Read data as a list
with open(
'stocks.csv') as f: f_csv = csv.reader(f) headers = next(f_csv) # headers and row Is a list print(headers) for row in f_csv: print(row)
import csv

# Read data as a dictionary with open(
'stocks.csv') as f: f_csv = csv.DictReader(f) # row Is a OrderedDict Dictionary Type for row in f_csv: # The first output is: OrderedDict([('Symbol', 'AA'), ('Price', '39.48'), ('Date', '6/11/2007'), ('Time', '9:36am'), ('Change', '-0.18'), ('Volume', '181800')]) print(row)
headers = ['Symbol','Price','Date','Time','Change','Volume']
rows = [('AA', 39.48, '6/11/2007', '9:36am', -0.18, 181800),
         ('AIG', 71.38, '6/11/2007', '9:36am', -0.15, 195500),
         ('AXP', 62.58, '6/11/2007', '9:36am', -0.46, 935000),
       ]

# Write data as a list with open(
'stocks.csv','w') as f: f_csv = csv.writer(f) # Write single line data f_csv.writerow(headers) # Write multiline data f_csv.writerows(rows)
headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume']
rows = [{'Symbol':'AA', 'Price':39.48, 'Date':'6/11/2007',
        'Time':'9:36am', 'Change':-0.18, 'Volume':181800},
        {'Symbol':'AIG', 'Price': 71.38, 'Date':'6/11/2007',
        'Time':'9:36am', 'Change':-0.15, 'Volume': 195500},
        {'Symbol':'AXP', 'Price': 62.58, 'Date':'6/11/2007',
        'Time':'9:36am', 'Change':-0.46, 'Volume': 935000},
        ]

# Write data as a dictionary
with open('stocks.csv','w') as f:
    f_csv = csv.DictWriter(f, headers)
    f_csv.writeheader()
    f_csv.writerows(rows)

 

 

6.3 Parsing simple XML data

As the title of this subsection says, it only describes simple XML parsing. For smaller, less complex XML files, you can use the built-in xml.etree.ElementTree. For complex XML documents, you can use the tripartite library lxml, which is more powerful and faster.For the following sample code, you can replace it directly with from lxml.etree import parse.

from urllib.request import urlopen
from xml.etree.ElementTree import parse

# download XML File and parse
u = urlopen('http://planet.python.org/rss20.xml')
doc = parse(u)

# Find Node channel Lower title node
e = doc.find('channel/title')
# Print Node Name: title
print(e.tag)
# Print node text: Planet Python
print(e.text)
# Print the value of one of the properties of the node, because the node has no other properties, so get xxx The result is None
print(e.get('xxx'))

# ergodic channel Lower item node
for item in doc.iterfind('channel/item'):
    # stay item Find text for corresponding child nodes in a node
    title = item.findtext('title')
    date = item.findtext('pubDate')
    link = item.findtext('link')

    print(title)
    print(date)
    print(link)
    print()
title
Planet Python
None
Codementor: Automating Everything With Python: Reading Time: 3 Mins
Sat, 22 Feb 2020 09:01:58 +0000
https://www.codementor.io/maxongzb/automating-everything-with-python-reading-time-3-mins-13v57qt7y6

Quansight Labs Blog: My Unexpected Dive into Open-Source Python
Fri, 21 Feb 2020 18:38:07 +0000
https://labs.quansight.org/blog/2020/02/my-unexpected-dive-into-open-source-python/

...

 

 

6.4 Incremental parsing of large XML files

If the XML file you need to parse is too large, consider incremental parsing using from xml.etree.ElementTree import iterparse. It should be noted that in both versions of the example below, loading the entire XML document into memory performs better than incremental parsing, but consumes much more memory than incremental parsing.

The section of the XML file potholes.xml that needs to be parsed is as follows, and now you need to count the contents of the zip node in the row node:

<response>
    <row>
        <row ...>
            <creation_date>2012-11-18T00:00:00</creation_date>
            <status>Completed</status>
            <completion_date>2012-11-18T00:00:00</completion_date>
            <service_request_number>12-01906549</service_request_number>
            <type_of_service_request>Pot Hole in Street</type_of_service_request>
            <current_activity>Final Outcome</current_activity>
            <most_recent_action>CDOT Street Cut ... Outcome</most_recent_action>
            <street_address>4714 S TALMAN AVE</street_address>
            <zip>60632</zip>
            <x_coordinate>1159494.68618856</x_coordinate>
            <y_coordinate>1873313.83503384</y_coordinate>
            <ward>14</ward>
            <police_district>9</police_district>
            <community_area>58</community_area>
            <latitude>41.808090232127896</latitude>
            <longitude>-87.69053684711305</longitude>
            <location latitude="41.808090232127896"
            longitude="-87.69053684711305" />
        </row>
        <row ...>
            <creation_date>2012-11-18T00:00:00</creation_date>
            <status>Completed</status>
            <completion_date>2012-11-18T00:00:00</completion_date>
            <service_request_number>12-01906695</service_request_number>
            <type_of_service_request>Pot Hole in Street</type_of_service_request>
            <current_activity>Final Outcome</current_activity>
            <most_recent_action>CDOT Street Cut ... Outcome</most_recent_action>
            <street_address>3510 W NORTH AVE</street_address>
            <zip>60647</zip>
            <x_coordinate>1152732.14127696</x_coordinate>
            <y_coordinate>1910409.38979075</y_coordinate>
            <ward>26</ward>
            <police_district>14</police_district>
            <community_area>23</community_area>
            <latitude>41.91002084292946</latitude>
            <longitude>-87.71435952353961</longitude>
            <location latitude="41.91002084292946"
            longitude="-87.71435952353961" />
        </row>
    </row>
</response>

Load All into Memory Resolution:

from xml.etree.ElementTree import parse
from collections import Counter

potholes_by_zip = Counter()

doc = parse('potholes.xml')
for pothole in doc.iterfind('row/row'):
    potholes_by_zip[pothole.findtext('zip')] += 1
for zipcode, num in potholes_by_zip.most_common():
    print(zipcode, num)

Incremental parsing:

from xml.etree.ElementTree import iterparse
from collections import Counter


def parse_and_remove(filename, path):
    path_parts = path.split('/')
    # start Event: Generated when a node is created
    # end Event: Occurs when a node is created and completed
    doc = iterparse(filename, ('start', 'end'))
    # Skip Root Node
    next(doc)

    tag_stack = []
    elem_stack = []
    for event, elem in doc:
        if event == 'start':
            tag_stack.append(elem.tag)
            elem_stack.append(elem)
        elif event == 'end':
            if tag_stack == path_parts:
                yield elem
                # Here is the core statement to reduce memory consumption: yield The resulting element is removed from its parent node
                elem_stack[-2].remove(elem)
            try:
                tag_stack.pop()
                elem_stack.pop()
            except IndexError:
                pass


potholes_by_zip = Counter()

data = parse_and_remove('potholes.xml', 'row/row')
for pothole in data:
    potholes_by_zip[pothole.findtext('zip')] += 1
for zipcode, num in potholes_by_zip.most_common():
    print(zipcode, num)

 

 

6.5 Convert Dictionary to XML

from xml.etree.ElementTree import Element can be used to create an XML, but it is important to note that it can only construct values of type string.

from xml.etree.ElementTree import Element, tostring


def dict_to_xml(tag, d):
    """Create one from a dictionary XML"""
    elem = Element(tag)
    for key, val in d.items():
        child = Element(key)
        # text The value of needs to be str type
        child.text = str(val)
        elem.append(child)
    return elem


s = {'name': 'GOOG', 'shares': 100, 'price': 490.1}
e = dict_to_xml('stock', s)
# Set a property value for a node
e.set('_id', '1234')
print(e)
print(tostring(e))
<Element 'stock' at 0x000001761DB01B88>
b'<stock _id="1234"><name>GOOG</name><shares>100</shares><price>490.1</price></stock>'

 

 

6.6 Parsing and modifying XML

When modifying the XML in the example, it is important to note that all modifications are made to the parent node and can be treated as a list.

  • Delete Node: Use the remove() method of the parent node.
  • Add Node: Use the insert() and append() methods of the parent node.
  • Indexing and slicing: Nodes can be indexed and sliced using such elements [i] or [i:j].
  • Create a new node: use the Element class.

Prepared file pred.xml:

<?xml version="1.0"?>
<stop>
    <id>14791</id>
    <nm>Clark &amp; Balmoral</nm>
    <sri>
        <rt>22</rt>
        <d>North Bound</d>
        <dd>North Bound</dd>
    </sri>
    <cr>22</cr>
    <pre>
        <pt>5 MIN</pt>
        <fd>Howard</fd>
        <v>1378</v>
        <rn>22</rn>
    </pre>
    <pre>
        <pt>15 MIN</pt>
        <fd>Howard</fd>
        <v>1867</v>
        <rn>22</rn>
    </pre>
</stop>
>>> from xml.etree.ElementTree import parse, Element
>>> doc = parse('pred.xml')
>>> root = doc.getroot()
>>> root
<Element 'stop' at 0x100770cb0>
>>> root.remove(root.find('sri'))
>>> root.remove(root.find('cr'))
>>> root.getchildren().index(root.find('nm'))
1
>>> e = Element('spam')
>>> e.text = 'This is a test'
>>> root.insert(2, e)
>>> doc.write('newpred.xml', xml_declaration=True)
>>>

Posted by Slashscape on Sat, 22 Feb 2020 08:48:46 -0800