Python Learning - Structured Text Files

Keywords: xml JSON Python Javascript

Structured text files

Structured text has many formats, which can be distinguished by separators such as tab(' t'), comma (',') or vertical line ('|'). For example, comma-separated values (csv)'<'and'>' tags, such as XML and HTML punctuation symbols, such as JavaScript Object Notation (JSON). Indentations such as YAML (short for YAML Ain't Markup Language) are mixed, such as various configuration files.

CSV

Files with delimiters are generally used as data exchange formats or databases.

>>> import csv
>>> villains = [
    ['Doctor','No'],
    ['Rosa','klebb'],
    ['Mister','Big'],
    ['Auric','Goldfinger'],
    ['Ernst','Blofeld'],]
>>> with open('villains','wt') as fout:  # A Context Manager
    csvout = csv.writer(fout)
    csvout.writerows(villains)

>>> villains
[['Doctor', 'No'], ['Rosa', 'klebb'], ['Mister', 'Big'], ['Auric', 'Goldfinger'], ['Ernst', 'Blofeld']]

XML

A delimited file has only two dimensions of data: rows and columns. XML is the most prominent markup format for handling this transformation, which uses tags to separate data.

XML is usually used for data transmission and message. It has some formats such as RSS and Atom. There are many customized XML formats in industry, such as: financial field (http://www.service-architecture.com/articles/xml/finance_xml.html).

Sample file: menu.xml

<?xml version="1.0"?>
<menu>
    <breakfast hours="7-11">
        <item price="$6.00">breakfast burritor</item>
        <item price="$4.00">pancakes</item>
    </breakfast>
<lunch hours="11-3">
        <item price="$5.00">hamburger</item>
</lunch>
    <dinner hours="3-10">
        <item price="8.00">spaghetti</item>
    </dinner>
</menu>

The simplest way to parse XML in Python is to use ElementTree. The following code parses the menu.xml file and outputs some tags and attributes:

>>> import xml.etree.ElementTree as et
>>> tree = et.ElementTree(file = 'menu.xml')
>>> root = tree.getroot()
>>> root.tag
'menu'
>>> for child in root:
    print('tag:',child.tag,'attributes:',child.attrib)
    for grandchild in child:
        print('\tag:',grandchild.tag,'attributes:',grandchild.attrib)

tag: breakfast attributes: {'hours': '7-11'}
    ag: item attributes: {'price': '$6.00'}
    ag: item attributes: {'price': '$4.00'}
tag: lunch attributes: {'hours': '11-3'}
    ag: item attributes: {'price': '$5.00'}
tag: dinner attributes: {'hours': '3-10'}
    ag: item attributes: {'price': '8.00'}
>>> len(root)  #Number of menu choices
3
>>> len(root[0])  #Number of Breakfast Items
2

For each element in a nested list, tag is a tag string and attrib is a dictionary of its attributes. ElementTree has many ways to find XML export data, modify data, and even write to XML files, which are described in detail in his document (https://docs.python.org/3.3/library/xml.etree.elementtree.html).

Other standard Python XML libraries are as follows: xml.dom: Document Object Model (DOM), which JavaScript developers are familiar with, represents a Web document as a hierarchy that loads the entire XML file into memory and also allows you to get all the content.

xml.sax: Simple XML API s or SAX parse XML online without loading everything into memory at once, so it's a good choice for handling huge XML file streams.

HTML

More HTML is used to format output results than to exchange data.

JSON

JavaScript Object Notation(JSON, http://www.json.org) is a popular data exchange format originating from JavaScript. It is a subset of JavaScript language and a legitimate supported grammar of Python. Python has only one major JSON module, json.

Example: Take the data structure of JSON as an example component of previous XML, and decode the JSON string into data.

>>> menu = \
    {
    "breakfast":{
            "hours":"7-11",
            "items":{
                    "breakfast burritos":"$6.00",
                    "pancakes":"$4.00"
                    }
             },
    "lunch":{
            "hours":"11-3",
            "items":{
                    "hamburger":"$5.00"
                    }
            },
    "dinner":{
            "hours":"3-10",
            "items":{
                   "spaghetti":"$8.00"
                   }
            }
     }
>>> 

Encoding menu into JSON strings using dumps()

>>> import json
>>> menu_json = json.dumps(menu)
>>> menu_json
'{"breakfast": {"hours": "7-11", "items": {"breakfast burritos": "$6.00", "pancakes": "$4.00"}}, "lunch": {"hours": "11-3", "items": {"hamburger": "$5.00"}}, "dinner": {"hours": "3-10", "items": {"spaghetti": "$8.00"}}}'
>>> 

Use load() to parse the JSON string menu_json into Python's data structure

>>> menu2 = json.loads(menu_json)
>>> menu2
{'breakfast': {'hours': '7-11', 'items': {'breakfast burritos': '$6.00', 'pancakes': '$4.00'}}, 'lunch': {'hours': '11-3', 'items': {'hamburger': '$5.00'}}, 'dinner': {'hours': '3-10', 'items': {'spaghetti': '$8.00'}}}
>>> 

Menu and menu2 are dictionaries with the same key values

YAML

Yaml also has keys and values, but it is mainly used to deal with date and time. The standard Python library does not deal with YAML modules, so it is necessary to install third-party yaml operation data. load() converts YAML strings into Python data structures, while dump() does the opposite.

configuration file

The standard configparser module is used to process Windows-style initialization.ini files. These files contain the definition of key=value. Custom operations such as modification can also be implemented. See configparser(https://docs.python.org/3.3/library/configparser.html). If you need more than two layers of nested structure, use YAML or JSON.

Here is a simple configuration file settings.cfg example:

[english]
greeting = Hell
[french]
greeting = Bonjour
[files]
home = /usr/local

Simple insertion:

bin = %(home)s/bin

>>> 
>>> import configparser
>>> cfg = configparser.ConfigParser()
>>> cfg.read('settings.cfg')
['settings.cfg']
>>> cfg
<configparser.ConfigParser object at 0x014112D0>
>>> cfg['french']
<Section: french>
>>> cfg['french']['greeting']
'Bonjour'
>>> cfg['files']['bin']
'/usr/local/bin'
>>>

Other exchange formats

The following binary data exchange formats are usually faster and more complex than XML or JSON:

MsgPack(http://msgpack.org)

Protocol Buffers(https://code.google.com/p/protobuf/)

Avro(http://avro.apache.org/docs/current/)

Thrift(http://thrift.apache.org/)

Using pickle serialization

Storing data structures into a file is also called serializing. Python provides the pickle module to save and restore data objects in a special binary format. The function dump() is used to serialize the data island file, while the function load() is used for deserialization.

>>> 
>>> import pickle
>>> import datetime
>>> 
>>> now1 = datetime.datetime.utcnow()
>>> pickled = pickle.dumps(now1)
>>> now2 = pickle.loads(pickled)
>>> now1
datetime.datetime(2017, 3, 25, 3, 16, 49, 282762)
>>> now2
datetime.datetime(2017, 3, 25, 3, 16, 49, 282762)
>>>

Posted by Twysted on Sun, 14 Jul 2019 11:11:10 -0700