Data Storage for Python Network Crawling

Keywords: JSON Python encoding ascii

Introduction to JSON

What is json?

JSON (JavaScipt Object Notation,JS Object Markup) is a lightweight data interaction format. It is based on a subset of ECMAScript (js specification specified by w3c), and uses a text format completely independent of programming language to store and represent data. The concise and clear hierarchical structure makes JSON an ideal data exchange language, which is easy for people to read and write, easy for machine parsing and generation, and effectively improves the efficiency of network transmission. For more explanation, see Baidu Encyclopedia:

Note: JSON is just a data format, not a data type. json's data type is str.

JSON Supports Data Format

  • Object (dictionary). Use curly braces
  • Lists (arrays). Use brackets
  • Integer and Floating Point Type
  • String type (strings must be double quotation marks, not single quotation marks)

Multiple data are separated by commas.

Note: JSON itself is essentially a string.

json format detection and parsing:
https://www.json.cn/
http://www.bejson.com/

3. Dictionaries and lists are converted to json (converting python objects into json strings):

#Setting up a working directory
import os
os.chdir("G:\\python\\Python Study\\json")
import json
books = [
    {
     'title':'How steel is made',
     'price':98        
    },
    {
        'title':'The Dream of Red Mansion',
        'price':99
    }
]
json_str = json.dumps(books,ensure_ascii=False)
print(type(json_str))
print(json_str)

Result:
<class 'str'>
[{title":"How steel is made","price": 98}, {title", "Dream of Red Mansions", "price": 99}]

Then, we can see that python objects have been transformed into json, and single quotation marks have become double quotation marks.

Because json can only store ASCII characters in dumps, it will translate Chinese characters, so we can turn this feature off using ensure_ascii=False.

In python, only the base type can be converted into json format strings, i.e. int, float, str, list, dict, tuple (tuple). Non-basic types need to be converted into basic types before they can be converted into json format.

Store json data into files:

with open('books.json','w') as fp:
    fp.write(json_str)

4. dump python data directly into files

In addition to the dumps function, there is also a dump function in the json module, which can pass in a file pointer and directly dump the string into the file.

import json
books = [
    {
     'title':'How steel is made',
     'price':98        
    },
    {
        'title':'The Dream of Red Mansion',
        'price':99
    }
]
with open('a.json','w') as fp:
    json.dump(books,fp)

V. Coding issues

That is to say, the transformation of JSON and the saving of data proceed synchronously, open'a.json', and find that the file is ascii encoding:

[{"title": "\u94a2\u94c1\u662f\u600e\u6837\u7ec3\u6210\u7684", "price": 98}, {"title": "\u7ea2\u697c\u68a6", "price": 99}]

When using the dump function, you can set the ensure_ascii parameter to False:

with open('b.json','w') as fp:
    json.dump(books,fp,ensure_ascii = False)

Open'b.json'and scramble code appears.

[{"title": "ΈΦΜϊΚΗΤυΡωΑ·³Ι΅Δ", "price": 98}, {"title": "ΊμΒΓΞ", "price": 99}]

We can specify that the file code is utf-8 when the file is opened:

with open('c.json','w',encoding='utf-8') as fp:
    json.dump(books,fp,ensure_ascii = False)

Open'c.json'perfectly:

[{"title": "How steel is made", "price": 98}, {"title": "The Dream of Red Mansion", "price": 99}]

6. Converting json into python objects

Convert json to list:

json_str = '[{"title": "How steel is made", "price": 98}, {"title": "The Dream of Red Mansion", "price": 99}]'
books = json.loads(json_str,encoding='utf-8')
print(type(books))
print(books)

Result:
<class 'list'>
[{title': `How steel is made', `price': 98}, {`title': `Dream of Red Mansions', `price': 99}]

Look at the elements in json:

for book in books:
    print(type(book))
    print(book)
{'title': 'How steel is made', 'price': 98}
<class 'dict'>
{'title': 'The Dream of Red Mansion', 'price': 99}
<class 'dict'>

Read json directly from the file:

import os 
os.chdir("G:\\python\\Python Study\\json")
import json
with open('a.json','r',encoding='utf-8') as fp:
    json_str = json.load(fp)
    print(json_str)
[{'title': 'How steel is made', 'price': 98}, {'title': 'The Dream of Red Mansion', 'price': 99}]

7. csv file

csv file interpretation( https://zh.wikipedia.org/wiki/comma-separated values:

The term "CSV" generally refers to any document that has the following characteristics:

  • Pure text, using a character set, such as ASCII, Unicode, EBCDIC or GB2312 (Simplified Chinese Environment), etc.
  • It consists of records (typically one record per row);
  • Each record is separated into fields by separators (typical separators are commas, semicolons or tabs; sometimes separators can include optional spaces);
  • Each record has the same sequence of fields.

Read the csv file:

import os
os.chdir('G:\\python\\Python Study\\csv')
import csv
with open('stock.csv','r') as fp:
    # reader is an iterator
    reader = csv.reader(fp)
    # Represents the removal of heading lines
    titles= next(reader)
    for x in reader:
        print(x)

Result:

    ['0', '000001.XSHE', '1', 'Ping An Bank', 'XSHE', '2017/12/1', '13.38', '13.4', '13.48', '12.96', '13', '178493315']
    ['1', '000002.XSHE', '2', 'Vanke A', 'XSHE', '2017/12/1', '31.22', '30.5', '32.03', '30.5', '30.73', '55743855']
    ['2', '000004.XSHE', '4', 'National agriculture technology', 'XSHE', '2017/12/1', '25.56', '25.41', '26.4', '25.18', '26.2', '2211059']
    ['3', '000005.XSHE', '5', 'Century Star source', 'XSHE', '2017/12/1', '4.2', '4.2', '4.24', '4.2', '4.22', '2365348']
    ['4', '000006.XSHE', '6', 'Shen Zhen Ye A', 'XSHE', '2017/12/1', '9.85', '0', '0', '0', '9.85', '0']
    ['5', '000007.XSHE', '7', 'Brand new', 'XSHE', '2017/12/1', '16.66', '0', '0', '0', '16.66', '0']
    ['6', '000008.XSHE', '8', 'Shenzhou high speed railway', 'XSHE', '2017/12/1', '8.48', '8.48', '8.74', '8.41', '8.59', '5689054']
    ['7', '000009.XSHE', '9', 'Baoan, China', 'XSHE', '2017/12/1', '7.6', '7.61', '7.63', '7.53', '7.58', '9149395']
    ['8', '000010.XSHE', '10', 'Beautiful ecology', 'XSHE', '2017/12/1', '5.13', '5.13', '5.23', '5.11', '5.21', '6765580']
    ['9', '000011.XSHE', '11', 'Deep property A', 'XSHE', '2017/12/1', '17.18', '17.08', '17.28', '17', '17.11', '2474700']
    ['10', '000012.XSHE', '12', 'CsG A', 'XSHE', '2017/12/1', '9.19', '9.1', '9.28', '9.02', '9.11', '35308183']
   ......
    ['2102', '300716.XSHE', '300716', 'National Science and technology', 'XSHE', '2017/12/1', '26.5', '25.8', '27.4', '25.8', '26.85', '5483801']
    ['2103', '300717.XSHE', '300717', 'Huaxin new material', 'XSHE', '2017/12/1', '39.48', '38.82', '39.49', '38.06', '38.65', '1969054']
    ['2104', '300718.XSHE', '300718', 'Changsheng bearing', 'XSHE', '2017/12/1', '40.13', '40.01', '40.95', '39.5', '39.96', '3050717']
    ['2105', '300719.XSHE', '300719', 'Anderville', 'XSHE', '2017/12/1', '27.13', '27', '27.58', '26.39', '27.31', '6095688']
    ['2106', '300720.XSHE', '300720', 'Hai Chuan intelligence', 'XSHE', '2017/12/1', '33.43', '33', '33.39', '31.89', '32.81', '3307540']
In this way, the data can be acquired by subscription in the future when the data is acquired:
import csv
with open('stock.csv','r') as fp:
    # reader is an iterator
    reader = csv.reader(fp)
    # Represents the removal of heading lines
    titles= next(reader)
    for x in reader:
        name = x[3]
        volumn = x[-1]
        print({"name":name,"volumn":volumn})

Result:

{'name': 'Ping An Bank', 'volumn': '178493315'}
{'name': 'Vanke A', 'volumn': '55743855'}
{'name': 'National agriculture technology', 'volumn': '2211059'}
{'name': 'Century Star source', 'volumn': '2365348'}
{'name': 'Shen Zhen Ye A', 'volumn': '0'}
{'name': 'Brand new', 'volumn': '0'}
{'name': 'Shenzhou high speed railway', 'volumn': '5689054'}
{'name': 'Baoan, China', 'volumn': '9149395'}
{'name': 'Beautiful ecology', 'volumn': '6765580'}
{'name': 'Deep property A', 'volumn': '2474700'}
{'name': 'CsG A', 'volumn': '35308183'}

If you want to get data by heading, you can use DictReader.

import csv

with open('stock.csv','r') as fp:
    # The reader object created with DictReader does not contain the data in the line of the title. The reader is an iterator that traverses the iterator and returns a dictionary.
    reader= csv.DictReader(fp)
    for x in reader:
        print(x['turnoverVol'])

Result:

178493315
55743855
2211059
2365348
0
0
5689054
9149395
6765580
2474700
35308183
1236110
29434715
1562976
5792996
0

8. Write data to csv file

Writing data to a csv file requires creating a writer object, which is mainly used in two ways. One is writerow, which is to write one line, the other is writerows, which is to write multiple lines:

import os 
os.chdir('G:\\python\\Python Study\\csv')

import csv

headers = ['name','age','classroom']
values = [
    ('zhilaio',18,'111'),
    ('wwj',20,'222'),
    ('sshs',21,'222')
]
# 'w'means write, opens the file in a written way, newline =', encoding = utf-8' specifies the encoding format, newline =' n', that is, every line written, it changes to an empty string.
with open('text.csv','w',encoding = 'utf-8', newline = '') as fp:
    writer = csv.writer(fp)
    #Write a row of data, the header
    writer.writerow(headers)
    # Write multiline data
    writer.writerows(values)

You can also use a dictionary to write data in, and then you need to use DictWriter.

import csv

headers = ['name','age','classroom']
values = [
    {"name":'zhilaio',"age":18,"classroom":'111'},
    {"name":'abd',"age":28,"classroom":'1244'}
   ]
with open('text1.csv','w',newline = '') as fp:
    writer = csv.DictWriter(fp,headers)
    writer.writeheader()
    writer.writerow({"name":'lxp',"age":24,"classroom":'139'})
    writer.writerows(values)

Posted by smordue on Thu, 16 May 2019 20:53:26 -0700