Python -- Processing json with Chinese

Keywords: JSON codec ascii encoding

First of all, the blog mainly introduces the background of writing this blog.

1, the main Chinese field that Bo deals with is this style:

01 hanging air conditioner
 02 ordinary chair
 02 ordinary curtain
 03 desk - computer desk - office desk
 04 Microwave oven-oven-dishwasher-sterilizer
 05 Electric Heater-Humidifier-Small Sun-Fan-Air Purifier

2. The code is in this style

# coding: utf-8

import os
import sys
import json
import string
import re


ijson = {"objects":[]}
with open("position") as fp:
    for line in fp:
        label = line.strip().replace("\n", "")
        print label
        iobject = {}
        iobject["id"] = re.sub("\D", "", label) 
        iobject["label"] = re.sub("[A-Za-z0-9\!\%\[\]\,\. ]", "", label)
        ijson["objects"].append(iobject)
    
print ijson

3. But the result is the same.

{'objects': [{'id': '01', 'label': '\xe6\x8c\xe5\xbc\x8f\xe7\xa9\xba\xe8\xb0\x83'}, {'id': '02', 'label': '\xe6\x99\xae\xe9\x9a\xe6\xa4\x85\xe5\xad\x90'}, {'id': '02', 'label': '\xe6\x99\xae\xe9\x9a\xe7\xaa\x97\xe5\xb8\x98'}, {'id': '03', 'label': '\xe4\xb9\xa6\xe6\xa1\x8c-\xe7\x94\xb5\xe8\x84\x91\xe6\xa1\x8c-\xe5\x8a\x9e\xe5\x85\xac\xe6\xa1\x8c'}, {'id': '04', 'label': '\xe5\xbe\xae\xe6\xb3\xa2\xe7\x89-\xe7\x83\xa4\xe7\xae\xb1-\xe6\xb4\x97\xe7\xa2\x97\xe6\x9c\xba-\xe6\xb6\x88\xe6\xaf\x92\xe6\x9f\x9c'}, {'id': '05', 'label': '\xe7\x94\xb5\xe6\x9a\x96\xe6\xb0\x94-\xe5\x8a\xa0\xe6\xb9\xbf\xe5\x99\xa8-\xe5\xb0\x8f\xe5\xa4\xaa\xe9\x98\xb3-\xe7\x94\xb5\xe9\xa3\x8e\xe6\x89\x87-\xe7\xa9\xba\xe6\xb0\x94\xe5\x87\xe5\x8c\x96\xe5\x99\xa8'}]}

4. Since it's json, let's dumps the output

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

Get to the point

 

From the background, we can see that the json string containing Chinese can not be dumps, and it will throw UnicodeDecodeError error. This is because the dumps default is the ASCII encoding for Chinese. In order to export Chinese, we specify ensure_ascii=False:

print json.dumps(ijson, ensure_ascii=False)

{"objects": [{"id": "01", "label": "?Air conditioning"}, {"id": "02", "label": "universal?Chair"}, {"id": "02", "label": "universal?Window curtains"}, {"id": "03", "label": "Desk-The computer table-A desk?"}, {"id": "04", "label": "microwave?-Oven-Dishwasher-Sterilizer"}, {"id": "05", "label": "Electric heating-Humidifier-Little sun-Electric fan-atmosphere?Chemical device"}]}

More pit father came, will find some Chinese become? The reason is that bloggers find a code for separating numbers and Chinese on the Internet:

iobject["id"] = re.sub("\D", "", label) 

iobject["label"] = re.sub("[A-Za-z0-9\!\%\[\]\,\. ]", "", label)

This is the culprit that leads to my garbled Chinese code!!! The reason is unknown, the knowledgeable man can leave a message to communicate.

 

Modify the way of Chinese extraction:

pattern ="[^\u4e00-\u9fa5]+" 
regex = re.compile(pattern)
iobject["label"] = regex.findall(label)[0]

Get the json: with beautiful format and correct Chinese display.

 

 

Posted by misseether on Fri, 04 Oct 2019 23:15:31 -0700