First of all, the blog mainly introduces the background of writing this blog.
1, the main Chinese field that Bo deals with is this style:
01 hanging air conditioner 02 ordinary chair 02 ordinary curtain 03 desk - computer desk - office desk 04 Microwave oven-oven-dishwasher-sterilizer 05 Electric Heater-Humidifier-Small Sun-Fan-Air Purifier
2. The code is in this style
# coding: utf-8 import os import sys import json import string import re ijson = {"objects":[]} with open("position") as fp: for line in fp: label = line.strip().replace("\n", "") print label iobject = {} iobject["id"] = re.sub("\D", "", label) iobject["label"] = re.sub("[A-Za-z0-9\!\%\[\]\,\. ]", "", label) ijson["objects"].append(iobject) print ijson
3. But the result is the same.
{'objects': [{'id': '01', 'label': '\xe6\x8c\xe5\xbc\x8f\xe7\xa9\xba\xe8\xb0\x83'}, {'id': '02', 'label': '\xe6\x99\xae\xe9\x9a\xe6\xa4\x85\xe5\xad\x90'}, {'id': '02', 'label': '\xe6\x99\xae\xe9\x9a\xe7\xaa\x97\xe5\xb8\x98'}, {'id': '03', 'label': '\xe4\xb9\xa6\xe6\xa1\x8c-\xe7\x94\xb5\xe8\x84\x91\xe6\xa1\x8c-\xe5\x8a\x9e\xe5\x85\xac\xe6\xa1\x8c'}, {'id': '04', 'label': '\xe5\xbe\xae\xe6\xb3\xa2\xe7\x89-\xe7\x83\xa4\xe7\xae\xb1-\xe6\xb4\x97\xe7\xa2\x97\xe6\x9c\xba-\xe6\xb6\x88\xe6\xaf\x92\xe6\x9f\x9c'}, {'id': '05', 'label': '\xe7\x94\xb5\xe6\x9a\x96\xe6\xb0\x94-\xe5\x8a\xa0\xe6\xb9\xbf\xe5\x99\xa8-\xe5\xb0\x8f\xe5\xa4\xaa\xe9\x98\xb3-\xe7\x94\xb5\xe9\xa3\x8e\xe6\x89\x87-\xe7\xa9\xba\xe6\xb0\x94\xe5\x87\xe5\x8c\x96\xe5\x99\xa8'}]}
4. Since it's json, let's dumps the output
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte
Get to the point
From the background, we can see that the json string containing Chinese can not be dumps, and it will throw UnicodeDecodeError error. This is because the dumps default is the ASCII encoding for Chinese. In order to export Chinese, we specify ensure_ascii=False:
print json.dumps(ijson, ensure_ascii=False)
{"objects": [{"id": "01", "label": "?Air conditioning"}, {"id": "02", "label": "universal?Chair"}, {"id": "02", "label": "universal?Window curtains"}, {"id": "03", "label": "Desk-The computer table-A desk?"}, {"id": "04", "label": "microwave?-Oven-Dishwasher-Sterilizer"}, {"id": "05", "label": "Electric heating-Humidifier-Little sun-Electric fan-atmosphere?Chemical device"}]}
More pit father came, will find some Chinese become? The reason is that bloggers find a code for separating numbers and Chinese on the Internet:
iobject["id"] = re.sub("\D", "", label)
iobject["label"] = re.sub("[A-Za-z0-9\!\%\[\]\,\. ]", "", label)
This is the culprit that leads to my garbled Chinese code!!! The reason is unknown, the knowledgeable man can leave a message to communicate.
Modify the way of Chinese extraction:
pattern ="[^\u4e00-\u9fa5]+" regex = re.compile(pattern) iobject["label"] = regex.findall(label)[0]
Get the json: with beautiful format and correct Chinese display.