Python: Converting Text Coding

Keywords: Python encoding codec

Recently, when making weekly reports, we need to extract the data from the csv text and make tables to produce charts.

When acquiring csv text content, it is basically used with open(filename, encoding ='UTF-8') as f: to open csv text, but in the actual use process, it is found that some csv text is not UTF-8 format, resulting in errors in the process of running the program, each time it needs to manually modify the encoding format of this document. To utf-8, run the program again, so I want to say: directly judge and modify the text encoding in the program.

The basic idea: first find out if the text is utf-8 encoding, if not, then modify it to utf-8 encoding text, and then process it.

python has a chardet library to view encoding information for text:

The detect ion function only needs a non-unicode string parameter and returns a dictionary (e.g. {encoding':'utf-8','confidence': 0.99}). The dictionary includes the coding format and confidence of the judgement.

import chardet

def get_encode_info(file):
    with open(file, 'rb') as f:
        return chardet.detect(f.read())['encoding']

 

However, the performance is good when dealing with small files. If the text is slightly too large, it will be very slow. At present, my local csv file is nearly 200 k, we can clearly feel that the speed is too slow and inefficient. However, the Universal Detector object is provided in the chardet library to handle: create the Universal Detector object, and then call its feed method repeatedly for each text block. If the detector reaches the minimum confidence threshold, it sets detector.done to True. Once you've run out of source text, call detector.close(), which completes some final calculations to prevent the detector from reaching its minimum confidence threshold before it does. The result will be a dictionary containing automatically detected character encoding and confidence (the same as returned by the charde.test function).

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
        detector = UniversalDetector()
 for line in f.readlines():
            detector.feed(line)
 if detector.done:
 break
        detector.close()
 return detector.result['encoding']

 

Problems encountered in coding conversion: Unicode DecodeError:'charmap'codec can't decode byte 0x90 in position 178365: character maps to < undefined >

def read_file(file):
 with open(file, 'rb') as f:
 return f.read()

def write_file(content, file):
 with open(file, 'wb') as f:
        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
    file_content = read_file(file)
    file_decode = file_content.decode(original_encode)   #-->There's a problem here.
    file_encode = file_decode.encode(des_encode)
    write_file(file_encode, file)

 

This is because the byte character group is not decoded properly and another parameter errors is added. The official document says:

bytearray.decode(encoding="utf-8", errors="strict")

Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

This means that the character array is decoded into a utf-8 string, which may be set to different processing schemes. By default, it is `strict'. It may throw a Unicode Error, which can be changed to `ignore', and `replace'.

So change this line of code file_decode = file_content.decode(original_encode) to file_decode = file_content.decode(original_encode,'ignore').

Complete code:

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
     detector = UniversalDetector()
     for line in f.readlines():
         detector.feed(line)
         if detector.done:
             break
     detector.close()
     return detector.result['encoding']

def read_file(file):
    with open(file, 'rb') as f:
        return f.read()

def write_file(content, file):
    with open(file, 'wb') as f:
        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
    file_content = read_file(file)
    file_decode = file_content.decode(original_encode,'ignore')
    file_encode = file_decode.encode(des_encode)
    write_file(file_encode, file)

if __name__ == "__main__":
    filename = r'C:\Users\danvy\Desktop\Automation\testdata\test.csv'
    file_content = read_file(filename)
    encode_info = get_encode_info(filename)
    if encode_info != 'utf-8':
        convert_encode2utf8(filename, encode_info, 'utf-8')
    encode_info = get_encode_info(filename)
    print(encode_info)

 

  

Posted by ununium on Mon, 26 Aug 2019 22:47:07 -0700