python text string and byte string

Keywords: Python

python The string in has always been a big problem for Xiaobian. I believe big guys have also experienced the fear of being dominated by various codes. But it doesn't matter. I'm sure you'll suddenly see the python string after reading this article!
Code link: https://github.com/princewen/professional-python3
1, String type
python3:
The python language has two different strings, one for storing text and one for storing raw bytes.
The text string is stored internally in Unicode, and the byte string stores the original bytes and displays ASCII.

In Python 3, the text string type is named str and the byte string type is named bytes.
Normally, instantiating a string will get an str instance. If you want to get a bytes instance, you need to add the b character before the text.

text_str = 'The quick brown fox jumped over the lazy dogs'
print (type(text_str)) #output : <class 'str'>
 
 
byte_str = b'The quick brown fox jumped over the lazy dogs'
print (type(byte_str)) #output : <class 'bytes'>

python2:

There are also two kinds of strings in python2. However, the str class in python3 is named unicode in python2, but the bytes class in python3 is named STR class in python2.
This means that in Python 3, the str class is a text string, while in Python 2, the str class is a byte string.
If the string is instantiated without prefix, a str class (byte string!!!) is returned. If you want to get a text string, you need to add u character in front of the string.

byte_str = 'The quick brown fox jumped over the lazy dogs'
#output : <type 'str'>
print type(byte_str)
 
text_str = u'The quick brown fox jumped over the lazy dogs'
#output : <type 'unicode'>
print type(text_str)

2, String conversion
python3:

Type conversion can be performed between STR and bytes. The str class contains an encode method to convert it to a byte using a specific encoding. Similarly, the bytes class contains a decode method that takes an encoding as a single required parameter and returns a str. Another thing to note is that Python 3 will never attempt to implicitly convert between a STR and a byte. You need to explicitly use the str.encode or bytes.decode methods.

#output :  b'The quick brown fox jumped over the lazy dogs'
print (text_str.encode('utf-8'))
 
#output : The quick brown fox jumped over the lazy dogs
print (byte_str.decode('utf-8'))
 
#output : False
print ('foo' == b'foo')
 
#Output : KeyError: b'foo'
d={'foo':'bar'}
print (d[b'foo'])
 
#Output : TypeError: Can't convert 'bytes' object to str implicitly
print ('foo'+b'bar')
 
#Output : TypeError: %b requires bytes, or an object that implements __bytes__, not 'str'
print (b'foo %s' % 'bar')
 
#output : bar b'foo'
print ('bar %s' % b'foo')

python2:

Unlike Python 3, python 2 attempts to convert implicitly between text strings and byte strings.
The working mechanism is that if the interpreter encounters a different kind of string mixing operation, the interpreter will first convert the byte string into a text string, and then operate on the text string.
The interpreter uses implicit decoding in converting byte strings to text strings, and the default encoding in Python 2 is almost always ASCII
We can use the sys.getdefaultencoding method to view the default encoding method.

#output :  foobar
print 'foo'+u'bar'
 
#output : ascii
print sys.getdefaultencoding()
 
#output : False
print 'foo'==u'bar'
 
#Output : bar
d = {u'foo':'bar'}
print d['foo']

In python2, calling the encode method can convert any type of string into byte string, or use decode to convert any type of string to text string.
In practical use, it is easy to confuse people and lead to disaster. Consider the following examples:
As shown below, the following code reports an error. After the first encode, the string has been converted into a byte string in utf-8 format. Since there is another encode process, there will be an implicit decoding process to decode the byte string into a text string first,
The default implicit conversion method will be used here, that is, the method obtained by getdefaultencoding (). Here it is ascii coding, so the following statement is equivalent to:

text_str.encode('utf-8').decode('ascii').encode('utf-8')
text_str = u'\u03b1 is for alpha'
 
# Traceback (most recent call last):
#   File "/Users/shixiaowen/python3/python advanced programming / string and unicode/python2 string. Py", line 48, in < module >
#     print text_str.encode('utf-8').encode('utf-8')
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
 
print text_str.encode('utf-8').encode('utf-8')

3, Read file
python3:

Files always store bytes, so in order to use the text data read in the file, it must first be decoded into a text string.
In Python 3, the text will be automatically decoded for you under normal circumstances, so opening or reading the file will get a text string.
The decoding method used depends on the system. In mac os or most linux systems, the preferred encoding is utf-8, but windows is not necessarily.
You can use the locale.getpreferredencoding() method to get the default decoding method of the system.

# <class 'str'>
# There are two kinds of string data in Python, text string and byte string, which can be converted to each other
# This chapter will learn the differences between text strings and byte strings, and the differences between these two types of strings in Python 2 and python 3.
with open('String and unicode','r') as f:
    text_str=f.read()
    print (type(text_str))
    print (text_str)
 
import locale
#output : UTF-8
print (locale.getpreferredencoding())

When reading the file, you can display the code of the declaration file and use the encoding keyword of the open method

# <class 'str'>
# There are two kinds of string data in Python, text string and byte string, which can be converted to each other
# This chapter will learn the differences between text strings and byte strings, and the differences between these two types of strings in Python 2 and python 3.
with open('String and unicode','r',encoding='utf-8') as f:
    text_str = f.read()
    print(type(text_str))
    print(text_str)
 
"""
If you want to read the file as a byte string, use the following method
"""
 
# <class 'bytes'>
# b'Python\xe4\xb8\xad\xe6\x9c\x89\xe4\xb8\xa4\xe7\xa7\x8d\xe4\xb8\x8d\xe5\x90\x8......
with open('String and unicode','rb') as f:
    text_str=f.read()
    print (type(text_str))
    print (text_str)

python2:

In Python 2, the read method always returns a byte string no matter how the file is opened

# <type 'str'>
# There are two kinds of string data in Python, text string and byte string, which can be converted to each other
# This chapter will learn the differences between text strings and byte strings, and the differences between these two types of strings in Python 2 and python 3.
with open('String and unicode','r') as f:
    text_str=f.read()
    print (type(text_str))
    print text_str

Posted by Web For U on Fri, 03 Dec 2021 03:01:33 -0800