1. Pandas for text data processing
1) Pandas is equipped with a set of methods for strings, which makes it easy to operate on each element of the array;
2) Access through str and automatically exclude missing / NA values,
3) For example, the method of. str.contains() in membership judgment was mentioned at the end of the previous article
Sample data generation
s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj']) df = pd.DataFrame({'key1':list('abcdef'), 'key2':['hee','fv','w','hija','123',np.nan]}) print(s) print(df)
- > the output result is:
0 A 1 b 2 C 3 bbhello 4 123 5 NaN 6 hj dtype: object key1 key2 0 a hee 1 b fv 2 c w 3 d hija 4 e 123 5 f NaN
4) Simple demo, such as counting, case judgment, etc
print(s.str.count('b')) print(df['key2'].str.upper()) df.columns = df.columns.str.upper() print(df)
- > the output result is: (the first output is to determine the number of 'b' in each line, the second output is to change the character to uppercase, and the third output is to capitalize the title)
0 0.0 1 1.0 2 0.0 3 2.0 4 0.0 5 NaN 6 0.0 dtype: float64 0 HEE 1 FV 2 W 3 HIJA 4 123 5 NaN Name: key2, dtype: object KEY1 KEY2 0 a hee 1 b fv 2 c w 3 d hija 4 e 123 5 f NaN
2. Common methods of string
Some methods that can be used on strings before and can be used in DataFrame data normally after using. str
2.1 string judgment function
.lower(),.upper(),.len(),.startswith(),.endswith()
Sample data generation
s = pd.Series(['A','b','bbhello','123',np.nan]) print(s)
- > the output result is:
0 A 1 b 2 bbhello 3 123 4 NaN dtype: object
Corresponding judgment function operation
print(s.str.lower()) print(s.str.upper()) print(s.str.len()) print(s.str.startswith('b')) print(s.str.endswith('3'))
- > the output result is: (note the last data type, and the third output is numerical type)
0 a 1 b 2 bbhello 3 123 4 NaN dtype: object 0 A 1 B 2 BBHELLO 3 123 4 NaN dtype: object 0 1.0 1 1.0 2 7.0 3 3.0 4 NaN dtype: float64 0 False 1 True 2 True 3 False 4 NaN dtype: object 0 False 1 False 2 False 3 True 4 NaN dtype: object
2.2 data processing functions at both ends of string
. strip(),. lstrip(),. rstrip() are string processing methods to remove data at the two ends, left end and right end respectively
Sample data generation
s = pd.Series([' jack', 'jill ', ' jesse ', 'frank']) df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '], index=range(3)) print(s) print(df)
- > the output result is: (pay attention to the space inside the data)
0 jack 1 jill 2 jesse 3 frank dtype: object Column A Column B 0 0.106196 -1.530680 1 0.076250 0.699981 2 -1.816159 0.067348
Clear spaces at both ends of internal data
print(s.str.strip()) print(s.str.lstrip()) print(s.str.rstrip())
- > the output results are: (three kinds of output can be compared, the main difference is in the arrangement)
0 jack 1 jill 2 jesse 3 frank dtype: object 0 jack 1 jill 2 jesse 3 frank dtype: object 0 jack 1 jill 2 jesse 3 frank dtype: object
Clear spaces at both ends of column header data
df.columns = df.columns.str.strip() print(df)
- > the output result is:
Column A Column B 0 0.106196 -1.530680 1 0.076250 0.699981 2 -1.816159 0.067348
2.3 replacement of data in string
. replace(old,new,n=num), the content of old to be replaced, the content of new to be added, num means several places to be replaced
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '], index=range(3)) df.columns = df.columns.str.replace(' ','-') print(df) df.columns = df.columns.str.replace('-','hehe',n=1) print(df)
- > the output result is:
-Column-A- -Column-B- 0 0.130360 -1.210729 1 0.681419 -0.063032 2 0.159494 1.012640 heheColumn-A- heheColumn-B- 0 0.130360 -1.210729 1 0.681419 -0.063032 2 0.159494 1.012640
2.4 data segmentation of strings
.split(),.rsplit()
s = pd.Series(['a,b,c','1,2,3',['a,,,c'],np.nan]) print(s.str.split(',')) print(s.str.split(',')[0])
- > the output result is: (internal data can be split and indexed)
0 [a, b, c] 1 [1, 2, 3] 2 NaN 3 NaN dtype: object ['a', 'b', 'c']
★★★ you can use the get or [] symbol to access the elements in the split list
print(s.str.split(',').str[0]) print(s.str.split(',').str.get(1))
- > the output result is:
0 a 1 1 2 NaN 3 NaN dtype: object 0 b 1 2 2 NaN 3 NaN dtype: object
★★★★ you can use expand to easily expand this operation to return to DataFrame (similar to excel, which is listed according to some rules)
n parameter limit number of divisions
rsplit is similar to split, which works in reverse, from the end of the string to the beginning of the string
print(s.str.split(',', expand=True)) print(s.str.split(',', expand=True, n = 1)) print(s.str.rsplit(',', expand=True, n = 1))
- > the output results are: (understand the use results of rsplit)
0 1 2 0 a b c 1 1 2 3 2 NaN NaN NaN 3 NaN NaN NaN 0 1 0 a b,c 1 1 2,3 2 NaN NaN 3 NaN NaN 0 1 0 a,b c 1 1,2 3 2 NaN NaN 3 NaN NaN
2.5 string index
Index after. str is the same as the string itself
s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj']) df = pd.DataFrame({'key1':list('abcdef'), 'key2':['hee','fv','w','hija','123',np.nan]}) print(s.str[0]) # Take the first string print(s.str[:2]) # Take the first two strings print(df['key2'].str[0])
- > the output result is:
0 A 1 b 2 C 3 b 4 1 5 NaN 6 h dtype: object 0 A 1 b 2 C 3 bb 4 12 5 NaN 6 hj dtype: object 0 h 1 f 2 w 3 h 4 1 5 NaN Name: key2, dtype: object