[python data analysis (11)] Pandas for text data processing (string judgment, data processing at both ends, replacement and segmentation, basic index)

Keywords: Excel

1. Pandas for text data processing

1) Pandas is equipped with a set of methods for strings, which makes it easy to operate on each element of the array;

2) Access through str and automatically exclude missing / NA values,

3) For example, the method of. str.contains() in membership judgment was mentioned at the end of the previous article

Sample data generation

s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['hee','fv','w','hija','123',np.nan]})
print(s)
print(df)

- > the output result is:

0          A
1          b
2          C
3    bbhello
4        123
5        NaN
6         hj
dtype: object

  key1  key2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN

4) Simple demo, such as counting, case judgment, etc

print(s.str.count('b'))
print(df['key2'].str.upper())

df.columns = df.columns.str.upper()
print(df)

- > the output result is: (the first output is to determine the number of 'b' in each line, the second output is to change the character to uppercase, and the third output is to capitalize the title)

0    0.0
1    1.0
2    0.0
3    2.0
4    0.0
5    NaN
6    0.0
dtype: float64

0     HEE
1      FV
2       W
3    HIJA
4     123
5     NaN
Name: key2, dtype: object

  KEY1  KEY2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN

2. Common methods of string

Some methods that can be used on strings before and can be used in DataFrame data normally after using. str

2.1 string judgment function

.lower(),.upper(),.len(),.startswith(),.endswith()

Sample data generation

s = pd.Series(['A','b','bbhello','123',np.nan])
print(s)

- > the output result is:

0          A
1          b
2    bbhello
3        123
4        NaN
dtype: object

Corresponding judgment function operation

print(s.str.lower())
print(s.str.upper())
print(s.str.len())
print(s.str.startswith('b'))
print(s.str.endswith('3'))

- > the output result is: (note the last data type, and the third output is numerical type)

0          a
1          b
2    bbhello
3        123
4        NaN
dtype: object

0          A
1          B
2    BBHELLO
3        123
4        NaN
dtype: object

0    1.0
1    1.0
2    7.0
3    3.0
4    NaN
dtype: float64

0    False
1     True
2     True
3    False
4      NaN
dtype: object

0    False
1    False
2    False
3     True
4      NaN
dtype: object

2.2 data processing functions at both ends of string

. strip(),. lstrip(),. rstrip() are string processing methods to remove data at the two ends, left end and right end respectively

Sample data generation

s = pd.Series([' jack', 'jill ', ' jesse ', 'frank'])
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                  index=range(3))
print(s)
print(df)

- > the output result is: (pay attention to the space inside the data)

0       jack
1      jill 
2     jesse 
3      frank
dtype: object

    Column A    Column B 
0    0.106196   -1.530680
1    0.076250    0.699981
2   -1.816159    0.067348

Clear spaces at both ends of internal data

print(s.str.strip())  
print(s.str.lstrip())  
print(s.str.rstrip())  

- > the output results are: (three kinds of output can be compared, the main difference is in the arrangement)

0     jack
1     jill
2    jesse
3    frank
dtype: object

0      jack
1     jill 
2    jesse 
3     frank
dtype: object

0      jack
1      jill
2     jesse
3     frank
dtype: object

Clear spaces at both ends of column header data

df.columns = df.columns.str.strip()
print(df)

- > the output result is:

   Column A  Column B
0  0.106196 -1.530680
1  0.076250  0.699981
2 -1.816159  0.067348

2.3 replacement of data in string

. replace(old,new,n=num), the content of old to be replaced, the content of new to be added, num means several places to be replaced

df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                  index=range(3))
df.columns = df.columns.str.replace(' ','-')
print(df)

df.columns = df.columns.str.replace('-','hehe',n=1)
print(df)

- > the output result is:

   -Column-A-  -Column-B-
0    0.130360   -1.210729
1    0.681419   -0.063032
2    0.159494    1.012640

   heheColumn-A-  heheColumn-B-
0       0.130360      -1.210729
1       0.681419      -0.063032
2       0.159494       1.012640

2.4 data segmentation of strings

.split(),.rsplit()

s = pd.Series(['a,b,c','1,2,3',['a,,,c'],np.nan])
print(s.str.split(','))
print(s.str.split(',')[0])

- > the output result is: (internal data can be split and indexed)

0    [a, b, c]
1    [1, 2, 3]
2          NaN
3          NaN
dtype: object

['a', 'b', 'c']

★★★ you can use the get or [] symbol to access the elements in the split list

print(s.str.split(',').str[0])
print(s.str.split(',').str.get(1))

- > the output result is:

0      a
1      1
2    NaN
3    NaN
dtype: object

0      b
1      2
2    NaN
3    NaN
dtype: object

★★★★ you can use expand to easily expand this operation to return to DataFrame (similar to excel, which is listed according to some rules)

n parameter limit number of divisions

rsplit is similar to split, which works in reverse, from the end of the string to the beginning of the string

print(s.str.split(',', expand=True))
print(s.str.split(',', expand=True, n = 1))
print(s.str.rsplit(',', expand=True, n = 1))

- > the output results are: (understand the use results of rsplit)

     0    1    2
0    a    b    c
1    1    2    3
2  NaN  NaN  NaN
3  NaN  NaN  NaN

     0    1
0    a  b,c
1    1  2,3
2  NaN  NaN
3  NaN  NaN

     0    1
0  a,b    c
1  1,2    3
2  NaN  NaN
3  NaN  NaN

2.5 string index

Index after. str is the same as the string itself

s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['hee','fv','w','hija','123',np.nan]})

print(s.str[0])  # Take the first string
print(s.str[:2])  # Take the first two strings
print(df['key2'].str[0]) 

- > the output result is:

0      A
1      b
2      C
3      b
4      1
5    NaN
6      h
dtype: object

0      A
1      b
2      C
3     bb
4     12
5    NaN
6     hj
dtype: object

0      h
1      f
2      w
3      h
4      1
5    NaN
Name: key2, dtype: object
88 original articles published, 22 praised, 10000 visitors+
Private letter follow

Posted by MilesStandish on Tue, 25 Feb 2020 22:41:25 -0800