Introduction to pandas (Part 2) -- Mo Xiaotian with little talent and learning

The core content of this article is data cleaning.

Data cleaning

The steps of data work should be:

Data acquisition
Data cleaning
Data analysis
Data visualization and modeling
Therefore, in the last blog post, I said that the next blog post will talk about an important step in data analysis
We should know that data cleaning is carried out for the purpose of serving the next data analysis. Therefore, data processing should determine whether it needs to be processed and what kind of processing is needed according to data analysis in order to adapt to the next analysis and mining work.

The whole is divided into several different steps.

import pandas as pd
import numpy as np

1, Processing of missing data

In theory, we can fill or discard the whole treatment method. The choice depends on the situation. Overall work for pandas, four methods are used:

fillna(): fill
dropna(): filter by criteria
isnull(): judge whether it is empty
notnull(): judge whether it is not empty

The above four functions, together with our other syntax and logic, can basically complete this step. As shown below, these methods do not need special explanation

df = pd.DataFrame([np.random.rand(3),np.random.rand(3),np.random.rand(3)],columns=['A','B','C'])

df

	A	B	C
0	0.902598	0.598310	0.169824
1	0.425368	0.805950	0.677491
2	0.830366	0.305227	0.487216

df['A'][0] = np.NAN
print(df)
df.fillna(1)

          A         B         C
0       NaN  0.598310  0.169824
1  0.425368  0.805950  0.677491
2  0.830366  0.305227  0.487216

	A	B	C
0	1.000000	0.598310	0.169824
1	0.425368	0.805950	0.677491
2	0.830366	0.305227	0.487216

df['A'][0] = np.NAN
print(df)
df.dropna()

          A         B         C
0       NaN  0.598310  0.169824
1  0.425368  0.805950  0.677491
2  0.830366  0.305227  0.487216

	A	B	C
1	0.425368	0.805950	0.677491
2	0.830366	0.305227	0.487216

# Determine missing rows or columns based on parameters
df['A'][0] = np.NAN
print(df)
df.dropna(axis=1)

          A         B         C
0       NaN  0.598310  0.169824
1  0.425368  0.805950  0.677491
2  0.830366  0.305227  0.487216

	B	C
0	0.598310	0.169824
1	0.805950	0.677491
2	0.305227	0.487216

# Get the dataframe corresponding to the bool value.
bool_df_t = df.isnull()
bool_df_t

	A	B	C
0	True	False	False
1	False	False	False
2	False	False	False

# Contrary to the above
bool_df = df.notnull()
bool_df

	A	B	C
0	False	True	True
1	True	True	True
2	True	True	True

Of course, it can also perform operations similar to those in numpyarray, but it doesn't make sense.

print(df[bool_df_t])
print(df[bool_df])

    A   B   C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
          A         B         C
0       NaN  0.598310  0.169824
1  0.425368  0.805950  0.677491
2  0.830366  0.305227  0.487216

2, Processing of duplicate values

Considering that some data do not meet our analysis requirements or the problem of input modeling, it is necessary to process the repeated values of the data in some cases

# Building data with duplicate values
df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                       'data2' : np.random.randint(0, 4, 8)})
print(df_obj)
## df.duplicated(): judge duplicate data. All columns are judged by default. The number of columns can be specified. We can analyze according to the obtained results. The method will record the data that appears for the first time, marked as false, but
## Subsequent data will be marked as True, indicating duplication.
print(df_obj.duplicated())

  data1  data2
0     a      1
1     a      2
2     a      2
3     a      0
4     b      1
5     b      1
6     b      1
7     b      1
0    False
1    False
2     True
3    False
4    False
5     True
6     True
7     True
dtype: bool

## We can also use it to judge a column.
print(df_obj.duplicated('data2'))

0    False
1    False
2     True
3    False
4     True
5     True
6     True
7     True
dtype: bool

After judgment, you can use drop_ The duplicates () method deletes duplicate rows.

# It can delete the duplicate data and retain the data that appears for the first time.
df_obj.drop_duplicates()

	data1	data2
0	a	1
1	a	2
3	a	0
4	b	1

3, Data format conversion

Considering that some data formats may not meet the standards of later data analysis or are not suitable for model input, we need to deal with some data accordingly.

Mapping using functions.
Directly specify the value for replacement.

# Generate random data
ser_obj = pd.Series(np.random.randint(0,10,10))

ser_obj

0    0
1    0
2    7
3    3
4    0
5    8
6    6
7    7
8    8
9    9
dtype: int32

# The map function is similar to applymap in numpy and map in python basic syntax by using lamdba expression. It can be considered the same
ser_obj.map(lambda x : x ** 2)

0     0
1     0
2    49
3     9
4     0
5    64
6    36
7    49
8    64
9    81
dtype: int64

# Replace the value with the replace method
# Generate a random series
data = pd.Series(np.random.randint(0,100,10))
print(data)

0    73
1    18
2    48
3    27
4     1
5    60
6    59
7    38
8    66
9    53
dtype: int32

# Single replacement
data = data.replace(73,100)

data

0    100
1     18
2     48
3     27
4      1
5     60
6     59
7     38
8     66
9     53
dtype: int32

# Multi pair replacement
data = data.replace([100,18],[-1,-1])

data

0    -1
1    -1
2    48
3    27
4     1
5    60
6    59
7    38
8    66
9    53
dtype: int64

String operation can directly inherit the string operation of python basic syntax. There is no waste of time here.

4, Data merging

Combine the data according to different conditions. The main method used is pd.merge

pd.merge:(left, right, how='inner',on=None,left_on=None, right_on=None )

Left: DataFrame on the left when merging

Right: DataFrame on the right when merging

how: merge method, default to 'inner', 'outer', 'left', 'right'

On: the column names to be merged must have column names on both sides, and the intersection of column names in left and right is used as the connection key

left_ On: columns used as join keys in left dataframe

right_ On: right column used as join key in dataframe

Four connection modes are very important. They are internal connection, full connection, left connection and right connection.

Inner join: Join according to the intersection of the specified key.
External connection: connect according to the union of the specified key.
Left connection: connect according to the dataframe key of the specified left.
Right connection: connect according to the key of the specified right dataframe.

import pandas as pd
import numpy as np

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                       'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

left

	key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K2	A2	B2
3	K3	A3	B3

right

	key	C	D
0	K0	C0	D0
1	K1	C1	D1
2	K2	C2	D2
3	K3	C3	D3

pd.merge(left,right,on='key') #Specify connection key

	key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K2	A2	B2	C2	D2
3	K3	A3	B3	C3	D3

## You can see that both left and right data have key keys, and their values have K0, K1, K2 and K3. After connecting and specifying key, they are connected, and no data is lost in data splicing.

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

pd.merge(left,right,on=['key1','key2']) #Specify multiple keys to merge

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2

look, as shown above, the internal connection method is to connect the intersection. For the sequence K1 and K0, there are obviously two groups of data in right and only one group of data in left. According to the intersection, these two groups of data and that group of data belong to the intersection. Therefore, they are spliced into two groups of data. If there are two groups in left, let's see the results.

left = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                    'key2': ['K0', 'K0', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
print(pd.merge(left,right,on=['key1','key2']))

  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A1  B1  C1  D1
2   K1   K0  A1  B1  C2  D2
3   K1   K0  A2  B2  C1  D1
4   K1   K0  A2  B2  C2  D2

Let's think about the merging rules. For the above data, we finally get five pieces of data. In fact, the data matched by left and right are only one set of K0, K0 and two sets of K1, K0. However, the two sets of K1, K0 carry out intersection operation and merge the data into four groups. This intersection operation is not difficult to understand, But the first contact also needs to think, which is actually the calculation of database intersection. You can start with the following examples.

left = pd.DataFrame(
    {'key':['c','d','c'],
    'A':['a','b','e'],
    'B':['a','b','e']},
)
right = pd.DataFrame(
    {'key':['c','d','c'],
    'C':['a','b','c'],
    'D':['a','b','c']},
)

left

	key	A	B
0	c	a	a
1	d	b	b
2	c	e	e

right

	key	C	D
0	c	a	a
1	d	b	b
2	c	c	c

pd.merge(left,right,on=['key'])

	key	A	B	C	D
0	c	a	a	a	a
1	c	a	a	c	c
2	c	e	e	a	a
3	c	e	e	c	c
4	d	b	b	b	b

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
print(pd.merge(left, right, how='left', on=['key1', 'key2']))

## Match all key pairs in the left data frame. If right is not available, it will not match.

  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3

print(pd.merge(left, right, how='right', on=['key1', 'key2']))

## Match all key pairs in the right data frame. If there is no left, it will not match.

  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3

External connection, all values. If they do not exist, fill them with na (Union)

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
print(pd.merge(left,right,how='outer',on=['key1','key2']))

  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K1   A3   B3  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3

Process column name duplicates

# Process duplicate column names
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                        'data' : np.random.randint(0,10,3)})

df_obj1

	key	data
0	b	4
1	b	5
2	a	3
3	c	0
4	a	4
5	a	5
6	b	4

df_obj2

	key	data
0	a	8
1	b	7
2	d	8

print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))

  key  data_left  data_right
0   b          4           7
1   b          5           7
2   b          4           7
3   a          3           8
4   a          4           8
5   a          5           8

# Connect by index
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])

  key  data1  data2
0   b      4      8
1   b      0      8
6   b      1      8
2   a      6      1
4   a      4      1
5   a      2      1

df_obj1

	key	data1
0	b	4
1	b	0
2	a	6
3	c	1
4	a	4
5	a	2
6	b	1

df_obj2

	data2
a	1
b	8
d	8

# It refers to the connection by index with the left key as the primary key.
pd.merge(df_obj1, df_obj2, left_on='key', right_index=True)

	key	data1	data2
0	b	4	8
1	b	0	8
6	b	1	8
2	a	6	1
4	a	4	1
5	a	2	1

pd.concat() method

Similar to np.concat, connect arrays. Here, connect dataframe s.

df1 = pd.DataFrame(np.arange(6).reshape(3,2),index=list('abc'),columns=['one','two'])

df2 = pd.DataFrame(np.arange(4).reshape(2,2)+5,index=list('ac'),columns=['three','four'])

df1

	one	two
a	0	1
b	2	3
c	4	5

df2

	three	four
a	5	6
c	7	8

## The default connection method is external connection, which is called Union.
pd.concat([df1,df2])

	one	two	three	four
a	0.0	1.0	NaN	NaN
b	2.0	3.0	NaN	NaN
c	4.0	5.0	NaN	NaN
a	NaN	NaN	5.0	6.0
c	NaN	NaN	7.0	8.0

## Of course, as in the previous methods, you can specify the parameter axis, which is similar to the above operation. merge() method
pd.concat([df1,df2],axis=1)

	one	two	three	four
a	0	1	5.0	6.0
b	2	3	NaN	NaN
c	4	5	7.0	8.0

## Similarly, you can also specify the connection method, such as the connection (intersection) in 'inner'
pd.concat([df1,df2],axis=1,join='inner')

	one	two	three	four
a	0	1	5	6
c	4	5	7	8

Reshaping data

# 1. Convert column index to row index: stack method
df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])

stacked = df_obj.stack()

   data1  data2
0      5      3
1      7      4
2      5      7
3      7      2
4      9      5
0  data1    5
   data2    3
1  data1    7
   data2    4
2  data1    5
   data2    7
3  data1    7
   data2    2
4  data1    9
   data2    5
dtype: int32

df_obj

	data1	data2
0	5	3
1	7	4
2	5	7
3	7	2
4	9	5

stacked

0  data1    5
   data2    3
1  data1    7
   data2    4
2  data1    5
   data2    7
3  data1    7
   data2    2
4  data1    9
   data2    5
dtype: int32

As above, the data index becomes a hierarchical index.

print(type(df_obj))
# You can see that the data format has changed from dataframe to series
print(type(stacked))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

# 2. Expand the hierarchical index to the row index unstack() method
stacked.unstack()

	data1	data2
0	5	3
1	7	4
2	5	7
3	7	2
4	9	5

Sort it out as a whole.

Why data cleaning? It is for the convenience of subsequent data analysis, modeling and other operations.
Which step is data cleaning in data analysis? Serve for data analysis and model before data analysis.
Data cleaning process?
It is divided into several categories:

1. Processing missing values
2. Repeat value processing
3. Data format conversion
4. Data consolidation

ok, that's it today. I'm not going to write a blog specifically for the time series structure, so I intend to directly infiltrate it into the later actual combat. Before the actual combat, I need to add some statistical knowledge. Thank you for your attention. Students who study python can use wechat as a blogger to communicate directly if they have questions.

Posted by RClapham on Sun, 21 Nov 2021 11:56:13 -0800

Programmer Group