pandas data analysis to awesome tutorial [Full Version]

Keywords: Python Data Analysis pandas

Data structure of Pandas

Import pandas

from pandas import Series

1,Series

Series is an object similar to a one-dimensional array, which consists of the following two parts:

  • values: a set of data (darray type)
  • Index: related data index label

1) Creation of Series

There are two creation methods:

(1) Created from a list or numpy array, the default index is an integer index from 0 to N-1

Created by list

l = [1,2,3,4,5]
s = Series(l, index=list('abcde'))
s
a    1
b    2
c    3
d    4
e    5
dtype: int64
s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
s.values
array([1, 2, 3, 4, 5], dtype=int64)

You can also specify the index by setting the index parameter

s.index = [1,2,3,4,5]
s
1    1
2    2
3    3
4    4
5    5
dtype: int64
s[1] = 10
s
1    10
2     2
3     3
4     4
5     5
dtype: int64
n = np.arange(0,10)
s = Series(n.copy())
s
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
 nine    9
dtype: int thirty-two
s[0] = 8
s
0    8
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int thirty-two

In particular, ndarray creates references, not copies. Changes to the Series element also change the elements in the original ndarray object. (this is not the case in the list)

(2) Created by dictionary

# The key of the dictionary corresponds to the index of the Series
s = Series({'a':1, 'b':2, 'c':3, 'd': 4})
s
a    1
b    2
c    3
d    4
dtype: int64

============================================

Exercise 1:

Create the following Series in several ways, named s1:
Language one hundred and fifty
Math one hundred and fifty
English one hundred and fifty
Li Zong three hundred

============================================

# Dictionaries
s1 = Series({'chinese': one hundred and fifty, 'mathematics': 150, 'English': 150, 'Comprehensive management': 300})
s1
data = [150, 150, 150, 300]
index = ['chinese', 'mathematics', 'English', 'Comprehensive management']
s1 = Series(data=data,index=index)
s1
chinese    150
 mathematics    150
 English    150
 Comprehensive management    300
dtype: int64

2) Indexing and slicing of Series

You can use brackets to get a single index (in this case, the element type is returned), or a list in brackets to get multiple indexes (in this case, a Series type is returned). Indexes are divided into display indexes and implicit indexes:

(1) Explicit index:

  • Use the element in index as the index value
  • Use. loc [] (recommended)

Note that this is a closed interval

s
# Not recommended
s['a']
1
#Recommended writing
s.loc['a']
1
s.loc['a': 'c']  # Note: the interval is fully closed
a    1
b    2
c    3
dtype: int64
s['a': 'c']
a    1
b    2
c    3
dtype: int64

(2) Implicit index:

  • Use integer as index value
  • Use. iloc [] (recommended)

Note that this is a half open interval

s
a    1
b    2
c    3
d    4
dtype: int64
s[0]
1
s = Series(np.arange(10))
s
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32
s.index = np.arange(1,11)
s
1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int32
s[1]
0
# Recommended writing method of implicit index
s.iloc[0]
0
s.iloc[0:3] # The slice of an implicit index is left closed and right open
1    0
2    1
3    2
dtype: int32

============================================

Exercise 2:

Index and slice the Series s1 created in exercise 1 in several ways:

Indexes:
Math 150

section:
Language 150
Math 150
English 150

============================================

s1
s1.loc['mathematics']
s1.iloc[1]
s1.loc[['mathematics']] # Set another layer of brackets to return the original data type
s1.loc[['mathematics', 'Comprehensive management']]
s1.iloc[[1, 3]]
s1.loc['chinese': 'English']
s1.iloc[0: 3]

3) Basic concepts of Series

Series can be regarded as an ordered Dictionary of fixed length

You can get the attributes of series through shape, size, index, values, etc

s
1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int32
s.shape
s.size
s.index
s.values

You can quickly view the style of Series objects through head(), tail()

s.head(3)
1    0
2    1
3    2
dtype: int32
s.tail(4)
7     6
8     7
9     8
10    9
dtype: int32

When the index has no corresponding value, the missing data may display NaN (not a number)

s
1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
dtype: float64
s.loc[0] = np.nan
s
1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Name, dtype: float64

You can use pd.isnull(), pd.notnull(), or the built-in isnull(), notnull() function to detect missing data

pd.isnull(s)
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
0      True
dtype: bool
pd.notnull(s)
s.isnull()
s.notnull()

The Series object itself and its instances have a name attribute

s.name = 'Series s'
s
1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Series s, dtype: float64
Series.name = 'Name'
s
1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Name, dtype: float64

4) Series operation

(1) The array operation applicable to numpy is also applicable to Series

s + 1 

(2) Operation between series

  • Automatically align data of different indexes in operation
  • If the index does not correspond, fill in NaN
s
1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Name, dtype: float64
s2 = Series(np.random.randint(0, 10, size=11), index=np.arange(3, 14))
s2
3     0
4     5
5     3
6     9
7     1
8     9
9     1
10    7
11    9
 twelve    6
13    1
dtype: int32
s + s2
# The same index is used for operation, and different indexes are used to supplement NaN
0      NaN
1      NaN
2      NaN
3      2.0
4      8.0
5      7.0
6     14.0
7      7.0
8     16.0
9      9.0
10    16.0
11     NaN
12     NaN
13     NaN
dtype: float64
  • Note: to keep all index es, you need to use the. add() function
# Use the operation function encapsulated by pandas, and keep the value corresponding to all index es
s.add(s2, fill_value=0)
0      NaN
1      0.0
2      1.0
3      2.0
4      8.0
5      7.0
6     14.0
7      7.0
8     16.0
9      9.0
10    16.0
11     9.0
12     6.0
13     1.0
dtype: float64

============================================

Exercise 3:

  1. Think about the difference between the rules of Series operation and ndarray operation?

  2. Create another Series s2 index containing "document synthesis" and perform various arithmetic operations with s1. Think about how to save all the data.

============================================

Darray has a broadcast mechanism, Series does not have a broadcast mechanism, and Series only operates according to the same index

s2 = Series({'chinese': one hundred and eight, 'mathematics': one hundred and forty-nine, 'English': one hundred and thirty-eight, 'Wen Zong': two hundred and sixty-eight})
s2
chinese    108
 mathematics    one hundred and forty-nine
 English    138
 Wen Zong    268
dtype: int64
s2.sum()
663
s1  + s2
mathematics    two hundred and ninety-nine.0
 Wen Zong      NaN
 Comprehensive management      NaN
 English    two hundred and eighty-eight.0
 chinese    two hundred and fifty-eight.0
dtype: float64
s1.add(s2, fill_value=0) / 2
mathematics    149.5
 Wen Zong    one hundred and thirty-four.0
 Comprehensive management    150.0
 English    one hundred and forty-four.0
 chinese    one hundred and twenty-nine.0
dtype: float64

2,DataFrame

DataFrame is a [tabular] data structure, which can be regarded as a [dictionary composed of Series] (sharing the same index), composed of multiple columns of data arranged in a certain order. The original design intention is to expand the use scenario of Series from one dimension to two dimensions. DataFrame has both row and column indexes.

  • Row index: index
  • Column index: columns
  • Values: values (two-dimensional array of numpy)

1) Creation of DataFrame

The most common method is to pass a dictionary to create. DataFrame takes the dictionary key as the name of each column and the dictionary value (an array) as the value of each column

In addition, the DataFrame automatically adds the index of each row (like Series).

As with Series, if the incoming column does not match the key of the dictionary, the corresponding value is NaN.

from  pandas import DataFrame

Block creation

data = np.random.randint(0,150, size=(4,4))
index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
columns = ['chinese', 'mathematics', 'English','python']
df = DataFrame(data=data, index=index, columns=columns)
df
chinesemathematicsEnglishpython
Zhang San104263852
Li Si111048841
Wang Wu3811936139
Zhao Liu679321130

Create using dictionary

df = DataFrame({'chinese': np.random.randint(0,150, size=4), 'mathematics': np.random.randint(0,150, size=4), 'English': np.random.randint(0,150, size=4), 'python': np.random.randint(0,150, size=4)},)
df.index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
df
chinesemathematicsEnglishpython
Zhang San30one hundred and ten14389
Li Si11685fifty-three138
Wang Wu283229141
Zhao Liu91235050

DataFrame properties: values, columns, index, shape

df.values
df.columns
df.index
df.shape

============================================

Exercise 4:

Create a DataFrame named ddd according to the following test scores:

    this one and that one
 Language 150 0
 Mathematics 150 0
 English 150 0
 Li Zong 300 0

============================================

# Dictionaries
ddd = DataFrame({'Zhang San': [150]*  3 + [300], 'Li Si': [0]* 4})
ddd.index = ['chinese', 'mathematics', 'English', 'Comprehensive management']
ddd
data = [[150,0]] * 3 + [[300, 0]]
index = ['chinese', 'mathematics', 'English', 'Comprehensive management']
columns = ['Zhang San', 'Li Si']
ddd = DataFrame(data=data, index=index, columns=columns)
ddd

2) Index of DataFrame

(1) Index columns

  • In a dictionary like manner
  • By attributes

The column of the DataFrame can be obtained as a Series. The returned Series has the same index as the original DataFrame, and the name attribute has been set, that is, the corresponding column name.

df
chinesemathematicsEnglishpython
Zhang Sanone hundred and eighteen125312
Li Siforty-eight348154
Wang Wu325880133
Zhao Liuone hundred and seven252542
df['chinese']
Zhang San    one hundred and eighteen
 Li Si     forty-eight
 Wang Wu     32
 Zhao Liu    one hundred and seven
Name: chinese, dtype: int32
df.chinese
Zhang San    one hundred and eighteen
 Li Si     48
 Wang Wu     32
 Zhao Liu    107
Name: chinese, dtype: int32
# Add a new column
df['computer'] = np.random.randint(0,150, size=4)
df
chinesemathematicsEnglishpythoncomputer
Zhang San118125312110
Li Si48348154131
Wang Wu325880133132
Zhao Liu107252542129
# When adding a column, you cannot use the attribute writing method
df.Comprehensive management = np.random.randint(0,150, size=4)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/1072798280.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.Comprehensive management = np.random.randint(0,150, size=4)
df
chinesemathematicsEnglishpythoncomputer
Zhang San118125312110
Li Si48348154131
Wang Wu325880133132
Zhao Liu107252542129

(2) Index rows

  • Use. loc [] plus index to index rows
  • Use. iloc [] plus integers to index rows

Similarly, a Series is returned, and the index is the original columns.

# Explicit writing
df.loc['Zhang San']
chinese        118
 mathematics         12
 English         53
python     12
 computer       110
Name: Zhang San, dtype: int32
df.iloc[1]

(3) Method of indexing elements

  • Use column index
  • Use row index (iloc[3,1]
  • Use the values attribute (two-dimensional numpy array)
# Column before row
# Don't write the assignment of chain index like this
df['English']['Li Si'] = 88
df
# First and last
df.loc['Li Si'].loc['English']
# Recommended writing
df.loc['Li Si', 'English']
df.iloc[1, 2]

Chained indexing is not recommended

df.loc[['Li Si']]
chinesemathematicsEnglishpythoncomputer
Li Si48348154131

3) DataFrame slice

(1) Column slice

df['mathematics': 'python'] # This is not a slicing of columns
chinesemathematicsEnglishpython
df.loc[:, 'mathematics': 'python']
mathematicsEnglishpython
Zhang San125312
Li Si348154
Wang Wu5880133
Zhao Liu252542
df.iloc[:, 1:3]
mathematicsEnglish
Zhang San110143
Li Si8553
Wang Wu3229
Zhao Liu2350
df[['mathematics', 'English', 'python']]
mathematicsEnglishpython
Zhang San11014389
Li Si8553138
Wang Wu3229141
Zhao Liu235050
df.iloc[:, 0:3]
chinesemathematicsEnglish
Zhang San30110143
Li Si1168553
Wang Wu283229
Zhao Liu912350

(2) Row slice

df['Li Si': 'Wang Wu'] # Fully closed interval
chinesemathematicsEnglishpython
Li Si1168553138
Wang Wu283229141

DataFrame index summary

1. Row index reference. loc, column index reference brackets

2. For the index of elements, first index rows, then index columns. df.loc[index, columns]

3. If you still want to return the DataFrame, use two layers of brackets

be careful:

  • When using square brackets directly
    • An index represents a column index
    • Slices represent row slices
  • Do not use chained indexes

============================================

Exercise 5:

Index and slice ddd using many methods, and compare the differences

============================================

ddd
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/2469631560.py in <module>
----> 1 ddd


NameError: name 'ddd' is not defined

1. Index Zhang San's English score (a dataframe is required)

2. Slice and cut the scores from Chinese to mathematics

3. Assign a value of 108 to Li Si's English score

ddd.loc[['chinese'], ['Zhang San']]
ddd['chinese': 'mathematics']
ddd.loc['chinese': 'mathematics']
ddd.iloc[0:2]
ddd.loc['English', 'Li Si'] = 108

4) Operation of DataFrame

(1) Operations between dataframes

Same as Series:

  • Automatically align data of different indexes in operation
  • If the index does not correspond, fill in NaN

DataFrame and a single number are calculated

df 
chinesemathematicsEnglishpython
Zhang San3011014389
Li Si1168553138
Wang Wu283229141
Zhao Liu91235050
df + 1
chinesemathematicsEnglishpython
Zhang San31111144ninety
Li Si1178654139
Wang Wu293330142
Zhao Liu92245151

DataFrame and DataFrame are calculated

# Operation between DataFrame and DataFrame: only when the row and column indexes are consistent, and if they are inconsistent, fill in NaN
data = np.random.randint(0,150, size=(4,4))
index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
columns = ['chinese', 'mathematics', 'English','python']
df2 = DataFrame(data=data, index=index, columns=columns)
df2
chinesemathematicsEnglishpython
Zhang San1423eighty-four72
Li Si77384643
Wang Wu33784127
Zhao Liu142482480
df
chinesemathematicsEnglishpython
Zhang San3011014389
Li Si1168553138
Wang Wu283229141
Zhao Liu91235050
df2.loc['pseudo-ginseng'] = np.random.randint(0,150, size=4)
df2
chinesemathematicsEnglishpython
Zhang San75918188
Li Si133112881
Wang Wu13869250
Zhao Liu516423113
pseudo-ginseng146582042
df + df2 
chinesemathematicsEnglishpython
Zhang San105.0201.0224.0177.0
Li Si249.086.0181.0219.0
Wang Wu41.0118.0121.0191.0
pseudo-ginsengNaNNaNNaNNaN
Zhao Liu142.087.073.0163.0
df.add(df2, fill_value=0)
chinesemathematicsEnglishpython
Zhang San105.0201.0224.0177.0
Li Si249.086.0181.0219.0
Wang Wu41.0118.0121.0191.0
pseudo-ginseng146.058.020.042.0
Zhao Liu142.087.073.0163.0

The following is the corresponding table of Python operators and pandas operators:

Python OperatorPandas Method(s)
+add()
-sub(), subtract()
*mul(), multiply()
/truediv(), div(), divide()
//floordiv()
%mod()
**pow()

(2) Operation between Series and DataFrame

[important]

  • Use Python operators: operate in behavioral units (parameters must be rows) and are valid for all rows.
  • Use the pandas operator function:
    • axis=0: operate by column (the parameter must be column), which is valid for all columns.
    • axis=1: operates in behavioral units (the parameter must be a row), which is valid for all rows.
df
chinesemathematicsEnglishpython
Zhang San3011014389
Li Si1168553138
Wang Wu283229141
Zhao Liu91235050
s = Series(data=np.random.randint(0,150,size=5), index=['chinese', 'mathematics', 'English', 'python', 'computer'])
s
chinese        one hundred and forty
 mathematics        one hundred and fourteen
 English         90
python     82
 computer         9
dtype: int32
df + s # Directly perform the operation, observe whether the column index of the DataFrame is consistent with the index of the Series, and perform the corresponding operation if it is consistent
pythonmathematicsEnglishcomputerchinese
Zhang San171224233NaN170
Li Si220199143NaN256
Wang Wu223146119NaN168
Zhao Liu132137140NaN231
s = Series(data=np.random.randint(0,150,size=4), index=['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu'])
s
Zhang San    one hundred and twenty-two
 Li Si    one hundred and nine
 Wang Wu     84
 Zhao Liu     forty-five
dtype: int32
df + s   # The default operation using the operator is to compare the column index of DataFrame with the column index of Series
pythonZhang SanmathematicsLi SiWang WuEnglishchineseZhao Liu
Zhang SanNaNNaNNaNNaNNaNNaNNaNNaN
Li SiNaNNaNNaNNaNNaNNaNNaNNaN
Wang WuNaNNaNNaNNaNNaNNaNNaNNaN
Zhao LiuNaNNaNNaNNaNNaNNaNNaNNaN
df.add(s, axis=0) # Use axis to change the direction of the operation

Inductive summary

1. Dataframe and single number operation, and each element operates separately

2. For DataFrame and DataFrame operations, the same row and column indexes are used for operations, and different indexes are supplemented with NaN

3. DataFrame and Series operations. When operators are used, the column index of DataFrame and the index of Series are compared by default

4. If you want to keep the original data or change the direction of operation, use the pandas encapsulation method

============================================

Exercise 6:

  1. If DDD is the midterm examination result and ddd2 is the final examination result, please create ddd2 freely and add it to DDD to find the average value of the midterm and final examination.

  2. Suppose Zhang is found cheating in math in the Third mid-term exam and should be recorded as 0. How to achieve it?

  3. Li Si made meritorious service for reporting Zhang San's cheating and added 100 points to all subjects in the mid-term exam. How to realize it?

  4. Later, the teacher found that there was a problem wrong. In order to calm the students' emotions, he gave each student 10 points for each subject. How to achieve it?

============================================

df
chinesemathematicsEnglishpython
Zhang San3011014389
Li Si1168553138
Wang Wu283229141
Zhao Liu91235050
df2 = df.copy()
# values cannot be assigned directly
df2.values = np.random.randint(0, 150, size=(4,5))
df2.values
df2 = DataFrame(data=np.random.randint(0, 150, size=(4,5)), index=df.index, columns=df.columns)
df2
df
(df + df2) / 2
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/3808069166.py in <module>
----> 1 (df + df2) / 2


NameError: name 'df' is not defined
df.loc['Zhang San', 'mathematics'] = 0
df
df.loc['Li Si'] += 100
df
df + 10

Processing lost data

import numpy as np
import pandas  as  pd

There are two types of lost data:

  • None
  • np.nan(NaN)
None

1. None

None comes with Python. Its type is python object. Therefore, none cannot participate in any calculation.

type(None)
NoneType
n = np.array([1,1,2,3, None])
n
array([1, 1, 2, 3, None], dtype=object)
n.sum()
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-4-20b8964b5fcc> in <module>
----> 1 n.sum()


d:\1903\.venv\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     37          initial=_NoValue, where=True):
---> 38     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     39 
     40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,


TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

The operation of object type is much slower than that of int type
Calculate the summation time for different data types
%timeit np.arange(1e5,dtype=xxx).sum()

%timeit  np.arange(1e5, dtype=np.int32).sum()
253 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit  np.arange(1e5, dtype=np.float64).sum()
270 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit  np.arange(1e5, dtype=np.object).sum()
10.4 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

2. np.nan(NaN)

np.nan is a floating-point type and can participate in the calculation. But the result of calculation is always NaN.

type(np.nan)
float
n = np.array([1,1,2,3, np.nan])
n.sum()
nan

However, you can use the np.nan * () function to calculate nan. In this case, nan is regarded as 0.

np.nansum(n)
7.0

3. None and NaN in pandas

1) None and np.nan in pandas are regarded as np.nan

from pandas import Series, DataFrame
data = np.random.randint(0,150, size=(4,4))
index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
columns = ['chinese', 'mathematics', 'English','python']
df = DataFrame(data=data, index=index, columns=columns)
df
chinesemathematicsEnglishpython
Zhang San1051497032
Li Si1184411724
Wang Wu901111208
Zhao Liu8253522
df.loc['Zhang San', 'chinese'] = None
df
chinesemathematicsEnglishpython
Zhang SanNaN1497032
Li Si118.04411724
Wang Wu90.01111208
Zhao Liu82.053522

Modify DataFrame data using DataFrame row index and column index

df.loc['Li Si', 'chinese'] = np.nan
df
chinesemathematicsEnglishpython
Zhang SanNaN1497032
Li SiNaN4411724
Wang Wu90.01111208
Zhao Liu82.053522
df.loc['Zhang San', 'python'] = np.nan
df
chinesemathematicsEnglishpython
Zhang SanNaN13194NaN
Li SiNaN5108124.0
Wang Wu54.07436144.0
Zhao Liu33.01173437.0

2) Operation of None and np.nan in pandas

  • isnull(): judge whether NaN exists
  • notnull(): judge whether there is no NaN
  • dropna(): filtering missing data
  • fillna(): fill in missing data

(1) Judgment function

  • isnull()
  • notnull()
pd.isnull(df)
chinesemathematicsEnglishpython
Zhang SanTrueFalseFalseTrue
Li SiTrueFalseFalseFalse
Wang WuFalseFalseFalseFalse
Zhao LiuFalseFalseFalseFalse
pd.notnull(df)
chinesemathematicsEnglishpython
Zhang SanFalseTrueTrueFalse
Li SiFalseTrueTrueTrue
Wang WuTrueTrueTrueTrue
Zhao LiuTrueTrueTrueTrue
df.isnull()
chinesemathematicsEnglishpython
Zhang SanTrueFalseFalseTrue
Li SiTrueFalseFalseFalse
Wang WuFalseFalseFalseFalse
Zhao LiuFalseFalseFalseFalse
# Use with any to determine whether NaN data exists in rows or columns
# The default axis is 0, which determines whether each column has NaN data
df.isnull().any(axis=0)
chinese         True
 mathematics        False
 English        False
python     True
dtype: bool
# Check whether there is NaN in each line
df.isnull().any(axis=1)
Zhang San     True
 Li Si     True
 Wang Wu    False
 Zhao Liu    False
dtype: bool

(2) Filter function

  • dropna()
df
chinesemathematicsEnglishpython
Zhang SanNaN1497032
Li SiNaN4411724
Wang Wu90.01111208
Zhao Liu82.053522
# axis=1, delete column
# how='all 'means all NaN are deleted
df.dropna(axis=0, how='all')
chinesemathematicsEnglishpython
Zhang SanNaN1497032
Li SiNaN4411724
Wang Wu90.01111208
Zhao Liu82.053522
df.dropna(axis=0, how='any', inplace=True)
df
chinesemathematicsEnglishpython
Wang Wu64.04797148.0
Zhao Liu125.07511397.0
df
chinesemathematicsEnglishpython
Zhang SanNaN13194NaN
Li SiNaN5108124.0
Wang Wu54.07436144.0
Zhao Liu33.01173437.0
df.dropna(axis=0, how='any',subset=['chinese', 'mathematics', 'English'])
chinesemathematicsEnglishpython
Wang Wu54.07436144.0
Zhao Liu33.01173437.0

You can choose whether to filter rows or columns (rows by default)

You can also select the filtering method. how = "all"

(3) Fill function Series/DataFrame

  • fillna()
df
chinesemathematicsEnglishpython
Zhang SanNaN13194NaN
Li SiNaN5108124.0
Wang Wu54.07436144.0
Zhao Liu33.01173437.0
# Fill with the specified value
df.fillna(value=100)
chinesemathematicsEnglishpython
Zhang San100.013194100.0
Li Si100.05108124.0
Wang Wu54.07436144.0
Zhao Liu33.01173437.0
df
chinesemathematicsEnglishpython
Zhang SanNaN13194NaN
Li SiNaN5108124.0
Wang Wu54.07436144.0
Zhao Liu33.01173437.0
# Fill with existing data
# 'backfill', 'bfill', 'pad', 'ffill'
df.fillna(axis=0, method='bfill', limit=1)
chinesemathematicsEnglishpython
Zhang SanNaN13194124.0
Li Si54.05108124.0
Wang Wu54.07436144.0
Zhao Liu33.01173437.0
df.fillna(axis=1, method='bfill', inplace=True)
---------------------------------------------------------------------------

NotImplementedError                       Traceback (most recent call last)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/2566829753.py in <module>
----> 1 df.fillna(axis=1, method='bfill', inplace=True)


c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper


c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast)
   5174         downcast=None,
   5175     ) -> DataFrame | None:
-> 5176         return super().fillna(
   5177             value=value,
   5178             method=method,


c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6324             if not self._mgr.is_single_block and axis == 1:
   6325                 if inplace:
-> 6326                     raise NotImplementedError()
   6327                 result = self.T.fillna(method=method, limit=limit).T
   6328 


NotImplementedError: 

You can choose whether to fill forward or backward

For the DataFrame, also select the axis of the fill. Remember, for DataFrame:

  • axis=0: index / line
  • axis=1: columns / column

============================================

Exercise 7:

  1. Briefly describe the difference between None and NaN

  2. Suppose Zhang San and Li Si take the mock exam, but Zhang San gives up the English exam because he suddenly wants to understand life, so he writes it as None. Please create a DataFrame based on this and name it ddd3

  3. The teacher decided to fill Zhang San's English score with math scores. How to achieve it?
    Fill in Zhang San's English score with Li Si's English score?

============================================

1, None yes python of object,Cannot participate in calculation,np.nan yes float type,Can participate in calculation,But the result is always nan, have access to np.nan*To calculate the correct value.
stay pandas in,None and np.nan Are regarded as np.nan To handle.
df
chinesemathematicsEnglishpython
Zhang San1491394674
Li Si3332119116
Wang Wu3311314855
Zhao Liu1311182127
df.loc['Zhang San', 'English'] = np.nan
df
chinesemathematicsEnglishpython
Zhang San149139NaN74
Li Si3332119.0116
Wang Wu33113148.055
Zhao Liu1311182.0127
df.fillna(axis=1, method='pad')
chinesemathematicsEnglishpython
Zhang San149.0139.0139.074.0
Li Si33.032.0119.0116.0
Wang Wu33.0113.0148.055.0
Zhao Liu131.011.082.0127.0
df.fillna(axis=0, method='bfill', inplace=True)
df
chinesemathematicsEnglishpython
Zhang San149139119.074
Li Si3332119.0116
Wang Wu33113148.055
Zhao Liu1311182.0127

Posted by poisedforflight on Wed, 13 Oct 2021 08:51:25 -0700