pandas data analysis to awesome tutorial [Full Version]

Data structure of Pandas

Import pandas

from pandas import Series

1,Series

Series is an object similar to a one-dimensional array, which consists of the following two parts:

values: a set of data (darray type)
Index: related data index label

1) Creation of Series

There are two creation methods:

(1) Created from a list or numpy array, the default index is an integer index from 0 to N-1

Created by list

l = [1,2,3,4,5]
s = Series(l, index=list('abcde'))
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

s.values

array([1, 2, 3, 4, 5], dtype=int64)

You can also specify the index by setting the index parameter

s.index = [1,2,3,4,5]
s

1    1
2    2
3    3
4    4
5    5
dtype: int64

s[1] = 10
s

1    10
2     2
3     3
4     4
5     5
dtype: int64

n = np.arange(0,10)
s = Series(n.copy())
s

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
 nine    9
dtype: int thirty-two

s[0] = 8
s

0    8
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int thirty-two

In particular, ndarray creates references, not copies. Changes to the Series element also change the elements in the original ndarray object. (this is not the case in the list)

(2) Created by dictionary

# The key of the dictionary corresponds to the index of the Series
s = Series({'a':1, 'b':2, 'c':3, 'd': 4})
s

a    1
b    2
c    3
d    4
dtype: int64

============================================

Exercise 1:

Create the following Series in several ways, named s1:
Language one hundred and fifty
Math one hundred and fifty
English one hundred and fifty
Li Zong three hundred

============================================

# Dictionaries
s1 = Series({'chinese': one hundred and fifty, 'mathematics': 150, 'English': 150, 'Comprehensive management': 300})
s1

data = [150, 150, 150, 300]
index = ['chinese', 'mathematics', 'English', 'Comprehensive management']
s1 = Series(data=data,index=index)
s1

chinese    150
 mathematics    150
 English    150
 Comprehensive management    300
dtype: int64

2) Indexing and slicing of Series

You can use brackets to get a single index (in this case, the element type is returned), or a list in brackets to get multiple indexes (in this case, a Series type is returned). Indexes are divided into display indexes and implicit indexes:

(1) Explicit index:

Use the element in index as the index value
Use. loc [] (recommended)

Note that this is a closed interval

# Not recommended
s['a']

#Recommended writing
s.loc['a']

s.loc['a': 'c']  # Note: the interval is fully closed

a    1
b    2
c    3
dtype: int64

s['a': 'c']

a    1
b    2
c    3
dtype: int64

(2) Implicit index:

Use integer as index value
Use. iloc [] (recommended)

Note that this is a half open interval

a    1
b    2
c    3
d    4
dtype: int64

s[0]

s = Series(np.arange(10))
s

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32

s.index = np.arange(1,11)
s

1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int32

s[1]

# Recommended writing method of implicit index
s.iloc[0]

s.iloc[0:3] # The slice of an implicit index is left closed and right open

1    0
2    1
3    2
dtype: int32

============================================

Exercise 2:

Index and slice the Series s1 created in exercise 1 in several ways:

Indexes:
Math 150

section:
Language 150
Math 150
English 150

============================================

s1

s1.loc['mathematics']

s1.iloc[1]

s1.loc[['mathematics']] # Set another layer of brackets to return the original data type

s1.loc[['mathematics', 'Comprehensive management']]

s1.iloc[[1, 3]]

s1.loc['chinese': 'English']

s1.iloc[0: 3]

3) Basic concepts of Series

Series can be regarded as an ordered Dictionary of fixed length

You can get the attributes of series through shape, size, index, values, etc

1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
dtype: int32

s.shape

s.size

s.index

s.values

You can quickly view the style of Series objects through head(), tail()

s.head(3)

1    0
2    1
3    2
dtype: int32

s.tail(4)

7     6
8     7
9     8
10    9
dtype: int32

When the index has no corresponding value, the missing data may display NaN (not a number)

1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
dtype: float64

s.loc[0] = np.nan

1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Name, dtype: float64

You can use pd.isnull(), pd.notnull(), or the built-in isnull(), notnull() function to detect missing data

pd.isnull(s)

1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
0      True
dtype: bool

pd.notnull(s)

s.isnull()

s.notnull()

The Series object itself and its instances have a name attribute

s.name = 'Series s'
s

1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Series s, dtype: float64

Series.name = 'Name'
s

1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Name, dtype: float64

4) Series operation

(1) The array operation applicable to numpy is also applicable to Series

s + 1

(2) Operation between series

Automatically align data of different indexes in operation
If the index does not correspond, fill in NaN

1     0.0
2     1.0
3     2.0
4     3.0
5     4.0
6     5.0
7     6.0
8     7.0
9     8.0
10    9.0
0     NaN
Name: Name, dtype: float64

s2 = Series(np.random.randint(0, 10, size=11), index=np.arange(3, 14))
s2

3     0
4     5
5     3
6     9
7     1
8     9
9     1
10    7
11    9
 twelve    6
13    1
dtype: int32

s + s2
# The same index is used for operation, and different indexes are used to supplement NaN

0      NaN
1      NaN
2      NaN
3      2.0
4      8.0
5      7.0
6     14.0
7      7.0
8     16.0
9      9.0
10    16.0
11     NaN
12     NaN
13     NaN
dtype: float64

Note: to keep all index es, you need to use the. add() function

# Use the operation function encapsulated by pandas, and keep the value corresponding to all index es
s.add(s2, fill_value=0)

0      NaN
1      0.0
2      1.0
3      2.0
4      8.0
5      7.0
6     14.0
7      7.0
8     16.0
9      9.0
10    16.0
11     9.0
12     6.0
13     1.0
dtype: float64

============================================

Exercise 3:

Think about the difference between the rules of Series operation and ndarray operation?
Create another Series s2 index containing "document synthesis" and perform various arithmetic operations with s1. Think about how to save all the data.

============================================

Darray has a broadcast mechanism, Series does not have a broadcast mechanism, and Series only operates according to the same index

s2 = Series({'chinese': one hundred and eight, 'mathematics': one hundred and forty-nine, 'English': one hundred and thirty-eight, 'Wen Zong': two hundred and sixty-eight})
s2

chinese    108
 mathematics    one hundred and forty-nine
 English    138
 Wen Zong    268
dtype: int64

s2.sum()

s1  + s2

mathematics    two hundred and ninety-nine.0
 Wen Zong      NaN
 Comprehensive management      NaN
 English    two hundred and eighty-eight.0
 chinese    two hundred and fifty-eight.0
dtype: float64

s1.add(s2, fill_value=0) / 2

mathematics    149.5
 Wen Zong    one hundred and thirty-four.0
 Comprehensive management    150.0
 English    one hundred and forty-four.0
 chinese    one hundred and twenty-nine.0
dtype: float64

2,DataFrame

DataFrame is a [tabular] data structure, which can be regarded as a [dictionary composed of Series] (sharing the same index), composed of multiple columns of data arranged in a certain order. The original design intention is to expand the use scenario of Series from one dimension to two dimensions. DataFrame has both row and column indexes.

Row index: index
Column index: columns
Values: values (two-dimensional array of numpy)

1) Creation of DataFrame

The most common method is to pass a dictionary to create. DataFrame takes the dictionary key as the name of each column and the dictionary value (an array) as the value of each column

In addition, the DataFrame automatically adds the index of each row (like Series).

As with Series, if the incoming column does not match the key of the dictionary, the corresponding value is NaN.

from  pandas import DataFrame

Block creation

data = np.random.randint(0,150, size=(4,4))
index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
columns = ['chinese', 'mathematics', 'English','python']
df = DataFrame(data=data, index=index, columns=columns)
df

	chinese	mathematics	English	python
Zhang San	104	26	38	52
Li Si	11	104	88	41
Wang Wu	38	119	36	139
Zhao Liu	67	93	21	130

Create using dictionary

df = DataFrame({'chinese': np.random.randint(0,150, size=4), 'mathematics': np.random.randint(0,150, size=4), 'English': np.random.randint(0,150, size=4), 'python': np.random.randint(0,150, size=4)},)
df.index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
df

	chinese	mathematics	English	python
Zhang San	30	one hundred and ten	143	89
Li Si	116	85	fifty-three	138
Wang Wu	28	32	29	141
Zhao Liu	91	23	50	50

DataFrame properties: values, columns, index, shape

df.values

df.columns

df.index

df.shape

============================================

Exercise 4:

Create a DataFrame named ddd according to the following test scores:

    this one and that one
 Language 150 0
 Mathematics 150 0
 English 150 0
 Li Zong 300 0

============================================

# Dictionaries
ddd = DataFrame({'Zhang San': [150]*  3 + [300], 'Li Si': [0]* 4})
ddd.index = ['chinese', 'mathematics', 'English', 'Comprehensive management']
ddd

data = [[150,0]] * 3 + [[300, 0]]
index = ['chinese', 'mathematics', 'English', 'Comprehensive management']
columns = ['Zhang San', 'Li Si']
ddd = DataFrame(data=data, index=index, columns=columns)
ddd

2) Index of DataFrame

(1) Index columns

In a dictionary like manner
By attributes

The column of the DataFrame can be obtained as a Series. The returned Series has the same index as the original DataFrame, and the name attribute has been set, that is, the corresponding column name.

df

	chinese	mathematics	English	python
Zhang San	one hundred and eighteen	12	53	12
Li Si	forty-eight	34	81	54
Wang Wu	32	58	80	133
Zhao Liu	one hundred and seven	25	25	42

df['chinese']

Zhang San    one hundred and eighteen
 Li Si     forty-eight
 Wang Wu     32
 Zhao Liu    one hundred and seven
Name: chinese, dtype: int32

df.chinese

Zhang San    one hundred and eighteen
 Li Si     48
 Wang Wu     32
 Zhao Liu    107
Name: chinese, dtype: int32

# Add a new column
df['computer'] = np.random.randint(0,150, size=4)
df

	chinese	mathematics	English	python	computer
Zhang San	118	12	53	12	110
Li Si	48	34	81	54	131
Wang Wu	32	58	80	133	132
Zhao Liu	107	25	25	42	129

# When adding a column, you cannot use the attribute writing method
df.Comprehensive management = np.random.randint(0,150, size=4)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/1072798280.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.Comprehensive management = np.random.randint(0,150, size=4)

df

	chinese	mathematics	English	python	computer
Zhang San	118	12	53	12	110
Li Si	48	34	81	54	131
Wang Wu	32	58	80	133	132
Zhao Liu	107	25	25	42	129

(2) Index rows

Use. loc [] plus index to index rows
Use. iloc [] plus integers to index rows

Similarly, a Series is returned, and the index is the original columns.

# Explicit writing
df.loc['Zhang San']

chinese        118
 mathematics         12
 English         53
python     12
 computer       110
Name: Zhang San, dtype: int32

df.iloc[1]

(3) Method of indexing elements

Use column index
Use row index (iloc[3,1]
Use the values attribute (two-dimensional numpy array)

# Column before row
# Don't write the assignment of chain index like this
df['English']['Li Si'] = 88

df

# First and last
df.loc['Li Si'].loc['English']

# Recommended writing
df.loc['Li Si', 'English']

df.iloc[1, 2]

Chained indexing is not recommended

df.loc[['Li Si']]

	chinese	mathematics	English	python	computer
Li Si	48	34	81	54	131

3) DataFrame slice

(1) Column slice

df['mathematics': 'python'] # This is not a slicing of columns

	chinese	mathematics	English	python

df.loc[:, 'mathematics': 'python']

	mathematics	English	python
Zhang San	12	53	12
Li Si	34	81	54
Wang Wu	58	80	133
Zhao Liu	25	25	42

df.iloc[:, 1:3]

	mathematics	English
Zhang San	110	143
Li Si	85	53
Wang Wu	32	29
Zhao Liu	23	50

df[['mathematics', 'English', 'python']]

	mathematics	English	python
Zhang San	110	143	89
Li Si	85	53	138
Wang Wu	32	29	141
Zhao Liu	23	50	50

df.iloc[:, 0:3]

	chinese	mathematics	English
Zhang San	30	110	143
Li Si	116	85	53
Wang Wu	28	32	29
Zhao Liu	91	23	50

(2) Row slice

df['Li Si': 'Wang Wu'] # Fully closed interval

	chinese	mathematics	English	python
Li Si	116	85	53	138
Wang Wu	28	32	29	141

DataFrame index summary

1. Row index reference. loc, column index reference brackets

2. For the index of elements, first index rows, then index columns. df.loc[index, columns]

3. If you still want to return the DataFrame, use two layers of brackets

be careful:

When using square brackets directly
- An index represents a column index
- Slices represent row slices
Do not use chained indexes

============================================

Exercise 5:

Index and slice ddd using many methods, and compare the differences

============================================

ddd

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/2469631560.py in <module>
----> 1 ddd


NameError: name 'ddd' is not defined

1. Index Zhang San's English score (a dataframe is required)

2. Slice and cut the scores from Chinese to mathematics

3. Assign a value of 108 to Li Si's English score

ddd.loc[['chinese'], ['Zhang San']]

ddd['chinese': 'mathematics']

ddd.loc['chinese': 'mathematics']

ddd.iloc[0:2]

ddd.loc['English', 'Li Si'] = 108

4) Operation of DataFrame

(1) Operations between dataframes

Same as Series:

Automatically align data of different indexes in operation
If the index does not correspond, fill in NaN

DataFrame and a single number are calculated

df

	chinese	mathematics	English	python
Zhang San	30	110	143	89
Li Si	116	85	53	138
Wang Wu	28	32	29	141
Zhao Liu	91	23	50	50

df + 1

	chinese	mathematics	English	python
Zhang San	31	111	144	ninety
Li Si	117	86	54	139
Wang Wu	29	33	30	142
Zhao Liu	92	24	51	51

DataFrame and DataFrame are calculated

# Operation between DataFrame and DataFrame: only when the row and column indexes are consistent, and if they are inconsistent, fill in NaN
data = np.random.randint(0,150, size=(4,4))
index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
columns = ['chinese', 'mathematics', 'English','python']
df2 = DataFrame(data=data, index=index, columns=columns)
df2

	chinese	mathematics	English	python
Zhang San	142	3	eighty-four	72
Li Si	77	38	46	43
Wang Wu	3	37	84	127
Zhao Liu	142	48	24	80

df

	chinese	mathematics	English	python
Zhang San	30	110	143	89
Li Si	116	85	53	138
Wang Wu	28	32	29	141
Zhao Liu	91	23	50	50

df2.loc['pseudo-ginseng'] = np.random.randint(0,150, size=4)
df2

	chinese	mathematics	English	python
Zhang San	75	91	81	88
Li Si	133	1	128	81
Wang Wu	13	86	92	50
Zhao Liu	51	64	23	113
pseudo-ginseng	146	58	20	42

df + df2

	chinese	mathematics	English	python
Zhang San	105.0	201.0	224.0	177.0
Li Si	249.0	86.0	181.0	219.0
Wang Wu	41.0	118.0	121.0	191.0
pseudo-ginseng	NaN	NaN	NaN	NaN
Zhao Liu	142.0	87.0	73.0	163.0

df.add(df2, fill_value=0)

	chinese	mathematics	English	python
Zhang San	105.0	201.0	224.0	177.0
Li Si	249.0	86.0	181.0	219.0
Wang Wu	41.0	118.0	121.0	191.0
pseudo-ginseng	146.0	58.0	20.0	42.0
Zhao Liu	142.0	87.0	73.0	163.0

The following is the corresponding table of Python operators and pandas operators:

Python Operator	Pandas Method(s)
+	add()
-	sub(), subtract()
*	mul(), multiply()
/	truediv(), div(), divide()
//	floordiv()
%	mod()
**	pow()

(2) Operation between Series and DataFrame

[important]

Use Python operators: operate in behavioral units (parameters must be rows) and are valid for all rows.
Use the pandas operator function:
- axis=0: operate by column (the parameter must be column), which is valid for all columns.
- axis=1: operates in behavioral units (the parameter must be a row), which is valid for all rows.

df

	chinese	mathematics	English	python
Zhang San	30	110	143	89
Li Si	116	85	53	138
Wang Wu	28	32	29	141
Zhao Liu	91	23	50	50

s = Series(data=np.random.randint(0,150,size=5), index=['chinese', 'mathematics', 'English', 'python', 'computer'])
s

chinese        one hundred and forty
 mathematics        one hundred and fourteen
 English         90
python     82
 computer         9
dtype: int32

df + s # Directly perform the operation, observe whether the column index of the DataFrame is consistent with the index of the Series, and perform the corresponding operation if it is consistent

	python	mathematics	English	computer	chinese
Zhang San	171	224	233	NaN	170
Li Si	220	199	143	NaN	256
Wang Wu	223	146	119	NaN	168
Zhao Liu	132	137	140	NaN	231

s = Series(data=np.random.randint(0,150,size=4), index=['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu'])
s

Zhang San    one hundred and twenty-two
 Li Si    one hundred and nine
 Wang Wu     84
 Zhao Liu     forty-five
dtype: int32

df + s   # The default operation using the operator is to compare the column index of DataFrame with the column index of Series

	python	Zhang San	mathematics	Li Si	Wang Wu	English	chinese	Zhao Liu
Zhang San	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Li Si	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Wang Wu	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Zhao Liu	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

df.add(s, axis=0) # Use axis to change the direction of the operation

Inductive summary

1. Dataframe and single number operation, and each element operates separately

2. For DataFrame and DataFrame operations, the same row and column indexes are used for operations, and different indexes are supplemented with NaN

3. DataFrame and Series operations. When operators are used, the column index of DataFrame and the index of Series are compared by default

4. If you want to keep the original data or change the direction of operation, use the pandas encapsulation method

============================================

Exercise 6:

If DDD is the midterm examination result and ddd2 is the final examination result, please create ddd2 freely and add it to DDD to find the average value of the midterm and final examination.
Suppose Zhang is found cheating in math in the Third mid-term exam and should be recorded as 0. How to achieve it?
Li Si made meritorious service for reporting Zhang San's cheating and added 100 points to all subjects in the mid-term exam. How to realize it?
Later, the teacher found that there was a problem wrong. In order to calm the students' emotions, he gave each student 10 points for each subject. How to achieve it?

============================================

df

	chinese	mathematics	English	python
Zhang San	30	110	143	89
Li Si	116	85	53	138
Wang Wu	28	32	29	141
Zhao Liu	91	23	50	50

df2 = df.copy()

# values cannot be assigned directly
df2.values = np.random.randint(0, 150, size=(4,5))

df2.values

df2 = DataFrame(data=np.random.randint(0, 150, size=(4,5)), index=df.index, columns=df.columns)
df2

df

(df + df2) / 2

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/3808069166.py in <module>
----> 1 (df + df2) / 2


NameError: name 'df' is not defined

df.loc['Zhang San', 'mathematics'] = 0

df

df.loc['Li Si'] += 100

df

df + 10

Processing lost data

import numpy as np
import pandas  as  pd

There are two types of lost data:

None
np.nan(NaN)

None

1. None

None comes with Python. Its type is python object. Therefore, none cannot participate in any calculation.

type(None)

NoneType

n = np.array([1,1,2,3, None])
n

array([1, 1, 2, 3, None], dtype=object)

n.sum()

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-4-20b8964b5fcc> in <module>
----> 1 n.sum()


d:\1903\.venv\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     37          initial=_NoValue, where=True):
---> 38     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     39 
     40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,


TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

The operation of object type is much slower than that of int type
Calculate the summation time for different data types
%timeit np.arange(1e5,dtype=xxx).sum()

%timeit  np.arange(1e5, dtype=np.int32).sum()

253 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit  np.arange(1e5, dtype=np.float64).sum()

270 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit  np.arange(1e5, dtype=np.object).sum()

10.4 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

2. np.nan(NaN)

np.nan is a floating-point type and can participate in the calculation. But the result of calculation is always NaN.

type(np.nan)

float

n = np.array([1,1,2,3, np.nan])
n.sum()

nan

However, you can use the np.nan * () function to calculate nan. In this case, nan is regarded as 0.

np.nansum(n)

7.0

3. None and NaN in pandas

1) None and np.nan in pandas are regarded as np.nan

from pandas import Series, DataFrame

data = np.random.randint(0,150, size=(4,4))
index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']
columns = ['chinese', 'mathematics', 'English','python']
df = DataFrame(data=data, index=index, columns=columns)
df

	chinese	mathematics	English	python
Zhang San	105	149	70	32
Li Si	118	44	117	24
Wang Wu	90	111	120	8
Zhao Liu	82	53	5	22

df.loc['Zhang San', 'chinese'] = None

df

	chinese	mathematics	English	python
Zhang San	NaN	149	70	32
Li Si	118.0	44	117	24
Wang Wu	90.0	111	120	8
Zhao Liu	82.0	53	5	22

Modify DataFrame data using DataFrame row index and column index

df.loc['Li Si', 'chinese'] = np.nan

df

	chinese	mathematics	English	python
Zhang San	NaN	149	70	32
Li Si	NaN	44	117	24
Wang Wu	90.0	111	120	8
Zhao Liu	82.0	53	5	22

df.loc['Zhang San', 'python'] = np.nan

df

	chinese	mathematics	English	python
Zhang San	NaN	131	94	NaN
Li Si	NaN	5	108	124.0
Wang Wu	54.0	74	36	144.0
Zhao Liu	33.0	117	34	37.0

2) Operation of None and np.nan in pandas

isnull(): judge whether NaN exists
notnull(): judge whether there is no NaN
dropna(): filtering missing data
fillna(): fill in missing data

(1) Judgment function

isnull()
notnull()

pd.isnull(df)

	chinese	mathematics	English	python
Zhang San	True	False	False	True
Li Si	True	False	False	False
Wang Wu	False	False	False	False
Zhao Liu	False	False	False	False

pd.notnull(df)

	chinese	mathematics	English	python
Zhang San	False	True	True	False
Li Si	False	True	True	True
Wang Wu	True	True	True	True
Zhao Liu	True	True	True	True

df.isnull()

	chinese	mathematics	English	python
Zhang San	True	False	False	True
Li Si	True	False	False	False
Wang Wu	False	False	False	False
Zhao Liu	False	False	False	False

# Use with any to determine whether NaN data exists in rows or columns
# The default axis is 0, which determines whether each column has NaN data
df.isnull().any(axis=0)

chinese         True
 mathematics        False
 English        False
python     True
dtype: bool

# Check whether there is NaN in each line
df.isnull().any(axis=1)

Zhang San     True
 Li Si     True
 Wang Wu    False
 Zhao Liu    False
dtype: bool

(2) Filter function

dropna()

df

	chinese	mathematics	English	python
Zhang San	NaN	149	70	32
Li Si	NaN	44	117	24
Wang Wu	90.0	111	120	8
Zhao Liu	82.0	53	5	22

# axis=1, delete column
# how='all 'means all NaN are deleted
df.dropna(axis=0, how='all')

	chinese	mathematics	English	python
Zhang San	NaN	149	70	32
Li Si	NaN	44	117	24
Wang Wu	90.0	111	120	8
Zhao Liu	82.0	53	5	22

df.dropna(axis=0, how='any', inplace=True)

df

	chinese	mathematics	English	python
Wang Wu	64.0	47	97	148.0
Zhao Liu	125.0	75	113	97.0

df

	chinese	mathematics	English	python
Zhang San	NaN	131	94	NaN
Li Si	NaN	5	108	124.0
Wang Wu	54.0	74	36	144.0
Zhao Liu	33.0	117	34	37.0

df.dropna(axis=0, how='any',subset=['chinese', 'mathematics', 'English'])

	chinese	mathematics	English	python
Wang Wu	54.0	74	36	144.0
Zhao Liu	33.0	117	34	37.0

You can choose whether to filter rows or columns (rows by default)

You can also select the filtering method. how = "all"

(3) Fill function Series/DataFrame

fillna()

df

	chinese	mathematics	English	python
Zhang San	NaN	131	94	NaN
Li Si	NaN	5	108	124.0
Wang Wu	54.0	74	36	144.0
Zhao Liu	33.0	117	34	37.0

# Fill with the specified value
df.fillna(value=100)

	chinese	mathematics	English	python
Zhang San	100.0	131	94	100.0
Li Si	100.0	5	108	124.0
Wang Wu	54.0	74	36	144.0
Zhao Liu	33.0	117	34	37.0

df

	chinese	mathematics	English	python
Zhang San	NaN	131	94	NaN
Li Si	NaN	5	108	124.0
Wang Wu	54.0	74	36	144.0
Zhao Liu	33.0	117	34	37.0

# Fill with existing data
# 'backfill', 'bfill', 'pad', 'ffill'
df.fillna(axis=0, method='bfill', limit=1)

	chinese	mathematics	English	python
Zhang San	NaN	131	94	124.0
Li Si	54.0	5	108	124.0
Wang Wu	54.0	74	36	144.0
Zhao Liu	33.0	117	34	37.0

df.fillna(axis=1, method='bfill', inplace=True)

---------------------------------------------------------------------------

NotImplementedError                       Traceback (most recent call last)

C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/2566829753.py in <module>
----> 1 df.fillna(axis=1, method='bfill', inplace=True)


c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper


c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast)
   5174         downcast=None,
   5175     ) -> DataFrame | None:
-> 5176         return super().fillna(
   5177             value=value,
   5178             method=method,


c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
   6324             if not self._mgr.is_single_block and axis == 1:
   6325                 if inplace:
-> 6326                     raise NotImplementedError()
   6327                 result = self.T.fillna(method=method, limit=limit).T
   6328 


NotImplementedError:

You can choose whether to fill forward or backward

For the DataFrame, also select the axis of the fill. Remember, for DataFrame:

axis=0: index / line
axis=1: columns / column

============================================

Exercise 7:

Briefly describe the difference between None and NaN
Suppose Zhang San and Li Si take the mock exam, but Zhang San gives up the English exam because he suddenly wants to understand life, so he writes it as None. Please create a DataFrame based on this and name it ddd3
The teacher decided to fill Zhang San's English score with math scores. How to achieve it?
Fill in Zhang San's English score with Li Si's English score?

============================================

1, None yes python of object,Cannot participate in calculation,np.nan yes float type,Can participate in calculation,But the result is always nan, have access to np.nan*To calculate the correct value.
stay pandas in,None and np.nan Are regarded as np.nan To handle.

df

	chinese	mathematics	English	python
Zhang San	149	139	46	74
Li Si	33	32	119	116
Wang Wu	33	113	148	55
Zhao Liu	131	11	82	127

df.loc['Zhang San', 'English'] = np.nan

df

	chinese	mathematics	English	python
Zhang San	149	139	NaN	74
Li Si	33	32	119.0	116
Wang Wu	33	113	148.0	55
Zhao Liu	131	11	82.0	127

df.fillna(axis=1, method='pad')

	chinese	mathematics	English	python
Zhang San	149.0	139.0	139.0	74.0
Li Si	33.0	32.0	119.0	116.0
Wang Wu	33.0	113.0	148.0	55.0
Zhao Liu	131.0	11.0	82.0	127.0

df.fillna(axis=0, method='bfill', inplace=True)

df

	chinese	mathematics	English	python
Zhang San	149	139	119.0	74
Li Si	33	32	119.0	116
Wang Wu	33	113	148.0	55
Zhao Liu	131	11	82.0	127

Posted by poisedforflight on Wed, 13 Oct 2021 08:51:25 -0700

Programmer Group

pandas data analysis to awesome tutorial [Full Version]

Data structure of Pandas

1,Series

1) Creation of Series

2) Indexing and slicing of Series

3) Basic concepts of Series

4) Series operation

2,DataFrame

1) Creation of DataFrame

2) Index of DataFrame

3) DataFrame slice

DataFrame index summary

4) Operation of DataFrame

Processing lost data

1. None

2. np.nan(NaN)

3. None and NaN in pandas

1) None and np.nan in pandas are regarded as np.nan

2) Operation of None and np.nan in pandas

Hot Keywords