Data structure of Pandas
Import pandas
from pandas import Series
1,Series
Series is an object similar to a one-dimensional array, which consists of the following two parts:
- values: a set of data (darray type)
- Index: related data index label
1) Creation of Series
There are two creation methods:
(1) Created from a list or numpy array, the default index is an integer index from 0 to N-1
Created by list
l = [1,2,3,4,5] s = Series(l, index=list('abcde')) s
a 1 b 2 c 3 d 4 e 5 dtype: int64
s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
s.values
array([1, 2, 3, 4, 5], dtype=int64)
You can also specify the index by setting the index parameter
s.index = [1,2,3,4,5] s
1 1 2 2 3 3 4 4 5 5 dtype: int64
s[1] = 10 s
1 10 2 2 3 3 4 4 5 5 dtype: int64
n = np.arange(0,10) s = Series(n.copy()) s
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 nine 9 dtype: int thirty-two
s[0] = 8 s
0 8 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int thirty-two
In particular, ndarray creates references, not copies. Changes to the Series element also change the elements in the original ndarray object. (this is not the case in the list)
(2) Created by dictionary
# The key of the dictionary corresponds to the index of the Series s = Series({'a':1, 'b':2, 'c':3, 'd': 4}) s
a 1 b 2 c 3 d 4 dtype: int64
============================================
Exercise 1:
Create the following Series in several ways, named s1:
Language one hundred and fifty
Math one hundred and fifty
English one hundred and fifty
Li Zong three hundred
============================================
# Dictionaries s1 = Series({'chinese': one hundred and fifty, 'mathematics': 150, 'English': 150, 'Comprehensive management': 300}) s1
data = [150, 150, 150, 300] index = ['chinese', 'mathematics', 'English', 'Comprehensive management'] s1 = Series(data=data,index=index) s1
chinese 150 mathematics 150 English 150 Comprehensive management 300 dtype: int64
2) Indexing and slicing of Series
You can use brackets to get a single index (in this case, the element type is returned), or a list in brackets to get multiple indexes (in this case, a Series type is returned). Indexes are divided into display indexes and implicit indexes:
(1) Explicit index:
- Use the element in index as the index value
- Use. loc [] (recommended)
Note that this is a closed interval
s
# Not recommended s['a']
1
#Recommended writing s.loc['a']
1
s.loc['a': 'c'] # Note: the interval is fully closed
a 1 b 2 c 3 dtype: int64
s['a': 'c']
a 1 b 2 c 3 dtype: int64
(2) Implicit index:
- Use integer as index value
- Use. iloc [] (recommended)
Note that this is a half open interval
s
a 1 b 2 c 3 d 4 dtype: int64
s[0]
1
s = Series(np.arange(10)) s
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int32
s.index = np.arange(1,11) s
1 0 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 10 9 dtype: int32
s[1]
0
# Recommended writing method of implicit index s.iloc[0]
0
s.iloc[0:3] # The slice of an implicit index is left closed and right open
1 0 2 1 3 2 dtype: int32
============================================
Exercise 2:
Index and slice the Series s1 created in exercise 1 in several ways:
Indexes:
Math 150
section:
Language 150
Math 150
English 150
============================================
s1
s1.loc['mathematics']
s1.iloc[1]
s1.loc[['mathematics']] # Set another layer of brackets to return the original data type
s1.loc[['mathematics', 'Comprehensive management']]
s1.iloc[[1, 3]]
s1.loc['chinese': 'English']
s1.iloc[0: 3]
3) Basic concepts of Series
Series can be regarded as an ordered Dictionary of fixed length
You can get the attributes of series through shape, size, index, values, etc
s
1 0 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 10 9 dtype: int32
s.shape
s.size
s.index
s.values
You can quickly view the style of Series objects through head(), tail()
s.head(3)
1 0 2 1 3 2 dtype: int32
s.tail(4)
7 6 8 7 9 8 10 9 dtype: int32
When the index has no corresponding value, the missing data may display NaN (not a number)
s
1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 10 9.0 0 NaN dtype: float64
s.loc[0] = np.nan
s
1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 10 9.0 0 NaN Name: Name, dtype: float64
You can use pd.isnull(), pd.notnull(), or the built-in isnull(), notnull() function to detect missing data
pd.isnull(s)
1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 0 True dtype: bool
pd.notnull(s)
s.isnull()
s.notnull()
The Series object itself and its instances have a name attribute
s.name = 'Series s' s
1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 10 9.0 0 NaN Name: Series s, dtype: float64
Series.name = 'Name' s
1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 10 9.0 0 NaN Name: Name, dtype: float64
4) Series operation
(1) The array operation applicable to numpy is also applicable to Series
s + 1
(2) Operation between series
- Automatically align data of different indexes in operation
- If the index does not correspond, fill in NaN
s
1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 10 9.0 0 NaN Name: Name, dtype: float64
s2 = Series(np.random.randint(0, 10, size=11), index=np.arange(3, 14)) s2
3 0 4 5 5 3 6 9 7 1 8 9 9 1 10 7 11 9 twelve 6 13 1 dtype: int32
s + s2 # The same index is used for operation, and different indexes are used to supplement NaN
0 NaN 1 NaN 2 NaN 3 2.0 4 8.0 5 7.0 6 14.0 7 7.0 8 16.0 9 9.0 10 16.0 11 NaN 12 NaN 13 NaN dtype: float64
- Note: to keep all index es, you need to use the. add() function
# Use the operation function encapsulated by pandas, and keep the value corresponding to all index es s.add(s2, fill_value=0)
0 NaN 1 0.0 2 1.0 3 2.0 4 8.0 5 7.0 6 14.0 7 7.0 8 16.0 9 9.0 10 16.0 11 9.0 12 6.0 13 1.0 dtype: float64
============================================
Exercise 3:
-
Think about the difference between the rules of Series operation and ndarray operation?
-
Create another Series s2 index containing "document synthesis" and perform various arithmetic operations with s1. Think about how to save all the data.
============================================
Darray has a broadcast mechanism, Series does not have a broadcast mechanism, and Series only operates according to the same index
s2 = Series({'chinese': one hundred and eight, 'mathematics': one hundred and forty-nine, 'English': one hundred and thirty-eight, 'Wen Zong': two hundred and sixty-eight}) s2
chinese 108 mathematics one hundred and forty-nine English 138 Wen Zong 268 dtype: int64
s2.sum()
663
s1 + s2
mathematics two hundred and ninety-nine.0 Wen Zong NaN Comprehensive management NaN English two hundred and eighty-eight.0 chinese two hundred and fifty-eight.0 dtype: float64
s1.add(s2, fill_value=0) / 2
mathematics 149.5 Wen Zong one hundred and thirty-four.0 Comprehensive management 150.0 English one hundred and forty-four.0 chinese one hundred and twenty-nine.0 dtype: float64
2,DataFrame
DataFrame is a [tabular] data structure, which can be regarded as a [dictionary composed of Series] (sharing the same index), composed of multiple columns of data arranged in a certain order. The original design intention is to expand the use scenario of Series from one dimension to two dimensions. DataFrame has both row and column indexes.
- Row index: index
- Column index: columns
- Values: values (two-dimensional array of numpy)
1) Creation of DataFrame
The most common method is to pass a dictionary to create. DataFrame takes the dictionary key as the name of each column and the dictionary value (an array) as the value of each column
In addition, the DataFrame automatically adds the index of each row (like Series).
As with Series, if the incoming column does not match the key of the dictionary, the corresponding value is NaN.
from pandas import DataFrame
Block creation
data = np.random.randint(0,150, size=(4,4)) index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu'] columns = ['chinese', 'mathematics', 'English','python'] df = DataFrame(data=data, index=index, columns=columns) df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 104 | 26 | 38 | 52 |
Li Si | 11 | 104 | 88 | 41 |
Wang Wu | 38 | 119 | 36 | 139 |
Zhao Liu | 67 | 93 | 21 | 130 |
Create using dictionary
df = DataFrame({'chinese': np.random.randint(0,150, size=4), 'mathematics': np.random.randint(0,150, size=4), 'English': np.random.randint(0,150, size=4), 'python': np.random.randint(0,150, size=4)},) df.index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu'] df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 30 | one hundred and ten | 143 | 89 |
Li Si | 116 | 85 | fifty-three | 138 |
Wang Wu | 28 | 32 | 29 | 141 |
Zhao Liu | 91 | 23 | 50 | 50 |
DataFrame properties: values, columns, index, shape
df.values
df.columns
df.index
df.shape
============================================
Exercise 4:
Create a DataFrame named ddd according to the following test scores:
this one and that one Language 150 0 Mathematics 150 0 English 150 0 Li Zong 300 0
============================================
# Dictionaries ddd = DataFrame({'Zhang San': [150]* 3 + [300], 'Li Si': [0]* 4}) ddd.index = ['chinese', 'mathematics', 'English', 'Comprehensive management'] ddd
data = [[150,0]] * 3 + [[300, 0]] index = ['chinese', 'mathematics', 'English', 'Comprehensive management'] columns = ['Zhang San', 'Li Si'] ddd = DataFrame(data=data, index=index, columns=columns) ddd
2) Index of DataFrame
(1) Index columns
- In a dictionary like manner
- By attributes
The column of the DataFrame can be obtained as a Series. The returned Series has the same index as the original DataFrame, and the name attribute has been set, that is, the corresponding column name.
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | one hundred and eighteen | 12 | 53 | 12 |
Li Si | forty-eight | 34 | 81 | 54 |
Wang Wu | 32 | 58 | 80 | 133 |
Zhao Liu | one hundred and seven | 25 | 25 | 42 |
df['chinese']
Zhang San one hundred and eighteen Li Si forty-eight Wang Wu 32 Zhao Liu one hundred and seven Name: chinese, dtype: int32
df.chinese
Zhang San one hundred and eighteen Li Si 48 Wang Wu 32 Zhao Liu 107 Name: chinese, dtype: int32
# Add a new column df['computer'] = np.random.randint(0,150, size=4) df
chinese | mathematics | English | python | computer | |
---|---|---|---|---|---|
Zhang San | 118 | 12 | 53 | 12 | 110 |
Li Si | 48 | 34 | 81 | 54 | 131 |
Wang Wu | 32 | 58 | 80 | 133 | 132 |
Zhao Liu | 107 | 25 | 25 | 42 | 129 |
# When adding a column, you cannot use the attribute writing method df.Comprehensive management = np.random.randint(0,150, size=4)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/1072798280.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access df.Comprehensive management = np.random.randint(0,150, size=4)
df
chinese | mathematics | English | python | computer | |
---|---|---|---|---|---|
Zhang San | 118 | 12 | 53 | 12 | 110 |
Li Si | 48 | 34 | 81 | 54 | 131 |
Wang Wu | 32 | 58 | 80 | 133 | 132 |
Zhao Liu | 107 | 25 | 25 | 42 | 129 |
(2) Index rows
- Use. loc [] plus index to index rows
- Use. iloc [] plus integers to index rows
Similarly, a Series is returned, and the index is the original columns.
# Explicit writing df.loc['Zhang San']
chinese 118 mathematics 12 English 53 python 12 computer 110 Name: Zhang San, dtype: int32
df.iloc[1]
(3) Method of indexing elements
- Use column index
- Use row index (iloc[3,1]
- Use the values attribute (two-dimensional numpy array)
# Column before row # Don't write the assignment of chain index like this df['English']['Li Si'] = 88
df
# First and last df.loc['Li Si'].loc['English']
# Recommended writing df.loc['Li Si', 'English']
df.iloc[1, 2]
Chained indexing is not recommended
df.loc[['Li Si']]
chinese | mathematics | English | python | computer | |
---|---|---|---|---|---|
Li Si | 48 | 34 | 81 | 54 | 131 |
3) DataFrame slice
(1) Column slice
df['mathematics': 'python'] # This is not a slicing of columns
chinese | mathematics | English | python |
---|
df.loc[:, 'mathematics': 'python']
mathematics | English | python | |
---|---|---|---|
Zhang San | 12 | 53 | 12 |
Li Si | 34 | 81 | 54 |
Wang Wu | 58 | 80 | 133 |
Zhao Liu | 25 | 25 | 42 |
df.iloc[:, 1:3]
mathematics | English | |
---|---|---|
Zhang San | 110 | 143 |
Li Si | 85 | 53 |
Wang Wu | 32 | 29 |
Zhao Liu | 23 | 50 |
df[['mathematics', 'English', 'python']]
mathematics | English | python | |
---|---|---|---|
Zhang San | 110 | 143 | 89 |
Li Si | 85 | 53 | 138 |
Wang Wu | 32 | 29 | 141 |
Zhao Liu | 23 | 50 | 50 |
df.iloc[:, 0:3]
chinese | mathematics | English | |
---|---|---|---|
Zhang San | 30 | 110 | 143 |
Li Si | 116 | 85 | 53 |
Wang Wu | 28 | 32 | 29 |
Zhao Liu | 91 | 23 | 50 |
(2) Row slice
df['Li Si': 'Wang Wu'] # Fully closed interval
chinese | mathematics | English | python | |
---|---|---|---|---|
Li Si | 116 | 85 | 53 | 138 |
Wang Wu | 28 | 32 | 29 | 141 |
DataFrame index summary
1. Row index reference. loc, column index reference brackets
2. For the index of elements, first index rows, then index columns. df.loc[index, columns]
3. If you still want to return the DataFrame, use two layers of brackets
be careful:
- When using square brackets directly
- An index represents a column index
- Slices represent row slices
- Do not use chained indexes
============================================
Exercise 5:
Index and slice ddd using many methods, and compare the differences
============================================
ddd
--------------------------------------------------------------------------- NameError Traceback (most recent call last) C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/2469631560.py in <module> ----> 1 ddd NameError: name 'ddd' is not defined
1. Index Zhang San's English score (a dataframe is required)
2. Slice and cut the scores from Chinese to mathematics
3. Assign a value of 108 to Li Si's English score
ddd.loc[['chinese'], ['Zhang San']]
ddd['chinese': 'mathematics']
ddd.loc['chinese': 'mathematics']
ddd.iloc[0:2]
ddd.loc['English', 'Li Si'] = 108
4) Operation of DataFrame
(1) Operations between dataframes
Same as Series:
- Automatically align data of different indexes in operation
- If the index does not correspond, fill in NaN
DataFrame and a single number are calculated
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 30 | 110 | 143 | 89 |
Li Si | 116 | 85 | 53 | 138 |
Wang Wu | 28 | 32 | 29 | 141 |
Zhao Liu | 91 | 23 | 50 | 50 |
df + 1
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 31 | 111 | 144 | ninety |
Li Si | 117 | 86 | 54 | 139 |
Wang Wu | 29 | 33 | 30 | 142 |
Zhao Liu | 92 | 24 | 51 | 51 |
DataFrame and DataFrame are calculated
# Operation between DataFrame and DataFrame: only when the row and column indexes are consistent, and if they are inconsistent, fill in NaN data = np.random.randint(0,150, size=(4,4)) index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu'] columns = ['chinese', 'mathematics', 'English','python'] df2 = DataFrame(data=data, index=index, columns=columns) df2
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 142 | 3 | eighty-four | 72 |
Li Si | 77 | 38 | 46 | 43 |
Wang Wu | 3 | 37 | 84 | 127 |
Zhao Liu | 142 | 48 | 24 | 80 |
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 30 | 110 | 143 | 89 |
Li Si | 116 | 85 | 53 | 138 |
Wang Wu | 28 | 32 | 29 | 141 |
Zhao Liu | 91 | 23 | 50 | 50 |
df2.loc['pseudo-ginseng'] = np.random.randint(0,150, size=4) df2
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 75 | 91 | 81 | 88 |
Li Si | 133 | 1 | 128 | 81 |
Wang Wu | 13 | 86 | 92 | 50 |
Zhao Liu | 51 | 64 | 23 | 113 |
pseudo-ginseng | 146 | 58 | 20 | 42 |
df + df2
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 105.0 | 201.0 | 224.0 | 177.0 |
Li Si | 249.0 | 86.0 | 181.0 | 219.0 |
Wang Wu | 41.0 | 118.0 | 121.0 | 191.0 |
pseudo-ginseng | NaN | NaN | NaN | NaN |
Zhao Liu | 142.0 | 87.0 | 73.0 | 163.0 |
df.add(df2, fill_value=0)
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 105.0 | 201.0 | 224.0 | 177.0 |
Li Si | 249.0 | 86.0 | 181.0 | 219.0 |
Wang Wu | 41.0 | 118.0 | 121.0 | 191.0 |
pseudo-ginseng | 146.0 | 58.0 | 20.0 | 42.0 |
Zhao Liu | 142.0 | 87.0 | 73.0 | 163.0 |
The following is the corresponding table of Python operators and pandas operators:
Python Operator | Pandas Method(s) |
---|---|
+ | add() |
- | sub(), subtract() |
* | mul(), multiply() |
/ | truediv(), div(), divide() |
// | floordiv() |
% | mod() |
** | pow() |
(2) Operation between Series and DataFrame
[important]
- Use Python operators: operate in behavioral units (parameters must be rows) and are valid for all rows.
- Use the pandas operator function:
- axis=0: operate by column (the parameter must be column), which is valid for all columns.
- axis=1: operates in behavioral units (the parameter must be a row), which is valid for all rows.
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 30 | 110 | 143 | 89 |
Li Si | 116 | 85 | 53 | 138 |
Wang Wu | 28 | 32 | 29 | 141 |
Zhao Liu | 91 | 23 | 50 | 50 |
s = Series(data=np.random.randint(0,150,size=5), index=['chinese', 'mathematics', 'English', 'python', 'computer']) s
chinese one hundred and forty mathematics one hundred and fourteen English 90 python 82 computer 9 dtype: int32
df + s # Directly perform the operation, observe whether the column index of the DataFrame is consistent with the index of the Series, and perform the corresponding operation if it is consistent
python | mathematics | English | computer | chinese | |
---|---|---|---|---|---|
Zhang San | 171 | 224 | 233 | NaN | 170 |
Li Si | 220 | 199 | 143 | NaN | 256 |
Wang Wu | 223 | 146 | 119 | NaN | 168 |
Zhao Liu | 132 | 137 | 140 | NaN | 231 |
s = Series(data=np.random.randint(0,150,size=4), index=['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu']) s
Zhang San one hundred and twenty-two Li Si one hundred and nine Wang Wu 84 Zhao Liu forty-five dtype: int32
df + s # The default operation using the operator is to compare the column index of DataFrame with the column index of Series
python | Zhang San | mathematics | Li Si | Wang Wu | English | chinese | Zhao Liu | |
---|---|---|---|---|---|---|---|---|
Zhang San | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Li Si | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Wang Wu | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Zhao Liu | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df.add(s, axis=0) # Use axis to change the direction of the operation
Inductive summary
1. Dataframe and single number operation, and each element operates separately
2. For DataFrame and DataFrame operations, the same row and column indexes are used for operations, and different indexes are supplemented with NaN
3. DataFrame and Series operations. When operators are used, the column index of DataFrame and the index of Series are compared by default
4. If you want to keep the original data or change the direction of operation, use the pandas encapsulation method
============================================
Exercise 6:
-
If DDD is the midterm examination result and ddd2 is the final examination result, please create ddd2 freely and add it to DDD to find the average value of the midterm and final examination.
-
Suppose Zhang is found cheating in math in the Third mid-term exam and should be recorded as 0. How to achieve it?
-
Li Si made meritorious service for reporting Zhang San's cheating and added 100 points to all subjects in the mid-term exam. How to realize it?
-
Later, the teacher found that there was a problem wrong. In order to calm the students' emotions, he gave each student 10 points for each subject. How to achieve it?
============================================
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 30 | 110 | 143 | 89 |
Li Si | 116 | 85 | 53 | 138 |
Wang Wu | 28 | 32 | 29 | 141 |
Zhao Liu | 91 | 23 | 50 | 50 |
df2 = df.copy()
# values cannot be assigned directly df2.values = np.random.randint(0, 150, size=(4,5))
df2.values
df2 = DataFrame(data=np.random.randint(0, 150, size=(4,5)), index=df.index, columns=df.columns) df2
df
(df + df2) / 2
--------------------------------------------------------------------------- NameError Traceback (most recent call last) C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/3808069166.py in <module> ----> 1 (df + df2) / 2 NameError: name 'df' is not defined
df.loc['Zhang San', 'mathematics'] = 0
df
df.loc['Li Si'] += 100
df
df + 10
Processing lost data
import numpy as np import pandas as pd
There are two types of lost data:
- None
- np.nan(NaN)
None
1. None
None comes with Python. Its type is python object. Therefore, none cannot participate in any calculation.
type(None)
NoneType
n = np.array([1,1,2,3, None]) n
array([1, 1, 2, 3, None], dtype=object)
n.sum()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-4-20b8964b5fcc> in <module> ----> 1 n.sum() d:\1903\.venv\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where) 36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False, 37 initial=_NoValue, where=True): ---> 38 return umr_sum(a, axis, dtype, out, keepdims, initial, where) 39 40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False, TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
The operation of object type is much slower than that of int type
Calculate the summation time for different data types
%timeit np.arange(1e5,dtype=xxx).sum()
%timeit np.arange(1e5, dtype=np.int32).sum()
253 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.arange(1e5, dtype=np.float64).sum()
270 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.arange(1e5, dtype=np.object).sum()
10.4 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2. np.nan(NaN)
np.nan is a floating-point type and can participate in the calculation. But the result of calculation is always NaN.
type(np.nan)
float
n = np.array([1,1,2,3, np.nan]) n.sum()
nan
However, you can use the np.nan * () function to calculate nan. In this case, nan is regarded as 0.
np.nansum(n)
7.0
3. None and NaN in pandas
1) None and np.nan in pandas are regarded as np.nan
from pandas import Series, DataFrame
data = np.random.randint(0,150, size=(4,4)) index = ['Zhang San', 'Li Si', 'Wang Wu', 'Zhao Liu'] columns = ['chinese', 'mathematics', 'English','python'] df = DataFrame(data=data, index=index, columns=columns) df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 105 | 149 | 70 | 32 |
Li Si | 118 | 44 | 117 | 24 |
Wang Wu | 90 | 111 | 120 | 8 |
Zhao Liu | 82 | 53 | 5 | 22 |
df.loc['Zhang San', 'chinese'] = None
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 149 | 70 | 32 |
Li Si | 118.0 | 44 | 117 | 24 |
Wang Wu | 90.0 | 111 | 120 | 8 |
Zhao Liu | 82.0 | 53 | 5 | 22 |
Modify DataFrame data using DataFrame row index and column index
df.loc['Li Si', 'chinese'] = np.nan
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 149 | 70 | 32 |
Li Si | NaN | 44 | 117 | 24 |
Wang Wu | 90.0 | 111 | 120 | 8 |
Zhao Liu | 82.0 | 53 | 5 | 22 |
df.loc['Zhang San', 'python'] = np.nan
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 131 | 94 | NaN |
Li Si | NaN | 5 | 108 | 124.0 |
Wang Wu | 54.0 | 74 | 36 | 144.0 |
Zhao Liu | 33.0 | 117 | 34 | 37.0 |
2) Operation of None and np.nan in pandas
- isnull(): judge whether NaN exists
- notnull(): judge whether there is no NaN
- dropna(): filtering missing data
- fillna(): fill in missing data
(1) Judgment function
- isnull()
- notnull()
pd.isnull(df)
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | True | False | False | True |
Li Si | True | False | False | False |
Wang Wu | False | False | False | False |
Zhao Liu | False | False | False | False |
pd.notnull(df)
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | False | True | True | False |
Li Si | False | True | True | True |
Wang Wu | True | True | True | True |
Zhao Liu | True | True | True | True |
df.isnull()
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | True | False | False | True |
Li Si | True | False | False | False |
Wang Wu | False | False | False | False |
Zhao Liu | False | False | False | False |
# Use with any to determine whether NaN data exists in rows or columns # The default axis is 0, which determines whether each column has NaN data df.isnull().any(axis=0)
chinese True mathematics False English False python True dtype: bool
# Check whether there is NaN in each line df.isnull().any(axis=1)
Zhang San True Li Si True Wang Wu False Zhao Liu False dtype: bool
(2) Filter function
- dropna()
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 149 | 70 | 32 |
Li Si | NaN | 44 | 117 | 24 |
Wang Wu | 90.0 | 111 | 120 | 8 |
Zhao Liu | 82.0 | 53 | 5 | 22 |
# axis=1, delete column # how='all 'means all NaN are deleted df.dropna(axis=0, how='all')
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 149 | 70 | 32 |
Li Si | NaN | 44 | 117 | 24 |
Wang Wu | 90.0 | 111 | 120 | 8 |
Zhao Liu | 82.0 | 53 | 5 | 22 |
df.dropna(axis=0, how='any', inplace=True)
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Wang Wu | 64.0 | 47 | 97 | 148.0 |
Zhao Liu | 125.0 | 75 | 113 | 97.0 |
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 131 | 94 | NaN |
Li Si | NaN | 5 | 108 | 124.0 |
Wang Wu | 54.0 | 74 | 36 | 144.0 |
Zhao Liu | 33.0 | 117 | 34 | 37.0 |
df.dropna(axis=0, how='any',subset=['chinese', 'mathematics', 'English'])
chinese | mathematics | English | python | |
---|---|---|---|---|
Wang Wu | 54.0 | 74 | 36 | 144.0 |
Zhao Liu | 33.0 | 117 | 34 | 37.0 |
You can choose whether to filter rows or columns (rows by default)
You can also select the filtering method. how = "all"
(3) Fill function Series/DataFrame
- fillna()
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 131 | 94 | NaN |
Li Si | NaN | 5 | 108 | 124.0 |
Wang Wu | 54.0 | 74 | 36 | 144.0 |
Zhao Liu | 33.0 | 117 | 34 | 37.0 |
# Fill with the specified value df.fillna(value=100)
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 100.0 | 131 | 94 | 100.0 |
Li Si | 100.0 | 5 | 108 | 124.0 |
Wang Wu | 54.0 | 74 | 36 | 144.0 |
Zhao Liu | 33.0 | 117 | 34 | 37.0 |
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 131 | 94 | NaN |
Li Si | NaN | 5 | 108 | 124.0 |
Wang Wu | 54.0 | 74 | 36 | 144.0 |
Zhao Liu | 33.0 | 117 | 34 | 37.0 |
# Fill with existing data # 'backfill', 'bfill', 'pad', 'ffill' df.fillna(axis=0, method='bfill', limit=1)
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | NaN | 131 | 94 | 124.0 |
Li Si | 54.0 | 5 | 108 | 124.0 |
Wang Wu | 54.0 | 74 | 36 | 144.0 |
Zhao Liu | 33.0 | 117 | 34 | 37.0 |
df.fillna(axis=1, method='bfill', inplace=True)
--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_7776/2566829753.py in <module> ----> 1 df.fillna(axis=1, method='bfill', inplace=True) c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs) 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) 312 313 return wrapper c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast) 5174 downcast=None, 5175 ) -> DataFrame | None: -> 5176 return super().fillna( 5177 value=value, 5178 method=method, c:\users\administrator\appdata\local\programs\python\python39\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast) 6324 if not self._mgr.is_single_block and axis == 1: 6325 if inplace: -> 6326 raise NotImplementedError() 6327 result = self.T.fillna(method=method, limit=limit).T 6328 NotImplementedError:
You can choose whether to fill forward or backward
For the DataFrame, also select the axis of the fill. Remember, for DataFrame:
- axis=0: index / line
- axis=1: columns / column
============================================
Exercise 7:
-
Briefly describe the difference between None and NaN
-
Suppose Zhang San and Li Si take the mock exam, but Zhang San gives up the English exam because he suddenly wants to understand life, so he writes it as None. Please create a DataFrame based on this and name it ddd3
-
The teacher decided to fill Zhang San's English score with math scores. How to achieve it?
Fill in Zhang San's English score with Li Si's English score?
============================================
1, None yes python of object,Cannot participate in calculation,np.nan yes float type,Can participate in calculation,But the result is always nan, have access to np.nan*To calculate the correct value. stay pandas in,None and np.nan Are regarded as np.nan To handle.
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 149 | 139 | 46 | 74 |
Li Si | 33 | 32 | 119 | 116 |
Wang Wu | 33 | 113 | 148 | 55 |
Zhao Liu | 131 | 11 | 82 | 127 |
df.loc['Zhang San', 'English'] = np.nan
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 149 | 139 | NaN | 74 |
Li Si | 33 | 32 | 119.0 | 116 |
Wang Wu | 33 | 113 | 148.0 | 55 |
Zhao Liu | 131 | 11 | 82.0 | 127 |
df.fillna(axis=1, method='pad')
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 149.0 | 139.0 | 139.0 | 74.0 |
Li Si | 33.0 | 32.0 | 119.0 | 116.0 |
Wang Wu | 33.0 | 113.0 | 148.0 | 55.0 |
Zhao Liu | 131.0 | 11.0 | 82.0 | 127.0 |
df.fillna(axis=0, method='bfill', inplace=True)
df
chinese | mathematics | English | python | |
---|---|---|---|---|
Zhang San | 149 | 139 | 119.0 | 74 |
Li Si | 33 | 32 | 119.0 | 116 |
Wang Wu | 33 | 113 | 148.0 | 55 |
Zhao Liu | 131 | 11 | 82.0 | 127 |