pandas notes 004
4, Basic operations of Index object Index and Index
import pandas as pd import numpy as np
1. Index object index
1.1 Series and DataFrame
Indexes in Series and DataFrame are Index objects.
Series:
pd1 = pd.Series(range(5),index = ['A','B','C','D','E']) #Create a Series index from the list and specify the index name print(pd1) print("="*20) print(type(pd1.index)) #Series is an index object
A 0 B 1 C 2 D 3 E 4 dtype: int64 ==================== <class 'pandas.core.indexes.base.Index'>
DataFrame:
pd2 = pd.DataFrame(np.arange(9).reshape(3,3),index=['A','B','C'],columns=['M','N','Q']) #Create a DataFrame index from a two-dimensional array and specify the index row and column names print(pd2) print("="*20) print(type(pd2.index)) #dataframe is an index object
M N Q A 0 1 2 B 3 4 5 C 6 7 8 ==================== <class 'pandas.core.indexes.base.Index'>
1.2 index object immutable
pd1.index[1] = 2 #report errors
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-7-1226982f94cb> in <module> ----> 1 pd1.index[1] = 2 #report errors F:\Anaconda_all\Anaconda\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 4275 @final 4276 def __setitem__(self, key, value): -> 4277 raise TypeError("Index does not support mutable operations") 4278 4279 def __getitem__(self, key): TypeError: Index does not support mutable operations
1.3 common Index types
- Index, index
- Int64Index, integer index
- MultiIndex, hierarchical index
- DatetimeIndex, timestamp type
2. Some basic operations of index
- reindex
- increase
- Delete
- change
- check
- Advanced index
2.1 reindex
2.1.1 Series index
ps1 = pd.Series(range(5),index = ['A','B','C','D','E']) ps1
A 0 B 1 C 2 D 3 E 4 dtype: int64
ps2 = ps1.reindex(['b','A','C','d','E','F']) #Rebuild row index print(ps1) #The original Series index has not changed print("="*30) print(ps2) #If the new index is different from the original index, NAN will be returned, and if it is the same, the value corresponding to the original index will be returned, regardless of the index order
A 0 B 1 C 2 D 3 E 4 dtype: int64 ============================== b NaN A 0.0 C 2.0 d NaN E 4.0 F NaN dtype: float64
2.1.2 DataFrame index
ps3 = pd.DataFrame(np.arange(12).reshape(3,4),index=['A','B','C'],columns=['a','b','c','d']) ps3
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11
Rebuild row index:
#Rebuild row index ps4 = ps3.reindex(['e','B','A']) print(ps3) #The original DataFrame index has not changed print("="*20) print(ps4)
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 ==================== a b c d e NaN NaN NaN NaN B 4.0 5.0 6.0 7.0 A 0.0 1.0 2.0 3.0
Rebuild column index:
#Rebuild column index ps5 = ps3.reindex(columns = ['b','c','q','v']) print(ps3) #The original DataFrame index has not changed print("="*20) print(ps5)
b c q v A 1 2 NaN NaN B 5 6 NaN NaN C 9 10 NaN NaN
2.2 add
2.2.1 Series index
p1 = pd.Series(range(5),index = ['A','B','C','D','E']) p1
A 0 B 1 C 2 D 3 E 4 dtype: int64
Change original index:
#Change original index p1['F'] = 9 p1
A 0 B 1 C 2 D 3 E 4 F 9 dtype: int64
Do not change the original index:
#Create a new index object without changing the original index s1 = pd.Series({'g':666}) p2 = p1.append(s1) print(p1) #Original index unchanged print("="*20) print(p2)
A 0 B 1 C 2 D 3 E 4 F 9 dtype: int64 ==================== A 0 B 1 C 2 D 3 E 4 F 9 g 666 dtype: int64
2.2.2 DataFrame index
Add column
#DataFrame index q = pd.DataFrame(np.arange(12).reshape(3,4),index=['A','B','C'],columns=['a','b','c','d']) q
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11
By default, the column is changed, and a new column is added on the rightmost side of the column, affecting the original index
q['t'] = 9 #The new t columns are all 9 print(q) print("="*20) q['y'] = [10,12,14] #Specifies the value of the new column print(q) print("="*20) q['m'] = ['19','32','24'] #Specify the value of the new column in quotation marks print(q)
a b c d t A 0 1 2 3 9 B 4 5 6 7 9 C 8 9 10 11 9 ==================== a b c d t y A 0 1 2 3 9 10 B 4 5 6 7 9 12 C 8 9 10 11 9 14 ==================== a b c d t y m A 0 1 2 3 9 10 19 B 4 5 6 7 9 12 32 C 8 9 10 11 9 14 24
Add a new column to the specified location (insert)
#Adds a new column to the specified location u = pd.DataFrame(np.arange(12).reshape(3,4),index=['A','B','C'],columns=['a','b','c','d']) u
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11
insert will affect the original index
u.insert(0,'t',2) #Add column t before column 0 as column 0, and the values are all 2 print(u) # print("="*20) u.insert(1,'r',[6,66,666]) #Add column t before column 1 as column 1 print(u) print("="*20) u.insert(2,'s',['7','77','777']) #Add column t before column 2 as column 2 print(u)
t a b c d A 2 0 1 2 3 B 2 4 5 6 7 C 2 8 9 10 11 ==================== t r a b c d A 2 6 0 1 2 3 B 2 66 4 5 6 7 C 2 666 8 9 10 11 ==================== t r s a b c d A 2 6 7 0 1 2 3 B 2 66 77 4 5 6 7 C 2 666 777 8 9 10 11
Add row
#Add row qt = pd.DataFrame(np.arange(12).reshape(3,4),index=['A','B','C'],columns=['a','b','c','d']) qt
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11
Use label index loc:
#Using the label index loc, the original index is changed qt.loc['D'] = [1,11,111,1111] #Add row D qt
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 1 11 111 1111
Use append
row = {'a':6,'b':6,'c':6,'d':6} qt1 = qt.append(row,ignore_index=True) #Add ignore_index=True statement, (ignore the original row index name), otherwise an error will be reported print(qt) #Original index unchanged print("="*20) print(qt1)
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 1 11 111 1111 ==================== a b c d 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 1 11 111 1111 4 6 6 6 6
2.3 delete
2.3.1 del
Will change the original index.
Series
k1 = pd.Series(range(5),index = ['A','B','C','D','E']) k1
A 0 B 1 C 2 D 3 E 4 dtype: int64
del k1['A'] #Delete row k1
B 1 C 2 D 3 E 4 dtype: int64
DataFrame
k2 = pd.DataFrame(np.arange(12).reshape(3,4),index=['A','B','C'],columns=['a','b','c','d']) k2
a b c d A 0 1 2 3 B 4 5 6 7 C 8 9 10 11
del k2['b'] #Delete column b k2
a c d A 0 2 3 B 4 6 7 C 8 10 11
2.3.2 drop
Without changing the original index, it is deleted as a new index object.
Series
kt1 = pd.Series(range(4),index = ['A','B','C','D']) kt1
A 0 B 1 C 2 D 3 dtype: int64
Delete a piece of data on the axis:
#Delete a piece of data on the axis kt2 = kt1.drop('A') print(kt1) #The original index object has not changed print("="*20) print(kt2)
A 0 B 1 C 2 D 3 dtype: int64 ==================== B 1 C 2 D 3 dtype: int64
Delete multiple pieces of data:
#Delete multiple pieces of data kt3 = kt1.drop(['A','C']) print(kt1) #The original index object has not changed print("="*20) print(kt3)
A 0 B 1 C 2 D 3 dtype: int64 ==================== B 1 D 3 dtype: int64
DataFrame
tj1 = pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['m','n','o','p']) tj1
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15
Delete rows by default (axis=0)
#Delete rows by default (axis=0) tj2 = tj1.drop('B') #Delete a row print(tj1) #The original index object has not changed print("="*20) print(tj2) print("="*20) tj3 = tj1.drop(['A','C']) #Delete multiple rows print(tj1) #The original index object has not changed print("="*20) print(tj3)
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15 ==================== m n o p A 0 1 2 3 C 8 9 10 11 D 12 13 14 15 ==================== m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15 ==================== m n o p B 4 5 6 7 D 12 13 14 15
Delete columns (axis=1 or axis = 'columns')
#Delete column (axis=1 or axis='columns') tj4 = tj1.drop('m',axis=1) #Delete a column print(tj1) print("="*20) print(tj4) print("="*20) tj5 = tj1.drop(['m','o'],axis='columns') #Delete multiple columns print(tj1) print("="*20) print(tj5)
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15 ==================== n o p A 1 2 3 B 5 6 7 C 9 10 11 D 13 14 15 ==================== m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15 ==================== n p A 1 3 B 5 7 C 9 11 D 13 15
inplace attribute of drop()
Delete on the original object and no new object will be returned.
#The inplace property is deleted on the original object and will not return a new object bt = pd.Series(range(4),index = ['A','B','C','D']) bt
A 0 B 1 C 2 D 3 dtype: int64
bt.drop('A',inplace=True) bt
B 1 C 2 D 3 dtype: int64
2.4 modification
2.4.1 Series index
bpr = pd.Series(range(4),index = ['A','B','C','D']) bpr
A 0 B 1 C 2 D 3 dtype: int64
Label index
bpr['A'] = 666 #Label index bpr
A 666 B 1 C 2 D 3 dtype: int64
Location index
bpr[1] = 777 #Location index bpr
A 666 B 777 C 2 D 3 dtype: int64
2.4.2 DataFrame index
tu1 = pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['m','n','o','p']) tu1
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15
Default change column
Object ['column']
tu1['p'] = 4 #Change all columns p to 4 tu1
m n o p A 0 1 2 4 B 4 5 6 4 C 8 9 10 4 D 12 13 14 4
Object ['column']
tu1['n'] = ['2','22','222','2222'] tu1
m n o p A 0 2 2 4 B 4 22 6 4 C 8 222 10 4 D 12 2222 14 4
Objects. Columns
# Object. Column: the effect is the same as the above object ['column'] tu1.m = [1,2,3,4] tu1
m n o p A 1 2 2 4 B 2 22 6 4 C 3 222 10 4 D 4 2222 14 4
Modify rows using label index loc
#Use label index loc td1 = pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['m','n','o','p']) td1
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15
loc ['row name']
td1.loc['A'] = 666 #Modify row A, all values are 666 td1
m n o p A 666 666 666 666 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15
Modify exact value
#Modify a value td1.loc['B','p'] = 100 #Modify the value of row B and column p to 100 td1
m n o p A 666 666 666 666 B 4 5 6 100 C 8 9 10 11 D 12 13 14 15
2.5 check
2.5.1 Series index
cc = pd.Series(range(4),index = ['A','B','C','D']) cc
A 0 B 1 C 2 D 3 dtype: int64
Row index
cc['A'] #Label index
0
cc[0] #Location index
0
Slice index
#Position slice index cc[1:4] #Take left instead of right
B 1 C 2 D 3 dtype: int64
#Label slice index cc['B':'D'] #Both left and right
B 1 C 2 D 3 dtype: int64
Discontinuous index (two brackets)
cc[['A','B']] #Label discontinuous index
A 0 B 1 dtype: int64
cc[[0,1]] #Position discontinuous index
A 0 B 1 dtype: int64
Boolean index
#True is returned if the condition is met, otherwise False is returned cc > 2
A False B False C False D True dtype: bool
Returns the value corresponding to the index that meets the condition (True)
cc[cc>2] #Returns the value corresponding to the index that meets the condition (True)
D 3 dtype: int64
2.5.2 DataFrame index
red = pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['m','n','o','p']) red
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15
Column index
Note: 1. By default, only the column index can be retrieved, and an error is reported when the row index is retrieved. 2. The value can only be obtained by index name, not by location index (such as red[0])
#1. Column index (by default, only the column index can be retrieved, and an error is reported when the row index is retrieved) red['n'] #It can only be obtained by index name, not by location index
A 1 B 5 C 9 D 13 Name: n, dtype: int32
Take multiple columns (discontinuous)
#Take multiple columns (discontinuous) red[['m','p']]
m p A 0 3 B 4 7 C 8 11 D 12 15
Take a value
#Take a value red['m']['B'] #The first bracket represents a column and the second bracket represents a row
4
section
#section red[1:3] #The row is obtained, and the loc advanced index is required to obtain the column
m n o p B 4 5 6 7 C 8 9 10 11
2.6 advanced index
- loc Tag Index
- iloc location index
- ix tag and location hybrid index
2.6.1 loc Tag Index
Based on custom index name (label index)
Series
ts = pd.Series(range(4),index = ['A','B','C','D']) ts
A 0 B 1 C 2 D 3 dtype: int64
ts.loc['A':'C'] #The common label slices of loc and ts['A':'C'] in Series are the same (both left and right label slices are taken)
A 0 B 1 C 2 dtype: int64
DataFrame
green = pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['m','n','o','p']) green
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 15
green.loc['A','m'] #First row first column
0
green.loc['A':'C','m':'n'] #The first parameter is the range of rows (which can be a single row), and the second parameter is the range of columns (which can be a single column)
m n A 0 1 B 4 5 C 8 9
2.6.2 iloc location index
The function is the same as loc, but the index is based on the index number
Series
lol = pd.Series(range(4),index = ['A','B','C','D']) lol
A 0 B 1 C 2 D 3 dtype: int64
lol.iloc[1]
1
lol.iloc[1:3] #Take left instead of right
B 1 C 2 dtype: int64
DataFrame
gto = pd.DataFrame(np.arange(16).reshape(4,4),index=['A','B','C','D'],columns=['m','n','o','p']) gto
m n o p A 0 1 2 3 B 4 5 6 7 C 8 9 10 11 D 12 13 14 1
gto.iloc[0,1] #The first parameter is row and the second parameter is column. Here, it means to take the value of the first row and the second column
1
Position slice left not right
gto.iloc[1:3,0:3] #The first parameter is row, and the second parameter is column (the position slice is left rather than right)
m n o B 4 5 6 C 8 9 10