pandas index object and index reconstruction

Keywords: Python

1, Index

Index object index in Pandas is used to store axis labels and other metadata. The index object is immutable and cannot be modified by the user.

In [73]: obj = pd.Series(range(3),index = ['a','b','c'])
In [74]: index = obj.index
In [75]: index
Out[75]: Index(['a', 'b', 'c'], dtype='object')
In [76]: index[1:]
Out[76]: Index(['b', 'c'], dtype='object')
In [77]: index[1] = 'f'  # TypeError
In [8]: index.size
Out[8]: 3
In [9]: index.shape
Out[9]: (3,)
In [10]: index.ndim
Out[10]: 1
In [11]: index.dtype
Out[11]: dtype('O')

The immutability of index objects makes it safer to share index objects in a variety of data structures:

In [78]: labels = pd.Index(np.arange(3))
In [79]: labels
Out[79]: Int64Index([0, 1, 2], dtype='int64')
In [80]: obj2 = pd.Series([2,3.5,0], index=labels)
In [81]: obj2
Out[81]:
0    2.0
1    3.5
2    0.0
dtype: float64
In [82]: obj2.index is labels
Out[82]: True

The index object is essentially a container object, so you can use Python's in operation:

In [84]: f2
Out[84]:
key    year     state  pop  debt
order
a      2000   beijing  1.5   NaN
b      2001   beijing  1.7   NaN
c      2002   beijing  3.6   1.0
d      2001  shanghai  2.4   2.0
e      2002  shanghai  2.9   NaN
f      2003  shanghai  3.2   3.0
In [86]: 'c' in f2.index
Out[86]: True
In [88]: 'pop' in f2.columns
Out[88]: True

And most importantly, the index object of pandas can contain duplicate labels:

In [89]: dup_lables = pd.Index(['foo','foo','bar','bar'])
In [90]: dup_lables
Out[90]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

So think about it. Can DataFrame objects have duplicate columns or index es?

tolerable! But try not to! :

In [91]: f2.index = ['a']*6
In [92]: f2
Out[92]:
key  year     state  pop  debt
a    2000   beijing  1.5   NaN
a    2001   beijing  1.7   NaN
a    2002   beijing  3.6   1.0
a    2001  shanghai  2.4   2.0
a    2002  shanghai  2.9   NaN
a    2003  shanghai  3.2   3.0
In [93]: f2.loc['a']
Out[93]:
key  year     state  pop  debt
a    2000   beijing  1.5   NaN
a    2001   beijing  1.7   NaN
a    2002   beijing  3.6   1.0
a    2001  shanghai  2.4   2.0
a    2002  shanghai  2.9   NaN
a    2003  shanghai  3.2   3.0
In [94]: f2.columns = ['year']*4
In [95]: f2
Out[95]:
   year      year  year  year
a  2000   beijing   1.5   NaN
a  2001   beijing   1.7   NaN
a  2002   beijing   3.6   1.0
a  2001  shanghai   2.4   2.0
a  2002  shanghai   2.9   NaN
a  2003  shanghai   3.2   3.0
In [96]: f2.index.is_unique  # You can use this property to determine whether the index is unique
Out[96]: False

index objects can also perform intersection, union, difference and XOR operations of sets, similar to Python's standard set data structure.

2, Re index

The reindex method is used to reset the new index for the panda object. This is not to modify in place, but to adjust the order by referring to the original data.

In [96]: obj=pd.Series([4.5,7.2,-5.3,3.6],index = ['d','b','a','c'])
In [97]: obj
Out[97]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

reindex will be arranged according to the new index. Indexes that do not exist will introduce missing values:

In [99]: obj2 = obj.reindex(list('abcde'))
In [100]: obj2
Out[100]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

You can also specify the fill method parameter for the missing value. For example, fill indicates forward fill and bfill indicates backward fill

In [101]: obj3 = pd.Series(['blue','purple','yellow'],index = [0,2,4])
In [102]: obj3
Out[102]:
0      blue
2    purple
4    yellow
dtype: object

In [103]: obj3.reindex(range(6),method='ffill')
Out[103]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

For a two-dimensional object such as DataFrame, if only one list parameter is provided when reindex method is executed, the default is to modify the row index. You can use the keyword parameter columns to specify that the column index is modified:

In [104]: f = pd.DataFrame(np.arange(9).reshape((3,3)),index=list('acd'),columns=['beijing','shanghai','guangzhou'])
In [105]: f
Out[105]:
   beijing  shanghai  guangzhou
a        0         1          2
c        3         4          5
d        6         7          8
In [106]: f2 = f.reindex(list('abcd'))
In [107]: f2
Out[107]:
   beijing  shanghai  guangzhou
a      0.0       1.0        2.0
b      NaN       NaN        NaN
c      3.0       4.0        5.0
d      6.0       7.0        8.0
In [112]: f3 = f.reindex(columns=['beijing','shanghai','xian','guangzhou'])
In [113]: f3
Out[113]:
   beijing  shanghai  xian  guangzhou
a        0         1   NaN          2
c        3         4   NaN          5
d        6         7   NaN          8

Posted by Tyree on Sun, 31 May 2020 19:17:29 -0700