Pnadas Foundation-Hierarchical Index

Hierarchical index

MultiIndex is a hierarchical index object

tup = [('beijing',2000),('beijing',2019),
          ('shanghai',2000),('shanghai',2019),
          ('guangzhou',2000),('guangzhou',2019)]
          
values = [10000,100000,6000,60000,4000,40000]

index = pd.MultiIndex.from_tuples(tup) # Generating MultiIndex with Tuples

sss = pd.Series(values, index=index) # Provide a MultiIndex as an index
>>>
>beijing   2000     10000
           2019    100000
shanghai   2000      6000
           2019     60000
guangzhou  2000      4000
           2019     40000

More ways to create MultiIndex include:

From the list: pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2])
From tuples: pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
Cartesian product: pd.MultiIndex.from_product([['a','b'],[1,2])
Direct construction: pd.MultiIndex (levels=['a','b'], [1,2]], labels=[[0,0,1,1], [0,1,0,1])

#=========================

Hierarchical indexing is very important in reshaping data and array perspective tables. For example, we can use the unstack method to rearrange data in the DataFrame, that is, to expand:

s.unstack()

1         2         3
a  0.283490  0.295529  0.277676
b  0.487573       NaN  0.091161
c  0.285157 -0.806851       NaN
d       NaN -0.287969 -0.696511
#--------------------------------------------------------------------------------------
s.unstack().stack()  # Inverse stack

a  1    0.283490
   2    0.295529
   3    0.277676
b  1    0.487573
   3    0.091161
c  1    0.285157
   2   -0.806851
d  2   -0.287969
   3   -0.696511

#==================

For DataFrame objects, each axis can be hierarchically indexed, providing a multidimensional array for index or columns to hierarchize:

Advanced Hierarchical Index

DataFrame object
sort_index(level=1) means to sort the indexes at the second level.
swaplevel(0, 1) means to exchange row indexes at Layer 0 and Layer 1.

Original:
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11
#--------------------------------------
df.swaplevel(0, 1).sort_index(level=0)
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
     b        6   7        8
2    a        3   4        5
     b        9  10       11

Indexing with columns in the DataFrame

df= pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                'd': [0, 1, 2, 0, 1, 2, 3]})
df>>>
a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3

set_index(['c','d']) converts C and D columns into hierarchical row indexes
drop=False retains the original column data
reset_index is the reverse operation of set_index

 df2 = df.set_index(['c','d'])
 
df2
       a  b
c   d
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1
#--------------------------------------
df.set_index(['c','d'],drop=False)

       a  b    c  d
c   d
one 0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
two 0  3  4  two  0
    1  4  3  two  1
    2  5  2  two  2
    3  6  1  two  3
#------------------------------------
 df2.reset_index()
 
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1

DataFrame Index Slice

If MultiIndex is not an ordered index, most slicing operations fail! At this point, you can use the sort_index method described earlier to sort the order first.

  In [19]: df = pd.DataFrame(np.arange(12).reshape((4, 3)),
        ...:             index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
        ...:             columns=[['Ohio', 'Ohio', 'Colorado'],
        ...:             ['Green', 'Red', 'Green']])
        ...:
    
    In [20]: df
    Out[20]:
         Ohio     Colorado
        Green Red    Green
    a 1     0   1        2
      2     3   4        5
    b 1     6   7        8
      2     9  10       11
    
    In [23]: df['Ohio','Colorado']  # You can't do this because the column index is hierarchical
    KeyError                                  Traceback (most recent call last)
    ---------------------------------------------------------------------------
    In [24]: df[['Ohio','Colorado']]  # This way
    Out[24]:
         Ohio     Colorado
        Green Red    Green
    a 1     0   1        2
      2     3   4        5
    b 1     6   7        8
      2     9  10       11
  #----------------------------------------
    In [25]: df['Ohio','Green']  # Each layer provides a parameter
    Out[25]:
    a  1    0
       2    3
    b  1    6
       2    9
    Name: (Ohio, Green), dtype: int32
      #----------------------------------------
    In [26]: df.iloc[:2,:2]  # Implicit indexing
    Out[26]:
         Ohio
        Green Red
    a 1     0   1
      2     3   4
      #----------------------------------------
    In [28]: df.loc[:,('Ohio','Red')] # This is more difficult to understand.
    Out[28]:
    a  1     1
       2     4
    b  1     7
       2    10
    Name: (Ohio, Red), dtype: int32

Posted by lordzardeck on Thu, 10 Oct 2019 01:08:12 -0700