Query method of Dataframe in pandas ([], loc, iloc, at, iat, ix)

Keywords: Python IPython

pandas provides us with a variety of slicing methods, which are often confusing if we don't understand them well. The following examples illustrate these slicing methods.

Data presentation

First, a set of data is generated randomly:

In [5]: rnd_1 = [random.randrange(1,20) for x in xrange(1000)]
   ...: rnd_2 = [random.randrange(1,20) for x in xrange(1000)]
   ...: rnd_3 = [random.randrange(1,20) for x in xrange(1000)]
   ...: fecha = pd.date_range('2012-4-10', '2015-1-4')
   ...: 
   ...: data = pd.DataFrame({'fecha':fecha, 'rnd_1': rnd_1, 'rnd_2': rnd_2, 'rnd_3': rnd_3})

In [6]: data.describe()
Out[6]: 
             rnd_1        rnd_2        rnd_3
count  1000.000000  1000.000000  1000.000000
mean      9.946000     9.825000     9.894000
std       5.553911     5.559432     5.423484
min       1.000000     1.000000     1.000000
25%       5.000000     5.000000     5.000000
50%      10.000000    10.000000    10.000000
75%      15.000000    15.000000    14.000000
max      19.000000    19.000000    19.000000

[] Section method

Using square brackets can slice the DataFrame, a bit like python's list slice. According to the index, the selection can be implemented or column selection or block selection.

# Row selection
In [7]: data[1:5]
Out[7]: 
       fecha  rnd_1  rnd_2  rnd_3
1 2012-04-11      1     16      3
2 2012-04-12      7      6      1
3 2012-04-13      2     16      7
4 2012-04-14      4     17      7

# Column selection
In [10]: data[['rnd_1', 'rnd_3']]
Out[10]: 
     rnd_1  rnd_3
0        8     12
1        1      3
2        7      1
3        2      7
4        4      7
5       12      8
6        2     12
7        9      8
8       13     17
9        4      7
10      14     14
11      19     16
12       2     12
13      15     18
14      13     18
15      13     11
16      17      7
17      14     10
18       9      6
19      11     15
20      16     13
21      18      9
22       1     18
23       4      3
24       6     11
25       2     13
26       7     17
27      11      8
28       3     12
29       4      2
..     ...    ...
970      8     14
971     19      5
972     13      2
973      8     10
974      8     17
975      6     16
976      3      2
977     12      6
978     12     10
979     15     13
980      8      4
981     17      3
982      1     17
983     11      5
984      7      7
985     13     14
986      6     19
987     13      9
988      3     15
989     19      6
990      7     11
991     11      7
992     19     12
993      2     15
994     10      4
995     14     13
996     12     11
997     11     15
998     17     14
999      3      8

[1000 rows x 2 columns]

# Block selection
In [11]: data[:7][['rnd_1', 'rnd_2']]
Out[11]: 
   rnd_1  rnd_2
0      8     17
1      1     16
2      7      6
3      2     16
4      4     17
5     12     19
6      2      7

However, for multi-column selection, it is not possible to use the 1:5 method as for row selection.

In [12]: data[['rnd_1':'rnd_3']]
  File "<ipython-input-13-6291b6a83eb0>", line 1
    data[['rnd_1':'rnd_3']]
                 ^
SyntaxError: invalid syntax

loc

loc lets you select rows and columns by index.

In [13]: data.loc[1:5]
Out[13]: 
       fecha  rnd_1  rnd_2  rnd_3
1 2012-04-11      1     16      3
2 2012-04-12      7      6      1
3 2012-04-13      2     16      7
4 2012-04-14      4     17      7
5 2012-04-15     12     19      8

It should be noted here that loc differs from the first method in that it chooses line 5, while the first method only chooses line 4. I don't feel much different elsewhere.

data.loc[2:4, ['rnd_2', 'fecha']]
Out[14]: 
   rnd_2      fecha
2      6 2012-04-12
3     16 2012-04-13
4     17 2012-04-14

loc can select data between two specific dates. It is important to note that both dates must be in the index.

In [15]: data_fecha = data.set_index('fecha')
    ...: data_fecha.head()
Out[15]: 
            rnd_1  rnd_2  rnd_3
fecha                          
2012-04-10      8     17     12
2012-04-11      1     16      3
2012-04-12      7      6      1
2012-04-13      2     16      7
2012-04-14      4     17      7

In [16]: # Generate two specific dates
    ...: fecha_1 = dt.datetime(2013, 4, 14)
    ...: fecha_2 = dt.datetime(2013, 4, 18)
    ...: 
    ...: # Generating slice data
    ...: data_fecha.loc[fecha_1: fecha_2]
Out[16]: 
            rnd_1  rnd_2  rnd_3
fecha                          
2013-04-14     17     10      5
2013-04-15     14      4      9
2013-04-16      1      2     18
2013-04-17      9     15      1
2013-04-18     16      7     17

iloc

If loc is selected according to the value of index, iloc is selected according to the location of index. Iloc does not care about the specific value of the index, but only about the location, so when using iloc, only numerical values can be used in square brackets.

# Row selection
In [17]: data_fecha[10: 15]
Out[17]: 
            rnd_1  rnd_2  rnd_3
fecha                          
2012-04-20     14      6     14
2012-04-21     19     14     16
2012-04-22      2      6     12
2012-04-23     15      8     18
2012-04-24     13      8     18

# Column selection
In [18]: data_fecha.iloc[:,[1,2]].head()
Out[18]: 
            rnd_2  rnd_3
fecha                   
2012-04-10     17     12
2012-04-11     16      3
2012-04-12      6      1
2012-04-13     16      7
2012-04-14     17      7

# Slice selection
In [19]: data_fecha.iloc[[1,12,34],[0,2]]
Out[19]: 
            rnd_1  rnd_3
fecha                   
2012-04-11      1      3
2012-04-22      2     12
2012-05-14     17     10

at

The use of at is similar to loc, but it has faster access to data than loc, and can only access a single element, not multiple elements.

In [20]: timeit data_fecha.at[fecha_1,'rnd_1']
The slowest run took 3783.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.3 µs per loop

In [21]: timeit data_fecha.loc[fecha_1,'rnd_1']
The slowest run took 121.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 192 µs per loop

In [22]: data_fecha.at[fecha_1,'rnd_1']
Out[22]: 17

iat

iat's relationship with iloc is like at's relationship with loc's. It's a faster index-based location selection method, and like at, it can only access a single element.

In [23]: data_fecha.iat[1,0]
Out[23]: 1

In [24]: timeit data_fecha.iat[1,0]
The slowest run took 6.23 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.77 µs per loop

In [25]: timeit data_fecha.iloc[1,0]
10000 loops, best of 3: 158 µs per loop

ix

Several of the methods mentioned above require that the rank of the query be in the index, or that the location not exceed the length range, while ix allows you to get data that is not in the DataFrame index.

In [28]: date_1 = dt.datetime(2013, 1, 10, 8, 30)
    ...: date_2 = dt.datetime(2013, 1, 13, 4, 20)
    ...: 
    ...: # Generating slice data
    ...: data_fecha.ix[date_1: date_2]
Out[28]: 
            rnd_1  rnd_2  rnd_3
fecha                          
2013-01-11     19     17     19
2013-01-12     10      9     17
2013-01-13     15      3     10

As shown in the example above, January 10, 2013 was not selected because the time point was regarded as 0:0, earlier than 8:30.

Posted by aks.it on Tue, 08 Jan 2019 19:06:10 -0800