pandas provides us with a variety of slicing methods, which are often confusing if we don't understand them well. The following examples illustrate these slicing methods.
Data presentation
First, a set of data is generated randomly:
In [5]: rnd_1 = [random.randrange(1,20) for x in xrange(1000)]
...: rnd_2 = [random.randrange(1,20) for x in xrange(1000)]
...: rnd_3 = [random.randrange(1,20) for x in xrange(1000)]
...: fecha = pd.date_range('2012-4-10', '2015-1-4')
...:
...: data = pd.DataFrame({'fecha':fecha, 'rnd_1': rnd_1, 'rnd_2': rnd_2, 'rnd_3': rnd_3})
In [6]: data.describe()
Out[6]:
rnd_1 rnd_2 rnd_3
count 1000.000000 1000.000000 1000.000000
mean 9.946000 9.825000 9.894000
std 5.553911 5.559432 5.423484
min 1.000000 1.000000 1.000000
25% 5.000000 5.000000 5.000000
50% 10.000000 10.000000 10.000000
75% 15.000000 15.000000 14.000000
max 19.000000 19.000000 19.000000
[] Section method
Using square brackets can slice the DataFrame, a bit like python's list slice. According to the index, the selection can be implemented or column selection or block selection.
# Row selection
In [7]: data[1:5]
Out[7]:
fecha rnd_1 rnd_2 rnd_3
1 2012-04-11 1 16 3
2 2012-04-12 7 6 1
3 2012-04-13 2 16 7
4 2012-04-14 4 17 7
# Column selection
In [10]: data[['rnd_1', 'rnd_3']]
Out[10]:
rnd_1 rnd_3
0 8 12
1 1 3
2 7 1
3 2 7
4 4 7
5 12 8
6 2 12
7 9 8
8 13 17
9 4 7
10 14 14
11 19 16
12 2 12
13 15 18
14 13 18
15 13 11
16 17 7
17 14 10
18 9 6
19 11 15
20 16 13
21 18 9
22 1 18
23 4 3
24 6 11
25 2 13
26 7 17
27 11 8
28 3 12
29 4 2
.. ... ...
970 8 14
971 19 5
972 13 2
973 8 10
974 8 17
975 6 16
976 3 2
977 12 6
978 12 10
979 15 13
980 8 4
981 17 3
982 1 17
983 11 5
984 7 7
985 13 14
986 6 19
987 13 9
988 3 15
989 19 6
990 7 11
991 11 7
992 19 12
993 2 15
994 10 4
995 14 13
996 12 11
997 11 15
998 17 14
999 3 8
[1000 rows x 2 columns]
# Block selection
In [11]: data[:7][['rnd_1', 'rnd_2']]
Out[11]:
rnd_1 rnd_2
0 8 17
1 1 16
2 7 6
3 2 16
4 4 17
5 12 19
6 2 7
However, for multi-column selection, it is not possible to use the 1:5 method as for row selection.
In [12]: data[['rnd_1':'rnd_3']]
File "<ipython-input-13-6291b6a83eb0>", line 1
data[['rnd_1':'rnd_3']]
^
SyntaxError: invalid syntax
loc
loc lets you select rows and columns by index.
In [13]: data.loc[1:5]
Out[13]:
fecha rnd_1 rnd_2 rnd_3
1 2012-04-11 1 16 3
2 2012-04-12 7 6 1
3 2012-04-13 2 16 7
4 2012-04-14 4 17 7
5 2012-04-15 12 19 8
It should be noted here that loc differs from the first method in that it chooses line 5, while the first method only chooses line 4. I don't feel much different elsewhere.
data.loc[2:4, ['rnd_2', 'fecha']]
Out[14]:
rnd_2 fecha
2 6 2012-04-12
3 16 2012-04-13
4 17 2012-04-14
loc can select data between two specific dates. It is important to note that both dates must be in the index.
In [15]: data_fecha = data.set_index('fecha')
...: data_fecha.head()
Out[15]:
rnd_1 rnd_2 rnd_3
fecha
2012-04-10 8 17 12
2012-04-11 1 16 3
2012-04-12 7 6 1
2012-04-13 2 16 7
2012-04-14 4 17 7
In [16]: # Generate two specific dates
...: fecha_1 = dt.datetime(2013, 4, 14)
...: fecha_2 = dt.datetime(2013, 4, 18)
...:
...: # Generating slice data
...: data_fecha.loc[fecha_1: fecha_2]
Out[16]:
rnd_1 rnd_2 rnd_3
fecha
2013-04-14 17 10 5
2013-04-15 14 4 9
2013-04-16 1 2 18
2013-04-17 9 15 1
2013-04-18 16 7 17
iloc
If loc is selected according to the value of index, iloc is selected according to the location of index. Iloc does not care about the specific value of the index, but only about the location, so when using iloc, only numerical values can be used in square brackets.
# Row selection
In [17]: data_fecha[10: 15]
Out[17]:
rnd_1 rnd_2 rnd_3
fecha
2012-04-20 14 6 14
2012-04-21 19 14 16
2012-04-22 2 6 12
2012-04-23 15 8 18
2012-04-24 13 8 18
# Column selection
In [18]: data_fecha.iloc[:,[1,2]].head()
Out[18]:
rnd_2 rnd_3
fecha
2012-04-10 17 12
2012-04-11 16 3
2012-04-12 6 1
2012-04-13 16 7
2012-04-14 17 7
# Slice selection
In [19]: data_fecha.iloc[[1,12,34],[0,2]]
Out[19]:
rnd_1 rnd_3
fecha
2012-04-11 1 3
2012-04-22 2 12
2012-05-14 17 10
at
The use of at is similar to loc, but it has faster access to data than loc, and can only access a single element, not multiple elements.
In [20]: timeit data_fecha.at[fecha_1,'rnd_1']
The slowest run took 3783.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.3 µs per loop
In [21]: timeit data_fecha.loc[fecha_1,'rnd_1']
The slowest run took 121.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 192 µs per loop
In [22]: data_fecha.at[fecha_1,'rnd_1']
Out[22]: 17
iat
iat's relationship with iloc is like at's relationship with loc's. It's a faster index-based location selection method, and like at, it can only access a single element.
In [23]: data_fecha.iat[1,0]
Out[23]: 1
In [24]: timeit data_fecha.iat[1,0]
The slowest run took 6.23 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.77 µs per loop
In [25]: timeit data_fecha.iloc[1,0]
10000 loops, best of 3: 158 µs per loop
ix
Several of the methods mentioned above require that the rank of the query be in the index, or that the location not exceed the length range, while ix allows you to get data that is not in the DataFrame index.
In [28]: date_1 = dt.datetime(2013, 1, 10, 8, 30)
...: date_2 = dt.datetime(2013, 1, 13, 4, 20)
...:
...: # Generating slice data
...: data_fecha.ix[date_1: date_2]
Out[28]:
rnd_1 rnd_2 rnd_3
fecha
2013-01-11 19 17 19
2013-01-12 10 9 17
2013-01-13 15 3 10
As shown in the example above, January 10, 2013 was not selected because the time point was regarded as 0:0, earlier than 8:30.