groupby() grouping of pandas

Keywords: Big Data Lambda

time                         data
2018-05-01 00:00:00.650   57
2018-05-01 00:00:01.990   54
2018-05-01 00:00:09.487   73
2018-05-01 00:00:14.607   95
2018-05-01 00:00:16.350   77
2018-05-01 00:00:16.397   28
2018-05-01 00:00:16.563   54
2018-05-01 00:00:25.457   19
2018-05-01 00:00:31.140   09
2018-05-01 00:00:54.427   18
2018-05-01 00:00:55.387   39
2018-05-01 00:01:02.193   97
2018-05-01 00:01:07.447   39
2018-05-01 00:01:09.020   41
2018-05-01 00:01:11.033   93
2018-05-01 00:01:25.693   42
2018-05-01 00:02:03.900   42
2018-05-01 00:02:04.190   84
2018-05-01 00:02:05.727   39
2018-05-01 00:02:10.910   40

Now there is a df data as shown above__

Now you need to group by date and hour, so you can index the time column first.

df.set_index("time", inpalce=True)

The inplace parameter indicates that the time column of the DF has become an index when it is modified on the original data box, which can be verified with df.index. I won't go into details here.

The reason why event columns should be indexed is that they are grouped according to dates and hours later, which makes it easier to get values. It will be used next.

groupby function of pandas is used to group:

dp = df.groupby([lambda x:x.day, lambda x:x.hour, "data"])

At this point, dp is a GroupBy object, without any calculation, but it has all the information needed to perform the next operation on each group. GroupBy can be calculated later, such as sum (), mean (), size (), agg(), and so on.

For the next step, we add a size() method after the above line of code.

dp = df.groupby([lambda x:x.day, lambda x:x.hour, "data"]).size()

At this time, the dp can be calculated and displayed. As follows__

      data
1  0  18      1
      19      2
      28      1
      39      3
      40      1
      41      1
      42      2
      54      2
      57      1
      73      1
      77      1
      84      1
      93      1
      95      1
      97      1
dtype: int64

Here is a list of the number of times the last column of statistics is saved:

new_dp = dp.reset_index(name="times")

new_dp is a newly generated DataFrame:

    level_0  level_1  data  times
0         1        0    18      1
1         1        0    19      2
2         1        0    28      1
3         1        0    39      3
4         1        0    40      1
5         1        0    41      1
6         1        0    42      2
7         1        0    54      2
8         1        0    57      1
9         1        0    73      1
10        1        0    77      1
11        1        0    84      1
12        1        0    93      1
13        1        0    95      1
14        1        0    97      1

The level_0 tag is the date, which is the 1st May.

The level_1 column is an hour, because every day starts at 0 o'clock, so it's 0.

Data column is the original data column data of df, but it is also grouped.

The times column is the number of occurrences per hour after grouping data columns.

Posted by DrJonesAC2 on Fri, 25 Jan 2019 20:39:14 -0800