Grouping operations for Python data analysis

Keywords: Lambda MySQL Python R Language

A common step in data analysis is to group data sets and then apply functions, which can also be called grouping operations. The God Hadley Wickham coined the term split-apply-combine, which means split-application-merge. So when we talk about grouping operations, what are we actually talking about?

  • Splitting: Dividing and grouping data according to standards
  • Applying: Applying a function to each group
  • Combining: Merging results into new data structures

The data storage format required by grouping operation is "long format", so the conversion of "long format" and "wide format" is introduced first, and then the specific operation of grouping operation.

"Long Format" VS "Wide Format“

There are two common ways to save data, long format or wide format.

  • wide format
gene meristem root FLOWER
gene1 582 91 495
gene2 305 3505 33

In a wide format, category variables are arranged separately. For example, the different parts of the plant above are divided into three columns, in which the data are expressed as the quantity of expression. It seems very intuitive, and it's the same way to record data in everyday life.

  • long format
gene organization Expression quantity
gene1 meristem 582
gene2 meristem 305
gene1 root 91
gene2 root 3503
gene1 FLOWER 492
gene2 FLOWER 33

Long-structured data is defined as a special column of categorical variables. Typically, relational databases (such as MySQL) store data in long formats, because category rows (such as organizations) can increase or decrease with the increase or deletion of data in tables under a fixed architecture.

Subsequent grouping operations actually tend to save data in long format, so here we first introduce how pandas can convert between long format and wide format.

First, create a long-formatted data for testing by:

import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
ldata = unpivot(tm.makeTimeDataFrame())

Then the long format is changed into the wide format, which can be regarded as "rotating" the columns of data into rows.

# The first method: pivot
# pd.pivot(index, columns, values), corresponding index, category column and numerical column
pivoted = ldata.pivot('date','variable','value')
# The second method: unstack
## Hierarchical indexing with set_index
unstacked = ldata.set_index(['date','varibale']).unstack('variable')

If the wide format is changed into the long format, it can also be considered that the rows of data are "rotated" into columns.

# First restore unstacked data to a normal DataFrame
wdata = nstacked.reset_index()
wdata.columns = ['date','A','B','C','D']
# The first method: melt
pd.melt(wdata, id_vars=['date'])
# The second method: stack
# If the original data is not indexed, it needs to be rebuilt with set_index
wdata.set_index('date').stack()

GroupBy

The first step is to divide data sets into groups according to certain criteria, i.e. grouping keys. pandas provides grouby methods that can work according to grouping keys as follows:

  • Numpy array, whose length is the same as the axis to be grouped
  • The value of a column name in the DataFrame
  • Dictionary or Series, which provides the corresponding relationship between the values on the axis to be grouped and the label - > group name
  • According to multi-level axis index
  • Python function for each label in an axis index or index

The following examples are given:
Form 1: According to the value of a column, it is generally a categorical value.

df = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                  'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
# Split by column A
df.groupby('A')
# Split by columns A and B
df.groupby(['A','B'])

This results in a pandas.core.groupby.DataFrameGroupBy object object object, which does not perform substantive operations, but produces only intermediate information. You can view the size of each group after grouping.

df.groupby(['A','B']).size()

Form 2: According to the function, a function is defined below to split the column names according to whether they belong to vowels or not.

def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

grouped = df.groupby(get_letter_type, axis=1).

Form 3: According to hierarchical index. Unlike the R language data.frame, which has only one row or column name, pandas's DataFrame allows multiple hierarchical row or column names.

# Data for building multi-level index
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
# Grouping according to different levels of index
level1 = s.groupby(level='first') # First floor
level2 = s.groupby(level=1) # The second floor

Form 4: Grouping by dictionary or Series
The mapping of grouping information can be defined by a dictionary.

mapping = {'A':'str','B':'str','C':'values','D':'values','E':'values'}
df.groupby(mapping, axis=1)

Form 4: Grouping by row index and column value at the same time

df.index = index
df.groupby([pd.Grouper(level='first'),'A']).mean()

Group Object Iteration: GroupBy grouping objects generated above can be iterated, resulting in a set of binary tuples.

# For single-key values
for name, group in df.groupby(['A']):
    print(name)
    print(group)
# For multiple key values
for (k1,k2), group in df.groupby(['A']):
    print(k1,k2)
    print(group)

Select Groups: We can extract a specific group with get_group() after the grouping is finished.

df.groupby('A').get_group('foo')
# Here are the results
                A      B         C         D         E
first second                                          
bar   one     foo    one  0.709603 -1.023657 -0.321730
baz   one     foo    two  1.495328 -0.279928 -0.297983
foo   one     foo    two  0.495288 -0.563845 -1.774453
qux   one     foo    one  1.649135 -0.274208  1.194536
      two     foo  three -2.194127  3.440418 -0.023144
# Multiple groups
df.groupby(['A', 'B']).get_group(('bar', 'one'))

Apply

The grouping step is more direct, while the application function step changes more. In this step, we may do the following:

  • Aggregation: Calculate the statistical information of each group separately, such as mean, size, etc.
  • Transformation s: Specific calculations for each group, such as standardization, missing interpolation, etc.
  • Filtration: Screening based on statistical information, discarding some groups.

Data aggregation: Aggregation

The so-called aggregation is the data conversion process in which arrays generate scalar values. We can use general methods such as mean, count, min, sum, etc. or aggregate or equivalent agg methods created by GroupBy objects.

grouped = df.groupby('A')
grouped.mean()
# Equivalent to
grouped.agg(np.mean)

For multiple groupings, the result of hierarchical indexing is generated. If you want the hierarchical indexing to be a separate column, you need to use the as_index option in groupby. Or finally use the reset_index method

# as_index
grouped = df.groupby(['A','B'], as_index=False)
grouped.agg(np.mean)
# reset_index
df.groupby(['A','B']).sum().reset_index()

aggregate or equivalent agg also allows multiple statistical functions for each column

df.groupby('A').agg([np.sum,np.mean,np.std])
# Grammatical sugar df.groupby('A').C is equivalent to
# grouped=df.groupby(['A'])
# grouped['C']
df.groupby('A').C.agg([np.sum,np.mean,np.std])

Or use different functions for different columns, such as calculating standard deviation for column D and averaging columns.

grouped=df.groupby(['A'])
grouped.agg({'D': np.std , 'C': np.mean})

Note: At present, the sum, mean, std and sem methods have been optimized for Cython.

Conversion: Transformation

The object returned by transform has the same size as the original grouped data. The functions used for conversion must satisfy the following conditions:

  • Returns the same size as the original group chunk
  • Column-by-column operation in block
  • Not in-situ operations on blocks. The original block is considered immutable, and any modification to the original block may lead to unexpected results. If you use fillna, it must be grouped.transform(lambda x: x.fillna(inplace=False))

For example, we standardize each set of data. Of course, we have to first have a set of data, randomly generate a set from 1999 to 2002, and then take 100 days as a sliding window to calculate the mean value of each sliding window.

# date_range generates date index
index = pd.date_range('10/1/1999', periods=1100)
ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
ts = ts.rolling(window=100,min_periods=100).mean().dropna()

Then, the data in each group should be standardized and zscore calculated by grouping according to the year.

# Two anonymous functions for grouping and calculating zscore
key = lambda x : x.year
zscore = lambda x : (x - x.mean())/x.std()
transformed = ts.groupby(key).transform(zscore)

broadcast technology is used in the application of zscore. Let's visualize the data shapes before and after the transformation

compare = pd.DataFrame({'Original': ts, 'Transformed': transformed})
compare.plot()

Before and after conversion

Filtering: Filtration

Filtering returns a subset of the original data. For example, the above time-Cycle data filters out years with an average value of less than 0.55.

ts.groupby(key).filter(lambda x : x.mean() > 0.55)

Apply: More Flexible "Split-App-Merge"

The aggregate and transform mentioned above have certain limitations. The function passed in can only produce two results, either a scalar value that can be broadcast or an array of the same size. apply is a more general function that can accomplish all the tasks described above and do better.

  • Grouping returns descriptive statistics
df.groupby('A').apply(lambda x : x.describe())
# The effect is equivalent to df.groupby('A').describe()

apply
  • Modify the returned group name
def f(group):
     return pd.DataFrame({'original' : group,
                          'demeaned' : group - group.mean()})
df.groupby('A')['C'].apply(f)
              demeaned  original
first second                    
bar   one     0.278558  0.709603
      two     0.280707 -0.140075
baz   one     1.064283  1.495328
      two    -0.848109 -1.268891
foo   one     0.064243  0.495288
      two     0.567402  0.146620
qux   one     1.218089  1.649135
      two    -2.625172 -2.194127

For more useful tips, such as getting the nth line of each group, see Official documents

Posted by Mordecai on Tue, 04 Jun 2019 16:19:32 -0700