Originally, I was going to go directly to the actual combat. Later, I thought, I'd better finish the necessary things first. Otherwise, many things in the actual combat will be confused and it will be troublesome to talk recursively.
This article is the last content explanation of pandas. The three connected together should be enough to deal with the vast majority of the content. In fact, it has almost talked about the main content of pandas. Let's make a beautiful ending.
Let's explain the concept of grouping orally
Grouping and aggregation
In fact, grouping means literally
For example, the following data: (code import data)
import pandas as pd
file_path = './split_test.xlsx'
df = pd.read_excel(file_path,sheet_name='Sheet one')
df
part | id | name | score | |
---|---|---|---|---|
0 | a | one | Zhang San | ten |
1 | a | 2 | Li Si | five |
2 | b | 3 | Wang Wu | ten |
3 | b | 4 | smith | 6 |
4 | c | five | kear | 7 |
five | c | 6 | clack | 8 |
If you put part a in one block, Part b in one block, and part c in one block, it is a grouping. This operation is executed in the sql statement as gruoupby. In fact, this method is also used in pandas, as follows:
df1 = df.groupby('part')
output data
df1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024 five 43471490>
You can see that it gets a dataframegroupby object.
# Try to read its data. There is no error in this step, indicating that it can be iterated. for line in df1: print(line)
('a', part id name score 0 a 1 Zhang San ten 1 a 2 Li Si five) ('b', part id name score 2 b 3 Wang Wu ten 3 b 4 smith 6) ('c', part id name score 4 c five kear 7 5 c 6 clack 8)
# print([line for line in df1]) print(list(df1)[0])
('a', part id name score 0 a 1 Zhang San ten 1 a 2 Li Si 5)
The first traversal can see two elements in a tuple. Try to traverse it with two placeholders.
for part,ele in df1: print(part,ele)
a part id name score 0 a 1 Zhang San ten 1 a 2 Li Si 5 b part id name score 2 b 3 Wang Wu ten 3 b 4 smith 6 c part id name score 4 c 5 kear 7 5 c 6 clack 8
# View the type of a block type(ele)
pandas.core.frame.DataFrame
This shows that each group of ELES, that is, elements, is a dataframe object, so it naturally inherits all the methods of the dataframe. Naturally, you can also take out all the elements.
Give me an impression. Next, let's specify the method
# View data df
part | id | name | score | |
---|---|---|---|---|
0 | a | 1 | Zhang San | ten |
1 | a | 2 | Li Si | 5 |
2 | b | 3 | Wang Wu | ten |
3 | b | 4 | smith | 6 |
4 | c | 5 | kear | 7 |
5 | c | 6 | clack | 8 |
# Group a key as a whole df_group_obj = df.groupby('part') # When you view the data, you can find that it is an object. Note that this object only contains grouped intermediate data print(df_group_obj) # Use the size attribute of the object to view the corresponding number of elements print(df_group_obj.size()) # This object can also call the methods of some dataframe objects, such as: # Average each column of the group print(df_group_obj.mean()) # Sum each column of the group print(df_group_obj.sum()) # Get the maximum value of each column in the group print(df_group_obj.max()) # Get the minimum value of each column in the group print(df_group_obj.min())
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000245494B2520> part a 2 b 2 c 2 dtype: int64 id score part a 1.5 7.5 b 3.5 8.0 c 5.5 7.5 id score part a 3 15 b 7 16 c 11 15 id name score part a 2 Li Si ten b 4 Wang Wu ten c 6 kear 8 id name score part a 1 Zhang San 5 b 3 smith 6 c 5 clack 7
Also note that the above operations are all for grouping, so if it is divided into three groups, the result should be three records (that is, three lines)
The above operation is to group the whole df object, and you can also take out each column for grouping.
That is, the operation can also be performed on a series object
# However, it should be noted that the parameters in groupby must be taken out with df, otherwise an error will be reported. In fact, it is not difficult to think about it, because df['part '] itself is a series object. Splicing the two series objects is a series object # A dataframe, so the grouping method can be used here. ser_group_by = df['score'].groupby(df['part']) # The rest of the operation is the same as above, so it needs too much explanation. print(ser_group_by.size()) print(ser_group_by.max()) print(ser_group_by.min())
part a 2 b 2 c 2 Name: score, dtype: int64 part a 10 b 10 c 8 Name: score, dtype: int64 part a 5 b 6 c 7 Name: score, dtype: int64
Similarly, we can customize the grouping. We only need to define the same number of sequence elements as the number of records (rows).
# Randomly generate a sequence import random group = [random.randint(1,4) for i in range(6)] # grouping df_group = df.groupby(group) # View data print(df_group)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002454927ABB0>
print(df_group.size())
1 3 3 2 4 1 dtype: int64
## As a verification, you can print the sequence to see print(group)
[1, 4, 1, 1, 3, 3]
A hierarchical index is formed by grouping according to multiple keys
df_group = df.groupby(['part', 'name']) print(df_group.size())
part name a Zhang San 1 Li Si 1 b smith 1 Wang Wu 1 c clack 1 kear 1 dtype: int64
In addition, we need to consider the nature of the data generated by grouping and analyze it according to the above data
- It is a unique object, an intermediate data, and contains only data, which can perform any operation on series and dataframe objects.
- Its index is defined by us. For example, we use two keys as parameters for grouping. As shown above, it contains two levels of indexes.
- This object supports us to use some methods of pandas for operations, but not all methods. Therefore, when we can't handle it, we can convert this object into a common data structure, such as dict or list.
- This object supports iterative operations.
# Convert to list list(df_group)
[(('a', 'Zhang San'), part id name score 0 a 1 Zhang San 10), (('a', 'Li Si'), part id name score 1 a 2 Li Si 5), (('b', 'smith'), part id name score 3 b 4 smith 6), (('b', 'Wang Wu'), part id name score 2 b 3 Wang Wu 10), (('c', 'clack'), part id name score 5 c 6 clack 8), (('c', 'kear'), part id name score 4 c 5 kear 7)]
# Convert to dictionary dict(list(df_group))
{('a', 'Zhang San'): part id name score 0 a 1 Zhang San 10, ('a', 'Li Si'): part id name score 1 a 2 Li Si 5, ('b', 'smith'): part id name score 3 b 4 smith 6, ('b', 'Wang Wu'): part id name score 2 b 3 Wang Wu 10, ('c', 'clack'): part id name score 5 c 6 clack 8, ('c', 'kear'): part id name score 4 c 5 kear 7}
Aggregation operation
In fact, the content we mentioned earlier already includes aggregation operation. To put it bluntly, it is to calculate several records of each group. No matter the max, min, mean, sum and size methods belong to aggregation operation.
There is a describe method not mentioned earlier. This method has been mentioned when talking about dataframe. In fact, the previous civilization did say that dataframe methods are basically applicable to grouped objects, so it is not specifically proposed.
df_group.describe()
id | score | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | ||
part | name | ||||||||||||||||
a | Zhang San | 1.0 | 1.0 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 10.0 | NaN | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
Li Si | 1.0 | 2.0 | NaN | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 5.0 | NaN | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | |
b | smith | 1.0 | 4.0 | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 1.0 | 6.0 | NaN | 6.0 | 6.0 | 6.0 | 6.0 | 6.0 |
Wang Wu | 1.0 | 3.0 | NaN | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 1.0 | 10.0 | NaN | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | |
c | clack | 1.0 | 6.0 | NaN | 6.0 | 6.0 | 6.0 | 6.0 | 6.0 | 1.0 | 8.0 | NaN | 8.0 | 8.0 | 8.0 | 8.0 | 8.0 |
kear | 1.0 | 5.0 | NaN | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 1.0 | 7.0 | NaN | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 |
Key points of the method, user-defined aggregation method
def my_func(df): """ """ return df.max() - df.min() df_group_obj.agg(my_func) # If our function passes parameters with (), it means that we pass the return value of the function rather than a method, so we can't pass parameters with ().
id | score | |
---|---|---|
part | ||
a | 1 | 5 |
b | 1 | 4 |
c | 1 | 1 |
type(df_group_obj.agg(my_func))
pandas.core.frame.DataFrame
# You can see from type that the aggregation method also returns dataframe. We can pass in multiple function names through agg, and the results are as follows: df_group_obj.agg(['sum',my_func])
id | score | |||
---|---|---|---|---|
sum | my_func | sum | my_func | |
part | ||||
a | 3 | 1 | 15 | 5 |
b | 7 | 1 | 16 | 4 |
c | 11 | 1 | 15 | 1 |
# Its operability is very strong, including that we can directly change the column name. df_group_obj.agg(['sum',('New column name',my_func)],)
id | score | |||
---|---|---|---|---|
sum | New column name | sum | New column name | |
part | ||||
a | 3 | 1 | 15 | 5 |
b | 7 | 1 | 16 | 4 |
c | 11 | 1 | 15 | 1 |
Stop here. For example, it is the basic content of grouping and aggregation. If you have thoroughly understood the above things, you will basically pass the test.
Next, let's look at the content of advanced calculation relative to the foundation. The last simulation example
Through the above methods, including the previous one or two blogs, it is obvious that we have been able to do many functional operations on Excel, but the strength of pandas is far more than that. It can reflect the trend of replacing excel. In fact, it depends on some top Python data analysts in the industry, including all of us, trying to get closer to these directions. As a learning cost, excel is obviously lower than python, and there is no discrimination. This should be well known. By definition, excel is a mature software, As a toolkit of programming language, pandas is lower level, and it is often more difficult for lower level things to realize the same function, but if you are skilled, what you can do will be much more powerful than using Excel alone.
import pandas as pd import random as rd import string ## Randomly generate a series of mature data and write it into excel. data = { 'id':[i for i in range(30)], 'name': [''.join([rd.choice(string.ascii_letters) for i in range(5)]) for i in range(30)], # The name is repeatable. 'part': [rd.choice(['a','b','c','d']) for i in range(30)], 'score': [rd.randint(60,100) for i in range(30)] }
data = pd.DataFrame(data)
data data.to_excel('fin_exam.xlsx', sheet_name='data_score')
id | name | part | score | |
---|---|---|---|---|
0 | 0 | DrPYm | a | 69 |
1 | 1 | JFOUg | d | 89 |
2 | 2 | fDGRR | a | 94 |
3 | 3 | QEaJz | a | 66 |
4 | 4 | IQwCs | a | 89 |
5 | 5 | QglNK | a | 98 |
6 | 6 | YEHEx | b | 77 |
7 | 7 | CThQU | a | 61 |
8 | 8 | ypFKY | a | 67 |
9 | 9 | YkzKn | b | 72 |
10 | 10 | BgoYQ | b | 75 |
11 | 11 | WLSAQ | a | 91 |
12 | 12 | wvGtX | c | 87 |
13 | 13 | wZigu | a | 64 |
14 | 14 | CgNab | d | 69 |
15 | 15 | OLwjx | c | 98 |
16 | 16 | OhMaH | d | 75 |
17 | 17 | wZXNp | b | 82 |
18 | 18 | unlzR | a | 93 |
19 | 19 | mHbyE | b | 69 |
20 | 20 | VjgMR | a | 63 |
21 | 21 | JkrpE | c | 89 |
22 | 22 | QLetK | a | 84 |
23 | 23 | MIcMu | d | 63 |
24 | 24 | DYdgN | c | 63 |
25 | 25 | rrjMV | d | 98 |
26 | 26 | mzSvn | c | 75 |
27 | 27 | BjWdT | d | 76 |
28 | 28 | dEPBF | d | 88 |
29 | 29 | lrsGq | c | 62 |
Task: take out the results of the first three people in each group
# grouping data_group = data.groupby('part') # View data quantity print(data_group.size()) # Custom operation def top_3(df): # The ascending=False parameter indicates descending sorting return df.sort_values(by='score', ascending=False)[:3] # result data_group.apply(top_3)
part a 12 b 5 c 6 d 7 dtype: int64
id | name | part | score | ||
---|---|---|---|---|---|
part | |||||
a | 5 | 5 | QglNK | a | 98 |
2 | 2 | fDGRR | a | 94 | |
18 | 18 | unlzR | a | 93 | |
b | 17 | 17 | wZXNp | b | 82 |
6 | 6 | YEHEx | b | 77 | |
10 | 10 | BgoYQ | b | 75 | |
c | 15 | 15 | OLwjx | c | 98 |
21 | 21 | JkrpE | c | 89 | |
12 | 12 | wvGtX | c | 87 | |
d | 25 | 25 | rrjMV | d | 98 |
1 | 1 | JFOUg | d | 89 | |
28 | 28 | dEPBF | d | 88 |
Task: take out the scores of the last three in each group
def top_d3(df): # The ascending=False parameter indicates descending sorting return df.sort_values(by='score', ascending=False)[-3:] data_group.apply(top_d3)
id | name | part | score | ||
---|---|---|---|---|---|
part | |||||
a | 13 | 13 | wZigu | a | 64 |
20 | 20 | VjgMR | a | 63 | |
7 | 7 | CThQU | a | 61 | |
b | 10 | 10 | BgoYQ | b | 75 |
9 | 9 | YkzKn | b | 72 | |
19 | 19 | mHbyE | b | 69 | |
c | 26 | 26 | mzSvn | c | 75 |
24 | 24 | DYdgN | c | 63 | |
29 | 29 | lrsGq | c | 62 | |
d | 16 | 16 | OhMaH | d | 75 |
14 | 14 | CgNab | d | 69 | |
23 | 23 | MIcMu | d | 63 |
The above actually uses an apply method: it can use our custom method.
Then there are also alternative methods:
- transform: pass in a custom method (I won't talk about it here. It's the same as apply. It's meaningless)