Introduction to pandas (conclusion) -- Mo Xiaotian with little talent and learning

Keywords: Data Analysis pandas

Originally, I was going to go directly to the actual combat. Later, I thought, I'd better finish the necessary things first. Otherwise, many things in the actual combat will be confused and it will be troublesome to talk recursively.

This article is the last content explanation of pandas. The three connected together should be enough to deal with the vast majority of the content. In fact, it has almost talked about the main content of pandas. Let's make a beautiful ending.

Let's explain the concept of grouping orally

Grouping and aggregation

In fact, grouping means literally

For example, the following data: (code import data)

import pandas as pd
file_path = './split_test.xlsx'
df = pd.read_excel(file_path,sheet_name='Sheet one')
df
partidnamescore
0aoneZhang Santen
1a2Li Sifive
2b3Wang Wuten
3b4smith6
4cfivekear7
fivec6clack8

If you put part a in one block, Part b in one block, and part c in one block, it is a grouping. This operation is executed in the sql statement as gruoupby. In fact, this method is also used in pandas, as follows:

df1 = df.groupby('part')

output data

df1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024 five 43471490>

You can see that it gets a dataframegroupby object.

# Try to read its data. There is no error in this step, indicating that it can be iterated.
for line in df1:
    print(line)
('a',   part  id name  score
0    a   1   Zhang San     ten
1    a   2   Li Si      five)
('b',   part  id   name  score
2    b   3     Wang Wu     ten
3    b   4  smith      6)
('c',   part  id   name  score
4    c   five   kear      7
5    c   6  clack      8)
# print([line for line in df1])
print(list(df1)[0])
('a',   part  id name  score
0    a   1   Zhang San     ten
1    a   2   Li Si      5)

The first traversal can see two elements in a tuple. Try to traverse it with two placeholders.

for part,ele in df1:
    print(part,ele)
a   part  id name  score
0    a   1   Zhang San     ten
1    a   2   Li Si      5
b   part  id   name  score
2    b   3     Wang Wu     ten
3    b   4  smith      6
c   part  id   name  score
4    c   5   kear      7
5    c   6  clack      8
# View the type of a block
type(ele)
pandas.core.frame.DataFrame

This shows that each group of ELES, that is, elements, is a dataframe object, so it naturally inherits all the methods of the dataframe. Naturally, you can also take out all the elements.

Give me an impression. Next, let's specify the method

# View data
df
partidnamescore
0a1Zhang Santen
1a2Li Si5
2b3Wang Wuten
3b4smith6
4c5kear7
5c6clack8
# Group a key as a whole
df_group_obj = df.groupby('part')
# When you view the data, you can find that it is an object. Note that this object only contains grouped intermediate data
print(df_group_obj)
# Use the size attribute of the object to view the corresponding number of elements
print(df_group_obj.size())
# This object can also call the methods of some dataframe objects, such as:
# Average each column of the group
print(df_group_obj.mean())
# Sum each column of the group
print(df_group_obj.sum())
# Get the maximum value of each column in the group
print(df_group_obj.max())
# Get the minimum value of each column in the group
print(df_group_obj.min())
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000245494B2520>
part
a    2
b    2
c    2
dtype: int64
       id  score
part            
a     1.5    7.5
b     3.5    8.0
c     5.5    7.5
      id  score
part           
a      3     15
b      7     16
c     11     15
      id  name  score
part                 
a      2    Li Si     ten
b      4    Wang Wu     ten
c      6  kear      8
      id   name  score
part                  
a      1     Zhang San      5
b      3  smith      6
c      5  clack      7

Also note that the above operations are all for grouping, so if it is divided into three groups, the result should be three records (that is, three lines)

The above operation is to group the whole df object, and you can also take out each column for grouping.

That is, the operation can also be performed on a series object

# However, it should be noted that the parameters in groupby must be taken out with df, otherwise an error will be reported. In fact, it is not difficult to think about it, because df['part '] itself is a series object. Splicing the two series objects is a series object
# A dataframe, so the grouping method can be used here.
ser_group_by = df['score'].groupby(df['part'])
# The rest of the operation is the same as above, so it needs too much explanation.
print(ser_group_by.size())
print(ser_group_by.max())
print(ser_group_by.min())
part
a    2
b    2
c    2
Name: score, dtype: int64
part
a    10
b    10
c     8
Name: score, dtype: int64
part
a    5
b    6
c    7
Name: score, dtype: int64

Similarly, we can customize the grouping. We only need to define the same number of sequence elements as the number of records (rows).

# Randomly generate a sequence
import random
group = [random.randint(1,4) for i in range(6)]
# grouping
df_group = df.groupby(group)
# View data
print(df_group)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002454927ABB0>
print(df_group.size())
1    3
3    2
4    1
dtype: int64
## As a verification, you can print the sequence to see
print(group)
[1, 4, 1, 1, 3, 3]

A hierarchical index is formed by grouping according to multiple keys

df_group = df.groupby(['part', 'name'])
print(df_group.size())
part  name 
a     Zhang San       1
      Li Si       1
b     smith    1
      Wang Wu       1
c     clack    1
      kear     1
dtype: int64

In addition, we need to consider the nature of the data generated by grouping and analyze it according to the above data

  1. It is a unique object, an intermediate data, and contains only data, which can perform any operation on series and dataframe objects.
  2. Its index is defined by us. For example, we use two keys as parameters for grouping. As shown above, it contains two levels of indexes.
  3. This object supports us to use some methods of pandas for operations, but not all methods. Therefore, when we can't handle it, we can convert this object into a common data structure, such as dict or list.
  4. This object supports iterative operations.
# Convert to list
list(df_group)
[(('a', 'Zhang San'),
    part  id name  score
  0    a   1   Zhang San     10),
 (('a', 'Li Si'),
    part  id name  score
  1    a   2   Li Si      5),
 (('b', 'smith'),
    part  id   name  score
  3    b   4  smith      6),
 (('b', 'Wang Wu'),
    part  id name  score
  2    b   3   Wang Wu     10),
 (('c', 'clack'),
    part  id   name  score
  5    c   6  clack      8),
 (('c', 'kear'),
    part  id  name  score
  4    c   5  kear      7)]
# Convert to dictionary
dict(list(df_group))
{('a',
  'Zhang San'):   part  id name  score
 0    a   1   Zhang San     10,
 ('a',
  'Li Si'):   part  id name  score
 1    a   2   Li Si      5,
 ('b',
  'smith'):   part  id   name  score
 3    b   4  smith      6,
 ('b',
  'Wang Wu'):   part  id name  score
 2    b   3   Wang Wu     10,
 ('c',
  'clack'):   part  id   name  score
 5    c   6  clack      8,
 ('c',
  'kear'):   part  id  name  score
 4    c   5  kear      7}

Aggregation operation

In fact, the content we mentioned earlier already includes aggregation operation. To put it bluntly, it is to calculate several records of each group. No matter the max, min, mean, sum and size methods belong to aggregation operation.

There is a describe method not mentioned earlier. This method has been mentioned when talking about dataframe. In fact, the previous civilization did say that dataframe methods are basically applicable to grouped objects, so it is not specifically proposed.

df_group.describe()
idscore
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
partname
aZhang San1.01.0NaN1.01.01.01.01.01.010.0NaN10.010.010.010.010.0
Li Si1.02.0NaN2.02.02.02.02.01.05.0NaN5.05.05.05.05.0
bsmith1.04.0NaN4.04.04.04.04.01.06.0NaN6.06.06.06.06.0
Wang Wu1.03.0NaN3.03.03.03.03.01.010.0NaN10.010.010.010.010.0
cclack1.06.0NaN6.06.06.06.06.01.08.0NaN8.08.08.08.08.0
kear1.05.0NaN5.05.05.05.05.01.07.0NaN7.07.07.07.07.0

Key points of the method, user-defined aggregation method

def my_func(df):
    """
    """
    return df.max() - df.min()

df_group_obj.agg(my_func)

# If our function passes parameters with (), it means that we pass the return value of the function rather than a method, so we can't pass parameters with ().
idscore
part
a15
b14
c11
type(df_group_obj.agg(my_func))
pandas.core.frame.DataFrame
# You can see from type that the aggregation method also returns dataframe. We can pass in multiple function names through agg, and the results are as follows:
df_group_obj.agg(['sum',my_func])
idscore
summy_funcsummy_func
part
a31155
b71164
c111151
# Its operability is very strong, including that we can directly change the column name.
df_group_obj.agg(['sum',('New column name',my_func)],)
idscore
sumNew column namesumNew column name
part
a31155
b71164
c111151

Stop here. For example, it is the basic content of grouping and aggregation. If you have thoroughly understood the above things, you will basically pass the test.

Next, let's look at the content of advanced calculation relative to the foundation. The last simulation example

Through the above methods, including the previous one or two blogs, it is obvious that we have been able to do many functional operations on Excel, but the strength of pandas is far more than that. It can reflect the trend of replacing excel. In fact, it depends on some top Python data analysts in the industry, including all of us, trying to get closer to these directions. As a learning cost, excel is obviously lower than python, and there is no discrimination. This should be well known. By definition, excel is a mature software, As a toolkit of programming language, pandas is lower level, and it is often more difficult for lower level things to realize the same function, but if you are skilled, what you can do will be much more powerful than using Excel alone.

import pandas as pd
import random as rd
import string
## Randomly generate a series of mature data and write it into excel.

data = {
    'id':[i for i in range(30)],
    'name': [''.join([rd.choice(string.ascii_letters) for i in range(5)]) for i in range(30)], # The name is repeatable.
    'part': [rd.choice(['a','b','c','d']) for i in range(30)],
    'score': [rd.randint(60,100) for i in range(30)]
}
data = pd.DataFrame(data)
data
data.to_excel('fin_exam.xlsx', sheet_name='data_score')
idnamepartscore
00DrPYma69
11JFOUgd89
22fDGRRa94
33QEaJza66
44IQwCsa89
55QglNKa98
66YEHExb77
77CThQUa61
88ypFKYa67
99YkzKnb72
1010BgoYQb75
1111WLSAQa91
1212wvGtXc87
1313wZigua64
1414CgNabd69
1515OLwjxc98
1616OhMaHd75
1717wZXNpb82
1818unlzRa93
1919mHbyEb69
2020VjgMRa63
2121JkrpEc89
2222QLetKa84
2323MIcMud63
2424DYdgNc63
2525rrjMVd98
2626mzSvnc75
2727BjWdTd76
2828dEPBFd88
2929lrsGqc62

Task: take out the results of the first three people in each group

# grouping
data_group = data.groupby('part')
# View data quantity
print(data_group.size())
# Custom operation
def top_3(df):
    # The ascending=False parameter indicates descending sorting
    return df.sort_values(by='score', ascending=False)[:3]
# result  
data_group.apply(top_3)
part
a    12
b     5
c     6
d     7
dtype: int64
idnamepartscore
part
a55QglNKa98
22fDGRRa94
1818unlzRa93
b1717wZXNpb82
66YEHExb77
1010BgoYQb75
c1515OLwjxc98
2121JkrpEc89
1212wvGtXc87
d2525rrjMVd98
11JFOUgd89
2828dEPBFd88

Task: take out the scores of the last three in each group

def top_d3(df):
    # The ascending=False parameter indicates descending sorting
    return df.sort_values(by='score', ascending=False)[-3:]
data_group.apply(top_d3)
idnamepartscore
part
a1313wZigua64
2020VjgMRa63
77CThQUa61
b1010BgoYQb75
99YkzKnb72
1919mHbyEb69
c2626mzSvnc75
2424DYdgNc63
2929lrsGqc62
d1616OhMaHd75
1414CgNabd69
2323MIcMud63

The above actually uses an apply method: it can use our custom method.

Then there are also alternative methods:

  1. transform: pass in a custom method (I won't talk about it here. It's the same as apply. It's meaningless)

So that's all for the whole pandas. Pandas is very huge, and bloggers can't say anything. In fact, programming is more about checking and learning by themselves. What the tutorial can give is to help avoid some pits, or extend some things and pull up the knowledge structure. After all, how to learn depends on yourself. In fact, the contents of the three articles are almost the same as a whole. You can also summarize them yourself. In the later words, depending on the situation, I may upload some actual combat including machine learning. I'll send it when I have time. If you have any questions or need anything, you can add bloggers through wechat, as follows:

Posted by Tonka1979 on Wed, 24 Nov 2021 14:35:49 -0800