Introduction to pandas (conclusion) -- Mo Xiaotian with little talent and learning

Keywords: Data Analysis pandas

Originally, I was going to go directly to the actual combat. Later, I thought, I'd better finish the necessary things first. Otherwise, many things in the actual combat will be confused and it will be troublesome to talk recursively.

This article is the last content explanation of pandas. The three connected together should be enough to deal with the vast majority of the content. In fact, it has almost talked about the main content of pandas. Let's make a beautiful ending.

Let's explain the concept of grouping orally

Grouping and aggregation

In fact, grouping means literally

For example, the following data: (code import data)

import pandas as pd

file_path = './split_test.xlsx'

df = pd.read_excel(file_path,sheet_name='Sheet one')

df

	part	id	name	score
0	a	one	Zhang San	ten
1	a	2	Li Si	five
2	b	3	Wang Wu	ten
3	b	4	smith	6
4	c	five	kear	7
five	c	6	clack	8

If you put part a in one block, Part b in one block, and part c in one block, it is a grouping. This operation is executed in the sql statement as gruoupby. In fact, this method is also used in pandas, as follows:

df1 = df.groupby('part')

output data

df1

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024 five 43471490>

You can see that it gets a dataframegroupby object.

# Try to read its data. There is no error in this step, indicating that it can be iterated.
for line in df1:
    print(line)

('a',   part  id name  score
0    a   1   Zhang San     ten
1    a   2   Li Si      five)
('b',   part  id   name  score
2    b   3     Wang Wu     ten
3    b   4  smith      6)
('c',   part  id   name  score
4    c   five   kear      7
5    c   6  clack      8)

# print([line for line in df1])
print(list(df1)[0])

('a',   part  id name  score
0    a   1   Zhang San     ten
1    a   2   Li Si      5)

The first traversal can see two elements in a tuple. Try to traverse it with two placeholders.

for part,ele in df1:
    print(part,ele)

a   part  id name  score
0    a   1   Zhang San     ten
1    a   2   Li Si      5
b   part  id   name  score
2    b   3     Wang Wu     ten
3    b   4  smith      6
c   part  id   name  score
4    c   5   kear      7
5    c   6  clack      8

# View the type of a block
type(ele)

pandas.core.frame.DataFrame

This shows that each group of ELES, that is, elements, is a dataframe object, so it naturally inherits all the methods of the dataframe. Naturally, you can also take out all the elements.

Give me an impression. Next, let's specify the method

# View data
df

	part	id	name	score
0	a	1	Zhang San	ten
1	a	2	Li Si	5
2	b	3	Wang Wu	ten
3	b	4	smith	6
4	c	5	kear	7
5	c	6	clack	8

# Group a key as a whole
df_group_obj = df.groupby('part')
# When you view the data, you can find that it is an object. Note that this object only contains grouped intermediate data
print(df_group_obj)
# Use the size attribute of the object to view the corresponding number of elements
print(df_group_obj.size())
# This object can also call the methods of some dataframe objects, such as:
# Average each column of the group
print(df_group_obj.mean())
# Sum each column of the group
print(df_group_obj.sum())
# Get the maximum value of each column in the group
print(df_group_obj.max())
# Get the minimum value of each column in the group
print(df_group_obj.min())

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000245494B2520>
part
a    2
b    2
c    2
dtype: int64
       id  score
part            
a     1.5    7.5
b     3.5    8.0
c     5.5    7.5
      id  score
part           
a      3     15
b      7     16
c     11     15
      id  name  score
part                 
a      2    Li Si     ten
b      4    Wang Wu     ten
c      6  kear      8
      id   name  score
part                  
a      1     Zhang San      5
b      3  smith      6
c      5  clack      7

Also note that the above operations are all for grouping, so if it is divided into three groups, the result should be three records (that is, three lines)

The above operation is to group the whole df object, and you can also take out each column for grouping.

That is, the operation can also be performed on a series object

# However, it should be noted that the parameters in groupby must be taken out with df, otherwise an error will be reported. In fact, it is not difficult to think about it, because df['part '] itself is a series object. Splicing the two series objects is a series object
# A dataframe, so the grouping method can be used here.
ser_group_by = df['score'].groupby(df['part'])
# The rest of the operation is the same as above, so it needs too much explanation.
print(ser_group_by.size())
print(ser_group_by.max())
print(ser_group_by.min())

part
a    2
b    2
c    2
Name: score, dtype: int64
part
a    10
b    10
c     8
Name: score, dtype: int64
part
a    5
b    6
c    7
Name: score, dtype: int64

Similarly, we can customize the grouping. We only need to define the same number of sequence elements as the number of records (rows).

# Randomly generate a sequence
import random
group = [random.randint(1,4) for i in range(6)]
# grouping
df_group = df.groupby(group)
# View data
print(df_group)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002454927ABB0>

print(df_group.size())

1    3
3    2
4    1
dtype: int64

## As a verification, you can print the sequence to see
print(group)

[1, 4, 1, 1, 3, 3]

A hierarchical index is formed by grouping according to multiple keys

df_group = df.groupby(['part', 'name'])
print(df_group.size())

part  name 
a     Zhang San       1
      Li Si       1
b     smith    1
      Wang Wu       1
c     clack    1
      kear     1
dtype: int64

In addition, we need to consider the nature of the data generated by grouping and analyze it according to the above data

It is a unique object, an intermediate data, and contains only data, which can perform any operation on series and dataframe objects.
Its index is defined by us. For example, we use two keys as parameters for grouping. As shown above, it contains two levels of indexes.
This object supports us to use some methods of pandas for operations, but not all methods. Therefore, when we can't handle it, we can convert this object into a common data structure, such as dict or list.
This object supports iterative operations.

# Convert to list
list(df_group)

[(('a', 'Zhang San'),
    part  id name  score
  0    a   1   Zhang San     10),
 (('a', 'Li Si'),
    part  id name  score
  1    a   2   Li Si      5),
 (('b', 'smith'),
    part  id   name  score
  3    b   4  smith      6),
 (('b', 'Wang Wu'),
    part  id name  score
  2    b   3   Wang Wu     10),
 (('c', 'clack'),
    part  id   name  score
  5    c   6  clack      8),
 (('c', 'kear'),
    part  id  name  score
  4    c   5  kear      7)]

# Convert to dictionary
dict(list(df_group))

{('a',
  'Zhang San'):   part  id name  score
 0    a   1   Zhang San     10,
 ('a',
  'Li Si'):   part  id name  score
 1    a   2   Li Si      5,
 ('b',
  'smith'):   part  id   name  score
 3    b   4  smith      6,
 ('b',
  'Wang Wu'):   part  id name  score
 2    b   3   Wang Wu     10,
 ('c',
  'clack'):   part  id   name  score
 5    c   6  clack      8,
 ('c',
  'kear'):   part  id  name  score
 4    c   5  kear      7}

Aggregation operation

In fact, the content we mentioned earlier already includes aggregation operation. To put it bluntly, it is to calculate several records of each group. No matter the max, min, mean, sum and size methods belong to aggregation operation.

There is a describe method not mentioned earlier. This method has been mentioned when talking about dataframe. In fact, the previous civilization did say that dataframe methods are basically applicable to grouped objects, so it is not specifically proposed.

df_group.describe()

		id								score
		count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
part	name
a	Zhang San	1.0	1.0	NaN	1.0	1.0	1.0	1.0	1.0	1.0	10.0	NaN	10.0	10.0	10.0	10.0	10.0
a	Li Si	1.0	2.0	NaN	2.0	2.0	2.0	2.0	2.0	1.0	5.0	NaN	5.0	5.0	5.0	5.0	5.0
b	smith	1.0	4.0	NaN	4.0	4.0	4.0	4.0	4.0	1.0	6.0	NaN	6.0	6.0	6.0	6.0	6.0
b	Wang Wu	1.0	3.0	NaN	3.0	3.0	3.0	3.0	3.0	1.0	10.0	NaN	10.0	10.0	10.0	10.0	10.0
c	clack	1.0	6.0	NaN	6.0	6.0	6.0	6.0	6.0	1.0	8.0	NaN	8.0	8.0	8.0	8.0	8.0
c	kear	1.0	5.0	NaN	5.0	5.0	5.0	5.0	5.0	1.0	7.0	NaN	7.0	7.0	7.0	7.0	7.0

Key points of the method, user-defined aggregation method

def my_func(df):
    """
    """
    return df.max() - df.min()

df_group_obj.agg(my_func)

# If our function passes parameters with (), it means that we pass the return value of the function rather than a method, so we can't pass parameters with ().

	id	score
part
a	1	5
b	1	4
c	1	1

type(df_group_obj.agg(my_func))

pandas.core.frame.DataFrame

# You can see from type that the aggregation method also returns dataframe. We can pass in multiple function names through agg, and the results are as follows:
df_group_obj.agg(['sum',my_func])

	id		score
	sum	my_func	sum	my_func
part
a	3	1	15	5
b	7	1	16	4
c	11	1	15	1

# Its operability is very strong, including that we can directly change the column name.
df_group_obj.agg(['sum',('New column name',my_func)],)

	id		score
	sum	New column name	sum	New column name
part
a	3	1	15	5
b	7	1	16	4
c	11	1	15	1

Stop here. For example, it is the basic content of grouping and aggregation. If you have thoroughly understood the above things, you will basically pass the test.

Next, let's look at the content of advanced calculation relative to the foundation. The last simulation example

Through the above methods, including the previous one or two blogs, it is obvious that we have been able to do many functional operations on Excel, but the strength of pandas is far more than that. It can reflect the trend of replacing excel. In fact, it depends on some top Python data analysts in the industry, including all of us, trying to get closer to these directions. As a learning cost, excel is obviously lower than python, and there is no discrimination. This should be well known. By definition, excel is a mature software, As a toolkit of programming language, pandas is lower level, and it is often more difficult for lower level things to realize the same function, but if you are skilled, what you can do will be much more powerful than using Excel alone.

import pandas as pd
import random as rd
import string
## Randomly generate a series of mature data and write it into excel.

data = {
    'id':[i for i in range(30)],
    'name': [''.join([rd.choice(string.ascii_letters) for i in range(5)]) for i in range(30)], # The name is repeatable.
    'part': [rd.choice(['a','b','c','d']) for i in range(30)],
    'score': [rd.randint(60,100) for i in range(30)]
}

data = pd.DataFrame(data)

data
data.to_excel('fin_exam.xlsx', sheet_name='data_score')

	id	name	part	score
0	0	DrPYm	a	69
1	1	JFOUg	d	89
2	2	fDGRR	a	94
3	3	QEaJz	a	66
4	4	IQwCs	a	89
5	5	QglNK	a	98
6	6	YEHEx	b	77
7	7	CThQU	a	61
8	8	ypFKY	a	67
9	9	YkzKn	b	72
10	10	BgoYQ	b	75
11	11	WLSAQ	a	91
12	12	wvGtX	c	87
13	13	wZigu	a	64
14	14	CgNab	d	69
15	15	OLwjx	c	98
16	16	OhMaH	d	75
17	17	wZXNp	b	82
18	18	unlzR	a	93
19	19	mHbyE	b	69
20	20	VjgMR	a	63
21	21	JkrpE	c	89
22	22	QLetK	a	84
23	23	MIcMu	d	63
24	24	DYdgN	c	63
25	25	rrjMV	d	98
26	26	mzSvn	c	75
27	27	BjWdT	d	76
28	28	dEPBF	d	88
29	29	lrsGq	c	62

Task: take out the results of the first three people in each group

# grouping
data_group = data.groupby('part')
# View data quantity
print(data_group.size())
# Custom operation
def top_3(df):
    # The ascending=False parameter indicates descending sorting
    return df.sort_values(by='score', ascending=False)[:3]
# result  
data_group.apply(top_3)

part
a    12
b     5
c     6
d     7
dtype: int64

		id	name	part	score
part
a	5	5	QglNK	a	98
	2	2	fDGRR	a	94
	18	18	unlzR	a	93
b	17	17	wZXNp	b	82
	6	6	YEHEx	b	77
	10	10	BgoYQ	b	75
c	15	15	OLwjx	c	98
	21	21	JkrpE	c	89
	12	12	wvGtX	c	87
d	25	25	rrjMV	d	98
	1	1	JFOUg	d	89
	28	28	dEPBF	d	88

Task: take out the scores of the last three in each group

def top_d3(df):
    # The ascending=False parameter indicates descending sorting
    return df.sort_values(by='score', ascending=False)[-3:]
data_group.apply(top_d3)

		id	name	part	score
part
a	13	13	wZigu	a	64
	20	20	VjgMR	a	63
	7	7	CThQU	a	61
b	10	10	BgoYQ	b	75
	9	9	YkzKn	b	72
	19	19	mHbyE	b	69
c	26	26	mzSvn	c	75
	24	24	DYdgN	c	63
	29	29	lrsGq	c	62
d	16	16	OhMaH	d	75
	14	14	CgNab	d	69
	23	23	MIcMu	d	63

The above actually uses an apply method: it can use our custom method.

Then there are also alternative methods:

transform: pass in a custom method (I won't talk about it here. It's the same as apply. It's meaningless)

So that's all for the whole pandas. Pandas is very huge, and bloggers can't say anything. In fact, programming is more about checking and learning by themselves. What the tutorial can give is to help avoid some pits, or extend some things and pull up the knowledge structure. After all, how to learn depends on yourself. In fact, the contents of the three articles are almost the same as a whole. You can also summarize them yourself. In the later words, depending on the situation, I may upload some actual combat including machine learning. I'll send it when I have time. If you have any questions or need anything, you can add bloggers through wechat, as follows:

Posted by Tonka1979 on Wed, 24 Nov 2021 14:35:49 -0800

Programmer Group