Article directory
Common functions of Pandas
Pandas grouping
In many cases, we divide the data into multiple sets and apply some functions to each subset. In the application function, the following operations can be performed.
Break data into groups
Grouping data
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(18).reshape(6,3), index = list('abcdef'), columns = ['one','two','three']) print(df.groupby('one'))
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10387dd68>
View grouping
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(18).reshape(6,3), index = list('abcdef'), columns = ['one','two','three']) print(df.groupby(['one','two']).groups)
{(0, 1): Index(['a'], dtype='object'), (3, 4): Index(['b'], dtype='object'), (6, 7): Index(['c'], dtype='object'), (9, 10): Index(['d'], dtype='object'), (12, 13): Index(['e'], dtype='object'), (15, 16): Index(['f'], dtype='object')}
Iterative traversal grouping
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(18).reshape(6,3), index = list('abcdef'), columns = ['one','two','three']) groupd = df.groupby('one') for name, group in groupd: # print(name) print(group)
one two three a 0 1 2 one two three b 3 4 5 one two three c 6 7 8 one two three d 9 10 11 one two three e 12 13 14 one two three f 15 16 17
Select a group
Using the get_group() method, select a group
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(18).reshape(6,3), index = list('abcdef'), columns = ['one','two','three']) groupd = df.groupby('one') print(groupd.get_group(3))
one two three b 3 4 5
polymerization
The aggregation function returns a single aggregation value for each group.
When a group_by object is created, multiple aggregation operations can be performed on the grouped data.
Commonly used is aggregation by AGG method or equivalent agg method.
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(18).reshape(6,3), index = list('abcdef'), columns = ['one','two','three']) print(df) grouped = df.groupby('one') print(grouped['three'].agg(np.mean)) print(grouped['three'].agg(np.size)) # View grouping size
one two three a 0 1 2 b 3 4 5 c 6 7 8 d 9 10 11 e 12 13 14 f 15 16 17 one 0 2 3 5 6 8 9 11 12 14 15 17 Name: three, dtype: int64 one 0 1 3 1 6 1 9 1 12 1 15 1 Name: three, dtype: int64
Use multiple aggregation functions at a time
A list or dictionary of transfer functions for aggregation
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(18).reshape(6,3), index = list('abcdef'), columns = ['one','two','three']) print(df) grouped = df.groupby('one') print(grouped['three'].agg([np.sum, np.mean, np.std]))
sum mean std one 0 2 2 NaN 3 5 5 NaN 6 8 8 NaN 9 11 11 NaN 12 14 14 NaN 15 17 17 NaN
filter
import pandas as pd import numpy as np df = pd.DataFrame(np.arange(18).reshape(6,3), index = list('abcdef'), columns = ['one','two','three']) print(df) grouped = df.groupby('one') print(grouped.filter(lambda x : len(x) >= 2 ))
one two three a 0 1 2 b 3 4 5 c 6 7 8 d 9 10 11 e 12 13 14 f 15 16 17 Empty DataFrame Columns: [one, two, three] Index: []
Pandas Merge/Connect
Pandas provides a separate merge() function as an entry point for all standard database connection operations between DataFrame objects
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)
parameter | describe |
---|---|
left | Another DataFrame object |
right | Another DataFrame object |
on | Column (name) connections must exist in left and right DataFrame objects. |
letf_on | Columns in the left DataFrame are used as keys, and can be column names or arrays equal to the length of the DataFrame. |
right_on | The column on the right side of the DataFrame serves as the key. It can be a column name or an array whose length equals the length of the DataFrame. |
left_index | If True, use the index (row label) in the left DataFrame as its join key |
fight_index | Ditto |
how | Default inner. Optional left, right, outher, inner |
sort | Sort the result DataFrame by connecting keys in dictionary order. By default, True, when set to False, greatly improves performance in many cases. |
Merge two DataFrame s on one key
import pandas as pd import numpy as np left = pd.DataFrame({ 'id':[1,2,3,4], 'Name':['hubo','vim','vi','kaka'], 'answer_id':['sub1','sub2','sub3','sub4'] }) right = pd.DataFrame({ 'id':[1,2,3,4], 'Name':['li','bo','wn','su'], 'answer_id':['sub2','sub3','sub5','sub6'] }) mid = pd.merge(left,right,on='id') print(mid)
id Name_x answer_id_x Name_y answer_id_y 0 1 hubo sub1 li sub2 1 2 vim sub2 bo sub3 2 3 vi sub3 wn sub5 3 4 kaka sub4 su sub6
Merge two DataFrame s on multiple keys
import pandas as pd import numpy as np left = pd.DataFrame({ 'id':[1,2,3,4], 'Name':['hubo','vim','vi','kaka'], 'answer_id':['sub1','sub2','sub3','sub4'] }) right = pd.DataFrame({ 'id':[1,2,3,4], 'Name':['li','bo','wn','su'], 'answer_id':['sub2','sub5','sub3','sub4'] }) mid = pd.merge(left,right,on=['id','answer_id']) print(mid)
id Name_x answer_id Name_y 0 3 vi sub3 wn 1 4 kaka sub4 su
how parameter
How to merge parameters specifies how to determine which keys will be included in the result table. If the key combination does not appear in the left or right table, the value in the join table will be NA.
- Left: Use the key of the left object
- Right: Use the key of the right object
- outher: union using keys
- inner: Intersection using keys
import pandas as pd import numpy as np left = pd.DataFrame({ 'id':[1,2,3,4], 'Name':['hubo','vim','vi','kaka'], 'answer_id':['sub1','sub2','sub3','sub4'] }) right = pd.DataFrame({ 'id':[1,2,3,4], 'Name':['li','bo','wn','su'], 'answer_id':['sub2','sub5','sub3','sub4'] }) mid = pd.merge(left,right,on=['answer_id'], how='left') print(mid)
id_x Name_x answer_id id_y Name_y 0 1 hubo sub1 NaN NaN 1 2 vim sub2 1.0 li 2 3 vi sub3 3.0 wn 3 4 kaka sub4 4.0 su
3. Pandas IO Tools
The main functions of reading files are read_csv() and read_table().
Intelligently converting tabular data to DataFrame objects using the same parsing
import pandas as pd df = pd.read_excel('catering_sale.xls') print(df.head())
Date sales 0 2015-03-01 51.0 1 2015-02-28 2618.2 2 2015-02-27 2608.4 3 2015-02-26 2651.9 4 2015-02-25 3442.1
Custom Index
Specify a column of the file to customize the index using index_col
import pandas as pd df = pd.read_excel('catering_sale.xls', index_col='date') print(df.head())
Sales volume date 2015-03-01 51.0 2015-02-28 2618.2 2015-02-27 2608.4 2015-02-26 2651.9 2015-02-25 3442.1
Converter
The column of dtype can be passed as a dictionary
Change int type to float type
import pandas as pd import numpy as np # df = pd.read_excel('catering_sale.xls', dtype= {sales': np.float64}) df = pd.read_excel('catering_sale.xls', dtype={'Sale':np.float64}) print(df.dtypes)
Date datetime64[ns] Sales float64 dtype: object
Specify the title name
Replace the title and display it from the second line
import pandas as pd import numpy as np df = pd.read_excel('catering_sale.xls', names=['a','b'], header=1) print(df.head())
a b 0 2015-02-28 2618.2 1 2015-02-27 2608.4 2 2015-02-26 2651.9 3 2015-02-25 3442.1 4 2015-02-24 3393.1