2 join and merge datasets
Data contained in pandas objects can be combined in many ways:
- pandas.merge Links lines based on one or more keys.
- pandas.concat To glue or stack objects axially
- combine_ The first instance method allows overlapping data to be spliced together to fill in missing values from one object with values from another
2.1 database style dataframe connection
df1=pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)}) df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)}) df1 key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 a 5 6 b 6 df2 key data2 0 a 0 1 b 1 2 d 2 pd.merge(df1,df2) # For the example of many to one link, the key column of df1 has multiple a BS, and each value of df2 has only one row. key data1 data2 0 b 0 1 1 b 1 1 2 b 6 1 # There is no specific column to link in here. The overlapping column names are automatically used as the key of the link 3 a 2 0 4 a 4 0 5 a 5 0 pd.merge(df1,df2,on='key') # This is the key displayed to specify the link df3=pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)}) df4=pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)}) pd.merge(df3,df4,left_on='lkey',right_on='rkey') # If the column names of each object are different, you can specify them separately lkey data1 rkey data2 0 b 0 b 1 1 b 1 b 1 2 b 6 b 1 3 a 2 a 0 4 a 4 a 0 5 a 5 a 0
As you can see, the results are missing the values of 'c' and 'd' and related data. By default, merge does inner join, and the key in the result is the intersection of two tables. Other options are left right outer. outer join is the union of keys,
pd.merge(df1,df2,how='outer') key data1 data2 0 b 0.0 1.0 1 b 1.0 1.0 2 b 6.0 1.0 3 a 2.0 0.0 4 a 4.0 0.0 5 a 5.0 0.0 6 c 3.0 NaN 7 d NaN 2.0
how parameters different connection types | behavior |
---|---|
inner | Join only the intersection of keys that both tables have. |
left | Union keys for all left tables |
right | Union keys of all right tables |
outer | Union the union of keys that both tables have. |
Many to many consolidation:
df1=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)}) df2=pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)}) df1 key data1 0 b 0 1 b 1 2 a 2 3 c 3 4 a 4 5 b 5 df2 key data2 0 a 0 1 b 1 2 a 2 3 b 3 4 d 4 pd.merge(df1,df2,on='key',how='left') key data1 data2 0 b 0 1.0 1 b 0 3.0 2 b 1 1.0 3 b 1 3.0 4 a 2 0.0 5 a 2 2.0 6 c 3 NaN 7 a 4 0.0 8 a 4 2.0 9 b 5 1.0 10 b 5 3.0
Many to many connection is the Cartesian product of A row, that is, A={a,b}, B={0,1,2}, then A × B={(a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2)}
The above connection is based on df1. In df1, b=[0,1,5] a=[2,4] c=3, in df2, b=[1,3] a=[0,2]. If we calculate by Cartesian product, there should be six connected b's, [(0,1), (0,3), (1,1), (1,3), (5,1), (5,3)] compared with the above, which is indeed arranged according to the index of df1. The connection method only affects the different key values displayed in the result
pd.merge(df1,df2,how='inner') key data1 data2 0 b 0 1 1 b 0 3 2 b 1 1 3 b 1 3 # The intersection of two keys. 4 b 5 1 5 b 5 3 6 a 2 0 7 a 2 2 8 a 4 0 9 a 4 2 left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]}) right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 'key2': ['one', 'one', 'one', 'two'], 'rval': [4, 5, 6, 7]}) pd.merge(left, right, on=['key1', 'key2'], how='outer') key1 key2 lval rval 0 foo one 1.0 4.0 # Use multiple keys to merge, passing in a list of column names. 1 foo one 1.0 5.0 2 foo two 2.0 NaN 3 bar one 3.0 6.0 4 bar two NaN 7.0 # The final problem in the merge operation is how to deal with overlapping column names, although you can rename the axis labels by rname, # However, the suffixes option of merge is used to specify the string to be added after the overlapping column names of the left and right dataframe objects pd.merge(left,right,on='key1') key1 key2_x lval key2_y rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo two 2 one 4 3 foo two 2 one 5 4 bar one 3 one 6 5 bar one 3 two 7 pd.merge(left,right,on='key1',suffixes=('_left','_right')) key1 key2_left lval key2_right rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo two 2 one 4 3 foo two 2 one 5 4 bar one 3 one 6 5 bar one 3 two 7
2.2 merge by index
In some cases, the key used for the merge in the dataframe is its index. In this case, you can pass left_index=True or right_index=True to indicate that the index needs to be used as the key for the merge.
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)}) right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b']) left1 key value 0 a 0 1 b 1 2 a 2 3 a 3 4 b 4 5 c 5 right1 group_val a 3.5 b 7.0 pd.merge(left1, right1, left_on='key', right_index=True) # Is to use the index on the right as the key key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0 lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'key2': [2000, 2001, 2002, 2001, 2002], 'data': np.arange(5.)}) righth = pd.DataFrame(np.arange(12).reshape((6, 2)), index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'], [2001, 2000, 2000, 2000, 2001, 2002]], columns=['event1', 'event2']) lefth key1 key2 data 0 Ohio 2000 0.0 1 Ohio 2001 1.0 2 Ohio 2002 2.0 3 Nevada 2001 3.0 4 Nevada 2002 4.0 righth event1 event2 Nevada 2001 0 1 2000 2 3 Ohio 2000 4 5 2000 6 7 2001 8 9 2002 10 11 pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True) # In the case of multi-level index, the join on index is an implicit multi key merge key1 key2 data event1 event2 # In this case, you must name the columns you want to merge as a list 0 Ohio 2000 0.0 4 5 0 Ohio 2000 0.0 6 7 1 Ohio 2001 1.0 8 9 2 Ohio 2002 2.0 10 11 # You can read it patiently, 3 Nevada 2001 3.0 0 1 left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'], columns=['Ohio', 'Nevada']) right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]], index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama']) left2 Ohio Nevada a 1.0 2.0 c 3.0 4.0 e 5.0 6.0 right2 Missouri Alabama b 7.0 8.0 c 9.0 10.0 d 11.0 12.0 e 13.0 14.0 pd.merge(left2, right2, how='outer', left_index=True, right_index=True) # Use index on both sides Ohio Nevada Missouri Alabama a 1.0 2.0 NaN NaN b NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 d NaN NaN 11.0 12.0 e 5.0 6.0 13.0 14.0
Dataframe has a convenient join method for merging by index, which can also be used to merge multiple dataframe objects with the same index but without overlapping columns.
left2.join(right2, how='outer') # We can do the same for the one above. Ohio Nevada Missouri Alabama a 1.0 2.0 NaN NaN b NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 d NaN NaN 11.0 12.0 e 5.0 6.0 13.0 14.0 left1.join(right1, on='key') # The join method makes the left connection on the connection key, and completely retains the row index of the left dataframe. key value group_val 0 a 0 3.5 1 b 1 7.0 2 a 2 3.5 3 a 3 3.5 4 b 4 7.0 5 c 5 NaN another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]], index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon']) another New York Oregon a 7.0 8.0 c 9.0 10.0 e 11.0 12.0 f 16.0 17.0 # For some simple index index merge, you can pass a dataframe list to the join method, which can replace the more general concat function left2.join([right2,another],how='outer') Ohio Nevada Missouri Alabama New York Oregon # Just add some more combined data... a 1.0 2.0 NaN NaN 7.0 8.0 c 3.0 4.0 9.0 10.0 9.0 10.0 e 5.0 6.0 13.0 14.0 11.0 12.0 b NaN NaN 7.0 8.0 NaN NaN d NaN NaN 11.0 12.0 NaN NaN f NaN NaN NaN NaN 16.0 17.0
It's so hard. I feel like the whole person is stupid......