Data normalization: database style data frame connection (merge method), merging according to index (join method)

Keywords: Database

2 join and merge datasets

Data contained in pandas objects can be combined in many ways:

  • pandas.merge Links lines based on one or more keys.
  • pandas.concat To glue or stack objects axially
  • combine_ The first instance method allows overlapping data to be spliced together to fill in missing values from one object with values from another

2.1 database style dataframe connection

df1=pd.DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
df1
  key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6
df2
  key	data2
0	a	0
1	b	1
2	d	2

pd.merge(df1,df2)            # For the example of many to one link, the key column of df1 has multiple a BS, and each value of df2 has only one row.
  key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1                     # There is no specific column to link in here. The overlapping column names are automatically used as the key of the link
3	a	2	0
4	a	4	0
5	a	5	0

pd.merge(df1,df2,on='key')            # This is the key displayed to specify the link


df3=pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})
df4=pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')        # If the column names of each object are different, you can specify them separately
	lkey	data1	rkey	data2
0	b	0	b	1
1	b	1	b	1
2	b	6	b	1
3	a	2	a	0
4	a	4	a	0
5	a	5	a	0

As you can see, the results are missing the values of 'c' and 'd' and related data. By default, merge does inner join, and the key in the result is the intersection of two tables. Other options are left right outer. outer join is the union of keys,

pd.merge(df1,df2,how='outer')
	key	data1	data2
0	b	0.0	1.0
1	b	1.0	1.0
2	b	6.0	1.0
3	a	2.0	0.0
4	a	4.0	0.0
5	a	5.0	0.0
6	c	3.0	NaN
7	d	NaN	2.0
how parameters different connection types behavior
inner Join only the intersection of keys that both tables have.
left Union keys for all left tables
right Union keys of all right tables
outer Union the union of keys that both tables have.

Many to many consolidation:

df1=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
df2=pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
df1
  key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5
df2
	key	data2
0	a	0
1	b	1
2	a	2
3	b	3
4	d	4

pd.merge(df1,df2,on='key',how='left')
	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

Many to many connection is the Cartesian product of A row, that is, A={a,b}, B={0,1,2}, then A × B={(a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2)}

The above connection is based on df1. In df1, b=[0,1,5] a=[2,4] c=3, in df2, b=[1,3] a=[0,2]. If we calculate by Cartesian product, there should be six connected b's, [(0,1), (0,3), (1,1), (1,3), (5,1), (5,3)] compared with the above, which is indeed arranged according to the index of df1. The connection method only affects the different key values displayed in the result

pd.merge(df1,df2,how='inner')

  key	data1	data2
0	b	0	1
1	b	0	3
2	b	1	1
3	b	1	3                  # The intersection of two keys.
4	b	5	1
5	b	5	3
6	a	2	0
7	a	2	2
8	a	4	0
9	a	4	2


left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})
pd.merge(left, right, on=['key1', 'key2'], how='outer')
	key1	key2	lval	rval
0	foo	one	1.0	4.0                        # Use multiple keys to merge, passing in a list of column names.
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

# The final problem in the merge operation is how to deal with overlapping column names, although you can rename the axis labels by rname,
# However, the suffixes option of merge is used to specify the string to be added after the overlapping column names of the left and right dataframe objects
pd.merge(left,right,on='key1')
	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

pd.merge(left,right,on='key1',suffixes=('_left','_right'))
key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7


2.2 merge by index

In some cases, the key used for the merge in the dataframe is its index. In this case, you can pass left_index=True or right_index=True to indicate that the index needs to be used as the key for the merge.

left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
left1
	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5
right1
group_val
a	3.5
b	7.0

pd.merge(left1, right1, left_on='key', right_index=True)        # Is to use the index on the right as the key
	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0

lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                               'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                              'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])
lefth
    key1	key2	data
0	Ohio	2000	0.0
1	Ohio	2001	1.0
2	Ohio	2002	2.0
3	Nevada	2001	3.0
4	Nevada	2002	4.0
righth
		event1	event2
Nevada	2001	0	1
        2000	2	3
Ohio	2000	4	5
        2000	6	7
        2001	8	9
        2002	10	11
        
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)   # In the case of multi-level index, the join on index is an implicit multi key merge
    key1	key2	data event1	event2            # In this case, you must name the columns you want to merge as a list
0	Ohio	2000	0.0	4	5               
0	Ohio	2000	0.0	6	7
1	Ohio	2001	1.0	8	9             
2	Ohio	2002	2.0	10	11                    # You can read it patiently,
3	Nevada	2001	3.0	0	1

left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])
left2
    Ohio	Nevada
a	1.0	2.0
c	3.0	4.0
e	5.0	6.0
right2
 Missouri	Alabama
b	7.0	8.0
c	9.0	10.0
d	11.0	12.0
e	13.0	14.0
                       
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)      # Use index on both sides
   Ohio	Nevada	Missouri	Alabama
a	1.0	2.0	NaN	NaN
b	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0
d	NaN	NaN	11.0	12.0
e	5.0	6.0	13.0	14.0                

Dataframe has a convenient join method for merging by index, which can also be used to merge multiple dataframe objects with the same index but without overlapping columns.

left2.join(right2, how='outer')      # We can do the same for the one above.
	Ohio	Nevada	Missouri	Alabama
a	1.0	2.0	NaN	NaN
b	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0
d	NaN	NaN	11.0	12.0
e	5.0	6.0	13.0	14.0

left1.join(right1, on='key')          # The join method makes the left connection on the connection key, and completely retains the row index of the left dataframe.
	key	value	group_val
0	a	0	3.5
1	b	1	7.0
2	a	2	3.5
3	a	3	3.5
4	b	4	7.0
5	c	5	NaN

another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                       index=['a', 'c', 'e', 'f'],
                       columns=['New York', 'Oregon'])
another
  New York	Oregon
a	7.0	8.0
c	9.0	10.0
e	11.0	12.0
f	16.0	17.0     # For some simple index index merge, you can pass a dataframe list to the join method, which can replace the more general concat function

left2.join([right2,another],how='outer')
Ohio	Nevada	Missouri	Alabama	New York	Oregon        # Just add some more combined data...
a	1.0	2.0	NaN	NaN	7.0	8.0
c	3.0	4.0	9.0	10.0	9.0	10.0
e	5.0	6.0	13.0	14.0	11.0	12.0
b	NaN	NaN	7.0	8.0	NaN	NaN
d	NaN	NaN	11.0	12.0	NaN	NaN
f	NaN	NaN	NaN	NaN	16.0	17.0

It's so hard. I feel like the whole person is stupid......

Posted by Mesden on Sat, 13 Jun 2020 21:07:57 -0700