In [1]:

import numpy as np
import pandas as pd

from IPython.display import Image

pd.set_option('display.precision', 2)

Programming with Data:
Advanced Python and Pandas

Merge, Join, & Combine¶

Types of Joins¶

Inner
Left
Right
Full

Inner Join¶

In [2]:

Image(filename='assets/inner-join.png', retina=True)

Out[2]:

TODO: Show with a markdown table

Left/Right Join¶

In [3]:

Image(filename='assets/left-join.png', retina=True)

Out[3]:

Full Join¶

In [4]:

Image(filename='assets/full-join.png', retina=True)

Out[4]:

The Data¶

Photo by Chris Liverani on Unsplash

Going to use stock data because I used to work as a quant.

Somewhat simple data¶

In [5]:

df1 = pd.DataFrame({
    'ticker': ['AAPL', 'MSFT', 'IBM', 'YHOO', 'GOOG'],
    'open': [426.23, 42.30, 101.65, 35.53, 200.41]
})
df1

Out[5]:

	ticker	open
0	AAPL	426.23
1	MSFT	42.30
2	IBM	101.65
3	YHOO	35.53
4	GOOG	200.41

df1 has ticker and open price (the price of the stock when the NYSE first opens at 0930)

More somewhat simple data¶

Tickers and close prices. Additional ticker for NFLX.

In [6]:

df2 = pd.DataFrame({
    'ticker': ['AAPL', 'GOOG', 'NFLX'],
    'close': [427.53, 210.96, 91.86]
}, columns=['ticker', 'close'])
df2

Out[6]:

	ticker	close
0	AAPL	427.53
1	GOOG	210.96
2	NFLX	91.86

Coding an inner join¶

An inner join gives us the intersection of the keys.

In [7]:

df1m2 = pd.merge(df1, df2, on='ticker')
df1m2

Out[7]:

	ticker	open	close
0	AAPL	426.23	427.53
1	GOOG	200.41	210.96

Verifying the inner join¶

We drop everything except tickers that are present both data frames.

In [8]:

common_tickers = set(df1.ticker) & set(df2.ticker)
common_tickers

Out[8]:

{'AAPL', 'GOOG'}

In [9]:

assert set(df1m2.ticker) == common_tickers

Aside: 99% of the time, use `pd.merge`¶

Most flexible way to join two data frames

pd.concat is more general - useful to join a collection (e.g. list) of data frames
pd.DataFrame.join works in more specific circumstances

Left Join¶

Include all keys from the left data frame.

In [10]:

df1m2_left = pd.merge(df1, df2, on='ticker', how='left')
df1m2_left

Out[10]:

	ticker	open	close
0	AAPL	426.23	427.53
1	MSFT	42.30	NaN
2	IBM	101.65	NaN
3	YHOO	35.53	NaN
4	GOOG	200.41	210.96

In [11]:

assert set(df1.ticker) == set(df1m2_left.ticker)

Filling missing levels¶

In [12]:

df1m2_left

Out[12]:

	ticker	open	close
0	AAPL	426.23	427.53
1	MSFT	42.30	NaN
2	IBM	101.65	NaN
3	YHOO	35.53	NaN
4	GOOG	200.41	210.96

Notice that pandas fills missing levels from df2 with NaN. Comparable to SQL where values would be NULL.

Right Join¶

Include all keys from the right data frame.

In [13]:

pd.merge(df1, df2, on='ticker', how='right')

Out[13]:

	ticker	open	close
0	AAPL	426.23	427.53
1	GOOG	200.41	210.96
2	NFLX	NaN	91.86

Same missingness handling as left join.

Outer/Full Join¶

In [14]:

df1m2_full = pd.merge(df1, df2, on='ticker', how='outer')
df1m2_full

Out[14]:

	ticker	open	close
0	AAPL	426.23	427.53
1	MSFT	42.30	NaN
2	IBM	101.65	NaN
3	YHOO	35.53	NaN
4	GOOG	200.41	210.96
5	NFLX	NaN	91.86

In [15]:

assert set(df1.ticker) | set(df2.ticker) == set(df1m2_full.ticker)

Concatenation/Binding¶

Join and bind across rows or columns
Pass 1 or more Series or DataFrames
Add rows and columns (pd.concat)

Data for Concatenation¶

We're going to make this data explicitly daily so we're going to add a date column. Take the first two records only so the data fits on the slide.

In [16]:

df3 = df1.assign(date=pd.Timestamp("2018-01-04"))\
    .iloc[:2, ] # first 2 rows only
df3

Out[16]:

	ticker	open	date
0	AAPL	426.23	2018-01-04
1	MSFT	42.30	2018-01-04

In [17]:

df4 = df3.assign(
    date=pd.Timestamp("2018-01-05"),
    open=lambda x: x.open + 10
)
df4

Out[17]:

	ticker	open	date
0	AAPL	436.23	2018-01-05
1	MSFT	52.30	2018-01-05

Adding rows¶

In [18]:

pd.concat([df3, df4])

Out[18]:

	ticker	open	date
0	AAPL	426.23	2018-01-04
1	MSFT	42.30	2018-01-04
0	AAPL	436.23	2018-01-05
1	MSFT	52.30	2018-01-05

Notice how the index is repeated and duplicated for the default pd.RangeIndex

No dups please¶

To check for duplicated index values:

In [19]:

try:
    pd.concat([df3, df4], verify_integrity=True)
except ValueError as e:
    print(e)

Indexes have overlapping values: Int64Index([0, 1], dtype='int64')

Ignore the index¶

ignore_index discards the indexes from the bound data frames

In [20]:

pd.concat([df3, df4], ignore_index=True)

Out[20]:

	ticker	open	date
0	AAPL	426.23	2018-01-04
1	MSFT	42.30	2018-01-04
2	AAPL	436.23	2018-01-05
3	MSFT	52.30	2018-01-05

We usually don't need to validate the index when we pass ignore_index because we're creating a new index!

Rows and Columns with `concat`¶

concat does an outer join on both rows and columns

In [21]:

df3a = df3.assign(close=lambda x: (x.open + 9))
df3a

Out[21]:

	ticker	open	date	close
0	AAPL	426.23	2018-01-04	435.23
1	MSFT	42.30	2018-01-04	51.30

A union of the columns¶

In [22]:

pd.concat([df3a, df4], ignore_index=True)

Out[22]:

	ticker	open	date	close
0	AAPL	426.23	2018-01-04	435.23
1	MSFT	42.30	2018-01-04	51.30
2	AAPL	436.23	2018-01-05	NaN
3	MSFT	52.30	2018-01-05	NaN

Bind across columns only¶

In [23]:

df5 = pd.DataFrame({'a': [1, 2]})
df6 = pd.DataFrame({'b': [3, 4]})

In [24]:

pd.concat([df5, df6], axis=1)

Out[24]:

	a	b
0	1	3
1	2	4

`concat` binds rows and columns¶

Always performs an outer join on the concatenation axis

In [25]:

df6a = df6.set_index(pd.Index([6, 7]))
pd.concat([df5, df6a])

Out[25]:

	a	b
0	1.0	NaN
1	2.0	NaN
6	NaN	3.0
7	NaN	4.0

Specify behavior of non-concatenation axis¶

The join parameter only applies to the non-concatenation axis
Set to inner to only get the common columns

In [26]:

pd.concat([df3a, df4], ignore_index=True, join='inner')

Out[26]:

	ticker	open	date
0	AAPL	426.23	2018-01-04
1	MSFT	42.30	2018-01-04
2	AAPL	436.23	2018-01-05
3	MSFT	52.30	2018-01-05

Notice there is no close column because it's not present in both data frames

Identify the source Series/DataFrame with `keys`¶

In [27]:

pd.concat([df3, df4], keys=['df3', 'df4'])

Out[27]:

		ticker	open	date
df3	0	AAPL	426.23	2018-01-04
df3	1	MSFT	42.30	2018-01-04
df4	0	AAPL	436.23	2018-01-05
df4	1	MSFT	52.30	2018-01-05

Use `keys` and `names`¶

In [28]:

pd.concat([df3, df4], keys=['df3', 'df4'], names=['source', 'row_num'])

Out[28]:

		ticker	open	date
source	row_num
df3	0	AAPL	426.23	2018-01-04
df3	1	MSFT	42.30	2018-01-04
df4	0	AAPL	436.23	2018-01-05
df4	1	MSFT	52.30	2018-01-05

TODO:

Add concatenation of Series
Add join
Show differences between concat, merge, and join

Programming with Data:Advanced Python and Pandas