(c) 2016 - present. Enplus Advisors, Inc.
import numpy as np
import pandas as pd
from IPython.display import Image
pd.set_option('display.precision', 2)
Image(filename='assets/inner-join.png', retina=True)
TODO: Show with a markdown table
Image(filename='assets/left-join.png', retina=True)
Image(filename='assets/full-join.png', retina=True)
Going to use stock data because I used to work as a quant.
df1 = pd.DataFrame({
'ticker': ['AAPL', 'MSFT', 'IBM', 'YHOO', 'GOOG'],
'open': [426.23, 42.30, 101.65, 35.53, 200.41]
})
df1
ticker | open | |
---|---|---|
0 | AAPL | 426.23 |
1 | MSFT | 42.30 |
2 | IBM | 101.65 |
3 | YHOO | 35.53 |
4 | GOOG | 200.41 |
df1
has ticker and open
price (the price of the stock when the NYSE first opens at 0930)
Tickers and close prices. Additional ticker for NFLX
.
df2 = pd.DataFrame({
'ticker': ['AAPL', 'GOOG', 'NFLX'],
'close': [427.53, 210.96, 91.86]
}, columns=['ticker', 'close'])
df2
ticker | close | |
---|---|---|
0 | AAPL | 427.53 |
1 | GOOG | 210.96 |
2 | NFLX | 91.86 |
An inner join gives us the intersection of the keys.
df1m2 = pd.merge(df1, df2, on='ticker')
df1m2
ticker | open | close | |
---|---|---|---|
0 | AAPL | 426.23 | 427.53 |
1 | GOOG | 200.41 | 210.96 |
We drop everything except tickers that are present both data frames.
common_tickers = set(df1.ticker) & set(df2.ticker)
common_tickers
{'AAPL', 'GOOG'}
assert set(df1m2.ticker) == common_tickers
pd.merge
¶Most flexible way to join two data frames
pd.concat
is more general - useful to join a collection (e.g. list
) of data framespd.DataFrame.join
works in more specific circumstancesInclude all keys from the left
data frame.
df1m2_left = pd.merge(df1, df2, on='ticker', how='left')
df1m2_left
ticker | open | close | |
---|---|---|---|
0 | AAPL | 426.23 | 427.53 |
1 | MSFT | 42.30 | NaN |
2 | IBM | 101.65 | NaN |
3 | YHOO | 35.53 | NaN |
4 | GOOG | 200.41 | 210.96 |
assert set(df1.ticker) == set(df1m2_left.ticker)
df1m2_left
ticker | open | close | |
---|---|---|---|
0 | AAPL | 426.23 | 427.53 |
1 | MSFT | 42.30 | NaN |
2 | IBM | 101.65 | NaN |
3 | YHOO | 35.53 | NaN |
4 | GOOG | 200.41 | 210.96 |
Notice that pandas
fills missing levels from df2
with NaN
. Comparable to SQL
where values would be NULL
.
Include all keys from the right
data frame.
pd.merge(df1, df2, on='ticker', how='right')
ticker | open | close | |
---|---|---|---|
0 | AAPL | 426.23 | 427.53 |
1 | GOOG | 200.41 | 210.96 |
2 | NFLX | NaN | 91.86 |
Same missingness handling as left
join.
df1m2_full = pd.merge(df1, df2, on='ticker', how='outer')
df1m2_full
ticker | open | close | |
---|---|---|---|
0 | AAPL | 426.23 | 427.53 |
1 | MSFT | 42.30 | NaN |
2 | IBM | 101.65 | NaN |
3 | YHOO | 35.53 | NaN |
4 | GOOG | 200.41 | 210.96 |
5 | NFLX | NaN | 91.86 |
assert set(df1.ticker) | set(df2.ticker) == set(df1m2_full.ticker)
Series
or DataFrame
spd.concat
)We're going to make this data explicitly daily so we're going to
add a date
column. Take the first two records only so the data
fits on the slide.
df3 = df1.assign(date=pd.Timestamp("2018-01-04"))\
.iloc[:2, ] # first 2 rows only
df3
ticker | open | date | |
---|---|---|---|
0 | AAPL | 426.23 | 2018-01-04 |
1 | MSFT | 42.30 | 2018-01-04 |
df4 = df3.assign(
date=pd.Timestamp("2018-01-05"),
open=lambda x: x.open + 10
)
df4
ticker | open | date | |
---|---|---|---|
0 | AAPL | 436.23 | 2018-01-05 |
1 | MSFT | 52.30 | 2018-01-05 |
pd.concat([df3, df4])
ticker | open | date | |
---|---|---|---|
0 | AAPL | 426.23 | 2018-01-04 |
1 | MSFT | 42.30 | 2018-01-04 |
0 | AAPL | 436.23 | 2018-01-05 |
1 | MSFT | 52.30 | 2018-01-05 |
Notice how the index is repeated and duplicated for the default pd.RangeIndex
To check for duplicated index values:
try:
pd.concat([df3, df4], verify_integrity=True)
except ValueError as e:
print(e)
Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
ignore_index
discards the indexes from the bound data frames
pd.concat([df3, df4], ignore_index=True)
ticker | open | date | |
---|---|---|---|
0 | AAPL | 426.23 | 2018-01-04 |
1 | MSFT | 42.30 | 2018-01-04 |
2 | AAPL | 436.23 | 2018-01-05 |
3 | MSFT | 52.30 | 2018-01-05 |
We usually don't need to validate the index when we pass ignore_index
because we're creating a new index!
concat
¶concat
does an outer join on both rows and columnsdf3a = df3.assign(close=lambda x: (x.open + 9))
df3a
ticker | open | date | close | |
---|---|---|---|---|
0 | AAPL | 426.23 | 2018-01-04 | 435.23 |
1 | MSFT | 42.30 | 2018-01-04 | 51.30 |
pd.concat([df3a, df4], ignore_index=True)
ticker | open | date | close | |
---|---|---|---|---|
0 | AAPL | 426.23 | 2018-01-04 | 435.23 |
1 | MSFT | 42.30 | 2018-01-04 | 51.30 |
2 | AAPL | 436.23 | 2018-01-05 | NaN |
3 | MSFT | 52.30 | 2018-01-05 | NaN |
df5 = pd.DataFrame({'a': [1, 2]})
df6 = pd.DataFrame({'b': [3, 4]})
pd.concat([df5, df6], axis=1)
a | b | |
---|---|---|
0 | 1 | 3 |
1 | 2 | 4 |
concat
binds rows and columns¶df6a = df6.set_index(pd.Index([6, 7]))
pd.concat([df5, df6a])
a | b | |
---|---|---|
0 | 1.0 | NaN |
1 | 2.0 | NaN |
6 | NaN | 3.0 |
7 | NaN | 4.0 |
join
parameter only applies to the non-concatenation axisinner
to only get the common columnspd.concat([df3a, df4], ignore_index=True, join='inner')
ticker | open | date | |
---|---|---|---|
0 | AAPL | 426.23 | 2018-01-04 |
1 | MSFT | 42.30 | 2018-01-04 |
2 | AAPL | 436.23 | 2018-01-05 |
3 | MSFT | 52.30 | 2018-01-05 |
Notice there is no close
column because it's not present in both
data frames
keys
¶pd.concat([df3, df4], keys=['df3', 'df4'])
ticker | open | date | ||
---|---|---|---|---|
df3 | 0 | AAPL | 426.23 | 2018-01-04 |
1 | MSFT | 42.30 | 2018-01-04 | |
df4 | 0 | AAPL | 436.23 | 2018-01-05 |
1 | MSFT | 52.30 | 2018-01-05 |
keys
and names
¶pd.concat([df3, df4], keys=['df3', 'df4'], names=['source', 'row_num'])
ticker | open | date | ||
---|---|---|---|---|
source | row_num | |||
df3 | 0 | AAPL | 426.23 | 2018-01-04 |
1 | MSFT | 42.30 | 2018-01-04 | |
df4 | 0 | AAPL | 436.23 | 2018-01-05 |
1 | MSFT | 52.30 | 2018-01-05 |
TODO: