This exploratory notebook is a good starting point to help you make your first steps in the challenge.
We remind that the aim of the 2021 QRT Challenge Data is to determine the link between two types of assets: liquids and illiquids. We provide returns of 100 illiquid assets and the aim is to predict, for the same day, the sign of the return of 100 liquid assets.
In the following, we propose a very simple approach that determines for each liquid asset, the illiquid asset with maximum correlation. Thus we measures the $\beta$ (see definition here) between these assets which will be used for prediction.
This notebook is very straightforward, but if you have any question or comment, please ask it in the forum.
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
X_train = pd.read_parquet('./X_train.parquet')
Y_train = pd.read_csv('./y_train.csv', index_col=0)
X_test = pd.read_parquet('./X_test.parquet')
X_train.head()
ID_DAY | RET_216 | RET_238 | RET_45 | RET_295 | RET_230 | RET_120 | RET_188 | RET_260 | RET_15 | ... | RET_122 | RET_194 | RET_72 | RET_293 | RET_281 | RET_193 | RET_95 | RET_162 | RET_297 | ID_TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | |||||||||||||||||||||
0 | 3316 | 0.004024 | 0.009237 | 0.004967 | NaN | 0.01704 | 0.013885 | 0.041885 | 0.015207 | -0.003143 | ... | 0.007596 | 0.01501 | 0.014733 | -0.000476 | 0.006539 | -0.010233 | 0.001251 | -0.003102 | -0.094847 | 139 |
1 | 3316 | 0.004024 | 0.009237 | 0.004967 | NaN | 0.01704 | 0.013885 | 0.041885 | 0.015207 | -0.003143 | ... | 0.007596 | 0.01501 | 0.014733 | -0.000476 | 0.006539 | -0.010233 | 0.001251 | -0.003102 | -0.094847 | 129 |
2 | 3316 | 0.004024 | 0.009237 | 0.004967 | NaN | 0.01704 | 0.013885 | 0.041885 | 0.015207 | -0.003143 | ... | 0.007596 | 0.01501 | 0.014733 | -0.000476 | 0.006539 | -0.010233 | 0.001251 | -0.003102 | -0.094847 | 136 |
3 | 3316 | 0.004024 | 0.009237 | 0.004967 | NaN | 0.01704 | 0.013885 | 0.041885 | 0.015207 | -0.003143 | ... | 0.007596 | 0.01501 | 0.014733 | -0.000476 | 0.006539 | -0.010233 | 0.001251 | -0.003102 | -0.094847 | 161 |
4 | 3316 | 0.004024 | 0.009237 | 0.004967 | NaN | 0.01704 | 0.013885 | 0.041885 | 0.015207 | -0.003143 | ... | 0.007596 | 0.01501 | 0.014733 | -0.000476 | 0.006539 | -0.010233 | 0.001251 | -0.003102 | -0.094847 | 217 |
5 rows × 102 columns
We transform the data so that each line corresponds to a specific day
idx_ret_features = np.where(X_train.columns.str.contains('RET'))[0]
init_ret_features = X_train.columns[idx_ret_features]
target_ret_features = 'RET_' + X_train['ID_TARGET'].unique()
returns = {}
for day in tqdm(X_train.ID_DAY.unique()):
u = X_train.loc[X_train.ID_DAY == day]
a = u.iloc[0, idx_ret_features]
b = Y_train[X_train.ID_DAY == day]['RET_TARGET']
b.index = 'RET_' + u.ID_TARGET
returns[day] = pd.concat([a, b])
returns = pd.DataFrame(returns).T.astype(float)
returns.head()
HBox(children=(FloatProgress(value=0.0, max=2748.0), HTML(value='')))
RET_0 | RET_1 | RET_102 | RET_105 | RET_106 | RET_108 | RET_109 | RET_110 | RET_114 | RET_115 | ... | RET_88 | RET_9 | RET_90 | RET_91 | RET_93 | RET_95 | RET_96 | RET_97 | RET_98 | RET_99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3316 | -0.016501 | 0.018693 | 0.005316 | NaN | 0.001352 | 0.027374 | -0.010890 | 0.057911 | 0.014155 | 0.021104 | ... | 0.027113 | 0.008602 | 0.018250 | 0.038581 | 0.027588 | 0.001251 | 0.037917 | -0.002689 | 0.002140 | 0.023026 |
3355 | 0.000674 | 0.005759 | 0.007938 | NaN | -0.005017 | -0.007413 | -0.005980 | 0.011900 | -0.011755 | 0.032401 | ... | 0.017933 | 0.001658 | 0.061274 | -0.018643 | 0.000372 | -0.010232 | -0.013496 | 0.013819 | -0.036501 | 0.017736 |
1662 | -0.000919 | 0.010430 | 0.007901 | NaN | 0.006959 | 0.005593 | -0.066666 | 0.049822 | 0.021599 | 0.019816 | ... | 0.016645 | -0.016289 | 0.027690 | -0.026479 | 0.006664 | 0.023721 | 0.013057 | -0.008237 | 0.014655 | 0.011614 |
3405 | -0.004989 | 0.004558 | 0.004325 | NaN | -0.007720 | 0.002788 | -0.038432 | -0.021308 | 0.000595 | 0.001613 | ... | -0.039065 | 0.021897 | -0.031456 | 0.029652 | 0.016171 | -0.024835 | -0.020391 | 0.006790 | 0.015796 | -0.015763 |
1602 | 0.002468 | -0.024584 | 0.008947 | NaN | -0.004735 | -0.031061 | 0.012366 | 0.045117 | -0.015403 | -0.015327 | ... | -0.028080 | 0.016530 | 0.019826 | 0.014627 | 0.010884 | -0.025205 | 0.006157 | -0.006526 | 0.002125 | 0.036186 |
5 rows × 200 columns
We compute the $\beta$ between all assets. This matrix will determine the linear link between all assets.
This step is not necessary and could be done in the next step, but it is a good way to introduce the use of a matrix shrinkage, greatly used in finance when dealing with noisy data. See here for more information.
from sklearn.covariance import oas
features = returns.columns
cov = pd.DataFrame(oas(returns.fillna(0))[0], index=features, columns=features)
beta = cov / np.diag(cov)
beta.head()
RET_0 | RET_1 | RET_102 | RET_105 | RET_106 | RET_108 | RET_109 | RET_110 | RET_114 | RET_115 | ... | RET_88 | RET_9 | RET_90 | RET_91 | RET_93 | RET_95 | RET_96 | RET_97 | RET_98 | RET_99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RET_0 | 1.000000 | 0.126513 | 0.108238 | 0.128353 | 0.117609 | 0.070110 | 0.194425 | 0.080396 | 0.174791 | 0.118757 | ... | 0.146664 | -0.012388 | 0.155055 | 0.081016 | 0.181609 | 0.172702 | 0.063350 | 0.138673 | 0.104377 | 0.165404 |
RET_1 | 0.184228 | 1.000000 | 0.122461 | 0.214639 | 0.131930 | 0.117362 | 0.196036 | 0.174030 | 0.256237 | 0.223324 | ... | 0.242727 | -0.067669 | 0.228235 | 0.132136 | 0.177939 | 0.253013 | 0.105388 | 0.160021 | 0.109545 | 0.206384 |
RET_102 | 0.086988 | 0.067585 | 1.000000 | 0.149505 | 0.381845 | 0.146912 | 0.064788 | 0.142368 | 0.090649 | 0.104195 | ... | 0.101701 | 0.008833 | 0.111743 | 0.043866 | 0.090732 | 0.162930 | 0.033348 | 0.124352 | 0.450246 | 0.230993 |
RET_105 | 0.141652 | 0.162670 | 0.205304 | 1.000000 | 0.195409 | 0.126792 | 0.134372 | 0.238152 | 0.197308 | 0.179270 | ... | 0.165933 | -0.003716 | 0.200567 | 0.221046 | 0.223728 | 0.219295 | 0.154519 | 0.205005 | 0.245215 | 0.320027 |
RET_106 | 0.093577 | 0.072087 | 0.378043 | 0.140882 | 1.000000 | 0.099451 | 0.076619 | 0.123011 | 0.100518 | 0.105586 | ... | 0.096279 | -0.018555 | 0.104224 | 0.020172 | 0.121545 | 0.133660 | 0.011020 | 0.144167 | 0.393937 | 0.196698 |
5 rows × 200 columns
For each target asset (liquid assets), we determine the illiquid asset that has maximum correlation and we save the id and the associated beta coefficient.
proj_matrix = beta.T.loc[init_ret_features, target_ret_features]
corr = returns.corr().loc[init_ret_features, target_ret_features]
coeffs = {}
for id_target in tqdm(target_ret_features):
x = proj_matrix[id_target]
c = corr[id_target]
id_init_max = c.abs().idxmax()
j = id_target.replace('RET_', '')
coeffs[j] = (id_init_max, x[id_init_max])
HBox(children=(FloatProgress(value=0.0), HTML(value='')))
We thus simply make the predictions on the test data set using the pairs we saved and the beta.
If there is missing values, we replace them with the mean.
pred = {}
for idx, row in tqdm(X_test.iterrows()):
j = row['ID_TARGET']
i, val = coeffs[j]
x = row[i]
if np.isnan(x):
x = row[init_ret_features].mean()
p = x * val
pred[idx] = p
pred = pd.Series(pred, name="RET_TARGET")
# The NaNs are filled by the mean of the prediction of that day
pred_mean_day = pred.groupby(X_test['ID_DAY']).transform('mean')
pred = pred.fillna(pred_mean_day)
pred = np.sign(pred)
HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…
pred.name = "RET_TARGET"
pred = pred.astype(int)
pred.to_csv('./benchmark.csv')