Replicate Dynamic Return Dependencies Across Industries: A Machine Learning Approach by David Rapach, Jack Strauss, Jun Tu and Guofu Zhou.
Use industry returns from Ken French
Forecast (for example) this month's Chemical industry return using last month's returns from all 30 industries
Use LASSO for predictor subset selection over the entire 1960-2016 period to determine that e.g. Beer is predicted by Food, Clothing, Coal
Use LASSO-selected predictors and simple linear regression to predict returns
Generate portfolios and run backtests.
Predictor selection - finds same predictors except 2 industries. Possibly use of AICc instead of AIC (don't see an sklearn implementation that uses AICc)
Prediction by industry - R-squareds line up pretty closely
Portfolio performance, similar ballpark results. Maybe AICc/AIC; Also paper standardizes predictors, which is pretty standard. Finally, for some reason their mean returns don't line up to geometric mean annualized, they seem to be calculating something different.
Replicating exactly is hard but it does replicate closely and perform well
Run various sklearn regressors to see which performs best, understand metrics that predict performance. MSE does not predict Sharpe. Kendall's tau, i.e. correlation of predicted vs. actual rankings, performs better.
Tune ElasticNet to get slightly better performance than Lasso/OLS
Run Keras NNs. They don't improve on Lasso/OLS or ElasticNet.
import os
import sys
import warnings
import numpy as np
import pandas as pd
import time
import copy
import random
from itertools import product
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' #Hide messy TensorFlow warnings
warnings.filterwarnings("ignore") #Hide messy numpy warnings
import sklearn
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import confusion_matrix, mean_squared_error, explained_variance_score, r2_score, accuracy_score
from sklearn.linear_model import LinearRegression, Lasso, lasso_path, lars_path, LassoLarsIC
from sklearn.ensemble.forest import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.utils.testing import all_estimators
import xgboost
from scipy.stats import chisquare, kendalltau
import tensorflow as tf
tf.set_random_seed(1764)
print(tf.__version__)
# confirm GPU is in use
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))
import keras
from keras.layers.core import Dense, Activation
from keras.layers import Input
from keras.models import Model
from keras.layers.recurrent import LSTM, GRU
from keras.regularizers import l1
from keras.models import Sequential
from keras.models import load_model
#import ffn
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# this will make dataframe display as sortable widget
# commented out because sortable tables not viewable using nbviewer
from beakerx import *
import plotly
# print (plotly.__version__) # requires version >= 1.9.0
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
import plotly.figure_factory as ff
init_notebook_mode(connected=True)
random.seed(1764)
np.random.seed(1764)
1.8.0 [[22. 28.] [49. 64.]]
Using TensorFlow backend.
print("Loading data...")
data = pd.read_csv("30_Industry_Portfolios.csv")
data = data.set_index('yyyymm')
industries = list(data.columns)
# map industry names to col nums
ind_reverse_dict = dict([(industries[i], i) for i in range(len(industries))])
rfdata = pd.read_csv("F-F_Research_Data_Factors.csv")
rfdata = rfdata.set_index('yyyymm')
data['rf'] = rfdata['RF']
# subtract risk-free rate
# create a response variable led by 1 period to predict
for ind in industries:
data[ind] = data[ind] - data['rf']
#for ind in industries:
# data[ind+".3m"] = pd.rolling_mean(data[ind],3)
#for ind in industries:
# data[ind+".6m"] = pd.rolling_mean(data[ind],6)
#for ind in industries:
# data[ind+".12m"] = pd.rolling_mean(data[ind],12)
for ind in industries:
data[ind+".lead"] = data[ind].shift(-1)
data = data.loc[data.index[data.index > 195911]]
data = data.drop(columns=['rf'])
data = data.dropna(axis=0, how='any')
nresponses = len(industries)
npredictors = data.shape[1]-nresponses
predictors = list(data.columns[:npredictors])
predictor_reverse_dict = dict([(predictors[i], i) for i in range(len(predictors))])
responses = list(data.columns[-nresponses:])
response_reverse_dict = dict([(responses[i], i) for i in range(len(responses))])
print(data.shape)
data[['Food', 'Food.lead']]
Loading data... (697, 60)
Food | Food.lead | |
---|---|---|
yyyymm | ||
195912 | 2.01 | -4.49 |
196001 | -4.49 | 3.35 |
196002 | 3.35 | -1.67 |
196003 | -1.67 | 1.17 |
196004 | 1.17 | 8.20 |
196005 | 8.20 | 5.39 |
196006 | 5.39 | -2.11 |
196007 | -2.11 | 4.57 |
196008 | 4.57 | -3.88 |
196009 | -3.88 | 1.02 |
196010 | 1.02 | 9.46 |
196011 | 9.46 | 4.51 |
196012 | 4.51 | 4.70 |
196101 | 4.70 | 4.21 |
196102 | 4.21 | 4.64 |
196103 | 4.64 | -1.39 |
196104 | -1.39 | 4.20 |
196105 | 4.20 | -2.17 |
196106 | -2.17 | 2.72 |
196107 | 2.72 | 4.92 |
196108 | 4.92 | -0.62 |
196109 | -0.62 | 3.73 |
196110 | 3.73 | 5.28 |
196111 | 5.28 | -3.69 |
196112 | -3.69 | -6.67 |
196201 | -6.67 | -0.25 |
196202 | -0.25 | 0.98 |
196203 | 0.98 | -4.59 |
196204 | -4.59 | -11.25 |
196205 | -11.25 | -8.75 |
... | ... | ... |
201507 | 4.03 | -4.37 |
201508 | -4.37 | -1.19 |
201509 | -1.19 | 5.81 |
201510 | 5.81 | 0.11 |
201511 | 0.11 | 1.96 |
201512 | 1.96 | -1.67 |
201601 | -1.67 | 0.95 |
201602 | 0.95 | 4.69 |
201603 | 4.69 | 0.63 |
201604 | 0.63 | 2.06 |
201605 | 2.06 | 4.75 |
201606 | 4.75 | -0.51 |
201607 | -0.51 | -0.52 |
201608 | -0.52 | -2.92 |
201609 | -2.92 | -0.33 |
201610 | -0.33 | -4.41 |
201611 | -4.41 | 4.43 |
201612 | 4.43 | 0.95 |
201701 | 0.95 | 1.71 |
201702 | 1.71 | 0.52 |
201703 | 0.52 | 0.76 |
201704 | 0.76 | 1.63 |
201705 | 1.63 | -2.65 |
201706 | -2.65 | 1.52 |
201707 | 1.52 | -2.77 |
201708 | -2.77 | 0.43 |
201709 | 0.43 | 0.71 |
201710 | 0.71 | 4.15 |
201711 | 4.15 | -0.10 |
201712 | -0.10 | 2.27 |
697 rows × 2 columns
# exclude 2017 and later to tie to paper
data = data.loc[data.index[data.index < 201701]]
data = data.loc[data.index[data.index > 195911]]
data
Food | Beer | Smoke | Games | Books | Hshld | Clths | Hlth | Chems | Txtls | ... | Telcm.lead | Servs.lead | BusEq.lead | Paper.lead | Trans.lead | Whlsl.lead | Rtail.lead | Meals.lead | Fin.lead | Other.lead | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
yyyymm | |||||||||||||||||||||
195912 | 2.01 | 0.35 | -3.02 | 1.64 | 7.29 | 0.67 | 1.87 | -1.97 | 3.08 | 0.74 | ... | 0.62 | -6.18 | -7.93 | -9.41 | -4.31 | -5.33 | -6.09 | -10.08 | -4.68 | -3.98 |
196001 | -4.49 | -5.71 | -2.05 | 1.21 | -5.47 | -7.84 | -8.53 | -6.68 | -10.03 | -4.77 | ... | 8.07 | 9.13 | 5.09 | 3.00 | -0.94 | 1.42 | 4.00 | 1.81 | -0.98 | 6.32 |
196002 | 3.35 | -2.14 | 2.27 | 4.23 | 2.39 | 9.31 | 1.44 | -0.02 | -0.74 | 0.32 | ... | -0.21 | -0.31 | 3.34 | -2.43 | -4.99 | -1.37 | -0.13 | -3.88 | 0.05 | -2.43 |
196003 | -1.67 | -2.94 | -0.18 | -0.65 | 2.18 | -0.56 | -2.59 | 1.26 | -2.75 | -6.79 | ... | -1.24 | 7.14 | 1.77 | 0.41 | -2.13 | 0.45 | -0.53 | 8.86 | -0.64 | 0.55 |
196004 | 1.17 | -2.16 | 1.35 | 6.46 | -1.17 | -1.27 | 0.21 | 1.49 | -5.53 | -1.10 | ... | 3.05 | -1.75 | 11.90 | 2.85 | 0.90 | 1.65 | 3.11 | 0.80 | -0.45 | 1.02 |
196005 | 8.20 | -0.52 | 2.44 | 7.28 | 11.67 | 7.74 | 1.74 | 13.50 | 3.40 | 2.10 | ... | -0.58 | -8.07 | 2.39 | 3.50 | 2.17 | 5.96 | 3.41 | 1.03 | 3.72 | 6.41 |
196006 | 5.39 | 0.47 | 4.73 | 2.24 | 0.02 | 6.38 | -1.59 | -0.40 | 0.45 | 4.04 | ... | -0.03 | 2.84 | -2.02 | -4.10 | -3.11 | -6.16 | -2.99 | -1.25 | 0.09 | -5.95 |
196007 | -2.11 | -0.79 | 4.60 | -4.72 | 0.23 | -0.60 | -1.10 | -3.99 | -6.80 | -3.14 | ... | 6.94 | 5.69 | 2.71 | 1.18 | 1.98 | 4.51 | 2.85 | 2.05 | 3.47 | 3.48 |
196008 | 4.57 | 3.24 | 5.20 | 7.16 | 3.63 | 5.09 | 3.34 | 2.29 | 1.17 | -0.84 | ... | -6.07 | -3.53 | -7.61 | -7.37 | -7.07 | -8.44 | -8.57 | -1.90 | -5.78 | -4.21 |
196009 | -3.88 | -5.00 | -2.09 | -2.33 | -6.20 | -9.18 | -4.23 | -8.87 | -6.70 | -5.25 | ... | -0.08 | 4.62 | -3.40 | -1.85 | -1.02 | -4.22 | 0.31 | -4.54 | -0.40 | 0.38 |
196010 | 1.02 | 0.54 | 3.87 | 0.11 | 2.38 | 6.48 | -3.50 | -3.71 | -1.59 | -3.06 | ... | 4.06 | 9.49 | 8.19 | 5.31 | 5.35 | 9.72 | 6.50 | 4.40 | 7.71 | 4.01 |
196011 | 9.46 | 6.57 | 5.44 | 13.91 | 10.11 | 9.13 | 3.15 | 3.91 | 4.25 | 2.04 | ... | 12.29 | 8.18 | 4.29 | 5.57 | 2.27 | 2.06 | 2.05 | 2.08 | 5.56 | 3.80 |
196012 | 4.51 | -0.31 | 3.54 | 7.77 | 7.41 | 1.76 | 3.28 | 6.06 | 2.85 | 0.52 | ... | 7.70 | 4.29 | 5.08 | 4.56 | 8.35 | 7.93 | 2.28 | 4.08 | 7.12 | 8.23 |
196101 | 4.70 | 5.23 | 8.77 | 0.56 | 9.47 | 4.36 | 5.94 | 5.86 | 6.46 | 11.21 | ... | 0.61 | 0.20 | 4.54 | 6.83 | 4.22 | 3.31 | 4.82 | 8.23 | 7.00 | 6.00 |
196102 | 4.21 | 8.16 | 5.41 | 22.33 | 2.15 | 5.90 | 7.84 | 5.05 | 2.13 | 6.81 | ... | 7.23 | -0.20 | 2.31 | -0.69 | 0.86 | 4.45 | 5.76 | 4.06 | 4.34 | 7.08 |
196103 | 4.64 | 2.55 | 5.60 | 7.18 | 4.77 | 6.34 | 3.08 | 3.60 | 0.92 | 5.92 | ... | 0.63 | -0.12 | 2.19 | -0.37 | -1.62 | 3.08 | 0.22 | 4.23 | 1.38 | -3.67 |
196104 | -1.39 | 1.40 | -0.23 | -2.21 | -6.37 | 2.66 | 2.60 | -0.47 | -1.47 | -5.31 | ... | -1.22 | -0.70 | 1.57 | 1.39 | 4.74 | -0.04 | 4.31 | -1.90 | 4.00 | 3.32 |
196105 | 4.20 | 5.38 | 3.39 | -3.91 | 2.71 | -0.02 | 6.80 | 2.10 | 5.50 | 5.47 | ... | -4.19 | 0.13 | -3.31 | -4.46 | -4.57 | -4.90 | 0.80 | -5.63 | -2.88 | 0.37 |
196106 | -2.17 | -3.12 | 3.97 | -5.87 | -3.85 | 3.43 | -5.50 | -3.58 | -1.32 | -3.36 | ... | 6.25 | -8.25 | 0.56 | -0.50 | -0.32 | -0.01 | 2.45 | 2.69 | 3.35 | 5.37 |
196107 | 2.72 | 0.88 | 5.95 | -1.21 | -2.55 | 1.97 | 2.03 | 3.27 | 2.95 | 1.53 | ... | 0.12 | 8.99 | 5.11 | 5.37 | 3.52 | 3.09 | 3.03 | 0.46 | 8.65 | 1.64 |
196108 | 4.92 | 3.20 | 7.74 | 0.89 | 0.89 | 10.45 | 5.21 | 3.70 | 2.35 | 5.77 | ... | -2.94 | -6.04 | 1.01 | -2.74 | -1.16 | -4.22 | 0.66 | -6.21 | -0.40 | 3.14 |
196109 | -0.62 | -1.48 | -0.07 | 1.24 | 0.75 | -3.05 | -1.14 | -1.48 | -4.45 | -4.25 | ... | 0.00 | 2.24 | 6.53 | 1.74 | 2.16 | 4.30 | 9.35 | 0.71 | 2.02 | 0.39 |
196110 | 3.73 | -0.84 | 7.05 | -5.26 | 0.99 | -0.67 | 8.28 | 3.33 | 0.05 | 3.11 | ... | 8.34 | 8.27 | 0.99 | 2.05 | 0.47 | 5.65 | 4.90 | 1.08 | 7.22 | 1.69 |
196111 | 5.28 | 4.47 | 8.03 | 0.25 | 3.75 | 4.51 | 5.30 | 3.12 | 2.49 | 7.37 | ... | 3.14 | -0.68 | -0.55 | -2.65 | -0.24 | 0.46 | -0.63 | -2.21 | -4.44 | -0.77 |
196112 | -3.69 | 1.41 | -6.12 | 1.97 | -3.66 | -3.78 | 0.32 | -2.21 | -0.16 | -1.17 | ... | -6.32 | -4.88 | -6.91 | -5.22 | 2.52 | -0.79 | -9.56 | -3.90 | -4.99 | -3.62 |
196201 | -6.67 | -3.45 | -4.28 | -13.23 | -3.44 | -7.37 | -5.89 | -4.86 | -4.76 | 0.57 | ... | 3.89 | -1.94 | -0.07 | 3.87 | 0.32 | -0.09 | 1.58 | -0.59 | 3.59 | 4.20 |
196202 | -0.25 | 0.28 | 0.68 | -2.02 | -0.52 | -0.90 | 2.01 | 3.56 | 3.30 | 1.93 | ... | -3.14 | -1.41 | -0.76 | 1.13 | -1.68 | -2.30 | 0.90 | -4.07 | -2.13 | -1.83 |
196203 | 0.98 | -0.34 | -6.67 | -5.34 | 0.41 | 4.31 | -1.18 | 0.34 | -2.72 | -0.74 | ... | -4.93 | -3.11 | -12.01 | -7.93 | -6.27 | -5.78 | -4.61 | -9.09 | -7.69 | -2.12 |
196204 | -4.59 | -3.59 | -12.99 | -11.04 | -8.74 | -7.03 | -8.01 | -11.23 | -6.23 | -7.53 | ... | -7.35 | -10.97 | -13.19 | -11.03 | -5.17 | -11.34 | -9.09 | -7.46 | -10.02 | -11.83 |
196205 | -11.25 | -9.05 | -14.14 | -11.39 | -14.87 | -10.19 | -10.01 | -11.14 | -8.25 | -7.50 | ... | -8.72 | -13.11 | -12.59 | -9.62 | -7.81 | -11.11 | -10.43 | -12.90 | -11.01 | -14.25 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
201407 | -5.83 | -2.92 | -3.48 | -3.70 | -1.60 | -2.62 | -0.62 | -0.33 | -3.26 | -5.70 | ... | 0.52 | 3.49 | 5.10 | 3.38 | 4.57 | 4.87 | 5.71 | 3.32 | 3.81 | 7.19 |
201408 | 6.38 | 5.46 | 5.08 | -1.80 | 2.64 | 5.87 | 3.53 | 5.44 | 4.39 | 10.14 | ... | -1.70 | -1.20 | -1.91 | -3.49 | 0.71 | -2.29 | -1.50 | -0.60 | -0.59 | 0.16 |
201409 | -0.53 | 0.83 | 2.15 | -4.38 | -7.29 | 0.84 | 4.44 | -0.08 | -2.14 | -2.38 | ... | 1.42 | 0.46 | 2.83 | 6.42 | 6.22 | 3.32 | 3.32 | 1.28 | 3.87 | 1.58 |
201410 | 1.70 | 3.28 | 6.13 | 0.58 | 3.28 | 3.71 | 1.57 | 5.77 | -1.09 | 2.08 | ... | 3.03 | 2.05 | 6.47 | 4.80 | 5.88 | 1.83 | 8.91 | 5.57 | 1.93 | 4.55 |
201411 | 5.68 | 4.35 | 0.76 | -0.40 | 1.11 | 3.23 | 8.89 | 2.67 | 2.21 | 7.53 | ... | -1.73 | -0.60 | -1.87 | 2.08 | 1.09 | 0.26 | 2.25 | -0.43 | 2.08 | 0.17 |
201412 | -2.48 | -4.30 | -3.11 | -4.60 | 1.83 | 0.02 | -0.82 | -0.89 | 0.83 | -0.31 | ... | -4.89 | -4.44 | -2.14 | -2.43 | -4.53 | -2.37 | -0.29 | 0.66 | -7.61 | -4.11 |
201501 | -1.64 | 0.90 | 2.95 | 0.25 | -2.18 | -4.92 | -3.75 | 1.56 | -2.01 | 2.02 | ... | 9.12 | 7.92 | 8.43 | 6.68 | 3.10 | 5.37 | 5.58 | 6.82 | 7.93 | 4.63 |
201502 | 4.44 | 4.40 | 5.47 | 5.87 | 9.00 | 3.83 | 5.43 | 4.31 | 8.10 | 13.34 | ... | -2.20 | -1.52 | -3.12 | -1.73 | -3.62 | 0.57 | 1.00 | -0.29 | 0.11 | -1.77 |
201503 | -0.72 | -2.07 | -8.82 | -2.60 | 0.56 | -1.98 | 1.23 | 0.86 | -3.55 | 3.33 | ... | 3.40 | 2.25 | 0.10 | -2.65 | -1.14 | -1.23 | -2.88 | 0.51 | 0.76 | -0.23 |
201504 | -0.17 | -0.52 | 5.94 | 3.75 | -4.11 | -2.41 | -1.53 | -1.39 | 1.11 | -5.95 | ... | 0.81 | -0.12 | 4.02 | 1.32 | -3.06 | 1.44 | 0.54 | 1.56 | 3.09 | 0.92 |
201505 | 2.03 | 2.00 | 1.28 | 0.71 | 1.93 | 0.39 | 0.04 | 4.89 | 1.24 | 4.37 | ... | -0.48 | -1.87 | -4.92 | -2.84 | -3.25 | -2.82 | -0.49 | 0.29 | 1.34 | -3.59 |
201506 | -1.95 | -1.71 | -2.57 | 1.30 | -0.84 | -0.11 | 4.17 | 0.07 | -2.72 | 4.09 | ... | 1.41 | 5.22 | -0.91 | -0.54 | 3.27 | -1.32 | 5.79 | 4.17 | 1.97 | 3.18 |
201507 | 4.03 | 3.51 | 9.59 | 6.09 | -2.90 | 0.71 | 5.96 | 3.66 | -4.90 | -0.72 | ... | -8.42 | -5.29 | -6.50 | -5.69 | -6.37 | -4.13 | -5.44 | -6.48 | -6.54 | -5.20 |
201508 | -4.37 | -3.12 | -4.06 | -7.35 | -8.61 | -6.94 | -3.86 | -8.37 | -7.15 | -3.11 | ... | -2.63 | -1.47 | -1.72 | -2.66 | -0.71 | -6.04 | -1.75 | 0.44 | -3.14 | -1.87 |
201509 | -1.19 | 2.58 | 2.37 | -9.94 | -5.32 | -0.53 | 1.18 | -7.28 | -8.38 | -5.92 | ... | 8.84 | 11.26 | 8.16 | 10.19 | 6.48 | 5.07 | 4.56 | 5.05 | 5.90 | 6.98 |
201510 | 5.81 | 8.06 | 10.90 | 14.61 | 12.21 | 5.81 | 0.98 | 7.74 | 16.62 | 7.96 | ... | -1.92 | 1.99 | 0.12 | -0.02 | -1.10 | 2.67 | 0.61 | -1.01 | 2.16 | 0.05 |
201511 | 0.11 | -0.71 | -3.00 | -0.41 | -1.17 | -1.10 | -1.08 | 0.71 | 1.68 | -2.59 | ... | -3.03 | -1.19 | -4.64 | -3.75 | -5.02 | -1.88 | 0.82 | -0.95 | -2.92 | 0.25 |
201512 | 1.96 | 0.30 | 1.59 | -1.70 | -6.18 | 1.86 | -4.38 | 0.39 | -4.82 | -2.65 | ... | -0.36 | -5.09 | -7.95 | -5.27 | -8.53 | -8.68 | -4.45 | -0.94 | -9.63 | -3.20 |
201601 | -1.67 | -0.23 | 4.28 | -8.15 | -5.28 | 0.16 | 1.52 | -9.43 | -11.10 | -5.33 | ... | 1.15 | -2.45 | 1.45 | 2.99 | 6.89 | 3.85 | -0.36 | 1.03 | -2.85 | 2.71 |
201602 | 0.95 | -2.34 | 0.93 | 4.25 | -0.96 | 0.34 | 0.81 | -1.09 | 6.79 | 0.63 | ... | 6.00 | 7.76 | 8.86 | 8.18 | 6.86 | 6.18 | 5.99 | 5.36 | 6.65 | 6.68 |
201603 | 4.69 | 5.61 | 5.04 | 8.61 | 6.88 | 4.65 | 2.30 | 2.90 | 8.33 | 3.58 | ... | 0.59 | -2.55 | -5.46 | 0.80 | -1.08 | 0.49 | -0.38 | -2.38 | 3.96 | 0.67 |
201604 | 0.63 | 0.31 | -0.25 | -6.30 | 1.82 | -0.42 | -2.27 | 3.55 | 3.77 | 1.55 | ... | 0.30 | 4.67 | 5.64 | 1.79 | -2.18 | 1.78 | 1.19 | -1.48 | 2.15 | -2.02 |
201605 | 2.06 | -0.91 | 0.83 | 5.42 | -0.53 | -0.09 | -4.96 | 2.46 | -1.42 | -1.70 | ... | 3.10 | -2.12 | -1.63 | 2.06 | -2.53 | 1.81 | 0.71 | 1.16 | -5.30 | 3.61 |
201606 | 4.75 | 5.31 | 6.87 | -4.43 | -0.34 | 3.16 | 1.63 | 0.11 | -1.14 | -4.67 | ... | 2.27 | 7.33 | 8.20 | 2.52 | 5.39 | 3.65 | 3.78 | 2.19 | 4.04 | -0.21 |
201607 | -0.51 | 1.82 | -2.79 | 6.15 | 7.38 | 2.58 | 1.62 | 6.00 | 4.29 | 8.39 | ... | -3.56 | 1.19 | 2.38 | 2.46 | 1.09 | -1.03 | -1.69 | -0.24 | 4.88 | 2.24 |
201608 | -0.52 | -0.90 | -1.22 | 0.94 | 0.29 | 1.24 | 1.37 | -3.24 | 2.48 | 1.03 | ... | 0.52 | 0.82 | 4.33 | -0.64 | 2.86 | -2.56 | -0.18 | -2.25 | -1.45 | -3.48 |
201609 | -2.92 | 1.63 | -2.78 | 4.62 | -3.95 | 0.00 | -6.92 | 0.35 | -1.76 | -4.87 | ... | -2.85 | -0.55 | -2.24 | -5.49 | -0.64 | -8.18 | -3.59 | -1.96 | 1.40 | -0.53 |
201610 | -0.33 | -1.65 | 4.59 | 5.59 | -10.28 | -2.96 | -5.76 | -7.45 | -1.95 | -4.17 | ... | 6.25 | -0.01 | 2.39 | 4.21 | 12.75 | 9.29 | 2.99 | 8.47 | 12.84 | 8.29 |
201611 | -4.41 | -5.76 | -5.12 | 3.87 | 8.15 | -4.18 | 1.80 | 1.37 | 7.55 | 1.58 | ... | 4.65 | -0.19 | 2.07 | 1.86 | 0.84 | 2.34 | -0.98 | 0.58 | 3.80 | 2.57 |
201612 | 4.43 | 3.00 | 5.39 | -3.36 | 1.98 | 1.43 | -0.44 | 0.82 | 0.32 | -1.27 | ... | 3.36 | 5.45 | 3.20 | 2.28 | 1.70 | 1.69 | 0.93 | 0.71 | 0.56 | -0.87 |
685 rows × 60 columns
data.to_csv("data.csv")
desc = data.describe()
desc
# min, max line up with Table 1
Food | Beer | Smoke | Games | Books | Hshld | Clths | Hlth | Chems | Txtls | ... | Telcm.lead | Servs.lead | BusEq.lead | Paper.lead | Trans.lead | Whlsl.lead | Rtail.lead | Meals.lead | Fin.lead | Other.lead | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | ... | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 | 685.000000 |
mean | 0.690715 | 0.710613 | 0.982321 | 0.701708 | 0.528277 | 0.554190 | 0.669460 | 0.650905 | 0.519781 | 0.667416 | ... | 0.520847 | 0.694234 | 0.584175 | 0.511241 | 0.582088 | 0.625562 | 0.662219 | 0.702730 | 0.609810 | 0.385620 |
std | 4.339811 | 5.090215 | 6.061582 | 7.180918 | 5.809314 | 4.759874 | 6.386027 | 4.928072 | 5.518477 | 7.022552 | ... | 4.628520 | 6.527984 | 6.738979 | 5.055314 | 5.739306 | 5.605317 | 5.349341 | 6.104515 | 5.411766 | 5.815446 |
min | -18.150000 | -20.190000 | -25.320000 | -33.400000 | -26.560000 | -22.240000 | -31.500000 | -21.060000 | -28.600000 | -33.110000 | ... | -16.440000 | -28.670000 | -32.070000 | -27.740000 | -28.500000 | -29.250000 | -29.740000 | -31.890000 | -22.530000 | -28.090000 |
25% | -1.640000 | -2.100000 | -2.780000 | -3.490000 | -2.690000 | -2.110000 | -2.810000 | -2.240000 | -2.800000 | -3.200000 | ... | -2.110000 | -3.090000 | -3.290000 | -2.430000 | -2.780000 | -2.570000 | -2.430000 | -2.940000 | -2.420000 | -2.990000 |
50% | 0.740000 | 0.710000 | 1.280000 | 0.890000 | 0.510000 | 0.750000 | 0.690000 | 0.750000 | 0.670000 | 0.630000 | ... | 0.610000 | 0.970000 | 0.560000 | 0.690000 | 0.860000 | 0.940000 | 0.470000 | 1.030000 | 0.820000 | 0.470000 |
75% | 3.120000 | 3.660000 | 4.640000 | 5.310000 | 3.720000 | 3.550000 | 4.310000 | 3.560000 | 3.760000 | 4.490000 | ... | 3.360000 | 4.290000 | 4.590000 | 3.460000 | 4.060000 | 3.880000 | 4.000000 | 4.330000 | 4.000000 | 4.200000 |
max | 19.890000 | 25.510000 | 32.380000 | 34.520000 | 33.130000 | 18.220000 | 31.790000 | 29.010000 | 21.680000 | 59.030000 | ... | 21.220000 | 23.380000 | 24.660000 | 21.000000 | 18.500000 | 17.530000 | 26.490000 | 27.380000 | 20.590000 | 19.960000 |
8 rows × 60 columns
# annualized returns don't match Table 1, oddly
# geometric mean, annualized
pd.DataFrame((np.prod(data/100 + 1)**(12.0/len(data))-1)[:30], columns=['Mean Ann. Return'])
Mean Ann. Return | |
---|---|
Food | 0.074020 |
Beer | 0.072005 |
Smoke | 0.100147 |
Games | 0.054031 |
Books | 0.043953 |
Hshld | 0.054098 |
Clths | 0.057170 |
Hlth | 0.065463 |
Chems | 0.044917 |
Txtls | 0.051888 |
Cnstr | 0.041836 |
Steel | 0.002802 |
FabPr | 0.045615 |
ElcEq | 0.062927 |
Autos | 0.027963 |
Carry | 0.063991 |
Mines | 0.032527 |
Coal | 0.026075 |
Oil | 0.062748 |
Util | 0.049564 |
Telcm | 0.050868 |
Servs | 0.057776 |
BusEq | 0.042774 |
Paper | 0.046776 |
Trans | 0.051138 |
Whlsl | 0.057056 |
Rtail | 0.064258 |
Meals | 0.063630 |
Fin | 0.056748 |
Other | 0.024894 |
# try this way, arithmetic mean then annualize (not very correct)
#print(pd.DataFrame(((desc.loc['mean']/100+1)**12-1)[:30]))
#nope
# same
pd.DataFrame(((1 + np.mean(data, axis=0)/100)**12 -1)[:30], columns=['Mean Ann. Return'])
Mean Ann. Return | |
---|---|
Food | 0.086108 |
Beer | 0.088687 |
Smoke | 0.124460 |
Games | 0.087532 |
Books | 0.065268 |
Hshld | 0.068568 |
Clths | 0.083360 |
Hlth | 0.080966 |
Chems | 0.064188 |
Txtls | 0.083096 |
Cnstr | 0.064656 |
Steel | 0.035833 |
FabPr | 0.069639 |
ElcEq | 0.087857 |
Autos | 0.055859 |
Carry | 0.089740 |
Mines | 0.067862 |
Coal | 0.091926 |
Oil | 0.080976 |
Util | 0.059601 |
Telcm | 0.064437 |
Servs | 0.085164 |
BusEq | 0.071925 |
Paper | 0.062952 |
Trans | 0.072192 |
Whlsl | 0.077332 |
Rtail | 0.082704 |
Meals | 0.088043 |
Fin | 0.075605 |
Other | 0.046240 |
#annualized volatility
pd.DataFrame((desc.loc['std']*np.sqrt(12))[:30].round(2))
# lines up with table 1
std | |
---|---|
Food | 15.03 |
Beer | 17.63 |
Smoke | 21.00 |
Games | 24.88 |
Books | 20.12 |
Hshld | 16.49 |
Clths | 22.12 |
Hlth | 17.07 |
Chems | 19.12 |
Txtls | 24.33 |
Cnstr | 20.75 |
Steel | 25.27 |
FabPr | 21.18 |
ElcEq | 21.52 |
Autos | 23.17 |
Carry | 21.80 |
Mines | 25.82 |
Coal | 35.31 |
Oil | 18.53 |
Util | 13.83 |
Telcm | 16.04 |
Servs | 22.61 |
BusEq | 23.34 |
Paper | 17.51 |
Trans | 19.88 |
Whlsl | 19.42 |
Rtail | 18.53 |
Meals | 21.15 |
Fin | 18.75 |
Other | 20.17 |
# Run LASSO, then OLS on selected variables
# skip last row to better match published r-squared
# looks like they forecast actuals 1960-2016 using 1959m12 to 2016m11
# not exact matches to Table 2 R-squared but almost within rounding error
X = data.values[:-1,:npredictors]
Y = data.values[:-1,-nresponses:]
nrows = X.shape[0]
X.shape
(684, 30)
def subset_selection(X, Y, model_aic, verbose=False, responses=responses, predictors=predictors):
nrows, npreds = X.shape
nows, nresps = Y.shape
coef_dict = []
for response_index in range(nresps):
y = Y[:,response_index]
model_aic.fit(X, y)
predcols = [i for i in range(npreds) if model_aic.coef_[i] !=0]
#y_response = model_aic.predict(X)
# print ("In-sample LASSO R-squared: %.6f" % r2_score(y, y_response))
if verbose and responses:
print("LASSO variables selected for %s: " % responses[response_index])
print([predictors[i] for i in predcols])
if not predcols:
if verbose and responses:
print("No coefs selected for " + responses[response_index] + ", using all")
print("---")
predcols = list(range(npreds))
# fit OLS vs. selected vars, better fit w/o LASSO penalties
# in-sample R-squared using LASSO coeffs
coef_dict.append(predcols)
if verbose and responses and predictors:
print("Running OLS for " + responses[response_index] + " against " + str([predictors[i] for i in predcols]))
# col nums of selected responses
model_ols = LinearRegression()
model_ols.fit(X[:, predcols], y)
y_pred = model_ols.predict(X[:, predcols])
print ("In-sample OLS R-squared: %.2f%%" % (100 * r2_score(y, y_pred)))
print("---")
return coef_dict
#coef_dict = subset_selection(X, Y, LassoLarsIC(criterion='aic'))
coef_dict = subset_selection(X, Y, LassoLarsIC(criterion='aic'), verbose=True, responses=responses, predictors=predictors)
print(coef_dict)
# These subsets line up closely with Table 2
# except Clths, Whlsl, we get different responses
LASSO variables selected for Food.lead: ['Clths', 'Coal', 'Util', 'Rtail'] Running OLS for Food.lead against ['Clths', 'Coal', 'Util', 'Rtail'] In-sample OLS R-squared: 2.24% --- LASSO variables selected for Beer.lead: ['Food', 'Clths', 'Coal'] Running OLS for Beer.lead against ['Food', 'Clths', 'Coal'] In-sample OLS R-squared: 2.52% --- LASSO variables selected for Smoke.lead: ['Txtls', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'Paper', 'Trans', 'Fin'] Running OLS for Smoke.lead against ['Txtls', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'Paper', 'Trans', 'Fin'] In-sample OLS R-squared: 6.55% --- LASSO variables selected for Games.lead: ['Books', 'Clths', 'Coal', 'Fin'] Running OLS for Games.lead against ['Books', 'Clths', 'Coal', 'Fin'] In-sample OLS R-squared: 5.05% --- LASSO variables selected for Books.lead: ['Games', 'Books', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Fin'] Running OLS for Books.lead against ['Games', 'Books', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Fin'] In-sample OLS R-squared: 6.30% --- LASSO variables selected for Hshld.lead: ['Clths', 'Coal', 'Rtail'] Running OLS for Hshld.lead against ['Clths', 'Coal', 'Rtail'] In-sample OLS R-squared: 2.97% --- LASSO variables selected for Clths.lead: ['Clths', 'Coal', 'Oil', 'Servs', 'Rtail'] Running OLS for Clths.lead against ['Clths', 'Coal', 'Oil', 'Servs', 'Rtail'] In-sample OLS R-squared: 4.73% --- LASSO variables selected for Hlth.lead: ['Books', 'Mines', 'Coal', 'Util'] Running OLS for Hlth.lead against ['Books', 'Mines', 'Coal', 'Util'] In-sample OLS R-squared: 2.68% --- LASSO variables selected for Chems.lead: ['Clths'] Running OLS for Chems.lead against ['Clths'] In-sample OLS R-squared: 0.78% --- LASSO variables selected for Txtls.lead: ['Clths', 'Autos', 'Coal', 'Oil', 'Rtail', 'Fin'] Running OLS for Txtls.lead against ['Clths', 'Autos', 'Coal', 'Oil', 'Rtail', 'Fin'] In-sample OLS R-squared: 7.91% --- LASSO variables selected for Cnstr.lead: ['Clths', 'Coal', 'Oil', 'Util', 'Trans', 'Rtail', 'Fin'] Running OLS for Cnstr.lead against ['Clths', 'Coal', 'Oil', 'Util', 'Trans', 'Rtail', 'Fin'] In-sample OLS R-squared: 5.15% --- LASSO variables selected for Steel.lead: ['Fin'] Running OLS for Steel.lead against ['Fin'] In-sample OLS R-squared: 1.28% --- LASSO variables selected for FabPr.lead: ['Trans', 'Fin'] Running OLS for FabPr.lead against ['Trans', 'Fin'] In-sample OLS R-squared: 1.57% --- LASSO variables selected for ElcEq.lead: ['Fin'] Running OLS for ElcEq.lead against ['Fin'] In-sample OLS R-squared: 0.80% --- LASSO variables selected for Autos.lead: ['Hshld', 'Clths', 'Coal', 'Oil', 'Util', 'BusEq', 'Rtail', 'Fin'] Running OLS for Autos.lead against ['Hshld', 'Clths', 'Coal', 'Oil', 'Util', 'BusEq', 'Rtail', 'Fin'] In-sample OLS R-squared: 6.13% --- LASSO variables selected for Carry.lead: ['Trans'] Running OLS for Carry.lead against ['Trans'] In-sample OLS R-squared: 2.32% --- LASSO variables selected for Mines.lead: [] No coefs selected for Mines.lead, using all --- Running OLS for Mines.lead against ['Food', 'Beer', 'Smoke', 'Games', 'Books', 'Hshld', 'Clths', 'Hlth', 'Chems', 'Txtls', 'Cnstr', 'Steel', 'FabPr', 'ElcEq', 'Autos', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'BusEq', 'Paper', 'Trans', 'Whlsl', 'Rtail', 'Meals', 'Fin', 'Other'] In-sample OLS R-squared: 5.62% --- LASSO variables selected for Coal.lead: ['Beer', 'Smoke', 'Books', 'Autos', 'Coal', 'Oil', 'Paper', 'Rtail'] Running OLS for Coal.lead against ['Beer', 'Smoke', 'Books', 'Autos', 'Coal', 'Oil', 'Paper', 'Rtail'] In-sample OLS R-squared: 2.84% --- LASSO variables selected for Oil.lead: ['Beer', 'Hlth', 'Carry'] Running OLS for Oil.lead against ['Beer', 'Hlth', 'Carry'] In-sample OLS R-squared: 2.52% --- LASSO variables selected for Util.lead: ['Food', 'Beer', 'Smoke', 'Hshld', 'Hlth', 'Cnstr', 'FabPr', 'Carry', 'Mines', 'Oil', 'Util', 'Telcm', 'BusEq', 'Whlsl', 'Fin', 'Other'] Running OLS for Util.lead against ['Food', 'Beer', 'Smoke', 'Hshld', 'Hlth', 'Cnstr', 'FabPr', 'Carry', 'Mines', 'Oil', 'Util', 'Telcm', 'BusEq', 'Whlsl', 'Fin', 'Other'] In-sample OLS R-squared: 7.86% --- LASSO variables selected for Telcm.lead: ['Beer', 'Smoke', 'Books', 'Hshld', 'Cnstr', 'Autos', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Meals', 'Fin'] Running OLS for Telcm.lead against ['Beer', 'Smoke', 'Books', 'Hshld', 'Cnstr', 'Autos', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Meals', 'Fin'] In-sample OLS R-squared: 5.18% --- LASSO variables selected for Servs.lead: ['Smoke', 'Books', 'Steel', 'Oil', 'Util', 'Fin'] Running OLS for Servs.lead against ['Smoke', 'Books', 'Steel', 'Oil', 'Util', 'Fin'] In-sample OLS R-squared: 2.87% --- LASSO variables selected for BusEq.lead: ['Smoke', 'Books', 'Util'] Running OLS for BusEq.lead against ['Smoke', 'Books', 'Util'] In-sample OLS R-squared: 2.75% --- LASSO variables selected for Paper.lead: ['Clths', 'Coal', 'Oil', 'Rtail', 'Fin'] Running OLS for Paper.lead against ['Clths', 'Coal', 'Oil', 'Rtail', 'Fin'] In-sample OLS R-squared: 3.24% --- LASSO variables selected for Trans.lead: ['Fin'] Running OLS for Trans.lead against ['Fin'] In-sample OLS R-squared: 1.32% --- LASSO variables selected for Whlsl.lead: ['Food', 'Smoke', 'Books', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'Fin', 'Other'] Running OLS for Whlsl.lead against ['Food', 'Smoke', 'Books', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'Fin', 'Other'] In-sample OLS R-squared: 6.79% --- LASSO variables selected for Rtail.lead: ['Rtail'] Running OLS for Rtail.lead against ['Rtail'] In-sample OLS R-squared: 1.61% --- LASSO variables selected for Meals.lead: ['Smoke', 'Books', 'Clths', 'Steel', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Meals', 'Fin'] Running OLS for Meals.lead against ['Smoke', 'Books', 'Clths', 'Steel', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Meals', 'Fin'] In-sample OLS R-squared: 7.90% --- LASSO variables selected for Fin.lead: ['Fin'] Running OLS for Fin.lead against ['Fin'] In-sample OLS R-squared: 1.70% --- LASSO variables selected for Other.lead: ['Clths', 'Fin'] Running OLS for Other.lead against ['Clths', 'Fin'] In-sample OLS R-squared: 2.69% --- [[6, 17, 19, 26], [0, 6, 17], [9, 15, 16, 17, 18, 19, 20, 21, 23, 24, 28], [4, 6, 17, 28], [3, 4, 17, 18, 19, 21, 22, 26, 28], [6, 17, 26], [6, 17, 18, 21, 26], [4, 16, 17, 19], [6], [6, 14, 17, 18, 26, 28], [6, 17, 18, 19, 24, 26, 28], [28], [24, 28], [28], [5, 6, 17, 18, 19, 22, 26, 28], [24], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [1, 2, 4, 14, 17, 18, 23, 26], [1, 7, 15], [0, 1, 2, 5, 7, 10, 12, 15, 16, 18, 19, 20, 22, 25, 28, 29], [1, 2, 4, 5, 10, 14, 15, 16, 17, 18, 19, 21, 22, 26, 27, 28], [2, 4, 11, 18, 19, 28], [2, 4, 19], [6, 17, 18, 26, 28], [28], [0, 2, 4, 15, 17, 18, 19, 21, 28, 29], [26], [2, 4, 6, 11, 15, 17, 18, 19, 21, 22, 27, 28], [28], [6, 28]]
# same predictors selected for all but 2 response vars
# use predictors from paper to match results
if True: # turn off/on
coef_dict_temp = {}
coef_dict_temp['Food.lead'] = ['Clths', 'Coal', 'Util', 'Rtail']
coef_dict_temp['Beer.lead'] = ['Food', 'Clths', 'Coal']
coef_dict_temp['Smoke.lead'] = ['Txtls', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'Paper', 'Trans', 'Fin']
coef_dict_temp['Games.lead'] = ['Books', 'Clths', 'Coal', 'Fin']
coef_dict_temp['Books.lead'] = ['Games', 'Books', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Fin']
coef_dict_temp['Hshld.lead'] = ['Clths', 'Coal', 'Rtail']
coef_dict_temp['Clths.lead'] = ['Books', 'Clths', 'Chems', 'Steel', 'ElcEq', 'Carry', 'Coal', 'Oil', 'Util','Telcm', 'Servs', 'BusEq', 'Rtail']
# Running OLS for Clths against ['Clths', 'Coal', 'Oil', 'Servs', 'Rtail']
coef_dict_temp['Hlth.lead'] = ['Books', 'Mines', 'Coal', 'Util']
coef_dict_temp['Chems.lead'] = ['Clths']
coef_dict_temp['Txtls.lead'] = ['Clths', 'Autos', 'Coal', 'Oil', 'Rtail', 'Fin']
coef_dict_temp['Cnstr.lead'] = ['Clths', 'Coal', 'Oil', 'Util', 'Trans', 'Rtail', 'Fin']
coef_dict_temp['Steel.lead'] = ['Fin']
coef_dict_temp['FabPr.lead'] = ['Trans', 'Fin']
coef_dict_temp['ElcEq.lead'] = ['Fin']
coef_dict_temp['Autos.lead'] = ['Hshld', 'Clths', 'Coal', 'Oil', 'Util', 'BusEq', 'Rtail', 'Fin']
coef_dict_temp['Carry.lead'] = ['Trans']
coef_dict_temp['Mines.lead'] = []
coef_dict_temp['Coal.lead'] = ['Beer', 'Smoke', 'Books', 'Autos', 'Coal', 'Oil', 'Paper', 'Rtail']
coef_dict_temp['Oil.lead'] = ['Beer', 'Hlth', 'Carry']
coef_dict_temp['Util.lead'] = ['Food', 'Beer', 'Smoke', 'Hshld', 'Hlth', 'Cnstr', 'FabPr', 'Carry', 'Mines', 'Oil', 'Util', 'Telcm', 'BusEq', 'Whlsl', 'Fin', 'Other']
coef_dict_temp['Telcm.lead'] = ['Beer', 'Smoke', 'Books', 'Hshld', 'Cnstr', 'Autos', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Meals', 'Fin']
coef_dict_temp['Servs.lead'] = ['Smoke', 'Books', 'Steel', 'Oil', 'Util', 'Fin']
coef_dict_temp['BusEq.lead'] = ['Smoke', 'Books', 'Util']
coef_dict_temp['Paper.lead'] = ['Clths', 'Coal', 'Oil', 'Rtail', 'Fin']
coef_dict_temp['Trans.lead'] = ['Fin']
coef_dict_temp['Whlsl.lead'] = ['Food', 'Beer', 'Smoke', 'Books', 'Hlth', 'Carry', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'BusEq', 'Fin', 'Other']
# Running OLS for Whlsl against ['Food', 'Smoke', 'Books', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'Fin', 'Other']
coef_dict_temp['Rtail.lead'] = ['Rtail']
coef_dict_temp['Meals.lead'] = ['Smoke', 'Books', 'Clths', 'Steel', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Meals', 'Fin']
coef_dict_temp['Fin.lead'] = ['Fin']
coef_dict_temp['Other.lead'] = ['Clths', 'Fin']
coef_dict_paper = []
for response in responses:
print(response, " -> ", coef_dict_temp[response])
coef_dict_paper.append([predictor_reverse_dict[jstr] for jstr in coef_dict_temp[response]])
print(coef_dict_paper)
Food.lead -> ['Clths', 'Coal', 'Util', 'Rtail'] Beer.lead -> ['Food', 'Clths', 'Coal'] Smoke.lead -> ['Txtls', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'Paper', 'Trans', 'Fin'] Games.lead -> ['Books', 'Clths', 'Coal', 'Fin'] Books.lead -> ['Games', 'Books', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Fin'] Hshld.lead -> ['Clths', 'Coal', 'Rtail'] Clths.lead -> ['Books', 'Clths', 'Chems', 'Steel', 'ElcEq', 'Carry', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'BusEq', 'Rtail'] Hlth.lead -> ['Books', 'Mines', 'Coal', 'Util'] Chems.lead -> ['Clths'] Txtls.lead -> ['Clths', 'Autos', 'Coal', 'Oil', 'Rtail', 'Fin'] Cnstr.lead -> ['Clths', 'Coal', 'Oil', 'Util', 'Trans', 'Rtail', 'Fin'] Steel.lead -> ['Fin'] FabPr.lead -> ['Trans', 'Fin'] ElcEq.lead -> ['Fin'] Autos.lead -> ['Hshld', 'Clths', 'Coal', 'Oil', 'Util', 'BusEq', 'Rtail', 'Fin'] Carry.lead -> ['Trans'] Mines.lead -> [] Coal.lead -> ['Beer', 'Smoke', 'Books', 'Autos', 'Coal', 'Oil', 'Paper', 'Rtail'] Oil.lead -> ['Beer', 'Hlth', 'Carry'] Util.lead -> ['Food', 'Beer', 'Smoke', 'Hshld', 'Hlth', 'Cnstr', 'FabPr', 'Carry', 'Mines', 'Oil', 'Util', 'Telcm', 'BusEq', 'Whlsl', 'Fin', 'Other'] Telcm.lead -> ['Beer', 'Smoke', 'Books', 'Hshld', 'Cnstr', 'Autos', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Meals', 'Fin'] Servs.lead -> ['Smoke', 'Books', 'Steel', 'Oil', 'Util', 'Fin'] BusEq.lead -> ['Smoke', 'Books', 'Util'] Paper.lead -> ['Clths', 'Coal', 'Oil', 'Rtail', 'Fin'] Trans.lead -> ['Fin'] Whlsl.lead -> ['Food', 'Beer', 'Smoke', 'Books', 'Hlth', 'Carry', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'BusEq', 'Fin', 'Other'] Rtail.lead -> ['Rtail'] Meals.lead -> ['Smoke', 'Books', 'Clths', 'Steel', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Meals', 'Fin'] Fin.lead -> ['Fin'] Other.lead -> ['Clths', 'Fin'] [[6, 17, 19, 26], [0, 6, 17], [9, 15, 16, 17, 18, 19, 20, 21, 23, 24, 28], [4, 6, 17, 28], [3, 4, 17, 18, 19, 21, 22, 26, 28], [6, 17, 26], [4, 6, 8, 11, 13, 15, 17, 18, 19, 20, 21, 22, 26], [4, 16, 17, 19], [6], [6, 14, 17, 18, 26, 28], [6, 17, 18, 19, 24, 26, 28], [28], [24, 28], [28], [5, 6, 17, 18, 19, 22, 26, 28], [24], [], [1, 2, 4, 14, 17, 18, 23, 26], [1, 7, 15], [0, 1, 2, 5, 7, 10, 12, 15, 16, 18, 19, 20, 22, 25, 28, 29], [1, 2, 4, 5, 10, 14, 15, 16, 17, 18, 19, 21, 22, 26, 27, 28], [2, 4, 11, 18, 19, 28], [2, 4, 19], [6, 17, 18, 26, 28], [28], [0, 1, 2, 4, 7, 15, 17, 18, 19, 20, 21, 22, 28, 29], [26], [2, 4, 6, 11, 15, 17, 18, 19, 21, 22, 27, 28], [28], [6, 28]]
def predict_with_subsets(X, Y, create_model, coef_dict, verbose=False):
"""evaluate subset selection, pass a model function and subsets, compute avg R-squared"""
global responses
nrows, ncols = Y.shape
model = create_model()
scores = []
for response_col in range(ncols):
y = Y[:,response_col]
# print("LASSO variables selected for %s: " % pred)
# print(coef_dict[pred])
if not coef_dict[response_col]:
if verbose:
print("No coefs selected for " + responses[response_col])
# print("---")
continue
# fit model vs. selected vars, better fit w/o LASSO penalties
# in-sample R-squared using LASSO coeffs
#print("Running model for " + pred + " against " + str(coef_dict[pred]))
# col nums of selected predictors
predcols = coef_dict[response_col]
model.fit(X[:, predcols], y)
y_pred = model.predict(X[:, predcols])
score = r2_score(y, y_pred)
scores.append(score)
if verbose:
print ("In-sample R-squared: %.2f%% for %s against %s" % (score*100, responses[response_col],
str([predictors[i] for i in coef_dict[response_col]])))
# print("---")
if verbose:
print("Mean R-squared: %.2f%%" % (100 * np.mean(np.array(scores))))
return np.mean(np.array(scores))
predict_with_subsets(X, Y, LinearRegression, coef_dict_paper, verbose=True)
In-sample R-squared: 2.24% for Food.lead against ['Clths', 'Coal', 'Util', 'Rtail'] In-sample R-squared: 2.52% for Beer.lead against ['Food', 'Clths', 'Coal'] In-sample R-squared: 6.55% for Smoke.lead against ['Txtls', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'Paper', 'Trans', 'Fin'] In-sample R-squared: 5.05% for Games.lead against ['Books', 'Clths', 'Coal', 'Fin'] In-sample R-squared: 6.30% for Books.lead against ['Games', 'Books', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Fin'] In-sample R-squared: 2.97% for Hshld.lead against ['Clths', 'Coal', 'Rtail'] In-sample R-squared: 7.82% for Clths.lead against ['Books', 'Clths', 'Chems', 'Steel', 'ElcEq', 'Carry', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'BusEq', 'Rtail'] In-sample R-squared: 2.68% for Hlth.lead against ['Books', 'Mines', 'Coal', 'Util'] In-sample R-squared: 0.78% for Chems.lead against ['Clths'] In-sample R-squared: 7.91% for Txtls.lead against ['Clths', 'Autos', 'Coal', 'Oil', 'Rtail', 'Fin'] In-sample R-squared: 5.15% for Cnstr.lead against ['Clths', 'Coal', 'Oil', 'Util', 'Trans', 'Rtail', 'Fin'] In-sample R-squared: 1.28% for Steel.lead against ['Fin'] In-sample R-squared: 1.57% for FabPr.lead against ['Trans', 'Fin'] In-sample R-squared: 0.80% for ElcEq.lead against ['Fin'] In-sample R-squared: 6.13% for Autos.lead against ['Hshld', 'Clths', 'Coal', 'Oil', 'Util', 'BusEq', 'Rtail', 'Fin'] In-sample R-squared: 2.32% for Carry.lead against ['Trans'] No coefs selected for Mines.lead In-sample R-squared: 2.84% for Coal.lead against ['Beer', 'Smoke', 'Books', 'Autos', 'Coal', 'Oil', 'Paper', 'Rtail'] In-sample R-squared: 2.52% for Oil.lead against ['Beer', 'Hlth', 'Carry'] In-sample R-squared: 7.86% for Util.lead against ['Food', 'Beer', 'Smoke', 'Hshld', 'Hlth', 'Cnstr', 'FabPr', 'Carry', 'Mines', 'Oil', 'Util', 'Telcm', 'BusEq', 'Whlsl', 'Fin', 'Other'] In-sample R-squared: 5.18% for Telcm.lead against ['Beer', 'Smoke', 'Books', 'Hshld', 'Cnstr', 'Autos', 'Carry', 'Mines', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Rtail', 'Meals', 'Fin'] In-sample R-squared: 2.87% for Servs.lead against ['Smoke', 'Books', 'Steel', 'Oil', 'Util', 'Fin'] In-sample R-squared: 2.75% for BusEq.lead against ['Smoke', 'Books', 'Util'] In-sample R-squared: 3.24% for Paper.lead against ['Clths', 'Coal', 'Oil', 'Rtail', 'Fin'] In-sample R-squared: 1.32% for Trans.lead against ['Fin'] In-sample R-squared: 7.43% for Whlsl.lead against ['Food', 'Beer', 'Smoke', 'Books', 'Hlth', 'Carry', 'Coal', 'Oil', 'Util', 'Telcm', 'Servs', 'BusEq', 'Fin', 'Other'] In-sample R-squared: 1.61% for Rtail.lead against ['Rtail'] In-sample R-squared: 7.90% for Meals.lead against ['Smoke', 'Books', 'Clths', 'Steel', 'Carry', 'Coal', 'Oil', 'Util', 'Servs', 'BusEq', 'Meals', 'Fin'] In-sample R-squared: 1.70% for Fin.lead against ['Fin'] In-sample R-squared: 2.69% for Other.lead against ['Clths', 'Fin'] Mean R-squared: 3.86%
0.038622786316912544
# use all predictors - higher in-sample R-squared
coef_dict_all = []
for _ in responses:
coef_dict_all.append(range(len(predictors)))
predict_with_subsets(X, Y, LinearRegression, coef_dict_all, verbose=False)
0.06637486888237351
# first iteration will train up to including 196911
# will use 196912 to predict 197001
# 1970101 will be first month of performance to use
# train on first 121 months up to 196912 (0:120), put first prediction in P[121] (122nd row)
# first month of performance will be 197002
FIRST_TRAIN_MONTHS = 121
FIRST_PREDICT_MONTH = FIRST_TRAIN_MONTHS # This is stupid but keeps my head straight
print(X[FIRST_TRAIN_MONTHS])
print(data.iloc[FIRST_TRAIN_MONTHS][:30])
[ -3.34 -1.95 -7.59 -7.76 -12.05 -7.5 -5.69 -7.71 -7.37 -5.26 -9.84 -6.31 -7.15 -6.89 -9.35 -12.49 -2.34 -0.77 -12.16 -4.83 -3.16 -11.17 -9.73 -8.89 -8.17 -8.28 -6.31 -13.12 -9.78 -6.2 ] Food -3.34 Beer -1.95 Smoke -7.59 Games -7.76 Books -12.05 Hshld -7.50 Clths -5.69 Hlth -7.71 Chems -7.37 Txtls -5.26 Cnstr -9.84 Steel -6.31 FabPr -7.15 ElcEq -6.89 Autos -9.35 Carry -12.49 Mines -2.34 Coal -0.77 Oil -12.16 Util -4.83 Telcm -3.16 Servs -11.17 BusEq -9.73 Paper -8.89 Trans -8.17 Whlsl -8.28 Rtail -6.31 Meals -13.12 Fin -9.78 Other -6.20 Name: 197001, dtype: float64
class PredictWrapper():
"""Wrap an sklearn model e.g. LinearRegression to fit, predict all vars as a vector,
match the way our Keras model will do it"""
def __init__(self, create_model, coef_dict, fit_missing=None):
self.create_model = create_model
self.coef_dict = coef_dict
self.fit_missing = fit_missing
self.models = []
def fit(self, X_fit, Y_fit, verbose=False):
self.nrows, self.ycols = Y_fit.shape
self.models = []
# fit model for each column
for responsecol in range(self.ycols):
if self.coef_dict[responsecol]:
# column indexes to fit against each other
predcols = self.coef_dict[responsecol]
else: # no columns selected
if not self.fit_missing:
# default: don't fit if no predictors selected
self.models.append(None)
#print(self.models[-1])
continue
elif self.fit_missing == "mean":
# append a numeric value to use
self.models.append(np.mean(Y_fit[:,responsecol]))
#print(self.models[-1])
continue
elif self.fit_missing == "all":
# predcols = all columns
predcols = range(X_fit.shape[1])
model = self.create_model() # create 1 sklearn model for each column
model.fit(X_fit[:, predcols], Y_fit[:,responsecol])
if verbose:
print("fit on " + str(X_fit[:, predcols].shape) + str(predcols))
print(model.coef_)
self.models.append(model)
#debug
#print(responsecol)
#print(X_fit[:, predcols])
#print("=====")
#print(Y_fit[:,responsecol])
#print("=====")
#print(self.model.coef_)
#print(self.model.intercept_)
#print("=====")
def predict(self, X_predict, verbose=False):
predictions = []
nrows = X_predict.shape[0]
for responsecol in range(self.ycols):
#print (type(self.models[responsecol]))
if type(self.models[responsecol]) == np.float64:
pred = np.array([self.models[responsecol]] * nrows).reshape(nrows,1)
predictions.append(pred)
sys.stdout.write('\010#')
elif not self.models[responsecol]:
# don't predict
predictions.append(np.array([np.nan] * nrows ).reshape(nrows,1))
sys.stdout.write('\010N')
else:
predcols = self.coef_dict[responsecol]
y_pred = self.models[responsecol].predict(X_predict[:,predcols])
if verbose:
print("predict on" + str(X_predict[:, predcols].shape) + str(predcols))
print(y_pred)
predictions.append(y_pred.reshape(nrows,1))
return np.hstack(predictions)
# typical pipeline
# backtestmodel = BacktestModel(X, # predictors
# Y, # responses
# create_model=LinearRegression, # create_model which returns a model (needed for 'timestep' which creates different model each timestep)
# # or model = someKerasModel, # initialized model that supports fit(X,Y), predict(X) , predicts an entire row,
# coef_dict_param=coef_dict_paper, # how to map predictors to responses ("all", "timestep", or a list of lists)
# startindex=FIRST_TRAIN_MONTHS, # initial training for backtest
# fit_missing='mean', # what to do when no predictors are selected in coef_dict_param - use all predictors, use historical mean, use np.nan
# scaler = None) # scaling function like MinMaxScaler
# backtestmodel.gen_predictions(verbose=False) # starting from startindex, step through X,Y month by month, fit up to current month, predict next month, store prediction in self.P
# backtestmodel.walkforward_xval(n_splits=5, verbose=True) # calls gen_predictions with a large step, fits and predicts one large fold at a time (useful to cross-validate quickly)
# backtestmodel.evaluate_predictions() # report metrics on prediction : MSE etc. #TODO: support custom calcs before/after/instead of default calcs
# backtestmodel.gen_returns(calc_returns, verbose=True) # takes a function that returns portfolio returns based on self.P, stores in self.R
# backtestmodel.report_returns(start_date=start_date_str, freq='M') # calc cumulative perf and report (TODO: allow it to take a reporting function to run before/after/in place of default report)
# backtestmodel.evaluate_quantiles(chart=True, verbose=True) # report quantile metrics # TODO: make this a custom calc passed into evaluate_predictions
class BacktestModel():
def __init__(self,
X, # predictors
Y, # responses
model=None, # model that supports fit(X,Y), predict(X) , predicts an entire row,
create_model=None, # or create_model which returns a model (needed for 'timestep' but slows down so pass model if dynamic not needed)
coef_dict_param="all", # mapping of predictors to responses ("all", "timestep", or a list of lists)
startindex=FIRST_TRAIN_MONTHS,
scaler=None,
fit_missing=None):
self.Xrows, self.Xcols = X.shape
self.Yrows, self.Ycols = Y.shape
if self.Xrows != self.Yrows:
raise(ValueError, "Shapes differ: X %s, Y %s" % (str(X.shape), str(Y.shape)))
self.X = X
self.Y = Y
self.Xscale = X.copy()
self.Yscale = Y.copy()
if scaler:
print("scaler: %s " %str(scaler))
# by rows
# MinMaxScaler: each row (min->0, max->1)
# StandardScaler: each row (mean->0, SD->1)
# self.Xscale = scaler().fit_transform(self.Xscale.transpose()).transpose()
# self.Yscale = scaler().fit_transform(self.Yscale.transpose()).transpose()
# by cols
# MinMaxScaler: each col (min->0, max->1)
# StandardScaler: each col (mean->0, SD->1)
self.Xscale = scaler().fit_transform(self.Xscale)
self.Yscale = scaler().fit_transform(self.Yscale)
self.model = model
self.create_model = create_model
self.coef_dict_param = coef_dict_param
self.startindex = startindex
self.fit_missing = fit_missing
def fit_predict(self, ntrain, npredict=1, verbose=False):
"""for backtest, train model using Y v. X
train on first ntrain rows. if ntrain=121, fit 0:120
predict following npredict rows
if npredict=1, predict row 121
if npredict=12, predict rows 121-132
"""
# fit first ntrain rows
X_fit = self.Xscale[:ntrain] # e.g. 0:120
Y_fit = self.Yscale[:ntrain]
# predict npredict rows
X_predict = self.Xscale[ntrain:ntrain+npredict] # 121-122
X_predict = X_predict.reshape(npredict,self.Xcols)
# if no coef_dict select predictors into coef_dict
if self.coef_dict_param == "timestep":
msg = "Performing subset selection"
coef_dict = subset_selection(X_fit, Y_fit, LassoLarsIC(criterion='aic'))
# if coef_dict == "all" use all predictors for each response
elif self.coef_dict_param == 'all':
msg = "Using all predictors"
coef_dict = [range(self.Xcols) for _ in range(self.ycols)]
else: # should check valid dict
msg = "Using coef_dict predictors"
coef_dict = self.coef_dict_param
if verbose:
print(msg)
if self.create_model:
self.model = PredictWrapper(self.create_model, coef_dict, fit_missing=self.fit_missing)
self.model.fit(X_fit, Y_fit, verbose=verbose)
return self.model.predict(X_predict, verbose=verbose)
# predict all months
# initial train_months = 120 -> train first model on 120 rows
# first prediction will be in P[120] (121st row)
# step = 6 -> predict following 6 rows, then step forward 6 months at a time
# initialize predictions matrix self.P
# use either step or folds
# step, do range(self.startindex, nrows, step)
# folds, at each fold train 0:startfold, predict startfold+1:endfold
# store only out-of-sample predictions in P, calc out-of-sample MSE
# using a step > 1 or folds is quicker, for quicker xval, or to speed up by not estimating model at each timestep
def gen_predictions(self,
step=1,
splits=None,
verbose=False):
self.P = np.zeros_like(self.Y)
progress_i = 0
self.nrows, self.ycols = Y.shape
if splits:
month_indexes = splits[:-1] # last index is nrows
else:
# create list of steps
month_indexes = list(range(self.startindex, nrows, step))
steps = [month_indexes[i+1]-month_indexes[i] for i in range(len(month_indexes)-1)]
# last step -> end
steps.append(self.nrows - month_indexes[-1])
if verbose:
print ("Steps: " + str(month_indexes))
for month_index, forecast_rows in zip(month_indexes, steps):
if verbose:
print("Training on first %d rows (%d:%d), putting predictions in rows %s" % (month_index,
0, month_index-1,
str(range(month_index,month_index+forecast_rows))))
predictions = self.fit_predict(month_index, forecast_rows, verbose=False)
first_pred_row = month_index
for row_index in range(forecast_rows):
self.P[first_pred_row+row_index] = predictions[row_index]
sys.stdout.write('.')
progress_i += 1
if progress_i % 80 == 0:
print("")
print("%s Still training step %d of %d" % (time.strftime("%H:%M:%S"), progress_i, len(month_indexes)))
sys.stdout.flush()
print("")
def evaluate_predictions(self):
# evaluate prediction (can move to separate function)
msetemp = (self.P[self.startindex:]-self.Yscale[self.startindex:])**2
#remove nans
msetemp = msetemp[~np.isnan(msetemp)]
self.mse = np.mean(msetemp)
print("OOS MSE across all predictions: %.4f" % self.mse)
self.model.fit(self.Xscale, self.Yscale)
Y_pred = self.model.predict(self.Xscale)
self.in_sample_mse = np.mean((Y_pred - self.Yscale) ** 2)
print("In-sample MSE: %.4f" % self.in_sample_mse)
# force unpredicted ys to be nans, then remove nans
vartemp = self.Yscale[self.startindex:] - self.P[self.startindex:] + self.P[self.startindex:]
vartemp = vartemp[~np.isnan(vartemp)]
y_variance = np.var(vartemp[self.startindex:])
print("Variance: %.4f" % (y_variance))
print("R-squared: %.4f" % (1- self.mse/y_variance))
return(self.mse)
def evaluate_quantiles(self, chart=False, verbose=False):
self.P_quantiles = np.zeros_like(self.P)
self.Y_quantiles = np.zeros_like(self.Y)
self.kendalltaus = []
self.ktpvals = []
# compute score for predicted quantiles vs. actual (expected) quantiles
N_QUANTILES=5
for row in range(self.startindex, self.P_quantiles.shape[0]):
#print(self.P[row])
self.P_quantiles[row] = pd.qcut(self.P[row], N_QUANTILES, range(N_QUANTILES))
self.Y_quantiles[row] = pd.qcut(self.Y[row], N_QUANTILES, range(N_QUANTILES))
kt, p_val = kendalltau(self.P[row], self.Y[row])
self.kendalltaus.append(kt)
self.ktpvals.append(p_val)
self.kendalltau = np.mean(self.kendalltaus)
self.kendalltau_pvalue = np.mean(self.ktpvals)
print("Avg rank correlation (Kendall's tau): %.4f (Expected: 0)" % (self.kendalltau))
pred_quantiles = self.P_quantiles[self.startindex:]
true_quantiles = self.Y_quantiles[self.startindex:]
nrows, ncols = pred_quantiles.shape
pred_quantiles = pred_quantiles.reshape(nrows*ncols)
true_quantiles = true_quantiles.reshape(nrows*ncols)
self.quintile_accuracy = accuracy_score(pred_quantiles, true_quantiles)
print("5-quintile accuracy: %.4f (Expected: 0.2)" % (self.quintile_accuracy))
pred_direction = np.zeros(nrows*ncols)
true_direction = np.zeros(nrows*ncols)
for i in range(nrows*ncols):
if pred_quantiles[i] == 4:
pred_direction[i] = 1
elif pred_quantiles[i] == 0:
pred_direction[i] = -1
if true_quantiles[i] == 4:
true_direction[i] = 1
elif true_quantiles[i] == 0:
true_direction[i] = -1
self.directional_accuracy = accuracy_score(pred_direction, true_direction)
print("Long/short/flat accuracy: %.4f (Expected: 0.44)" % (self.directional_accuracy))
nrows = nrows * ncols
conf_mat_expected = np.array([[0.64, 0.16],[0.16, 0.04]])*nrows
myscores = []
for q in range(5):
temp_pred = pred_quantiles == q
temp_actual = true_quantiles == q
conf_mat5 = confusion_matrix(temp_pred, temp_actual)
diff_mat = conf_mat5 - conf_mat_expected
if verbose:
print("Confusion matrix for quantile %d" % q)
print(conf_mat5)
cstmp, cspvtmp = chisquare(conf_mat5.reshape(4), conf_mat_expected.reshape(4))
print("Chi-square: %.4f (p-value: %.8f)" % (cstmp, cspvtmp))
# probably no valid statistical interpretation but
# average of improvement in true positive % and true negative %
myscore = diff_mat[1][1]
myscores.append(myscore)
# sum of true positive for top and bottom quintiles
self.excess_tp = myscores[0] + myscores[4]
print("Excess true positive in quintiles 1 + 5: %f" % (self.excess_tp))
conf_mat = confusion_matrix(pred_quantiles, true_quantiles)
if chart:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual Quintile')
plt.xlabel('Predicted Quintile')
plt.show()
return None
def walkforward_xval (self, n_splits=5, verbose=False):
"""quick and dirty genreturns, with a step"""
# generate k-folds
kf = KFold(n_splits=n_splits)
kf.get_n_splits(X)
last_indexes = []
for train_index, test_index in kf.split(X):
# use test_index as last index to train
last_index = test_index[-1] + 1
last_indexes.append(last_index)
print("%s Generate splits %s" % (time.strftime("%H:%M:%S"), str([i for i in last_indexes])))
#override startindex
self.startindex = last_indexes[0]
return self.gen_predictions(splits=last_indexes, verbose=verbose)
def gen_returns(self, port_returns_func, verbose=False):
self.R = np.zeros(self.P.shape[0])
first_pred_month=self.startindex
indcount = [0] * self.ycols
longcount = [0] * self.ycols
shortcount = [0] * self.ycols
for month_index in range(first_pred_month, nrows-1):
return_month = month_index + 1
port_return, long_indexes, short_indexes = port_returns_func(self.P[month_index],
self.X[return_month])
self.R[return_month] = port_return
for i in long_indexes:
indcount[i] += 1
longcount[i] += 1
for i in short_indexes:
indcount[i] += 1
shortcount[i] += 1
if verbose:
for i in range(len(responses)):
print("%s: long %d times, short %d times, total %d times" % (predictors[i],
longcount[i],
shortcount[i],
indcount[i]))
return self.R
def report_returns(self, start_date='01/01/1970', freq='M'):
first_pred_month=self.startindex
self.results = self.R[first_pred_month:]
nmonths = self.results.shape[0]
nyears = nmonths/12.0
index = pd.date_range(start_date,periods=nmonths, freq=freq)
perfdata = pd.DataFrame(self.results,index=index,columns=['Returns'])
perfdata['Equity'] = 100 * np.cumprod(1 + self.results / 100)
self.cumulative_return = perfdata['Equity']
self.mean_monthly_return_annualized = np.mean(1 + self.results/100) ** 12 - 1
self.mean_return = (self.cumulative_return[-1]/100) ** (1.0/nyears) - 1
self.annualized_vol = np.std(self.results/100) * np.sqrt(12.0)
self.sharpe = self.mean_monthly_return_annualized/self.annualized_vol
print("Mean return: %.3f%%" % (self.mean_return * 100 ))
#print("Mean monthly annualized return: %.3f%%" % (self.mean_monthly_return_annualized * 100 ))
#print("Monthly annualized volatility: %.3f%%" % (self.annualized_vol * 100))
print("Monthly Sharpe ratio: %.3f" % (self.sharpe))
# return calculation passed to gen_returns
NUMSTOCKS = 6 # top quintile (and bottom)
def calc_returns(prediction_row, return_row, numstocks=NUMSTOCKS, verbose=False):
# ensure nan sorts to top for shorts
short_sort_array = [999999 if np.isnan(x) else x for x in prediction_row]
# pick bottom numstocks
select_array = np.argsort(short_sort_array)
short_indexes = select_array[:numstocks]
# ensure nan sorts to bottom for longs
long_sort_array = [-999999 if np.isnan(x) else x for x in prediction_row]
# pick top numstocks
select_array = np.argsort(long_sort_array)
long_indexes = select_array[-numstocks:]
if verbose:
print("Longs: %s" %(str([(i,prediction_row[i]) for i in long_indexes])))
print("Shorts: %s" %(str([(i,prediction_row[i]) for i in short_indexes])))
# compute equal weighted long/short return
return np.mean(return_row[long_indexes])/2 - np.mean(return_row[short_indexes])/2, long_indexes, short_indexes
start_date_int = data.index[FIRST_TRAIN_MONTHS]
start_year, start_month = start_date_int // 100, start_date_int % 100
start_date_str = "%02d/%02d/%d" % (start_month, 1, start_year)
start_date_str
'01/01/1970'
# test fit_predict
backtestmodel = BacktestModel(X, Y, create_model=LinearRegression, coef_dict_param=coef_dict_all, startindex=FIRST_TRAIN_MONTHS)
print(backtestmodel.fit_predict(121, npredict=3, verbose=False))
[[ -4.70219919 -5.3073358 -3.72029191 -12.42448745 -6.62775756 -2.90316148 -6.79960146 -2.23172984 -5.75101181 -6.66276641 -5.75265368 -7.71979359 -6.57864566 -4.32124795 -5.0355927 -5.92230241 -4.76675177 -1.96315925 -3.43369206 -2.30343164 -3.72794445 -2.91917105 -5.27753906 -6.40474001 -6.85223288 -10.16508667 -4.69016977 -8.8042513 -5.4481542 -6.40122815] [ -1.09819574 0.33751522 1.47276575 -7.65291563 -0.40550754 -4.96855817 -1.82873168 -3.4663124 -2.05857985 -0.68158685 -1.54307573 -2.74859589 -2.04412571 -1.77673237 -3.39612781 -4.76047089 -0.65055072 -5.08513341 1.95082965 -2.60578312 -5.09137205 -1.72203771 -3.31591182 -1.94755286 -1.38046013 -2.75841651 -0.96353404 -6.81674253 1.34059086 -1.59466739] [ -1.96144894 -0.87682739 -3.02833064 0.58137733 -0.02056567 1.67143038 -0.44895783 -1.70395859 -0.63283007 -0.4731534 -0.71228211 -3.61106494 -1.93572984 -0.64937617 -1.09751999 0.24726352 -0.37303973 -1.04141371 2.1679461 -0.55736749 0.76975122 0.67561178 1.62692312 -1.12079845 -1.07373008 -0.27905898 1.55099516 -3.06722504 -0.32196075 0.5859896 ]]
# test model on our calculated coef_dict
# bad methodology, introduces snooping
# selects LASSO vars over whole timespan
# uses selected vars to do OLS backtest
# included as an example of what not to do
backtestmodel = BacktestModel(X, Y,
create_model=LinearRegression,
coef_dict_param=coef_dict_paper,
startindex=FIRST_TRAIN_MONTHS,
fit_missing='mean',
scaler = None)
backtestmodel.gen_predictions(verbose=False)
backtestmodel.gen_returns(calc_returns, verbose=True)
backtestmodel.report_returns(start_date=start_date_str, freq='M')
backtestmodel.evaluate_predictions()
backtestmodel.evaluate_quantiles(chart=True, verbose=True)
################################################################################. 00:15:19 Still training step 80 of 563 ################################################################################. 00:15:20 Still training step 160 of 563 ################################################################################. 00:15:21 Still training step 240 of 563 ################################################################################. 00:15:23 Still training step 320 of 563 ################################################################################. 00:15:24 Still training step 400 of 563 ################################################################################. 00:15:25 Still training step 480 of 563 ################################################################################. 00:15:26 Still training step 560 of 563 ###. Food: long 93 times, short 37 times, total 130 times Beer: long 122 times, short 94 times, total 216 times Smoke: long 211 times, short 112 times, total 323 times Games: long 147 times, short 125 times, total 272 times Books: long 126 times, short 107 times, total 233 times Hshld: long 81 times, short 72 times, total 153 times Clths: long 195 times, short 172 times, total 367 times Hlth: long 126 times, short 110 times, total 236 times Chems: long 26 times, short 125 times, total 151 times Txtls: long 160 times, short 128 times, total 288 times Cnstr: long 92 times, short 124 times, total 216 times Steel: long 4 times, short 180 times, total 184 times FabPr: long 33 times, short 62 times, total 95 times ElcEq: long 45 times, short 16 times, total 61 times Autos: long 87 times, short 117 times, total 204 times Carry: long 106 times, short 76 times, total 182 times Mines: long 132 times, short 102 times, total 234 times Coal: long 226 times, short 183 times, total 409 times Oil: long 160 times, short 137 times, total 297 times Util: long 161 times, short 182 times, total 343 times Telcm: long 118 times, short 163 times, total 281 times Servs: long 149 times, short 137 times, total 286 times BusEq: long 106 times, short 123 times, total 229 times Paper: long 50 times, short 107 times, total 157 times Trans: long 39 times, short 56 times, total 95 times Whlsl: long 168 times, short 147 times, total 315 times Rtail: long 72 times, short 62 times, total 134 times Meals: long 220 times, short 178 times, total 398 times Fin: long 56 times, short 21 times, total 77 times Other: long 61 times, short 117 times, total 178 times Mean return: 6.667% Monthly Sharpe ratio: 1.123 OOS MSE across all predictions: 39.4986 #In-sample MSE: 35.8279 Variance: 39.4097 R-squared: -0.0023 Avg rank correlation (Kendall's tau): 0.0589 (Expected: 0) 5-quintile accuracy: 0.2287 (Expected: 0.2) Long/short/flat accuracy: 0.4706 (Expected: 0.44) Confusion matrix for quantile 0 [[10953 2559] [ 2555 823]] Chi-square: 49.7107 (p-value: 0.00000000) Confusion matrix for quantile 1 [[10884 2628] [ 2628 750]] Chi-square: 12.8020 (p-value: 0.00508507) Confusion matrix for quantile 2 [[10850 2662] [ 2656 722]] Chi-square: 4.7384 (p-value: 0.19198761) Confusion matrix for quantile 3 [[10851 2661] [ 2666 712]] Chi-square: 3.2442 (p-value: 0.35547827) Confusion matrix for quantile 4 [[10994 2518] [ 2523 855]] Chi-square: 75.2761 (p-value: 0.00000000) Excess true positive in quintiles 1 + 5: 326.800000
# chart performance
def mychart(args, names=None, title=""):
x_coords = np.linspace(1970, 2016, args[0].shape[0])
plotdata = []
for i in range(len(args)):
tracelabel = "Trace %d" % i
if names:
tracelabel=names[i]
plotdata.append(Scatter(x=x_coords,
y=args[i].values.reshape(-1),
mode = 'line',
name=tracelabel))
layout = Layout(
title = title,
autosize=False,
width=900,
height=600,
yaxis=dict(
type='log',
autorange=True
)
)
fig = Figure(data=plotdata, layout=layout)
return iplot(fig)
def myscatter(arg1, arg2, names=None, title=""):
plotdata = []
plotdata.append(Scatter(
x = arg1,
y = arg2,
mode = 'markers'
))
layout = dict(
title=title,
autosize=False,
width="600",
height="480",
yaxis=dict(
# type='log',
autorange=True
)
)
# py.iplot(data, filename='basic-scatter')
fig = Figure(data=plotdata, layout=layout)
return iplot(fig)
def plot_matrix(lossframe, x_labels, y_labels, x_suffix="", y_suffix=""):
pivot = lossframe.pivot_table(index=[y_labels], columns=[x_labels], values=['mse'])
# print(pivot)
# specify labels as strings, to force plotly to use a discrete axis
# print(pivot.columns.levels[1]).values
# print(lossframe[x_labels].dtype)
if lossframe[x_labels].dtype == np.float64 or lossframe[x_labels].dtype == np.float32:
xaxis = ["%f %s" % (i, x_suffix) for i in pivot.columns.levels[1].values]
else:
xaxis = ["%d %s" % (i, x_suffix) for i in pivot.columns.levels[1].values]
if lossframe[y_labels].dtype == np.float64 or lossframe[y_labels].dtype == np.float32:
yaxis = ["%f %s" % (i, y_suffix) for i in pivot.index.values]
else:
yaxis = ["%d %s" % (i, y_suffix) for i in pivot.index.values]
# print(xaxis, yaxis)
"""plot a heat map of a matrix"""
chart_width=640
chart_height=480
layout = dict(
title="%s v. %s" % (x_labels, y_labels),
height=chart_height,
width=chart_width,
margin=dict(
l=150,
r=30,
b=120,
t=100,
),
xaxis=dict(
title=x_labels,
tickfont=dict(
family='Arial, sans-serif',
size=10,
color='black'
),
),
yaxis=dict(
title=y_labels,
tickfont=dict(
family='Arial, sans-serif',
size=10,
color='black'
),
),
)
data = [Heatmap(z=pivot.values,
x=xaxis,
y=yaxis,
colorscale=[[0, 'rgb(0,0,255)', [1, 'rgb(255,0,0)']]],
)
]
fig = Figure(data=data, layout=layout)
return iplot(fig, link_text="")
perf_post_LASSO = backtestmodel.cumulative_return
mychart([perf_post_LASSO],["Post-LASSO"], title="Post-LASSO")
# do LASSO subset selection at each timestep
backtestmodel = BacktestModel(X, Y,
create_model=LinearRegression,
coef_dict_param="timestep",
startindex=FIRST_TRAIN_MONTHS,
fit_missing='mean',
scaler = None)
backtestmodel.gen_predictions(verbose=False)
backtestmodel.gen_returns(calc_returns, verbose=False)
backtestmodel.report_returns(start_date=start_date_str, freq='M')
backtestmodel.evaluate_predictions()
backtestmodel.evaluate_quantiles(chart=True, verbose=True)
perf_LASSO_each_timestep = backtestmodel.cumulative_return
mychart([perf_LASSO_each_timestep],["LASSO each timestep"], title="LASSO each timestep")
................................................................................ 00:15:39 Still training step 80 of 563 ................................................................................ 00:15:52 Still training step 160 of 563 ................................................................................ 00:16:04 Still training step 240 of 563 ................................................................................ 00:16:17 Still training step 320 of 563 ................................................................................ 00:16:29 Still training step 400 of 563 ................................................................................ 00:16:41 Still training step 480 of 563 ................................................................................ 00:16:55 Still training step 560 of 563 ... Mean return: 3.545% Monthly Sharpe ratio: 0.674 OOS MSE across all predictions: 41.4734 In-sample MSE: 35.7675 Variance: 39.4097 R-squared: -0.0524 Avg rank correlation (Kendall's tau): 0.0331 (Expected: 0) 5-quintile accuracy: 0.2163 (Expected: 0.2) Long/short/flat accuracy: 0.4584 (Expected: 0.44) Confusion matrix for quantile 0 [[10910 2602] [ 2598 780]] Chi-square: 24.8287 (p-value: 0.00001677) Confusion matrix for quantile 1 [[10840 2672] [ 2672 706]] Chi-square: 2.1374 (p-value: 0.54439176) Confusion matrix for quantile 2 [[10831 2681] [ 2675 703]] Chi-square: 1.6009 (p-value: 0.65918763) Confusion matrix for quantile 3 [[10827 2685] [ 2690 688]] Chi-square: 0.4245 (p-value: 0.93512982) Confusion matrix for quantile 4 [[10916 2596] [ 2601 777]] Chi-square: 24.2603 (p-value: 0.00002204) Excess true positive in quintiles 1 + 5: 205.800000
# use all predictors at each timestep
backtestmodel = BacktestModel(X, Y,
create_model=LinearRegression,
coef_dict_param="all",
startindex=FIRST_TRAIN_MONTHS,
fit_missing='mean',
scaler=None)
backtestmodel.gen_predictions(verbose=False)
backtestmodel.gen_returns(calc_returns, verbose=False)
backtestmodel.report_returns(start_date=start_date_str, freq='M')
backtestmodel.evaluate_predictions()
backtestmodel.evaluate_quantiles(chart=True, verbose=False)
perf_all_preds = backtestmodel.cumulative_return
mychart([perf_all_preds],["All preds"], title="OLS all predictors")
................................................................................ 00:16:58 Still training step 80 of 563 ................................................................................ 00:17:00 Still training step 160 of 563 ................................................................................ 00:17:02 Still training step 240 of 563 ................................................................................ 00:17:04 Still training step 320 of 563 ................................................................................ 00:17:06 Still training step 400 of 563 ................................................................................ 00:17:08 Still training step 480 of 563 ................................................................................ 00:17:11 Still training step 560 of 563 ... Mean return: 2.781% Monthly Sharpe ratio: 0.506 OOS MSE across all predictions: 43.8692 In-sample MSE: 34.7467 Variance: 39.4097 R-squared: -0.1132 Avg rank correlation (Kendall's tau): 0.0246 (Expected: 0) 5-quintile accuracy: 0.2161 (Expected: 0.2) Long/short/flat accuracy: 0.4657 (Expected: 0.44) Excess true positive in quintiles 1 + 5: 225.800000
# pure LASSO (not LASSO followed by OLS on selected subset)
def create_model_lasso():
return LassoLarsIC(criterion='aic')
# use all predictors at each timestep
backtestmodel = BacktestModel(X, Y,
create_model=create_model_lasso,
coef_dict_param="all",
startindex=FIRST_TRAIN_MONTHS,
fit_missing='mean',
scaler=None)
backtestmodel.gen_predictions(verbose=False)
backtestmodel.gen_returns(calc_returns, verbose=False)
backtestmodel.report_returns(start_date=start_date_str, freq='M')
backtestmodel.evaluate_predictions()
backtestmodel.evaluate_quantiles(chart=True, verbose=False)
perf_LASSO_only = backtestmodel.cumulative_return
mychart([perf_LASSO_only],["LASSO only"], title="LASSO only")
................................................................................ 00:17:23 Still training step 80 of 563 ................................................................................ 00:17:34 Still training step 160 of 563 ................................................................................ 00:17:45 Still training step 240 of 563 ................................................................................ 00:17:56 Still training step 320 of 563 ................................................................................ 00:18:07 Still training step 400 of 563 ................................................................................ 00:18:18 Still training step 480 of 563 ................................................................................ 00:18:29 Still training step 560 of 563 ... Mean return: 3.077% Monthly Sharpe ratio: 0.553 OOS MSE across all predictions: 40.0227 In-sample MSE: 36.1107 Variance: 39.4097 R-squared: -0.0156 Avg rank correlation (Kendall's tau): 0.0331 (Expected: 0) 5-quintile accuracy: 0.2187 (Expected: 0.2) Long/short/flat accuracy: 0.4599 (Expected: 0.44) Excess true positive in quintiles 1 + 5: 198.800000