In this case study, similar to Case Study 1 of this chapter, we will use the Reinforcement Learning models to come up with a policy for optimal portfolio allocation among a set of cryptocurrencies.
In the reinforcement learning based framework defined for this problem, the algorithm determines the optimal portfolio allocation depending upon the current state of the portfolio of instruments.
The algorithm is trained using Deep QLearning framework and the components of the reinforcement learning environment are:
Agent: Portfolio manager, robo advisor or an individual.
Action: Assignment and rebalancing the portfolio weights. The DQN model provides the Q-values which is further converted into portfolio weights.
Reward function: Sharpe ratio, which consists of the standard deviation as the risk assessment measure is used reward function.
State: The state is the correlation matrix of the instruments based on a specific time window. The correlation matrix is a suitable state variable for the portfolio allocation, as it contains the information about the relationships between different instruments and can be useful in performing portfolio allocation.
Environment: Cryptocurrency exchange.
The data of cryptocurrencies that we will be using for this case study is obtained from the Kaggle platform and contains the daily prices of the cryptocurrencies during the period of 2018. The data contains some of the most liquid cryptocurrencies such as Bitcoin, Ethereum, Ripple, Litecoin and Dash.
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import datetime
import math
from numpy.random import choice
import random
from keras.layers import Input, Dense, Flatten, Dropout
from keras.models import Model
from keras.regularizers import l2
import numpy as np
import pandas as pd
import random
from collections import deque
import matplotlib.pylab as plt
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
#The data already obtained from yahoo finance is imported.
dataset = read_csv('data/crypto_portfolio.csv',index_col=0)
# shape
dataset.shape
(375, 15)
# peek at data
set_option('display.width', 100)
dataset.head(5)
ADA | BCH | BNB | BTC | DASH | EOS | ETH | IOT | LINK | LTC | TRX | USDT | XLM | XMR | XRP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||||
2018-01-01 | 0.7022 | 2319.120117 | 8.480 | 13444.879883 | 1019.419983 | 7.64 | 756.200012 | 3.90 | 0.7199 | 224.339996 | 0.05078 | 1.01 | 0.4840 | 338.170013 | 2.05 |
2018-01-02 | 0.7620 | 2555.489990 | 8.749 | 14754.129883 | 1162.469971 | 8.30 | 861.969971 | 3.98 | 0.6650 | 251.809998 | 0.07834 | 1.02 | 0.5560 | 364.440002 | 2.19 |
2018-01-03 | 1.1000 | 2557.520020 | 9.488 | 15156.620117 | 1129.890015 | 9.43 | 941.099976 | 4.13 | 0.6790 | 244.630005 | 0.09430 | 1.01 | 0.8848 | 385.820007 | 2.73 |
2018-01-04 | 1.1300 | 2355.780029 | 9.143 | 15180.080078 | 1120.119995 | 9.47 | 944.830017 | 4.10 | 0.9694 | 238.300003 | 0.21010 | 1.02 | 0.6950 | 372.230011 | 2.73 |
2018-01-05 | 1.0100 | 2390.040039 | 14.850 | 16954.779297 | 1080.880005 | 9.29 | 967.130005 | 3.76 | 0.9669 | 244.509995 | 0.22400 | 1.01 | 0.6400 | 357.299988 | 2.51 |
The data is the historical data of several Cryptocurrencies
We will look at the following Scripts :
We introduce a simulation environment class “CryptoEnvironment”, where we create a working environment for cryptocurrencies. This class has following key functions:
the instruments based on a lookback period. The function also returns the historical return or raw historical data as the state depending on is_cov_matrix or is_raw_time_series flag.
portfolio, given the portfolio weight and lookback period.
import numpy as np
import pandas as pd
from IPython.core.debugger import set_trace
#define a function portfolio
def portfolio(returns, weights):
weights = np.array(weights)
rets = returns.mean() * 252
covs = returns.cov() * 252
P_ret = np.sum(rets * weights)
P_vol = np.sqrt(np.dot(weights.T, np.dot(covs, weights)))
P_sharpe = P_ret / P_vol
return np.array([P_ret, P_vol, P_sharpe])
class CryptoEnvironment:
def __init__(self, prices = './data/crypto_portfolio.csv', capital = 1e6):
self.prices = prices
self.capital = capital
self.data = self.load_data()
def load_data(self):
data = pd.read_csv(self.prices)
try:
data.index = data['Date']
data = data.drop(columns = ['Date'])
except:
data.index = data['date']
data = data.drop(columns = ['date'])
return data
def preprocess_state(self, state):
return state
def get_state(self, t, lookback, is_cov_matrix = True, is_raw_time_series = False):
assert lookback <= t
decision_making_state = self.data.iloc[t-lookback:t]
decision_making_state = decision_making_state.pct_change().dropna()
#set_trace()
if is_cov_matrix:
x = decision_making_state.cov()
return x
else:
if is_raw_time_series:
decision_making_state = self.data.iloc[t-lookback:t]
return self.preprocess_state(decision_making_state)
def get_reward(self, action, action_t, reward_t, alpha = 0.01):
def local_portfolio(returns, weights):
weights = np.array(weights)
rets = returns.mean() # * 252
covs = returns.cov() # * 252
P_ret = np.sum(rets * weights)
P_vol = np.sqrt(np.dot(weights.T, np.dot(covs, weights)))
P_sharpe = P_ret / P_vol
return np.array([P_ret, P_vol, P_sharpe])
data_period = self.data[action_t:reward_t]
weights = action
returns = data_period.pct_change().dropna()
sharpe = local_portfolio(returns, weights)[-1]
sharpe = np.array([sharpe] * len(self.data.columns))
rew = (data_period.values[-1] - data_period.values[0]) / data_period.values[0]
return np.dot(returns, weights), sharpe
In this section, we will train an agent that will perform reinforcement learning based on the actor and critic networks. We will perform the following steps to achieve this:
class Agent:
def __init__(
self,
portfolio_size,
is_eval = False,
allow_short = True,
):
self.portfolio_size = portfolio_size
self.allow_short = allow_short
self.input_shape = (portfolio_size, portfolio_size, )
self.action_size = 3 # sit, buy, sell
self.memory4replay = []
self.is_eval = is_eval
self.alpha = 0.5
self.gamma = 0.95
self.epsilon = 1
self.epsilon_min = 0.01
self.epsilon_decay = 0.99
self.model = self._model()
def _model(self):
inputs = Input(shape=self.input_shape)
x = Flatten()(inputs)
x = Dense(100, activation='elu')(x)
x = Dropout(0.5)(x)
x = Dense(50, activation='elu')(x)
x = Dropout(0.5)(x)
predictions = []
for i in range(self.portfolio_size):
asset_dense = Dense(self.action_size, activation='linear')(x)
predictions.append(asset_dense)
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='adam', loss='mse')
return model
def nn_pred_to_weights(self, pred, allow_short = False):
weights = np.zeros(len(pred))
raw_weights = np.argmax(pred, axis=-1)
saved_min = None
for e, r in enumerate(raw_weights):
if r == 0: # sit
weights[e] = 0
elif r == 1: # buy
weights[e] = np.abs(pred[e][0][r])
else:
weights[e] = -np.abs(pred[e][0][r])
#sum of absolute values in short is allowed
if not allow_short:
weights += np.abs(np.min(weights))
saved_min = np.abs(np.min(weights))
saved_sum = np.sum(weights)
else:
saved_sum = np.sum(np.abs(weights))
weights /= saved_sum
return weights, saved_min, saved_sum
#return the action based on the state, uses the NN function
def act(self, state):
if not self.is_eval and random.random() <= self.epsilon:
w = np.random.normal(0, 1, size = (self.portfolio_size, ))
saved_min = None
if not self.allow_short:
w += np.abs(np.min(w))
saved_min = np.abs(np.min(w))
saved_sum = np.sum(w)
w /= saved_sum
return w , saved_min, saved_sum
pred = self.model.predict(np.expand_dims(state.values, 0))
return self.nn_pred_to_weights(pred, self.allow_short)
def expReplay(self, batch_size):
def weights_to_nn_preds_with_reward(action_weights,
reward,
Q_star = np.zeros((self.portfolio_size, self.action_size))):
Q = np.zeros((self.portfolio_size, self.action_size))
for i in range(self.portfolio_size):
if action_weights[i] == 0:
Q[i][0] = reward[i] + self.gamma * np.max(Q_star[i][0])
elif action_weights[i] > 0:
Q[i][1] = reward[i] + self.gamma * np.max(Q_star[i][1])
else:
Q[i][2] = reward[i] + self.gamma * np.max(Q_star[i][2])
return Q
def restore_Q_from_weights_and_stats(action):
action_weights, action_min, action_sum = action[0], action[1], action[2]
action_weights = action_weights * action_sum
if action_min != None:
action_weights = action_weights - action_min
return action_weights
for (s, s_, action, reward, done) in self.memory4replay:
action_weights = restore_Q_from_weights_and_stats(action)
#Reward =reward if not in the terminal state.
Q_learned_value = weights_to_nn_preds_with_reward(action_weights, reward)
s, s_ = s.values, s_.values
if not done:
# reward + gamma * Q^*(s_, a_)
Q_star = self.model.predict(np.expand_dims(s_, 0))
Q_learned_value = weights_to_nn_preds_with_reward(action_weights, reward, np.squeeze(Q_star))
Q_learned_value = [xi.reshape(1, -1) for xi in Q_learned_value]
Q_current_value = self.model.predict(np.expand_dims(s, 0))
Q = [np.add(a * (1-self.alpha), q * self.alpha) for a, q in zip(Q_current_value, Q_learned_value)]
# update current Q function with new optimal value
self.model.fit(np.expand_dims(s, 0), Q, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
In this step we train the algorithm. In order to do that, we first initialize the “Agent” class and “CryptoEnvironment” class.
N_ASSETS = 15 #53
agent = Agent(N_ASSETS)
env = CryptoEnvironment()
window_size = 180
episode_count = 50
batch_size = 32
rebalance_period = 90 #every 90 days weight change
data_length = len(env.data)
data_length
375
np.random.randint(window_size+1, data_length-window_size-1)
181
for e in range(episode_count):
agent.is_eval = False
data_length = len(env.data)
returns_history = []
returns_history_equal = []
rewards_history = []
equal_rewards = []
actions_to_show = []
print("Episode " + str(e) + "/" + str(episode_count), 'epsilon', agent.epsilon)
s = env.get_state(np.random.randint(window_size+1, data_length-window_size-1), window_size)
total_profit = 0
for t in range(window_size, data_length, rebalance_period):
date1 = t-rebalance_period
#correlation from 90-180 days
s_ = env.get_state(t, window_size)
action = agent.act(s_)
actions_to_show.append(action[0])
weighted_returns, reward = env.get_reward(action[0], date1, t)
weighted_returns_equal, reward_equal = env.get_reward(
np.ones(agent.portfolio_size) / agent.portfolio_size, date1, t)
rewards_history.append(reward)
equal_rewards.append(reward_equal)
returns_history.extend(weighted_returns)
returns_history_equal.extend(weighted_returns_equal)
done = True if t == data_length else False
agent.memory4replay.append((s, s_, action, reward, done))
if len(agent.memory4replay) >= batch_size:
agent.expReplay(batch_size)
agent.memory4replay = []
s = s_
rl_result = np.array(returns_history).cumsum()
equal_result = np.array(returns_history_equal).cumsum()
plt.figure(figsize = (12, 2))
plt.plot(rl_result, color = 'black', ls = '-')
plt.plot(equal_result, color = 'grey', ls = '--')
plt.show()
plt.figure(figsize = (12, 2))
for a in actions_to_show:
plt.bar(np.arange(N_ASSETS), a, color = 'grey', alpha = 0.25)
plt.xticks(np.arange(N_ASSETS), env.data.columns, rotation='vertical')
plt.show()
Episode 0/50 epsilon 1
Episode 1/50 epsilon 1
Episode 2/50 epsilon 1
Episode 3/50 epsilon 1
Episode 4/50 epsilon 1
Episode 5/50 epsilon 1
Episode 6/50 epsilon 1
Episode 7/50 epsilon 1
Episode 8/50 epsilon 1
Episode 9/50 epsilon 1
Episode 10/50 epsilon 1
Episode 11/50 epsilon 0.99
Episode 12/50 epsilon 0.99
Episode 13/50 epsilon 0.99
Episode 14/50 epsilon 0.99
Episode 15/50 epsilon 0.99
Episode 16/50 epsilon 0.99
Episode 17/50 epsilon 0.99
Episode 18/50 epsilon 0.99
Episode 19/50 epsilon 0.99
Episode 20/50 epsilon 0.99
Episode 21/50 epsilon 0.99
Episode 22/50 epsilon 0.9801
Episode 23/50 epsilon 0.9801
Episode 24/50 epsilon 0.9801
Episode 25/50 epsilon 0.9801
Episode 26/50 epsilon 0.9801
Episode 27/50 epsilon 0.9801
Episode 28/50 epsilon 0.9801
Episode 29/50 epsilon 0.9801
Episode 30/50 epsilon 0.9801
Episode 31/50 epsilon 0.9801
Episode 32/50 epsilon 0.9702989999999999
Episode 33/50 epsilon 0.9702989999999999
Episode 34/50 epsilon 0.9702989999999999
Episode 35/50 epsilon 0.9702989999999999
Episode 36/50 epsilon 0.9702989999999999
Episode 37/50 epsilon 0.9702989999999999
Episode 38/50 epsilon 0.9702989999999999
Episode 39/50 epsilon 0.9702989999999999
Episode 40/50 epsilon 0.9702989999999999
Episode 41/50 epsilon 0.9702989999999999
Episode 42/50 epsilon 0.9702989999999999
Episode 43/50 epsilon 0.96059601
Episode 44/50 epsilon 0.96059601
Episode 45/50 epsilon 0.96059601
Episode 46/50 epsilon 0.96059601
Episode 47/50 epsilon 0.96059601
Episode 48/50 epsilon 0.96059601
Episode 49/50 epsilon 0.96059601
The charts shown above show the details of the portfolio allocation of all the episodes.
After training the data, it is tested it against the test dataset.
agent.is_eval = True
actions_equal, actions_rl = [], []
result_equal, result_rl = [], []
for t in range(window_size, len(env.data), rebalance_period):
date1 = t-rebalance_period
s_ = env.get_state(t, window_size)
action = agent.act(s_)
weighted_returns, reward = env.get_reward(action[0], date1, t)
weighted_returns_equal, reward_equal = env.get_reward(
np.ones(agent.portfolio_size) / agent.portfolio_size, date1, t)
result_equal.append(weighted_returns_equal.tolist())
actions_equal.append(np.ones(agent.portfolio_size) / agent.portfolio_size)
result_rl.append(weighted_returns.tolist())
actions_rl.append(action[0])
result_equal_vis = [item for sublist in result_equal for item in sublist]
result_rl_vis = [item for sublist in result_rl for item in sublist]
plt.figure()
plt.plot(np.array(result_equal_vis).cumsum(), label = 'Benchmark', color = 'grey',ls = '--')
plt.plot(np.array(result_rl_vis).cumsum(), label = 'Deep RL portfolio', color = 'black',ls = '-')
plt.show()
#Plotting the data
import matplotlib
current_cmap = matplotlib.cm.get_cmap()
current_cmap.set_bad(color='red')
N = len(np.array([item for sublist in result_equal for item in sublist]).cumsum())
for i in range(0, len(actions_rl)):
current_range = np.arange(0, N)
current_ts = np.zeros(N)
current_ts2 = np.zeros(N)
ts_benchmark = np.array([item for sublist in result_equal[:i+1] for item in sublist]).cumsum()
ts_target = np.array([item for sublist in result_rl[:i+1] for item in sublist]).cumsum()
t = len(ts_benchmark)
current_ts[:t] = ts_benchmark
current_ts2[:t] = ts_target
current_ts[current_ts == 0] = ts_benchmark[-1]
current_ts2[current_ts2 == 0] = ts_target[-1]
plt.figure(figsize = (12, 10))
plt.subplot(2, 1, 1)
plt.bar(np.arange(N_ASSETS), actions_rl[i], color = 'grey')
plt.xticks(np.arange(N_ASSETS), env.data.columns, rotation='vertical')
plt.subplot(2, 1, 2)
plt.colormaps = current_cmap
plt.plot(current_range[:t], current_ts[:t], color = 'black', label = 'Benchmark')
plt.plot(current_range[:t], current_ts2[:t], color = 'red', label = 'Deep RL portfolio')
plt.plot(current_range[t:], current_ts[t:], ls = '--', lw = .1, color = 'black')
plt.autoscale(False)
plt.ylim([-1, 1])
plt.legend()
import statsmodels.api as sm
from statsmodels import regression
def sharpe(R):
r = np.diff(R)
sr = r.mean()/r.std() * np.sqrt(252)
return sr
def print_stats(result, benchmark):
sharpe_ratio = sharpe(np.array(result).cumsum())
returns = np.mean(np.array(result))
volatility = np.std(np.array(result))
X = benchmark
y = result
x = sm.add_constant(X)
model = regression.linear_model.OLS(y, x).fit()
alpha = model.params[0]
beta = model.params[1]
return np.round(np.array([returns, volatility, sharpe_ratio, alpha, beta]), 4).tolist()
print('EQUAL', print_stats(result_equal_vis, result_equal_vis))
print('RL AGENT', print_stats(result_rl_vis, result_equal_vis))
EQUAL [-0.0013, 0.0468, -0.5016, 0.0, 1.0] RL AGENT [0.0004, 0.0231, 0.4445, 0.0002, -0.1202]
RL portfolio has a higher return, higher sharp, lower volatility, higher alpha and negative correlation with the benchmark.
Conclusion
The idea in this case study was to go beyond classical Markowitz efficient frontier and directly learn the policy of changing the weights dynamically in the continuously changing market.
We set up a standardized working environ‐ ment(“gym”) for cryptocurrencies to facilitate the training. The model starts to learn over a period of time, discovers the strategy and starts to exploit it. we used the testing set to evaluate the model and found an overall profit in the test set.
Overall, the framework provided in this case study can enable financial practitioners to perform portfolio allocation and rebalancing with a very flexible and automated approach and can prove to be immensely useful, specifically for robo-advisors