# Project 5: Capstone Project¶

### Learning to Trade Using Q-Learning¶

#### Abstract¶

In this project, I will present an adaptive learning model to trade a single stock under the reinforcement learning framework. This area of machine learning consists in training an agent by reward and punishment without needing to specify the expected action. The agent learns from its experience and develops a strategy that maximizes its profits. The simulation results show initial success in bringing learning techniques to build algorithmic trading strategies.

## 1. Introduction¶

In this section, I will provide a high-level overview of the project, define the problem addressed and the metric used to measure the performance of the model created.

### 1.1. Project Overview¶

Udacity:

In this section, look to provide a high-level overview of the project in layman’s terms. Questions to ask yourself when writing this section:
- Has an overview of the project been provided, such as the problem domain, project origin, and related datasets or input data?
- Has enough background information been given so that an uninformed reader would understand the problem domain and following problem statement?

Nowadays, algo trading represents almost half of all cash equity trading in western Europe. In advanced markets, it already accounts for over 40%-50% of total volume. In Brazil its market share is not as large – currently about 10% – but is expected to rise in the years ahead as markets and players go digital.

As automated strategies are becoming increasingly popular, building an intelligent system that can trade many times a day and adapts itself to the market conditions and still consistently makes money is a subject of keen interest of any market participant.

Given that it is hard to produce such strategy, in this project I will try to build an algorithm that just does better than a random agent, but learns by itself how to trade. To do so, I will feed my agent with four days of information about every trade and change in the top of the order book in the PETR4 - one of the most liquidity assets in Brazilian Stock Market - in a Reinforcement Learning Framework. Later on, I will test what it has learned in a newest dataset.

The dataset used in this project is also known as level I order book data and includes all trades and changes in the prices and total quantities at best Bid (those who wants to buy the stock) and Offer side (those who intends to sell the stock).

### 1.2. Problem Statement¶

Udacity:

In this section, you will want to clearly define the problem that you are trying to solve, including the strategy (outline of tasks) you will use to achieve the desired solution. You should also thoroughly discuss what the intended solution will be for this problem. Questions to ask yourself when writing this section:
- Is the problem statement clearly defined? Will the reader understand what you are expecting to solve?
- Have you thoroughly discussed how you will attempt to solve the problem?
- Is an anticipated solution clearly defined? Will the reader understand what results you are looking for?

Algo trading strategies usually are programs that follow a predefined set of instructions to place its orders.

The primary challenge to this approach is building these rules in a way that it can consistently generate profit without being too sensitive to market conditions. Thus, the goal of this project is to develop an adaptive learning model that can learn by itself those rules and trade a particular asset using reinforcement learning framework under an environment that replays historical high-frequency data.

As \cite{chan2001electronic} described, reinforcement learning can be considered as a model-free approximation of dynamic programming. The knowledge of the underlying processes is not assumed but learned from experience. The agent can access some information about the environment state as the order flow imbalance, the sizes of the best bid and offer and so on. At each time step $t$, It should generate some valid action, as buy stocks or insert a limit order at the Ask side. The agent also should receive a reward or a penalty at each time step if it is already carrying a position from previous rounds or if it has made a trade (the cost of the operations are computed as a penalty). Based on the rewards and penalties it gets, the agent should learn an optimal policy for trade this particular stock, maximizing the profit it receives from its actions and resulting positions.

Udacity Reviewer:

This is really quite close! I'm marking as not meeting specifications because you should fully outline your solution here. You've outlined your strategy regarding reinforcement learning, but you should also address things like data preprocessing, choosing your state space etc. Basically, this section should serve as an outline for your entire solution. Just add a paragraph or two to fully outline your proposed methodology and you're good to go.

This project starts with an overview of the dataset and shows how the environment states will be represented in Section 2. The same section also dives in the reinforcement learning framework and defines the benchmark used at the end of the project. Section 3 discretizes the environment states by transforming its variables and clustering them into six groups. Also describes the implementation of the model and the environments, as well as and the process of improvement made upon the algorithm used. Section 4 presents the final model and compares statistically its performance to the benchmark chosen. Section 5 concludes the project with some closing remarks and possible improvements.

### 1.3. Metrics¶

Udacity:

In this section, you will need to clearly define the metrics or calculations you will use to measure performance of a model or result in your project. These calculations and metrics should be justified based on the characteristics of the problem and problem domain. Questions to ask yourself when writing this section:
- Are the metrics you’ve chosen to measure the performance of your models clearly discussed and defined?
- Have you provided reasonable justification for the metrics chosen based on the problem and solution?
Udacity Reviewer:

The section on metrics should address any statistics or metrics that you'll be using in your report. What you've written in your benchmark section is roughly what we're looking for for the metrics section and vice versa. I'd recommend changing the subtitles to clarify this. If it's more logical to introduce the benchmark before explaining your metrics, you could combine the 'Benchmark' and 'Metrics' subsections into a single 'Benchmark and Metrics' section.

Different metrics are used to support the decisions made throughout the project. We use the mean Silhouette Coefficient of all samples to justify the clustering method chosen to reduce the state space representation of the environment. As exposed in the scikit-learn documentation, this coefficient is composed by the mean intra-cluster distance ($a$) and the mean nearest-cluster distance ($b$) for each sample. The score for a single cluster is given by $s = \frac{b-a}{\max{a, \, b}}$.This scores are so average down to all samples and varying between $1$ (the best one) and $-1$ (the worst value).

Then, we use sharpe ratio to help us understanding the performance impact of different values to the model parameters. The Sharpe is measure upon the first difference ($\Delta r$) of the accumulated PnL curve of the model. So, the first difference is defined as $\Delta r = PnL_t - PnL_{t-1}$.

Finally, as we shall justify latter, the performance of my agent will be compared to the performance of a random agent. These performances will be measured primarily of Reais made (the Brazilian currency) by the agents. To compared the final PnL of both agents in the simulations, we will perform a one-sided Welch's unequal variances t-test for the null hypothesis that the learning agent has the expected PnL greater than the random agent. As the implementation of the t-test in the scipy assumes a two-sided t-test, to perform the one-sided test, we will divide the p-value by $2$ to compare to a critical value of $0.05$ and requires that the t-value is greater than zero. In the next section, I will detail the behavior of learning agent.

## 2. Analysis¶

In this section, I will explore the data set that will be used in the simulation, define and justify the inputs employed in the state representation of the algorithm, explain the reinforcement learning techniques used and provide a benchmark.

### 2.1. Data Exploration¶

Udacity:

In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:
- If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?
- If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?
- If a dataset is **not** present for this problem, has discussion been made about the input space or input data for your problem?
- Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)

The dataset used is composed by level I order book data from PETR4, a stock traded at BMFBovespa Stock Exchange. Includes 45 trading sessions from 07/25/2016 to 09/26/2016. I will use one day to create the scalers of the features used, that I shall explain. Then, I will use four days to train and test the model, and after each training session, I will validate the policy found in an unseen dataset from the subsequent day. The data was collected from Bloomberg.

In the figure below can be observed how the market behaved on the days that the out-of-sample will be performed. In this charts are plotted the number of cents that an investment of the same amount of money in PETR4 and BOVA11 would have varied on these market sessions. BOVA11 is an ETF that can be used as a proxy to the Bovespa Index, the Brazilian Stock Exchange Index. As can seem, PETR4 was relatively more volatile than the rest of the market.

In [5]:
import zipfile
s_fname = "data/data_0725_0926.zip"
s_fname2 = "data/bova11_2.zip"
archive = zipfile.ZipFile(s_fname, 'r')
archive2 = zipfile.ZipFile(s_fname2, 'r')
l_fnames = archive.infolist()

In [7]:
import qtrader.eda as eda; reload(eda);
df_last_pnl = eda.plot_cents_changed(archive, archive2)


Let's start by looking at the size of the files that can be used in the simulation:

In [1]:
def foo():
f_total = 0.
f_tot_rows = 0.
for i, x in enumerate(archive.infolist()):
f_total += x.file_size/ 1024.**2
for num_rows, row in enumerate(archive.open(x)):
f_tot_rows += 1
print "{}:\t{:,.0f} rows\t{:0.2f} MB".format(x.filename, num_rows + 1, x.file_size/ 1024.**2)
print '=' * 42
print "TOTAL\t\t{} files\t{:0.2f} MB".format(i+1,f_total)
print "\t\t{:0,.0f} rows".format(f_tot_rows)

%time foo()

20160725.csv:	110,756 rows	4.42 MB
20160726.csv:	100,109 rows	3.98 MB
20160727.csv:	123,175 rows	4.93 MB
20160728.csv:	109,655 rows	4.37 MB
20160729.csv:	135,111 rows	5.40 MB
20160801.csv:	109,710 rows	4.37 MB
20160802.csv:	108,053 rows	4.30 MB
20160803.csv:	137,039 rows	5.49 MB
20160804.csv:	139,118 rows	5.56 MB
20160805.csv:	112,852 rows	4.51 MB
20160808.csv:	89,730 rows	3.55 MB
20160809.csv:	83,826 rows	3.33 MB
20160810.csv:	105,758 rows	4.21 MB
20160811.csv:	144,728 rows	5.81 MB
20160812.csv:	147,086 rows	5.90 MB
20160815.csv:	108,633 rows	4.33 MB
20160816.csv:	108,795 rows	4.33 MB
20160817.csv:	118,980 rows	4.75 MB
20160818.csv:	84,489 rows	3.36 MB
20160819.csv:	98,329 rows	4.00 MB
20160822.csv:	98,594 rows	4.02 MB
20160823.csv:	90,752 rows	3.69 MB
20160824.csv:	87,930 rows	3.56 MB
20160825.csv:	95,929 rows	3.89 MB
20160826.csv:	152,547 rows	6.24 MB
20160829.csv:	98,630 rows	4.02 MB
20160830.csv:	122,067 rows	4.98 MB
20160831.csv:	155,391 rows	6.37 MB
20160901.csv:	150,122 rows	6.15 MB
20160902.csv:	147,257 rows	6.04 MB
20160905.csv:	70,243 rows	2.86 MB
20160906.csv:	109,355 rows	4.46 MB
20160908.csv:	140,519 rows	5.77 MB
20160909.csv:	142,940 rows	5.86 MB
20160912.csv:	171,462 rows	7.02 MB
20160913.csv:	224,427 rows	9.25 MB
20160914.csv:	172,215 rows	7.05 MB
20160915.csv:	139,648 rows	5.72 MB
20160916.csv:	119,952 rows	4.90 MB
20160919.csv:	126,815 rows	5.18 MB
20160920.csv:	149,962 rows	6.15 MB
20160921.csv:	163,128 rows	6.70 MB
20160922.csv:	163,957 rows	6.74 MB
20160923.csv:	159,513 rows	6.56 MB
20160926.csv:	101,986 rows	4.15 MB
==========================================
TOTAL		45 files	228.26 MB
5,631,273 rows
CPU times: user 18.7 s, sys: 161 ms, total: 18.9 s
Wall time: 18.9 s


There are 45 files, each one has 110,000 rows on average, resulting in 5,631,273 rows at total and almost 230 MB of information. Now, let's look at the structure of one of them:

In [4]:
import pandas as pd

Out[4]:
Date Type Price Size
0 2016-07-25 10:02:00 TRADE 11.98 5800
1 2016-07-25 10:02:00 BID 11.97 6100
2 2016-07-25 10:02:00 ASK 11.98 51800
3 2016-07-25 10:02:00 ASK 11.98 56800
4 2016-07-25 10:02:00 ASK 11.98 56900

Each file is composed of four different fields. The column $Date$ is the timestamp of the row and has a precision of seconds. $Type$ is the kind of information that the row encompasses. The type "TRADE" relates to an actual trade that has happened. "BID" is related to changes in the best Bid level and "ASK," to the best Offer level. $Price$ is the current best bid or ask and $Size$ is the cumulated quantity on that price and side.

All this data will be used to create the environment where my agent will operate. This environment is an order book, where the agent will be able to insert limit orders and execute trades at the best prices. The order book is represented by two binary trees, one for the Bid and other for the Ask side. As can be seen in the table below, the nodes of these trees are sorted by price (price level) in ascending order on the Bid side and descending order on the ask side. At each price level, there are other binary trees sorted by order of arrival. The first order to arrive is the first order filled when coming in a trade.

In [4]:
import qtrader.simulator as simulator
e = environment.Environment()
sim = simulator.Simulator(e)
%time sim.run(n_trials=1)

CPU times: user 13.8 s, sys: 45.4 ms, total: 13.9 s
Wall time: 13.9 s

In [8]:
sim.env.get_order_book()

Out[8]:
0 61,400 12.02 12.03 13,800
1 47,100 12.01 12.04 78,700
2 51,700 12.00 12.05 20,400
3 37,900 11.99 12.06 23,100
4 97,000 11.98 12.07 27,900

The environment will answer with the agent's current position and Profit and Loss (PnL) every time the agent executes a trade or has an order filled. The cost of the trade will be accounted as a penalty.

The agent also will be able to sense the state of the environment and include it in its own state representation. So, this intern state will be represented by a set of variables about the current situation os the market and the state of the agent, given by:

• $qOFI$ : integer. The net order flow at the bid and ask in the last 10 seconds
• $book\_ratio$ : float. The Bid size over the Ask size
• $position$: integer. The current position of my agent. The maximum is $100$
• $OrderBid$: boolean. If the agent has order at the bid side
• $OrderAsk$: boolean. If the agent has order at the ask side
Udacity:

Exploratory Visualization:

In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
- Have you visualized a relevant characteristic or feature about the dataset or input data?
- Is the visualization thoroughly analyzed and discussed?
- If a plot is provided, are the axes, title, and datum clearly defined?

Regarding the measure of the Order Flow Imbalance (OFI), there are many ways to measure it. \cite{cont2014price} argued the order flow imbalance is a measure of supply/demand imbalance and defines it as a sum of individual event contribution $e_n$ over time intervals $\left[ t_{k-1}, \; t_k \right]$, such that:

$$OFI_k = \sum^{N(t_k)}_{n=N(t_{k-1})+1} e_n$$

Where $N(t_k)$ and $N(t_{k-1}) + 1$ are index of the first and last event in the interval. The $e_n$ was defined by the authors as a measure of the contribution of the $n$-th event to the size of the bid and ask queues:

$$e_n = \mathbb{1}_{P_{n}^{B} \geq P_{n-1}^{B}} q^{B}_{n} - \mathbb{1}_{P_{n}^{B} \leq P_{n-1}^{B}} q^{B}_{n-1} - \mathbb{1}_{P_{n}^{A} \leq P_{n-1}^{A}} q^{A}_{n} + \mathbb{1}_{P_{n}^{A} \geq P_{n-1}^{A}} q^{A}_{n-1}$$

Where $q^{B}_{n}$ and $q^{A}_{n}$ are linked to the cumulated quantities at the best bid and ask in the time $n$. The subscript $n-1$ is related to the last observation. $\mathbb{1}$ is an indicator function. In the figure below is ploted the 10-second log-return of PETR4 against the contemporaneous OFI. Log-return is defined as $\ln{r_t} = \ln{\frac{P_t}{P_{t-1}}}$, where $P_t$ is the current price of PETR4 and $P_{t-1}$ is the previous one.

In [9]:
import qtrader.eda as eda; reload(eda);
s_fname = "data/petr4_0725_0818_2.zip"
%time eda.test_ofi_indicator(s_fname, f_min_time=20.)

CPU times: user 2.48 s, sys: 16 ms, total: 2.49 s
Wall time: 2.59 s

In [20]:
import pandas as pd
df.drop('TIME', axis=1, inplace=True)
df.dropna(inplace=True)
ax = sns.lmplot(x="OFI", y="LOG_RET", data=df, markers=["x"], palette="Set2", size=4, aspect=2.)
ax.ax.set_title(u'Relation between the Log-return and the $OFI$\n', fontsize=15);
ax.ax.set_ylim([-0.004, 0.005])
ax.ax.set_xlim([-400000, 400000])

Out[20]:
(-400000, 400000)

As described by \cite{cont2014price} in a similar test, the figure suggests that order flow imbalance is a stronger driver of high-frequency price changes and this variable will be used to describe the current state of the order book.

### 2.2. Algorithms and Techniques¶

Udacity:

In this section, you will need to discuss the algorithms and techniques you intend to use for solving the problem. You should justify the use of each one based on the characteristics of the problem and the problem domain. Questions to ask yourself when writing this section:
- Are the algorithms you will use, including any default variables/parameters in the project clearly defined?
- Are the techniques to be used thoroughly discussed and justified?
- Is it made clear how the input data or datasets will be handled by the algorithms and techniques chosen?

Based on \cite{cont2014price}, the algo trading might be conveniently modeled in the framework of reinforcement learning. As suggested by \cite{du1algorithm}, this framework adjusts the parameters of an agent to maximize the expected payoff or reward generated due to its actions. Therefore, the agent learns a policy that tells him the actions it must perform to achieve its best performance. This optimal policy is exactly what we hope to find when we are building an automated trading strategy.

According to \cite{chan2001electronic}, Markov decision processes (MDPs) are the most common model when implementing reinforcement learning. The MDP model of the environment consists, among other things, of a discrete set of states $S$ and a discrete set of actions taken from $A$. In this project, depending on the position of the learner(long or short), at each time step $t$ it will be allowed to choose an action $a_t$ from different subsets from the action space $A$ , that consists of six possibles actions:

$$a_{t} \in \left (None,\, buy,\, sell,\, best\_bid,\, best\_ask,\, best\_both \right)$$

Where $None$ indicates that the agent shouldn't have any order in the market. $Buy$ and $Sell$ means that the agent should execute a market order to buy or sell $100$ stocks (the size of an order will always be a hundred shares). This kind of action will be allowed based on a trailing stop of 4 cents. $best\_bid$ and $best\_ask$ indicate that the agent should keep order at best price just in the mentioned side and $best\_both$, it should have ordered at best price in both sides.

So, at each discrete time step $t$, the agent senses the current state $s_t$ and choose to take an action $a_t$. The environment responds by providing the agent a reward $r_t=r(s_t, a_t)$ and by producing the succeeding state $s_{t+1}=\delta(s_t, a_t)$. The functions $r$ and $\delta$ only depend on the current state and action (it is memoryless), are part of the environment and are not necessarily known to the agent.

The task of the agent is to learn a policy $\pi$ that maps each state to an action ($\pi: S \rightarrow A$), selecting its next action $a_t$ based solely on the current observed state $s_t$, that is $\pi(s_t)=a_t$. The optimal policy, or control strategy, is the one that produces the greatest possible cumulative reward over time. So, stating that:

$$V^{\pi}(s_t)= r_t + \gamma r_{t+1} + \gamma^2 r_{t+1} + ... = \sum_{i=0}^{\infty} \gamma^{i} r_{t+i}$$

Where $V^{\pi}(s_t)$ is also called the discounted cumulative reward and it represents the cumulative value achieved by following an policy $\pi$ from an initial state $s_t$ and $\gamma \in [0, 1]$ is a constant that determines the relative value of delayed versus immediate rewards. It is one of the

If we set $\gamma=0$, only immediate rewards is considered. As $\gamma \rightarrow 1$, future rewards are given greater emphasis relative to immediate reward. The optimal policy $\pi^{*}$ that will maximizes $V^{\pi}(s_t)$ for all states $s$ can be written as:

$$\pi^{*} = \underset{\pi}{\arg \max} \, V^{\pi} (s)\,\,\,\,\,, \,\, \forall s$$

However, learning $\pi^{*}: S \rightarrow A$ directly is difficult because the available training data does not provide training examples of the form $(s, a)$. Instead, as \cite{Mitchell} explained, the only available information is the sequence of immediate rewards $r(s_i, a_i)$ for $i=1,\, 2,\, 3,\,...$

So, as we are trying to maximize the cumulative rewards $V^{*}(s_t)$ for all states $s$, the agent should prefer $s_1$ over $s_2$ wherever $V^{*}(s_1) > V^{*}(s_2)$. Given that the agent must choose among actions and not states, and it isn't able to perfectly predict the immediate reward and immediate successor for every possible state-action transition, we also must learn $V^{*}$ indirectly.

To solve that, we define a function $Q(s, \, a)$ such that its value is the maximum discounted cumulative reward that can be achieved starting from state $s$ and applying action $a$ as the first action. So, we can write:

$$Q(s, \, a) = r(s, a) + \gamma V^{*}(\delta(s, a))$$

As $\delta(s, a)$ is the state resulting from applying action $a$ to state $s$ (the successor) chosen by following the optimal policy, $V^{*}$ is the cumulative value of the immediate successor state discounted by a factor $\gamma$. Thus, what we are trying to achieve is

$$\pi^{*}(s) = \underset{a}{\arg \max} Q(s, \, a)$$

It implies that the optimal policy can be obtained even if the agent just uses the current action $a$ and state $s$ and chooses the action that maximizes $Q(s,\, a)$. Also, it is important to notice that the function above implies that the agent can select optimal actions even when it has no knowledge of the functions $r$ and $\delta$.

Lastly, according to \cite{Mitchell}, there are some conditions to ensure that the reinforcement learning converges toward an optimal policy. On a deterministic MDP, the agent must select actions in a way that it visits every possible state-action pair infinitely often. This requirement can be a problem in the environment that the agent will operate.

As the most inputs suggested in the last subsection was defined in an infinite space, in section 3 I will discretize those numbers before use them to train my agent, keeping the state space representation manageable, hopefully. We also will see how \cite{Mitchell} defined a reliable way to estimate training values for $Q$, given only a sequence of immediate rewards $r$.

### 2.3. Benchmark¶

Udacity:

In this section, you will need to provide a clearly defined benchmark result or threshold for comparing across performances obtained by your solution. The reasoning behind the benchmark (in the case where it is not an established result) should be discussed. Questions to ask yourself when writing this section:
- Has some result or value been provided that acts as a benchmark for measuring performance?
- Is it clear how this result or value was obtained (whether by data or by hypothesis)?

In 1988, the Wall Street Journal created a Dartboard Contest, where Journal staffers threw darts at a stock table to select their assets, while investment experts picked their own stocks. After six months, they compared the results of the two methods. After adjusting the results to risk level, they found out that the pros barely have beaten the random pickers.

Given that, the benchmark used to measure the performance of the learner will be the amount of money made, in Reais, by a random agent. So, my goal will be to outperform this agent, that should just produce some random action from a set of allowed actions taken from $A$ at each time step $t$.

Just like my learner, the set of action can change over time depending on the open position, that is limited to $100$ stocks at most, on any side. When it reaches its limit, it will be allowed just to perform actions that decrease its position. So, for instance, if it already long in $100$ shares, the possible moves would be $\left (None,\, sell,\, best\_ask \right)$. If it is short, it just can perform $\left (None,\, buy,\, best\_bid\right)$.

The performance will be measured primarily in the money made by the agents (that will be optimized by the learner). First, I will analyze if the learning agent was able to improve its performance on the same dataset after different trials. Later on, I will use the policy learned to simulate the learning agent behavior in a different dataset and then I will compare the final Profit and Loss of both agents. All data analyzed will be obtained by simulation.

As the last reference, in the final section, we will compare the total return of the learner to a strategy of buy-and-hold in BOVA11 and in the stock traded to check if we are consistently beating the market and not just being profitable, as the Udacity reviewer noticed.

## 3. Methodology¶

In this section, I will discretize the input space and implement an agent to learn the Q function.

### 3.1 Data Preprocessing¶

Udacity:

In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:
- If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?
- Based on the **Data Exploration** section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?
- If no preprocessing is needed, has it been made clear why?

As mentioned before, I will implement a Markov decision processes (MDP) that requires, among other things, of a discrete set of states $S$. Apart from the input variables $position$, $OrderBid$, $OrderAsk$, the other variables are defined in an infinite domain. I am going to discretize those inputs, so my learning agent can use them in the representation of their intern state. In the Figure bellow, we can see the distribution of those variables. The data was produced using the first day of the dataset.

In [5]:
import pandas as pd
df.drop(['TIME', 'DELTA_MID'], axis=1, inplace=True)
df.dropna(inplace=True)

In [6]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(df.ix[:, ['OFI', 'BOOK_RATIO']],
alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Udacity Reviewer:

Please be sure to specify how you are doing this (I'd recommend giving the formula).

The scale of the variables is very different, and, in the case of the $BOOK\_RATIO$, it presents a logarithmic distribution. I will apply a logarithm transformation on this variable and rescale both to lie between a given minimum and maximum value of each feature using the function MinMaxScaler from scikit-learn. So, both variable will be scaled to lie between $0$ and $1$ by applying the formula $z_{i} =\frac{x_i - \min{X}}{\max{X} - \min{X}}$. Where $z$ is the transformed variable, $x_i$ is the variable to be transformed and $X$ is a vector with all $x$ that will be transformed. The result of the transformation can be seen in the figure below.

In [7]:
import sklearn.preprocessing as preprocessing
import numpy as np

In [8]:
scaler_ofi = preprocessing.MinMaxScaler().fit(pd.DataFrame(df.OFI))
scaler_bookratio = preprocessing.MinMaxScaler().fit(pd.DataFrame(np.log(df.BOOK_RATIO)))
d_transformed = {}
d_transformed['OFI'] = scaler_ofi.transform(pd.DataFrame(df.OFI)).T[0]
d_transformed['BOOK_RATIO'] = scaler_bookratio.transform(pd.DataFrame(np.log(df.BOOK_RATIO))).T[0]

In [9]:
df_transformed = pd.DataFrame(d_transformed)
pd.scatter_matrix(df_transformed.ix[:, ['OFI', 'BOOK_RATIO']],
alpha = 0.3, figsize = (14,8), diagonal = 'kde');


As mentioned before, in an MDP environment the agent must visit every possible state-action pair infinitely often. If I just bucketize the variables and combine them, I will end up with a huge number of states to explore. So, to reduce the state space, I am going to group those variables using K-Means and Gaussian Mixture Model (GMM) clustering algorithm. Then I will quantify the "goodness" of the clustering results by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). In the figure below, I am going to calculate the mean silhouette coefficient to K-Means and GMM using a different number of clusters. Also, I will test different covariance structures to GMM.

In [10]:
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
import time

reduced_data = df_transformed.ix[:, ['OFI', 'BOOK_RATIO']]
reduced_data.columns = ['Dimension 1', 'Dimension 2']
range_n_clusters = [2, 3, 4, 5, 6, 8, 10]

f_st = time.time()
d_score = {}
d_model = {}
s_key = "Kmeans"
d_score[s_key] = {}
d_model[s_key] = {}
for n_clusters in range_n_clusters:
# TODO: Apply your clustering algorithm of choice to the reduced data
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
preds = clusterer.fit_predict(reduced_data)
d_model[s_key][n_clusters] = clusterer
d_score[s_key][n_clusters] = metrics.silhouette_score(reduced_data, preds)
print "K-Means took {:0.2f} seconds to run over all complexity space".format(time.time() - f_st)

f_avg = 0

for covar_type in ['spherical', 'diag', 'tied', 'full']:
f_st = time.time()
s_key = "GMM_{}".format(covar_type)
d_score[s_key] = {}
d_model[s_key] = {}
for n_clusters in range_n_clusters:

# TODO: Apply your clustering algorithm of choice to the reduced data
clusterer = GMM(n_components=n_clusters,
covariance_type=covar_type,
random_state=10)
clusterer.fit(reduced_data)
preds = clusterer.predict(reduced_data)
d_model[s_key][n_clusters] = clusterer
d_score[s_key][n_clusters] = metrics.silhouette_score(reduced_data, preds)
f_avg += time.time() - f_st

print "GMM took {:0.2f} seconds on average to run over all complexity space".format(f_avg / 4.)

K-Means took 4.59 seconds to run over all complexity space
GMM took 16.85 seconds on average to run over all complexity space

In [12]:
import pandas as pd
ax = pd.DataFrame(d_score).plot()
ax.set_xlabel("Number of Clusters")
ax.set_ylabel("Silhouette Score\n")
ax.set_title("Performance vs Complexity\n", fontsize = 16);


The maximum score has happened using 2 clusters.However, I believe that the market can't be simplified that much. So, I will use the K-means with six centroids to group the variables. In the figure below we can see how the algorithm classified the data. Also, in the following table, the centroid was put in their original scales.

In [13]:
# get centers
sample_preds = []
centers = d_model["Kmeans"][6].cluster_centers_
preds = d_model["Kmeans"][6].fit_predict(reduced_data)

In [14]:
# Display the results of the clustering from implementation
eda.cluster_results(reduced_data, preds, centers)

In [15]:
# recovering data
log_centers = centers.copy()
df_aux = pd.DataFrame([np.exp(scaler_bookratio.inverse_transform(log_centers.T[0].reshape(1, -1))[0]),
scaler_ofi.inverse_transform(log_centers.T[1].reshape(1, -1))[0]]).T
df_aux.columns = df_transformed.columns
df_aux.index.name = 'CLUSTER'
df_aux.columns = ['BOOK RATIO', 'OFI']
df_aux.round(2)

Out[15]:
BOOK RATIO OFI
CLUSTER
0 0.89 -173662.94
1 0.91 281563.51
2 0.76 116727.32
3 0.85 -16602.29
4 7.91 23334.13
5 0.09 34240.00

Curiously, the algorithm gave more emphasis on the $BOOK\_RATIO$ when its value was very large (the bid size almost eight times greater than the ask size) or tiny (when the bid size was one tenth of the ask size). The other cluster seems mostly dominated by the $OFI$. In the next subsection, I will discuss how I have implemented the Q-learning, how I intend to perform the simulations and make some tests. Lastly, let's serialize the objects used in clusterization to be used later.

In [51]:
import pickle
pickle.dump(d_model["Kmeans"][6] ,open('data/kmeans_2.dat', 'w'))
pickle.dump(scaler_ofi, open('data/scale_ofi_2.dat', 'w'))
pickle.dump(scaler_bookratio, open('data/scale_bookratio_2.dat', 'w'))
print 'Done !'

Done !


### 3.2. Implementation¶

Udacity:

In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:
- Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?
- Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?
- Was there any part of the coding process (e.g., writing complicated functions) that should be documented?

As we have seen, learning the Q function corresponds to learning the optimal policy. According to \cite{Mohri_2012}, the optimal state-action value function $Q^{*}$ is defined for all $(s, \, a) \in S \times A$ as the expected return for taking the action $a \in A$ at the state $s \in S$, following the optimal policy. So, it can be written as \cite{Mitchell} suggested:

$$V^{*}(s) = \underset{a'}{\arg \max} \, Q(s, \, a')$$

Using this relationship, we can write a recursive definition of Q function, such that:

$$Q(s, \, a) = r(s, a) + \gamma \, \underset{a'}{\max} \, Q(\delta(s,\, a), \, a')$$

The recursive nature of the function above implies that our agent doesn't know the actual $Q$ function. It just can estimate $Q$, that we will refer as $\hat{Q}$. It will represents is hypothesis $\hat{Q}$ as a large table that attributes each pair $(s\, , \, a)$ to a value for $\hat{Q}(s,\, a)$ - the current hypothesis about the actual but unknown value $Q(s, \, a)$. I will initialize this table with zeros, but it could be filled with random numbers, according to \cite{Mitchell}. Still according to him, the agent repeatedly should observe its current state $s$ and do the following:

###### Algorithm 1: Update Q-table¶
• Observe the current state $s$ and the allowed actions $A^{*}$:
• Choose some action $a \, \in \, A^{*}$ and execute it
• Receive the immediate reward $r = r(s, a)$
• if there is no entry $(s, \, a)$
• initialize the table entry $\hat{Q}(s, \, a)$ to zero
• Observe the new state $s' = \delta(s, \,a)$.
• Updates the table entry for $\hat{Q}(s, \, a)$ following:
• $\hat{Q}(s, \, a) \leftarrow r + \gamma \underset{a'}{\max} \hat{Q}(s', \, a')$
• $s \leftarrow s'$

The main issue in the strategy presented in Algorithm 1 is that the agent could overcommit to actions that presented positive $\hat{Q}$ values early in the simulation, failing to explore other actions that could present even higher values. \cite{Mitchell} proposed to use a probabilistic approach to select actions, assigning higher probabilities to action with high $\hat{Q}$ values, but given to every action at least a nonzero probability. So, I will implement the following relation:

$$P(a_i\, | \,s ) = \frac{k ^{\hat{Q}(s, a_i)}}{\sum_j k^{\hat{Q}(s, a_j)}}$$

Where $P(a_i\, | \,s )$ is the probability of selecting the action $a_i$ given the state $s$. The constant $k$ is positive and determines how strongly the selection favors action with high $\hat{Q}$ values.

Ideally, to optimize the policy found, the agent should iterate over the same dataset repeatedly until it is not able to improve its PnL. Later on, the policy learned will be tested against the same dataset to check its consistency. Lastly, this policy will be tested on the subsequent day of the training session. So, before perform the out-of-sample test, we will use the following procedure:

###### Algorithm 2: Train-Test Q-Learning Trader¶
• for each trial in total iterations desired do:
• for each observation in the session do:
• Update the table $\hat{Q}$
• Save the table $\hat{Q}$ indexed by the trial ID
• for each trial in total iterations made do:
• Load the table $\hat{Q}$ related to the current trial
• for each observation in the session do:
• Observe the current state $s$ and the allowed actions $A^{*}$:
• if $s \, \notin \, \hat{Q}$: Close out open positions or do nothing
• else: Choose the optimal action $a \, \in \, A^{*}$ and execute it

Each training session will include data from the largest part of a trading session, starting at 10:30 and closing at 16:30. Also, the agent will be allowed to hold a position of just 100 shares at maximum (long or short). When the training session is over, all positions from the learner will be closed out so the agent always will start a new session without carrying positions.

The agent will be allowed to take action every $2$ seconds and, due to this delay, every time it decides to insert limit orders, it will place it 1 cent worst than the best price. So, if the best bid is $12.00$ and the best ask is $12.02$, if the agent chooses the action $BEST\_BOTH$, it should include a buy order at $11.99$ and a sell order at $12.03$. It will be allowed to cancel these orders after 2 seconds. However, if these orders are filled in the mean time, the environment will inform the agent so it can update its current position. Even though, it just will take new actions after passed those 2 seconds.

Udacity Reviewer:

Please be sure to note any complications that occurred during the coding process. Otherwise, this section is simply excellent

One of the biggest complication of the approach proposed in this project was to find out a reasonable representation of the environment state that wasn't too big to visit each state-action pair sufficiently often but was still useful in the learning process. In the next subsection, I will try different configurations of $k$ and $\gamma$ to try to improve the performance of the learning agent over the same trial.

### 3.3. Refinement¶

Udacity:

In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:
- Has an initial solution been found and clearly reported?
- Is the process of improvement clearly documented, such as what techniques were used?
- Are intermediate and final solutions clearly reported as the process is improved?

As mentioned before, we should iterate over the same dataset and check the policy learned on the same observations until convergence. Given the time required to perform each train-test iteration, "until convergence" will be 10 repetitions. We are going to train the model on the dataset from 08/15/2016. After each iteration, we will check how the agent would perform using the policy it has just learned. The agent in the first training session will use $\gamma=0.7$ and $k=0.3$. In the figure below are the results of the first round of iterations:

In [21]:
# analyze the logs from the in-sample tests
s_fname = 'log/train_test/sim_Fri_Oct__7_002946_2016.log'  # 15 old
# s_fname = 'log/train_test/sim_Wed_Oct__5_110344_2016.log'  # 15
# s_fname = 'log/train_test/sim_Thu_Oct__6_165539_2016.log'  # 25
# s_fname = 'log/train_test/sim_Thu_Oct__6_175507_2016.log'  # 35
# s_fname = 'log/train_test/sim_Thu_Oct__6_183555_2016.log'  # 5
%time d_rtn_train_1 = eda.simple_counts(s_fname, 'LearningAgent_k')

CPU times: user 38.7 s, sys: 200 ms, total: 38.9 s
Wall time: 39.1 s

In [22]:
import qtrader.eda as eda; reload(eda);
eda.plot_train_test_sim(d_rtn_train_1)


The curve Train in the charts is the PnL obtained during the training session when the agent was allowed to explore new actions randomly. The test is the PnL obtained using strictly the policy learned.

Although the agent was able to profit at the end of every single round, "Convergence" is something that I can not claim. For instance, the PnL was worst in the first round than in the first one. I believe this stability of the results is difficult to obtain in day-trading. For example, even if the agent think that it should buy before the market goes up, it doesn't depending on its will if its order is filled.

We will target on improving the final PnL of the agent. However, less variability of the results is desired, especially at the beginning of the day, when the strategy didn't make any money yet. So, we also will look at the sharpe ratio of the first difference of the cumulated PnL produced by each configuration.

First, we are going to iterate through some values for $k$ and look at its performance in the training phase at the first hours of the training session. We also will use just 5 iterations here to speed up the tests.

In [43]:
# improving K
s_fname = 'log/train_test/sim_Thu_Oct__6_133518_2016.log'
%time d_rtn_k = eda.count_by_k_gamma(s_fname, 'LearningAgent_k', 'k')

CPU times: user 42 s, sys: 140 ms, total: 42.1 s
Wall time: 43.1 s

In [79]:
import pandas as pd
import matplotlib.pyplot as plt

f, na_ax = plt.subplots(1, 4, sharex=True, sharey=True)
for ax1, s_key in zip(na_ax.ravel(), ['0.3', '0.8', '1.3', '2.0']):
df_aux = pd.Series(d_rtn_k[s_key][5])
df_filter = pd.Series([x.hour for x in df_aux.index])
df_aux = df_aux[((df_filter < 15)).values]
df_aux.reset_index(drop=True, inplace=True)
df_aux.plot(legend=False, ax=ax1)
df_first_diff = df_aux - df_aux.shift()
df_first_diff = df_first_diff[df_first_diff != 0]
f_sharpe = df_first_diff.mean()/df_first_diff.std()
ax1.set_title('$k = {}$ | $sharpe = {:0.2f}$'.format(s_key, f_sharpe), fontsize=10)
ax1.xaxis.set_ticklabels([])
ax1.set_ylabel('PnL', fontsize=8)
ax1.set_xlabel('Time', fontsize=8)
f.tight_layout()
s_title = 'Cumulative PnL Changing K\n'
f.suptitle(s_title, fontsize=16, y=1.03);


When the agent was set to use $k=0.8$ and $k=2.0$, it achieved very similar results and Sharpe ratios. As the variable $k$ control the likelihood of the agent try new actions based on the Q value already observed, I will prefer the smallest value because it improves the chance of the agent to explore. Now, let's perform the same analysis varying only the $\gamma$:

In [17]:
# improving Gamma
s_fname = 'log/train_test/sim_Thu_Oct__6_154516_2016.log'
%time d_rtn_gammas = eda.count_by_k_gamma(s_fname, 'LearningAgent_k', 'gamma')

CPU times: user 41.4 s, sys: 140 ms, total: 41.5 s
Wall time: 42.8 s

In [76]:
import pandas as pd
import matplotlib.pyplot as plt

f, na_ax = plt.subplots(1, 4, sharex=True, sharey=True)
for ax1, s_key in zip(na_ax.ravel(), ['0.3', '0.5', '0.7', '0.9']):
df_aux = pd.Series(d_rtn_gammas[s_key][5])
df_filter = pd.Series([x.hour for x in df_aux.index])
df_aux = df_aux[((df_filter < 15)).values]
df_aux.reset_index(drop=True, inplace=True)
df_aux.plot(legend=False, ax=ax1)
df_first_diff = df_aux - df_aux.shift()
f_sharpe = df_first_diff.mean()/df_first_diff.std()
ax1.set_title('$\gamma = {}$ | $sharpe = {:0.2f}$'.format(s_key, f_sharpe), fontsize=10)
ax1.xaxis.set_ticklabels([])
ax1.set_ylabel('PnL', fontsize=8)
ax1.set_xlabel('Time Step', fontsize=8)
f.tight_layout()
s_title = 'Cumulative PnL Changing Gamma\n'
f.suptitle(s_title, fontsize=16, y=1.03);


As explained before, as $\gamma$ approaches one, future rewards are given greater emphasis about the immediate reward. When it is zero, only immediate rewards is considered. Despite the fact that the best parameter was $\gamma = 0.9$, I am not comfortable in giving so little attention to immediate rewards. It sounds dangerous when we talk about stock markets. So, I will choose to use $\gamma = 0.5$ arbitrarily in the next tests. In the figure below, the agent is trained using $\gamma=0.5$ and $k=0.8$. [the next chart is not used in the final version]

In [29]:
# analyze the logs from the in-sample tests
# s_fname = 'log/train_test/sim_Fri_Oct__7_002946_2016.log'  # 15 old
s_fname = 'log/train_test/sim_Wed_Oct__5_110344_2016.log'  # 15
# s_fname = 'log/train_test/sim_Thu_Oct__6_165539_2016.log'  # 25
# s_fname = 'log/train_test/sim_Thu_Oct__6_175507_2016.log'  # 35
# s_fname = 'log/train_test/sim_Thu_Oct__6_183555_2016.log'  # 5
%time d_rtn_train_2 = eda.simple_counts(s_fname, 'LearningAgent_k')

CPU times: user 40.8 s, sys: 331 ms, total: 41.1 s
Wall time: 41.6 s

In [163]:
import qtrader.eda as eda; reload(eda);
eda.plot_train_test_sim(d_rtn)

In [103]:
# analyze the logs from the out-of-sample tests
s_fname = 'log/train_test/sim_Fri_Oct__7_003943_2016.log'  # idx = 15 old
%time d_rtn_test_1 = eda.simple_counts(s_fname, 'LearningAgent_k')

CPU times: user 1.91 s, sys: 13.9 ms, total: 1.92 s
Wall time: 1.96 s

In [16]:
# analyze the logs from the out-of-sample tests
s_fname = 'log/train_test/sim_Wed_Oct__5_111812_2016.log'  # idx = 15
%time d_rtn_test_2 = eda.simple_counts(s_fname, 'LearningAgent_k')

CPU times: user 1.8 s, sys: 9.08 ms, total: 1.81 s
Wall time: 1.82 s

In [105]:
# compare the old with the data using the new configuration
import pandas as pd
df_plot = pd.DataFrame(d_rtn_test_1['pnl']['test']).mean(axis=1).fillna(method='ffill')
ax1 = df_plot.plot(legend=True, label='old')
df_plot = pd.DataFrame(d_rtn_test_2['pnl']['test']).mean(axis=1).fillna(method='ffill')
df_plot.plot(legend=True, label='new', ax=ax1)
ax1.set_title('Cumulative PnL Produced by New\nand Old Configurations')
ax1.set_xlabel('Time')
ax1.set_ylabel('PnL');