What is the optimal way of constructing a portfolio? A portfolio that (1) consistently generates wealth while minimizing potential lossess and (2) is robust against large market fluctuations and economic downturns? I will explore these questions in some depth in a three-part ** Hedgecraft** series. While the series is aimed at a technical audience, my intentions are to break the technical concepts down into digestible bite-sized pieces suitable for a general audience, institutional investors, and others alike. The approach documented in this notebook (to my knowledge) is novel. To this end, the penultimate goal of the project is an end-to-end FinTech/Portfolio Management product.

Using insights from Network Science, we build a centrality-based risk model for generating portfolio asset weights. The model is trained with the daily prices of 31 stocks from 2006-2014 and validated in years 2015, 2016, and 2017. As a benchmark, we compare the model with a portolfio constructed with Modern Portfolio Theory (MPT). Our proposed asset allocation algorithm significantly outperformed both the DIJIA and S&P500 indexes in every validation year with an average annual return rate of 38.7%, a 18.85% annual volatility, a 1.95 Sharpe ratio, a -12.22% maximum drawdown, a return over maximum drawdown of 9.75, and a growth-risk-ratio of 4.32. In comparison, the MPT portfolio had a 9.64% average annual return rate, a 16.4% annual standard deviation, a Sharpe ratio of 0.47, a maximum drawdown of -20.32%, a return over maximum drawdown of 1.5, and a growth-risk-ratio of 0.69.

In this series we play the part of an Investment Data Scientist at Bridgewater Associates performing a go/no go analysis on a new idea for risk-weighted asset allocation. Our aim is to develop a network-based model for generating asset weights such that the probability of losing money in any given year is minimized. We've heard down the grapevine that all go-descisions will be presented to Dalio's inner circle at the end of the week and will likely be subject to intense scrutiny. As such, we work with a few highly correlated assets with strict go/no go criteria. We build the model using the daily prices of each stock (with a few replacements*) in the Dow Jones Industrial Average (DJIA). If our recommended portfolio either (1) loses money in **any** year, (2) does not outperform the market **every** year, or (3) does not outperform the MPT portfolio---the decision is no go.

- We replaced Visa (V), DowDuPont (DWDP), and Walgreens (WBA) with three alpha generators: Google (GOOGL), Amazon (AMZN), and Altaba (AABA) and, for the sake of model building, one poor performing stock: General Electric (GE). The dataset is found on Kaggle.

The building blocks of a portfolio are assets (resources with economic value expected to increase over time). Each asset belongs to one of seven primary asset classes: cash, equitiy, fixed income, commodities, real-estate, alternative assets, and more recently, digital (such as cryptocurrency and blockchain). Within each class are different asset types. For example: stocks, index funds, and equity mutual funds all belong to the equity class while gold, oil, and corn belong to the commodities class. An emerging consensus in the financial sector is this: a portfolio containing assets of many classes and types hedges against potential losses by increasing the number of revenue streams. In general the more diverse the portfolio the less likely it is to lose money. Take stocks for example. A diversified stock portfolio contains positions in multiple sectors. We call this *asset diversification*, or more simply *diversification*. Below is a table summarizing the asset classes and some of their respective types.

Cash | Equity | Fixed Income | Commodities | Real-Estate | Alternative Assets | Digital | |
---|---|---|---|---|---|---|---|

US Dollar | US Stocks | US Bonds | Gold | REIT's | Structured Credit | Cryptocurrencies | |

Japenese Yen | Foreign Stocks | Foreign Bonds | Oil | Commerical Properties | Liquidations | Security Tokens | |

Chinese Yaun | Index Funds | Deposits | Wheat | Land | Aviation Assets | Online Stores | |

UK Pound | Mutual Funds | Debentures | Corn | Industrial Properties | Collectables | Online Media | |

• | • | • | • | • | • | • | |

• | • | • | • | • | • | • |

**An investor solves the following (asset allocation) problem: given X dollars and N assets find the best possible way of breaking X into N pieces.** By "best possible" we mean maximizing our returns subject to minimizing the risk of our initial investment. In other words, we aim to consistently grow X irrespective of the overall state of the market. In what follows, we explore provocative insights by Ray Dalio and others on portfolio construction.

Source: Principles by Ray Dalio (Summary)

The above chart depicts the behaviour of a portfolio with increasing diversification. Along the x-axis is the number of asset types. Along the y-axis is how "spread out" the annual returns are. A lower annual standard deviation indicates smaller fluctuations in each revenue stream, and in turn a diminished risk exposure. The "Holy Grail" so to speak, is to (1) find the largest number of assets that are the **least** correlated and (2) allocate X dollars to those assets such that the probability of losing money any given year is minimized. The underlying principle is this: the portfolio most robust against large market fluctuations and economic downturns is a portfolio with assets that are the **most independent** of eachother.

Before we dive into the meat of our asset allocation model, we first explore, clean, and preprocess our historical price data for time-series analyses. In this section we complete the following.

- Observe how many rows and columns are in our dataset and what they mean.
- Observe the datatypes of the columns and update them if needed.
- Take note of how the data is structured and what preprocessing will be necessary for time-series analyses.
- Deal with any missing data accordingly.
- Rename the stock tickers to the company names for readability.

In [1]:

```
#import data manipulation (pandas) and numerical manipulation (numpy) modules
import pandas as pd
import numpy as np
#silence warnings
import warnings
warnings.filterwarnings("ignore")
#reads the csv file into pandas DataFrame
df = pd.read_csv("all_stocks_2006-01-01_to_2018-01-01.csv")
#prints first 5 rows of the DataFrame
df.head()
```

Out[1]:

`Date`

: date (yyyy-mm-dd)`Open`

: daily opening prices (USD)`High`

: daily high prices (USD)`Low`

: daily low prices (USD)`Close`

: daily closing prices (USD)`Volume`

daily volume (number of shares traded)`Name`

: ticker (abbreviates company name)

In [2]:

```
#prints last 5 rows
df.tail()
```

Out[2]:

In [3]:

```
#prints information about the DataFrame
df.info()
```

Some observations:

- The dataset has 93,612 rows and 7 columns.
- The
`Date`

column is**not**a DateTime object (we need to change this). - For time-series analyses we need to preprocess the data (we address this in the proceeding section).
- There are missing values in the
`Open`

,`High`

, and`Low`

columns (we adress this after preprocessing the data). - We also want to map the tickers (e.g., MMM) to the company names (e.g., 3M).
- Finally, we need to set the index as the date.

In [4]:

```
#changes Date column to a DateTime object
df['Date'] = pd.to_datetime(df['Date'])
```

In [5]:

```
#prints unique tickers in the Name column
print(df['Name'].unique())
```

In [6]:

```
#dictionary of tickers and their respective company names
ticker_mapping = {'AABA':'Altaba',
'AAPL':'Apple',
'AMZN': 'Amazon',
'AXP':'American Express',
'BA':'Boeing',
'CAT':'Caterpillar',
'MMM':'3M',
'CVX':'Chevron',
'CSCO':'Cisco Systems',
'KO':'Coca-Cola',
'DIS':'Walt Disney',
'XOM':'Exxon Mobil',
'GE': 'General Electric',
'GS':'Goldman Sachs',
'HD': 'Home Depot',
'IBM': 'IBM',
'INTC': 'Intel',
'JNJ':'Johnson & Johnson',
'JPM':'JPMorgan Chase',
'MCD':'Mcdonald\'s',
'MRK':'Merk',
'MSFT':'Microsoft',
'NKE':'Nike',
'PFE':'Pfizer',
'PG':'Procter & Gamble',
'TRV':'Travelers',
'UTX':'United Technologies',
'UNH':'UnitedHealth',
'VZ':'Verizon',
'WMT':'Walmart',
'GOOGL':'Google'}
#changes the tickers in df to the company names
df['Name'] = df['Name'].map(ticker_mapping)
```

In [7]:

```
#sets the Date column as the index
df.set_index('Date', inplace=True)
```

In this section we do the following.

- Break the data in two pieces: historical prices from 2006-2014 (
`df_train`

) and from 2015-2017 (`df_validate`

). We build our model portfolio using the former and test it with the latter. - We add a column to
`df_train`

recording the difference between the daily closing and opening prices`Close_diff`

. - We create a seperate DataFrame for the
`Open`

,`High`

,`Low`

,`Close`

, and`Close_diff`

time-series.- Pivot the tickers in the
`Name`

column of`df_train`

to the column names of the above DataFrames and set the values as the daily prices

- Pivot the tickers in the
- Transform each time-series so that it's stationary.
- We do this by detrending with the pd.diff() method

- We do this by detrending with the pd.diff() method
- Finally, remove the missing data.

In [8]:

```
#traning dataset
df_train = df.loc['2006-01-03':'2015-01-01']
df_train.tail()
```

Out[8]:

In [9]:

```
#testing dataset
df_validate = df.loc['2015-01-01':'2017-12-31']
df_validate.tail()
```

Out[9]:

It's always a good idea to check we didn't lose any data after the split.

In [10]:

```
#returns True if no data was lost after the split and False otherwise.
df_train.shape[0] + df_validate.shape[0] == df.shape[0]
```

Out[10]:

In [11]:

```
# sets each column as a stock and every row as a daily closing price
df_validate = df_validate.pivot(columns='Name', values='Close')
```

In [12]:

```
df_validate.head()
```

Out[12]:

In [13]:

```
#creates a new column with the difference beteween the closing and opening prices
df_train['Close_Diff'] = df_train.loc[:,'Close'] - df_train.loc[:,'Open']
```

In [14]:

```
df_train.head()
```

Out[14]:

In [15]:

```
#creates a DataFrame for each time-series (see In [11])
df_train_close = df_train.pivot(columns='Name', values='Close')
df_train_open = df_train.pivot(columns='Name', values='Open')
df_train_close_diff = df_train.pivot(columns='Name', values='Close_Diff')
df_train_high = df_train.pivot(columns='Name', values='High')
df_train_low = df_train.pivot(columns='Name', values='Low')
#makes a copy of the traning dataset
df_train_close_copy = df_train_close.copy()
df_train_close.head()
```

Out[15]:

In [16]:

```
#creates a list of stocks
stocks = df_train_close.columns.tolist()
#list of training DataFrames containing each time-series
df_train_list = [df_train_close, df_train_open, df_train_close_diff, df_train_high, df_train_low]
#detrends each time-series for each DataFrame
for df in df_train_list:
for s in stocks:
df[s] = df[s].diff()
```

In [17]:

```
df_train_close.head()
```

Out[17]:

In [18]:

```
#counts the missing values in each column
df_train_close.isnull().sum()
```

Out[18]:

Since the number of missing values in every column is considerably less than 1% of the total number of rows (93,612) we can safely drop them.

In [19]:

```
#drops all missing values in each DataFrame
for df in df_train_list:
df.dropna(inplace=True)
```

Now that the data is preprocessed we can start thinking our way through the problem creatively. To refresh our memory, let's restate the problem.

**Given the $N$ assets in our portfolio, find a way of computing the allocation weights $w_{i}$, $\Big( \sum_{i=1}^{N}w_{i}=1\Big)$ such that assets more correlated with each other obtain lower weights while those less correlated with each other obtain higher weights.**

One way of tackling the above is to think of our portfolio as a weighted graph). Intuitively, a graph captures the relations between objects -- abstract or concrete. Mathematically, a weighted graph is an ordered tuple $G = (V, E, W)$ where $V$ is a set of *vertices* (or *nodes*), $E$ is the set of pairwise relationships between the vertices (the *edges*), and $W$ is a set of numerical values assigned to each edge.

A useful represention of $G$ is the *adjacency matrix*:

Here the pairwise relations are expressed as the $ij$ entries of an $N \times N$ matrix where $N$ is the number of nodes. In what follows, the adjacency matrix becomes a critical instrument of our asset allocation algorithm. **Our strategy is to transform the historical pricing data into a graph with edges weighted by the correlations between each stock.** Once the time series data is in this form, we use graph centrality measures and graph algorithms to obtain the desired allocation weights. To construct the weighted graph we adopt the winner-take-all method presented by Tse, *et al*. (2010) with a few modifications. (See Stock Correlation Network for a summary.) Our workflow in this section is as follows.

- We compute the
*distance correlation matrix*$\rho_{D}(X_{i}, X_{j})$ for the`Open`

,`High`

,`Low`

,`Close`

, and`Close_diff`

time series. - We use the NetworkX module to transform each distance correlation matrix into a weighted graph.
- We adopt the winner-take-all method and remove edges with correlations below a threshold value of $\rho_{c} = 0.325$,

```
*Note a threshold value of 0.325 is arbitrary. In practice, the threshold cannot be such that the graph is disconnected, as many centrality measures are undefined for nodes without any connections.
```

- We inspect the distribution of edges (the so-called degree distribution) for each network. The degree of a node is simply the number of connections it has to other nodes. Algebraically, the degree of the
*i*th vertex is given as,

- Finally, we build a master network by averaging over the edge weights of the
`Open`

,`High`

,`Low`

,`Close`

, and`Close_diff`

networks and derive the asset weights from its structure.

Put simply, Distance correlation is a generalization of Pearson's correlation insofar as it (1) detects both linear and non-linear associations in the data and (2) can be applied to time series of unequal dimension. Below is a comparison of the Distance and Pearson correlation.

Distance correlation varies between 0 and 1. A Distance correlation close to 0 indicates a pair of time series is **independent** where values close to 1 indicate a high degree of **dependence**. This is in contrast to Pearson's correlation which varies between -1 and 1 and can be 0 for time series that are dependent (see Szekely, *et al*. (2017)). What makes Distance correlation particularly appealing is the fact that it can be applied to time series of unequal dimension. If our ultimate goal is to scale the asset allocation algorithm to the *entire* market (with time series of many assets) and update it in real-time (which it is), the algo must be able to handle time series of arbitrary dimension. The penultimate goal is to observe how an asset correlation network *representative* of the global market evolves in real-time and update the allocation weights in response.

In [20]:

```
#imports the dcor module to calculate distance correlation
import dcor
#function to compute the distance correlation (dcor) matrix from a DataFrame and output a DataFrame
#of dcor values.
def df_distance_correlation(df_train):
#initializes an empty DataFrame
df_train_dcor = pd.DataFrame(index=stocks, columns=stocks)
#initialzes a counter at zero
k=0
#iterates over the time series of each stock
for i in stocks:
#stores the ith time series as a vector
v_i = df_train.loc[:, i].values
#iterates over the time series of each stock subect to the counter k
for j in stocks[k:]:
#stores the jth time series as a vector
v_j = df_train.loc[:, j].values
#computes the dcor coefficient between the ith and jth vectors
dcor_val = dcor.distance_correlation(v_i, v_j)
#appends the dcor value at every ij entry of the empty DataFrame
df_train_dcor.at[i,j] = dcor_val
#appends the dcor value at every ji entry of the empty DataFrame
df_train_dcor.at[j,i] = dcor_val
#increments counter by 1
k+=1
#returns a DataFrame of dcor values for every pair of stocks
return df_train_dcor
```

In [21]:

```
df_train_dcor_list = [df_distance_correlation(df) for df in df_train_list]
```

In [22]:

```
df_train_dcor_list[4].head()
```

Out[22]:

In [23]:

```
#imports the NetworkX module
import networkx as nx
# takes in a pre-processed dataframe and returns a time-series correlation
# network with pairwise distance correlation values as the edges
def build_corr_nx(df_train):
# converts the distance correlation dataframe to a numpy matrix with dtype float
cor_matrix = df_train.values.astype('float')
# Since dcor ranges between 0 and 1, (0 corresponding to independence and 1
# corresponding to dependence), 1 - cor_matrix results in values closer to 0
# indicating a higher degree of dependence where values close to 1 indicate a lower degree of
# dependence. This will result in a network with nodes in close proximity reflecting the similarity
# of their respective time-series and vice versa.
sim_matrix = 1 - cor_matrix
# transforms the similarity matrix into a graph
G = nx.from_numpy_matrix(sim_matrix)
# extracts the indices (i.e., the stock names from the dataframe)
stock_names = df_train.index.values
# relabels the nodes of the network with the stock names
G = nx.relabel_nodes(G, lambda x: stock_names[x])
# assigns the edges of the network weights (i.e., the dcor values)
G.edges(data=True)
# copies G
## we need this to delete edges or othwerwise modify G
H = G.copy()
# iterates over the edges of H (the u-v pairs) and the weights (wt)
for (u, v, wt) in G.edges.data('weight'):
# selects edges with dcor values less than or equal to 0.33
if wt >= 1 - 0.325:
# removes the edges
H.remove_edge(u, v)
# selects self-edges
if u == v:
# removes the self-edges
H.remove_edge(u, v)
# returns the final stock correlation network
return H
```

In [24]:

```
#builds the distance correlation networks for the Open, Close, High, Low, and Close_diff time series
H_close = build_corr_nx(df_train_dcor_list[0])
H_open = build_corr_nx(df_train_dcor_list[1])
H_close_diff = build_corr_nx(df_train_dcor_list[2])
H_high = build_corr_nx(df_train_dcor_list[3])
H_low = build_corr_nx(df_train_dcor_list[4])
```

In [25]:

```
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```

In [26]:

```
# function to display the network from the distance correlation matrix
def plt_corr_nx(H, title):
# creates a set of tuples: the edges of G and their corresponding weights
edges, weights = zip(*nx.get_edge_attributes(H, "weight").items())
# This draws the network with the Kamada-Kawai path-length cost-function.
# Nodes are positioned by treating the network as a physical ball-and-spring system. The locations
# of the nodes are such that the total energy of the system is minimized.
pos = nx.kamada_kawai_layout(H)
with sns.axes_style('whitegrid'):
# figure size and style
plt.figure(figsize=(12, 9))
plt.title(title, size=16)
# computes the degree (number of connections) of each node
deg = H.degree
# list of node names
nodelist = []
# list of node sizes
node_sizes = []
# iterates over deg and appends the node names and degrees
for n, d in deg:
nodelist.append(n)
node_sizes.append(d)
# draw nodes
nx.draw_networkx_nodes(
H,
pos,
node_color="#DA70D6",
nodelist=nodelist,
node_size=np.power(node_sizes, 2.33),
alpha=0.8,
font_weight="bold",
)
# node label styles
nx.draw_networkx_labels(H, pos, font_size=13, font_family="sans-serif", font_weight='bold')
# color map
cmap = sns.cubehelix_palette(3, as_cmap=True, reverse=True)
# draw edges
nx.draw_networkx_edges(
H,
pos,
edge_list=edges,
style="solid",
edge_color=weights,
edge_cmap=cmap,
edge_vmin=min(weights),
edge_vmax=max(weights),
)
# builds a colorbar
sm = plt.cm.ScalarMappable(
cmap=cmap,
norm=plt.Normalize(vmin=min(weights),
vmax=max(weights))
)
sm._A = []
plt.colorbar(sm)
# displays network without axes
plt.axis("off")
#silence warnings
import warnings
warnings.filterwarnings("ignore")
```

The following visualizations are rendered with the Kamada-Kawai method, which treats each vertex of the graph as a mass and each edge as a spring. The graph is drawn by finding the list of vertex positions that minimize the total energy of the ball-spring system. The method treats the spring lengths as the weights of the graph, which is given by `1 - cor_matrix`

where `cor_matrix`

is the distance correlation matrix. Nodes seperated by large distances reflect smaller correlations between their time series data, while nodes seperated by small distances reflect larger correlations. The minimum energy configuration consists of vertices with few connections experiencing a repulsive force and vertices with many connections feeling an attractive force. As such, nodes with a larger degree (more correlations) fall towards to the center of the visualization where nodes with a smaller degree (fewer correlations) are pushed outwards. For an overview of physics-based graph visualizations see the Force-directed graph drawing wiki.

In [27]:

```
# plots the distance correlation network of the daily opening prices from 2006-2014
plt_corr_nx(H_close, title='Distance Correlation Network of the Daily Closing Prices (2006-2014)')
```

In the above visualization, the sizes of the vertices are proportional to the number of connections they have. The colorbar to the right indicates the degree of disimilarity (the distance) between the stocks. The larger the value (the lighter the color) the less similar the stocks are. Keeping this in mind, several stocks jump out. **Apple**, **Amazon**, **Altaba**, and **UnitedHealth** all lie on the periphery of the network with the fewest number of correlations above $\rho_{c} = 0.325$. On the other hand **3M**, **American Express**, **United Technolgies**, and **General Electric** sit in the core of the network with the greatest number connections above $\rho_{c} = 0.325$. It is clear from the closing prices network that our asset allocation algorithm needs to reward vertices on the periphery and punish those nearing the center. In the next code block we build a function to visualize how the edges of the distance correlation network are distributed.

In [28]:

```
# function to visualize the degree distribution
def hist_plot(network, title, bins, xticks):
# extracts the degrees of each vertex and stores them as a list
deg_list = list(dict(network.degree).values())
# sets local style
with plt.style.context('fivethirtyeight'):
# initializes a figure
plt.figure(figsize=(9,6))
# plots a pretty degree histogram with a kernel density estimator
sns.distplot(
deg_list,
kde=True,
bins = bins,
color='darksalmon',
hist_kws={'alpha': 0.7}
);
# turns the grid off
plt.grid(False)
# controls the number and spacing of xticks and yticks
plt.xticks(xticks, size=11)
plt.yticks(size=11)
# removes the figure spines
sns.despine(left=True, right=True, bottom=True, top=True)
# labels the y and x axis
plt.ylabel("Probability", size=15)
plt.xlabel("Number of Connections", size=15)
# sets the title
plt.title(title, size=20);
# draws a vertical line where the mean is
plt.axvline(sum(deg_list)/len(deg_list),
color='darkorchid',
linewidth=3,
linestyle='--',
label='Mean = {:2.0f}'.format(sum(deg_list)/len(deg_list))
)
# turns the legend on
plt.legend(loc=0, fontsize=12)
```

In [29]:

```
# plots the degree histogram of the closing prices network
hist_plot(
H_close,
'Degree Histogram of the Closing Prices Network',
bins=9,
xticks=range(13, 30, 2)
)
```

**Observations**

- The degree distribution is left-skewed.
- The average node is connected to 86.6% of the network.
- Very few nodes are connected to less than 66.6% of the network.
- The kernel density estimation is not a good fit.
- By eyeballing the plot, the degrees appear to follow an
*inverse power-law*distribution. (This would be consistent with the findings of Tse,*et al*. (2010)).

In [30]:

```
plt_corr_nx(
H_close_diff,
title='Distance Correlation Network of the Daily Net Change in Price (2006-2014)'
)
```

**Observations**

- The above network has substantially fewer edges than the former.
- Apple, Amazon, Altaba, UnitedHealth, and Merck have the fewest number of correlations above $\rho_{c}$.
- 3M, General Electric, American Express, Walt Disney, and United Technologies have the greatest number of correlations above $\rho_{c}$.
- UnitedHealth is clearly an outlier with only two connections above $\rho_{c}$.

In [31]:

```
hist_plot(
H_close_diff,
'Degree Histogram of the Daily Net Change in Price Network',
bins=9,
xticks=range(2, 30, 2)
)
```

**Observations:**

- The distribution is left-skewed.
- The average node is connected to 73.3% of the network.
- Very few nodes are connected to less than 53.3% of the network.
- The kernel density estimation is a poor fit.
- The degree distribution appears to follow an inverse power-law.

In [32]:

```
plt_corr_nx(
H_high,
title='Distance Correlation Network of the Daily High Prices (2006-2014)'
)
```