In this case study, we will use clustering methods to select pairs for a pairs trading strategy.
Our goal in this case study is to perform clustering analysis on the stocks of S&P500 and come up with pairs for a pairs trading strategy.
The data of the stocks of S&P 500, obtained using pandas_datareader from yahoo finance. It includes price data from 2018 onwards.
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import datetime
import pandas_datareader as dr
#Import Model Packages
from sklearn.cluster import KMeans, AgglomerativeClustering,AffinityPropagation, DBSCAN
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from scipy.spatial.distance import pdist
from sklearn.metrics import adjusted_mutual_info_score
from sklearn import cluster, covariance, manifold
#Other Helper Packages and functions
import matplotlib.ticker as ticker
from itertools import cycle
#The data already obtained from yahoo finance is imported.
dataset = read_csv('SP500Data.csv',index_col=0)
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
type(dataset)
pandas.core.frame.DataFrame
# shape
dataset.shape
(448, 502)
# peek at data
set_option('display.width', 100)
dataset.head(5)
ABT | ABBV | ABMD | ACN | ATVI | ADBE | AMD | AAP | AES | AMG | ... | WLTW | WYNN | XEL | XRX | XLNX | XYL | YUM | ZBH | ZION | ZTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||||||||||
2018-01-02 | 58.790001 | 98.410004 | 192.490005 | 153.839996 | 64.309998 | 177.699997 | 10.98 | 106.089996 | 10.88 | 203.039993 | ... | 146.990005 | 164.300003 | 47.810001 | 29.370001 | 67.879997 | 68.070000 | 81.599998 | 124.059998 | 50.700001 | 71.769997 |
2018-01-03 | 58.919998 | 99.949997 | 195.820007 | 154.550003 | 65.309998 | 181.039993 | 11.55 | 107.050003 | 10.87 | 202.119995 | ... | 149.740005 | 162.520004 | 47.490002 | 29.330000 | 69.239998 | 68.900002 | 81.529999 | 124.919998 | 50.639999 | 72.099998 |
2018-01-04 | 58.820000 | 99.379997 | 199.250000 | 156.380005 | 64.660004 | 183.220001 | 12.12 | 111.000000 | 10.83 | 198.539993 | ... | 151.259995 | 163.399994 | 47.119999 | 29.690001 | 70.489998 | 69.360001 | 82.360001 | 124.739998 | 50.849998 | 72.529999 |
2018-01-05 | 58.990002 | 101.110001 | 202.320007 | 157.669998 | 66.370003 | 185.339996 | 11.88 | 112.180000 | 10.87 | 199.470001 | ... | 152.229996 | 164.490005 | 46.790001 | 29.910000 | 74.150002 | 69.230003 | 82.839996 | 125.980003 | 50.869999 | 73.360001 |
2018-01-08 | 58.820000 | 99.489998 | 207.800003 | 158.929993 | 66.629997 | 185.039993 | 12.28 | 111.389999 | 10.87 | 200.529999 | ... | 151.410004 | 162.300003 | 47.139999 | 30.260000 | 74.639999 | 69.480003 | 82.980003 | 126.220001 | 50.619999 | 74.239998 |
5 rows × 502 columns
# describe data
set_option('precision', 3)
dataset.describe()
MMM | AXP | AAPL | BA | CAT | CVX | CSCO | KO | DIS | DWDP | ... | NKE | PFE | PG | TRV | UTX | UNH | VZ | V | WMT | WBA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 363.000 | ... | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 4804.000 | 2741.000 | 4804.000 | 4804.000 |
mean | 86.769 | 49.659 | 49.107 | 85.482 | 56.697 | 61.735 | 21.653 | 24.984 | 46.368 | 64.897 | ... | 23.724 | 20.737 | 49.960 | 55.961 | 62.209 | 64.418 | 27.193 | 53.323 | 50.767 | 41.697 |
std | 53.942 | 22.564 | 55.020 | 79.085 | 34.663 | 31.714 | 10.074 | 10.611 | 32.733 | 5.768 | ... | 20.988 | 7.630 | 19.769 | 34.644 | 32.627 | 62.920 | 11.973 | 37.647 | 17.040 | 19.937 |
min | 25.140 | 8.713 | 0.828 | 17.463 | 9.247 | 17.566 | 6.842 | 11.699 | 11.018 | 49.090 | ... | 2.595 | 8.041 | 16.204 | 13.287 | 14.521 | 5.175 | 11.210 | 9.846 | 30.748 | 17.317 |
25% | 51.192 | 34.079 | 3.900 | 37.407 | 26.335 | 31.820 | 14.910 | 15.420 | 22.044 | 62.250 | ... | 8.037 | 15.031 | 35.414 | 29.907 | 34.328 | 23.498 | 17.434 | 18.959 | 38.062 | 27.704 |
50% | 63.514 | 42.274 | 23.316 | 58.437 | 53.048 | 56.942 | 18.578 | 20.563 | 29.521 | 66.586 | ... | 14.147 | 18.643 | 46.735 | 39.824 | 55.715 | 42.924 | 21.556 | 45.207 | 42.782 | 32.706 |
75% | 122.906 | 66.816 | 84.007 | 112.996 | 76.488 | 91.688 | 24.650 | 34.927 | 75.833 | 69.143 | ... | 36.545 | 25.403 | 68.135 | 80.767 | 92.557 | 73.171 | 38.996 | 76.966 | 65.076 | 58.165 |
max | 251.981 | 112.421 | 231.260 | 411.110 | 166.832 | 128.680 | 63.698 | 50.400 | 117.973 | 75.261 | ... | 85.300 | 45.841 | 98.030 | 146.564 | 141.280 | 286.330 | 60.016 | 150.525 | 107.010 | 90.188 |
8 rows × 30 columns
We will take a detailed look into the visualization post clustering.
We check for the NAs in the rows, either drop them or fill them with the mean of the column.
#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())
Null Values = True
Getting rid of the columns with more than 30% missing values.
missing_fractions = dataset.isnull().mean().sort_values(ascending=False)
missing_fractions.head(10)
drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))
dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape
(448, 498)
Given that there are null values drop the rown contianing the null values.
# Fill the missing values with the last value available in the dataset.
dataset=dataset.fillna(method='ffill')
dataset.head(2)
ABT | ABBV | ABMD | ACN | ATVI | ADBE | AMD | AAP | AES | AMG | ... | WLTW | WYNN | XEL | XRX | XLNX | XYL | YUM | ZBH | ZION | ZTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||||||||||
2018-01-02 | 58.790001 | 98.410004 | 192.490005 | 153.839996 | 64.309998 | 177.699997 | 10.98 | 106.089996 | 10.88 | 203.039993 | ... | 146.990005 | 164.300003 | 47.810001 | 29.370001 | 67.879997 | 68.070000 | 81.599998 | 124.059998 | 50.700001 | 71.769997 |
2018-01-03 | 58.919998 | 99.949997 | 195.820007 | 154.550003 | 65.309998 | 181.039993 | 11.55 | 107.050003 | 10.87 | 202.119995 | ... | 149.740005 | 162.520004 | 47.490002 | 29.330000 | 69.239998 | 68.900002 | 81.529999 | 124.919998 | 50.639999 | 72.099998 |
2 rows × 498 columns
For the purpose of clustering, we will be using annual returns and variance as the variables as they are the indicators of the stock performance and its volatility. Let us prepare the return and volatility variables from the data.
#Calculate average annual percentage return and volatilities over a theoretical one year period
returns = dataset.pct_change().mean() * 252
returns = pd.DataFrame(returns)
returns.columns = ['Returns']
returns['Volatility'] = dataset.pct_change().std() * np.sqrt(252)
data=returns
#format the data as a numpy array to feed into the K-Means algorithm
#data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T
All the variables should be on the same scale before applying clustering, otherwise a feature with large values will dominate the result. We use StandardScaler in sklearn to standardize the dataset’s features onto unit scale (mean = 0 and variance = 1).
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data)
rescaledDataset = pd.DataFrame(scaler.fit_transform(data),columns = data.columns, index = data.index)
# summarize transformed data
rescaledDataset.head(2)
X=rescaledDataset
X.head(2)
Returns | Volatility | |
---|---|---|
ABT | 0.794067 | -0.702741 |
ABBV | -0.927603 | 0.794867 |
The parameters to clusters are the indices and the variables used in the clustering are the columns. Hence the data is in the right format to be fed to the clustering algorithms
We will look at the following models:
In this step we look at the following metrices:
distorsions = []
max_loop=20
for k in range(2, max_loop):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
distorsions.append(kmeans.inertia_)
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, max_loop), distorsions)
plt.xticks([i for i in range(2, max_loop)], rotation=75)
plt.grid(True)
Inspecting the sum of squared errors chart, it appears the elbow “kink” occurs 5 or 6 clusters for this data. Certainly, we can see that as the number of clusters increase pass 6, the sum of square of errors within clusters plateaus off.
from sklearn import metrics
silhouette_score = []
for k in range(2, max_loop):
kmeans = KMeans(n_clusters=k, random_state=10, n_init=10, n_jobs=-1)
kmeans.fit(X)
silhouette_score.append(metrics.silhouette_score(X, kmeans.labels_, random_state=10))
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, max_loop), silhouette_score)
plt.xticks([i for i in range(2, max_loop)], rotation=75)
plt.grid(True)
From the silhouette score chart, we can see that there are various parts of the graph where a kink can be seen. Since there is not much a difference in SSE after 6 clusters, we would prefer 6 clusters in the K-means model.
Let us build the k-means model with six clusters and visualize the results.
nclust=6
#Fit with k-means
k_means = cluster.KMeans(n_clusters=nclust)
k_means.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
#Extracting labels
target_labels = k_means.predict(X)
Visualizing how your clusters are formed is no easy task when the number of variables/dimensions in your dataset is very large. One of the methods of visualising a cluster in two-dimensional space.
centroids = k_means.cluster_centers_
fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c = k_means.labels_, cmap ="rainbow", label = X.index)
ax.set_title('k-Means results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=11)
[<matplotlib.lines.Line2D at 0x1532b4d6710>]
Let us check the elements of the clusters
# show number of stocks in each cluster
clustered_series = pd.Series(index=X.index, data=k_means.labels_.flatten())
# clustered stock with its cluster label
clustered_series_all = pd.Series(index=X.index, data=k_means.labels_.flatten())
clustered_series = clustered_series[clustered_series != -1]
plt.figure(figsize=(12,7))
plt.barh(
range(len(clustered_series.value_counts())), # cluster labels, y axis
clustered_series.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number')
plt.show()
The number of stocks in a cluster range from around 40 to 120. Although, the distribution is not equal, we have significant number of stocks in each cluster.
In the first step we look at the hierarchy graph and check for the number of clusters
The hierarchy class has a dendrogram method which takes the value returned by the linkage method of the same class. The linkage method takes the dataset and the method to minimize distances as parameters. We use 'ward' as the method since it minimizes then variants of distances between the clusters.
from scipy.cluster.hierarchy import dendrogram, linkage, ward
#Calulate linkage
Z= linkage(X, method='ward')
Z[0]
array([3.30000000e+01, 3.14000000e+02, 3.62580431e-03, 2.00000000e+00])
The best way to visualize an agglomerate clustering algorithm is through a dendogram, which displays a cluster tree, the leaves being the individual stocks and the root being the final single cluster. The "distance" between each cluster is shown on the y-axis, and thus the longer the branches are, the less correlated two clusters are.
#Plot Dendogram
plt.figure(figsize=(10, 7))
plt.title("Stocks Dendrograms")
dendrogram(Z,labels = X.index)
plt.show()
Once one big cluster is formed, the longest vertical distance without any horizontal line passing through it is selected and a horizontal line is drawn through it. The number of vertical lines this newly created horizontal line passes is equal to number of clusters. Then we select the distance threshold to cut the dendrogram to obtain the selected clustering level. The output is the cluster labelled for each row of data. As expected from the dendrogram, a cut at 13 gives us four clusters.
distance_threshold = 13
clusters = fcluster(Z, distance_threshold, criterion='distance')
chosen_clusters = pd.DataFrame(data=clusters, columns=['cluster'])
chosen_clusters['cluster'].unique()
array([1, 4, 3, 2], dtype=int64)
nclust = 4
hc = AgglomerativeClustering(n_clusters=nclust, affinity = 'euclidean', linkage = 'ward')
clust_labels1 = hc.fit_predict(X)
fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c =clust_labels1, cmap ="rainbow")
ax.set_title('Hierarchical Clustering')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
<matplotlib.colorbar.Colorbar at 0x1fa81d717f0>
Similar to the plot of k-means clustering, we see that there are some distinct clusters separated by different colors.
ap = AffinityPropagation()
ap.fit(X)
clust_labels2 = ap.predict(X)
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c =clust_labels2, cmap ="rainbow")
ax.set_title('Affinity')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)
<matplotlib.colorbar.Colorbar at 0x1498fae5390>
Similar to the plot of k-means clustering, we see that there are some distinct clusters separated by different colors.
cluster_centers_indices = ap.cluster_centers_indices_
labels = ap.labels_
no_clusters = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % no_clusters)
# Plot exemplars
X_temp=np.asarray(X)
plt.close('all')
plt.figure(1)
plt.clf()
fig = plt.figure(figsize=(8,6))
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(no_clusters), colors):
class_members = labels == k
cluster_center = X_temp[cluster_centers_indices[k]]
plt.plot(X_temp[class_members, 0], X_temp[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14)
for x in X_temp[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.show()
Estimated number of clusters: 27
<Figure size 432x288 with 0 Axes>
# show number of stocks in each cluster
clustered_series_ap = pd.Series(index=X.index, data=ap.labels_.flatten())
# clustered stock with its cluster label
clustered_series_all_ap = pd.Series(index=X.index, data=ap.labels_.flatten())
clustered_series_ap = clustered_series_ap[clustered_series != -1]
plt.figure(figsize=(12,7))
plt.barh(
range(len(clustered_series_ap.value_counts())), # cluster labels, y axis
clustered_series_ap.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number')
plt.show()
If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
from sklearn import metrics
print("km", metrics.silhouette_score(X, k_means.labels_, metric='euclidean'))
print("hc", metrics.silhouette_score(X, hc.fit_predict(X), metric='euclidean'))
print("ap", metrics.silhouette_score(X, ap.labels_, metric='euclidean'))
km 0.3350720873411941 hc 0.3432149515640865 ap 0.3450647315156527
Given the affinity propagation performs the best, we go ahead with the affinity propagation and use 27 clusters as specified by this clustering method
The understand the intuition behind clustering, let us visualize the results of the clusters.
# all stock with its cluster label (including -1)
clustered_series = pd.Series(index=X.index, data=ap.fit_predict(X).flatten())
# clustered stock with its cluster label
clustered_series_all = pd.Series(index=X.index, data=ap.fit_predict(X).flatten())
clustered_series = clustered_series[clustered_series != -1]
# get the number of stocks in each cluster
counts = clustered_series_ap.value_counts()
# let's visualize some clusters
cluster_vis_list = list(counts[(counts<25) & (counts>1)].index)[::-1]
cluster_vis_list
[11, 25, 16, 20, 15, 2, 0, 5, 19, 17, 22, 21, 24, 10, 9, 13]
CLUSTER_SIZE_LIMIT = 9999
counts = clustered_series.value_counts()
ticker_count_reduced = counts[(counts>1) & (counts<=CLUSTER_SIZE_LIMIT)]
print ("Clusters formed: %d" % len(ticker_count_reduced))
print ("Pairs to evaluate: %d" % (ticker_count_reduced*(ticker_count_reduced-1)).sum())
Clusters formed: 26 Pairs to evaluate: 12166
# plot a handful of the smallest clusters
plt.figure(figsize=(12,7))
cluster_vis_list[0:min(len(cluster_vis_list), 4)]
[11, 25, 16, 20]
<Figure size 864x504 with 0 Axes>
for clust in cluster_vis_list[0:min(len(cluster_vis_list), 4)]:
tickers = list(clustered_series[clustered_series==clust].index)
means = np.log(dataset.loc[:"2018-02-01", tickers].mean())
data = np.log(dataset.loc[:"2018-02-01", tickers]).sub(means)
data.plot(title='Stock Time Series for Cluster %d' % clust)
plt.show()
Looking at the charts above, across all the clusters with small number of stocks, we see similar movement of the stocks under different clusters, which corroborates the effectiveness of the clustering technique.
def find_cointegrated_pairs(data, significance=0.05):
# This function is from https://www.quantopian.com/lectures/introduction-to-pairs-trading
n = data.shape[1]
score_matrix = np.zeros((n, n))
pvalue_matrix = np.ones((n, n))
keys = data.keys()
pairs = []
for i in range(1):
for j in range(i+1, n):
S1 = data[keys[i]]
S2 = data[keys[j]]
result = coint(S1, S2)
score = result[0]
pvalue = result[1]
score_matrix[i, j] = score
pvalue_matrix[i, j] = pvalue
if pvalue < significance:
pairs.append((keys[i], keys[j]))
return score_matrix, pvalue_matrix, pairs
from statsmodels.tsa.stattools import coint
cluster_dict = {}
for i, which_clust in enumerate(ticker_count_reduced.index):
tickers = clustered_series[clustered_series == which_clust].index
score_matrix, pvalue_matrix, pairs = find_cointegrated_pairs(
dataset[tickers]
)
cluster_dict[which_clust] = {}
cluster_dict[which_clust]['score_matrix'] = score_matrix
cluster_dict[which_clust]['pvalue_matrix'] = pvalue_matrix
cluster_dict[which_clust]['pairs'] = pairs
pairs = []
for clust in cluster_dict.keys():
pairs.extend(cluster_dict[clust]['pairs'])
print ("Number of pairs found : %d" % len(pairs))
print ("In those pairs, there are %d unique tickers." % len(np.unique(pairs)))
Number of pairs found : 32 In those pairs, there are 47 unique tickers.
pairs
[('AOS', 'FITB'), ('AOS', 'ZION'), ('AIG', 'TEL'), ('ABBV', 'BWA'), ('AFL', 'ARE'), ('AFL', 'ED'), ('AFL', 'MMC'), ('AFL', 'WM'), ('ACN', 'EQIX'), ('A', 'WAT'), ('ADBE', 'ADI'), ('ADBE', 'CDNS'), ('ADBE', 'VFC'), ('ABT', 'AZO'), ('ABT', 'CHD'), ('ABT', 'IQV'), ('ABT', 'WELL'), ('ALL', 'GL'), ('MO', 'CCL'), ('ALB', 'CTL'), ('ALB', 'FANG'), ('ALB', 'EOG'), ('ALB', 'HP'), ('ALB', 'NOV'), ('ALB', 'PVH'), ('ALB', 'TPR'), ('ADSK', 'ULTA'), ('ADSK', 'XLNX'), ('AAL', 'FCX'), ('CMG', 'EW'), ('CMG', 'KEYS'), ('XEC', 'DXC')]
from sklearn.manifold import TSNE
import matplotlib.cm as cm
stocks = np.unique(pairs)
X_df = pd.DataFrame(index=X.index, data=X).T
in_pairs_series = clustered_series.loc[stocks]
stocks = list(np.unique(pairs))
X_pairs = X_df.T.loc[stocks]
X_tsne = TSNE(learning_rate=50, perplexity=3, random_state=1337).fit_transform(X_pairs)
plt.figure(1, facecolor='white',figsize=(16,8))
plt.clf()
plt.axis('off')
for pair in pairs:
#print(pair[0])
ticker1 = pair[0]
loc1 = X_pairs.index.get_loc(pair[0])
x1, y1 = X_tsne[loc1, :]
#print(ticker1, loc1)
ticker2 = pair[0]
loc2 = X_pairs.index.get_loc(pair[1])
x2, y2 = X_tsne[loc2, :]
plt.plot([x1, x2], [y1, y2], 'k-', alpha=0.3, c='gray');
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=220, alpha=0.9, c=in_pairs_series.values, cmap=cm.Paired)
plt.title('T-SNE Visualization of Validated Pairs');
# zip joins x and y coordinates in pairs
for x,y,name in zip(X_tsne[:,0],X_tsne[:,1],X_pairs.index):
label = name
plt.annotate(label, # this is the text
(x,y), # this is the point to label
textcoords="offset points", # how to position the text
xytext=(0,10), # distance from text to points (x,y)
ha='center') # horizontal alignment can be left, right or center
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=11)
[<matplotlib.lines.Line2D at 0x14901bdf7f0>]
Conclusion
The clustering techniques do not directly help in stock trend prediction. However, they can be effectively used in portfolio construction for finding the right pairs, which eventually help in risk mitigation and one can achieve superior risk adjusted returns.
We showed the approaches to finding the appropriate number of clusters in k-means and built a hierarchy graph in hierarchical clustering. A next step from this case study would be to explore and backtest various long/short trading strategies with pairs of stocks from the groupings of stocks.
Clustering can effectively be used for dividing stocks into groups with “similar characteristics” for many other kinds of trading strategies and can help in portfolio construction to ensure we choose a universe of stocks with sufficient diversification between them.