Pairs Trading- finding pairs based on Clustering¶

In this case study, we will use clustering methods to select pairs for a pairs trading strategy.

Content¶

1. Problem Definition
2. Getting Started - Load Libraries and Dataset
- 2.1. Load Libraries
- 2.2. Load Dataset
3. Exploratory Data Analysis
- 3.1 Descriptive Statistics
- 3.2. Data Visualisation
4. Data Preparation
- 4.1 Data Cleaning
- 4.3.Data Transformation
5.Evaluate Algorithms and Models
6.Pair Selection
- 6.1 Cointegration and Pair Selection Function
- 6.2. Pair Visualization

1. Problem Definition¶

Our goal in this case study is to perform clustering analysis on the stocks of S&P500 and come up with pairs for a pairs trading strategy.

The data of the stocks of S&P 500, obtained using pandas_datareader from yahoo finance. It includes price data from 2018 onwards.

2. Getting Started- Loading the data and python packages¶

2.1. Loading the python packages¶

In [1]:

# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import datetime
import pandas_datareader as dr

#Import Model Packages 
from sklearn.cluster import KMeans, AgglomerativeClustering,AffinityPropagation, DBSCAN
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from scipy.spatial.distance import pdist
from sklearn.metrics import adjusted_mutual_info_score
from sklearn import cluster, covariance, manifold


#Other Helper Packages and functions
import matplotlib.ticker as ticker
from itertools import cycle

2.2. Loading the Data¶

In [2]:

#The data already obtained from yahoo finance is imported.
dataset = read_csv('SP500Data.csv',index_col=0)

In [3]:

#Diable the warnings
import warnings
warnings.filterwarnings('ignore')

In [4]:

type(dataset)

Out[4]:

pandas.core.frame.DataFrame

3. Exploratory Data Analysis¶

3.1. Descriptive Statistics¶

In [34]:

# shape
dataset.shape

Out[34]:

(448, 502)

In [35]:

# peek at data
set_option('display.width', 100)
dataset.head(5)

Out[35]:

	ABT	ABBV	ABMD	ACN	ATVI	ADBE	AMD	AAP	AES	AMG	...	WLTW	WYNN	XEL	XRX	XLNX	XYL	YUM	ZBH	ZION	ZTS
Date
2018-01-02	58.790001	98.410004	192.490005	153.839996	64.309998	177.699997	10.98	106.089996	10.88	203.039993	...	146.990005	164.300003	47.810001	29.370001	67.879997	68.070000	81.599998	124.059998	50.700001	71.769997
2018-01-03	58.919998	99.949997	195.820007	154.550003	65.309998	181.039993	11.55	107.050003	10.87	202.119995	...	149.740005	162.520004	47.490002	29.330000	69.239998	68.900002	81.529999	124.919998	50.639999	72.099998
2018-01-04	58.820000	99.379997	199.250000	156.380005	64.660004	183.220001	12.12	111.000000	10.83	198.539993	...	151.259995	163.399994	47.119999	29.690001	70.489998	69.360001	82.360001	124.739998	50.849998	72.529999
2018-01-05	58.990002	101.110001	202.320007	157.669998	66.370003	185.339996	11.88	112.180000	10.87	199.470001	...	152.229996	164.490005	46.790001	29.910000	74.150002	69.230003	82.839996	125.980003	50.869999	73.360001
2018-01-08	58.820000	99.489998	207.800003	158.929993	66.629997	185.039993	12.28	111.389999	10.87	200.529999	...	151.410004	162.300003	47.139999	30.260000	74.639999	69.480003	82.980003	126.220001	50.619999	74.239998

5 rows × 502 columns

In [47]:

# describe data
set_option('precision', 3)
dataset.describe()

Out[47]:

	MMM	AXP	AAPL	BA	CAT	CVX	CSCO	KO	DIS	DWDP	...	NKE	PFE	PG	TRV	UTX	UNH	VZ	V	WMT	WBA
count	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	363.000	...	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	4804.000	2741.000	4804.000	4804.000
mean	86.769	49.659	49.107	85.482	56.697	61.735	21.653	24.984	46.368	64.897	...	23.724	20.737	49.960	55.961	62.209	64.418	27.193	53.323	50.767	41.697
std	53.942	22.564	55.020	79.085	34.663	31.714	10.074	10.611	32.733	5.768	...	20.988	7.630	19.769	34.644	32.627	62.920	11.973	37.647	17.040	19.937
min	25.140	8.713	0.828	17.463	9.247	17.566	6.842	11.699	11.018	49.090	...	2.595	8.041	16.204	13.287	14.521	5.175	11.210	9.846	30.748	17.317
25%	51.192	34.079	3.900	37.407	26.335	31.820	14.910	15.420	22.044	62.250	...	8.037	15.031	35.414	29.907	34.328	23.498	17.434	18.959	38.062	27.704
50%	63.514	42.274	23.316	58.437	53.048	56.942	18.578	20.563	29.521	66.586	...	14.147	18.643	46.735	39.824	55.715	42.924	21.556	45.207	42.782	32.706
75%	122.906	66.816	84.007	112.996	76.488	91.688	24.650	34.927	75.833	69.143	...	36.545	25.403	68.135	80.767	92.557	73.171	38.996	76.966	65.076	58.165
max	251.981	112.421	231.260	411.110	166.832	128.680	63.698	50.400	117.973	75.261	...	85.300	45.841	98.030	146.564	141.280	286.330	60.016	150.525	107.010	90.188

8 rows × 30 columns

3.2. Data Visualization¶

We will take a detailed look into the visualization post clustering.

4. Data Preparation¶

4.1. Data Cleaning¶

We check for the NAs in the rows, either drop them or fill them with the mean of the column.

In [4]:

#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())

Null Values = True

Getting rid of the columns with more than 30% missing values.

In [7]:

missing_fractions = dataset.isnull().mean().sort_values(ascending=False)

missing_fractions.head(10)

drop_list = sorted(list(missing_fractions[missing_fractions > 0.3].index))

dataset.drop(labels=drop_list, axis=1, inplace=True)
dataset.shape

Out[7]:

(448, 498)

Given that there are null values drop the rown contianing the null values.

In [8]:

# Fill the missing values with the last value available in the dataset. 
dataset=dataset.fillna(method='ffill')
dataset.head(2)

Out[8]:

	ABT	ABBV	ABMD	ACN	ATVI	ADBE	AMD	AAP	AES	AMG	...	WLTW	WYNN	XEL	XRX	XLNX	XYL	YUM	ZBH	ZION	ZTS
Date
2018-01-02	58.790001	98.410004	192.490005	153.839996	64.309998	177.699997	10.98	106.089996	10.88	203.039993	...	146.990005	164.300003	47.810001	29.370001	67.879997	68.070000	81.599998	124.059998	50.700001	71.769997
2018-01-03	58.919998	99.949997	195.820007	154.550003	65.309998	181.039993	11.55	107.050003	10.87	202.119995	...	149.740005	162.520004	47.490002	29.330000	69.239998	68.900002	81.529999	124.919998	50.639999	72.099998

2 rows × 498 columns

4.2. Data Transformation¶

For the purpose of clustering, we will be using annual returns and variance as the variables as they are the indicators of the stock performance and its volatility. Let us prepare the return and volatility variables from the data.

In [9]:

#Calculate average annual percentage return and volatilities over a theoretical one year period
returns = dataset.pct_change().mean() * 252
returns = pd.DataFrame(returns)
returns.columns = ['Returns']
returns['Volatility'] = dataset.pct_change().std() * np.sqrt(252)
data=returns
#format the data as a numpy array to feed into the K-Means algorithm
#data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T

All the variables should be on the same scale before applying clustering, otherwise a feature with large values will dominate the result. We use StandardScaler in sklearn to standardize the dataset’s features onto unit scale (mean = 0 and variance = 1).

In [10]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data)
rescaledDataset = pd.DataFrame(scaler.fit_transform(data),columns = data.columns, index = data.index)
# summarize transformed data
rescaledDataset.head(2)
X=rescaledDataset
X.head(2)

Out[10]:

	Returns	Volatility
ABT	0.794067	-0.702741
ABBV	-0.927603	0.794867

The parameters to clusters are the indices and the variables used in the clustering are the columns. Hence the data is in the right format to be fed to the clustering algorithms

5. Evaluate Algorithms and Models¶

We will look at the following models:

KMeans
Hierarchical Clustering (Agglomerative Clustering)
Affinity Propagation

5.1. K-Means Clustering¶

5.1.1. Finding optimal number of clusters¶

In this step we look at the following metrices:

Sum of square errors (SSE) within clusters
Silhouette score.

In [88]:

distorsions = []
max_loop=20
for k in range(2, max_loop):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    distorsions.append(kmeans.inertia_)
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, max_loop), distorsions)
plt.xticks([i for i in range(2, max_loop)], rotation=75)
plt.grid(True)

Inspecting the sum of squared errors chart, it appears the elbow “kink” occurs 5 or 6 clusters for this data. Certainly, we can see that as the number of clusters increase pass 6, the sum of square of errors within clusters plateaus off.

Silhouette score¶

In [90]:

from sklearn import metrics

silhouette_score = []
for k in range(2, max_loop):
        kmeans = KMeans(n_clusters=k,  random_state=10, n_init=10, n_jobs=-1)
        kmeans.fit(X)        
        silhouette_score.append(metrics.silhouette_score(X, kmeans.labels_, random_state=10))
fig = plt.figure(figsize=(15, 5))
plt.plot(range(2, max_loop), silhouette_score)
plt.xticks([i for i in range(2, max_loop)], rotation=75)
plt.grid(True)

From the silhouette score chart, we can see that there are various parts of the graph where a kink can be seen. Since there is not much a difference in SSE after 6 clusters, we would prefer 6 clusters in the K-means model.

5.1.2. Clustering and Visualisation¶

Let us build the k-means model with six clusters and visualize the results.

In [5]:

nclust=6

In [11]:

#Fit with k-means
k_means = cluster.KMeans(n_clusters=nclust)
k_means.fit(X)

Out[11]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [12]:

#Extracting labels 
target_labels = k_means.predict(X)

Visualizing how your clusters are formed is no easy task when the number of variables/dimensions in your dataset is very large. One of the methods of visualising a cluster in two-dimensional space.

In [14]:

centroids = k_means.cluster_centers_
fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c = k_means.labels_, cmap ="rainbow", label = X.index)
ax.set_title('k-Means results')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)

plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=11)

Out[14]:

[<matplotlib.lines.Line2D at 0x1532b4d6710>]

Let us check the elements of the clusters

In [96]:

# show number of stocks in each cluster
clustered_series = pd.Series(index=X.index, data=k_means.labels_.flatten())
# clustered stock with its cluster label
clustered_series_all = pd.Series(index=X.index, data=k_means.labels_.flatten())
clustered_series = clustered_series[clustered_series != -1]

plt.figure(figsize=(12,7))
plt.barh(
    range(len(clustered_series.value_counts())), # cluster labels, y axis
    clustered_series.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number')
plt.show()

The number of stocks in a cluster range from around 40 to 120. Although, the distribution is not equal, we have significant number of stocks in each cluster.

5.2. Hierarchical Clustering (Agglomerative Clustering)¶

In the first step we look at the hierarchy graph and check for the number of clusters

5.2.1. Building Hierarchy Graph/ Dendogram¶

The hierarchy class has a dendrogram method which takes the value returned by the linkage method of the same class. The linkage method takes the dataset and the method to minimize distances as parameters. We use 'ward' as the method since it minimizes then variants of distances between the clusters.

In [11]:

from scipy.cluster.hierarchy import dendrogram, linkage, ward

#Calulate linkage
Z= linkage(X, method='ward')
Z[0]

Out[11]:

array([3.30000000e+01, 3.14000000e+02, 3.62580431e-03, 2.00000000e+00])

The best way to visualize an agglomerate clustering algorithm is through a dendogram, which displays a cluster tree, the leaves being the individual stocks and the root being the final single cluster. The "distance" between each cluster is shown on the y-axis, and thus the longer the branches are, the less correlated two clusters are.

In [12]:

#Plot Dendogram
plt.figure(figsize=(10, 7))
plt.title("Stocks Dendrograms")
dendrogram(Z,labels = X.index)
plt.show()

Once one big cluster is formed, the longest vertical distance without any horizontal line passing through it is selected and a horizontal line is drawn through it. The number of vertical lines this newly created horizontal line passes is equal to number of clusters. Then we select the distance threshold to cut the dendrogram to obtain the selected clustering level. The output is the cluster labelled for each row of data. As expected from the dendrogram, a cut at 13 gives us four clusters.

In [103]:

distance_threshold = 13
clusters = fcluster(Z, distance_threshold, criterion='distance')
chosen_clusters = pd.DataFrame(data=clusters, columns=['cluster'])
chosen_clusters['cluster'].unique()

Out[103]:

array([1, 4, 3, 2], dtype=int64)

5.2.2. Clustering and Visualisation¶

In [13]:

nclust = 4
hc = AgglomerativeClustering(n_clusters=nclust, affinity = 'euclidean', linkage = 'ward')
clust_labels1 = hc.fit_predict(X)

In [15]:

fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c =clust_labels1, cmap ="rainbow")
ax.set_title('Hierarchical Clustering')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)

Out[15]:

<matplotlib.colorbar.Colorbar at 0x1fa81d717f0>

Similar to the plot of k-means clustering, we see that there are some distinct clusters separated by different colors.

5.3. Affinity Propagation¶

In [110]:

ap = AffinityPropagation()
ap.fit(X)
clust_labels2 = ap.predict(X)

In [112]:

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
scatter = ax.scatter(X.iloc[:,0],X.iloc[:,1], c =clust_labels2, cmap ="rainbow")
ax.set_title('Affinity')
ax.set_xlabel('Mean Return')
ax.set_ylabel('Volatility')
plt.colorbar(scatter)

Out[112]:

<matplotlib.colorbar.Colorbar at 0x1498fae5390>

Similar to the plot of k-means clustering, we see that there are some distinct clusters separated by different colors.

5.3.1 Cluster Visualisation¶

In [113]:

cluster_centers_indices = ap.cluster_centers_indices_
labels = ap.labels_

In [115]:

no_clusters = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % no_clusters)
# Plot exemplars

X_temp=np.asarray(X)
plt.close('all')
plt.figure(1)
plt.clf()

fig = plt.figure(figsize=(8,6))
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(no_clusters), colors):
    class_members = labels == k
    cluster_center = X_temp[cluster_centers_indices[k]]
    plt.plot(X_temp[class_members, 0], X_temp[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14)
    for x in X_temp[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.show()

Estimated number of clusters: 27

<Figure size 432x288 with 0 Axes>

In [117]:

# show number of stocks in each cluster
clustered_series_ap = pd.Series(index=X.index, data=ap.labels_.flatten())
# clustered stock with its cluster label
clustered_series_all_ap = pd.Series(index=X.index, data=ap.labels_.flatten())
clustered_series_ap = clustered_series_ap[clustered_series != -1]

plt.figure(figsize=(12,7))
plt.barh(
    range(len(clustered_series_ap.value_counts())), # cluster labels, y axis
    clustered_series_ap.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number')
plt.show()

5.4. Cluster Evaluation¶

If the ground truth labels are not known, evaluation must be performed using the model itself. The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:

In [119]:

from sklearn import metrics
print("km", metrics.silhouette_score(X, k_means.labels_, metric='euclidean'))
print("hc", metrics.silhouette_score(X, hc.fit_predict(X), metric='euclidean'))
print("ap", metrics.silhouette_score(X, ap.labels_, metric='euclidean'))

km 0.3350720873411941
hc 0.3432149515640865
ap 0.3450647315156527

Given the affinity propagation performs the best, we go ahead with the affinity propagation and use 27 clusters as specified by this clustering method

Visualising the return within a cluster¶

The understand the intuition behind clustering, let us visualize the results of the clusters.

In [163]:

# all stock with its cluster label (including -1)
clustered_series = pd.Series(index=X.index, data=ap.fit_predict(X).flatten())
# clustered stock with its cluster label
clustered_series_all = pd.Series(index=X.index, data=ap.fit_predict(X).flatten())
clustered_series = clustered_series[clustered_series != -1]

In [386]:

# get the number of stocks in each cluster
counts = clustered_series_ap.value_counts()

# let's visualize some clusters
cluster_vis_list = list(counts[(counts<25) & (counts>1)].index)[::-1]
cluster_vis_list

Out[386]:

[11, 25, 16, 20, 15, 2, 0, 5, 19, 17, 22, 21, 24, 10, 9, 13]

In [369]:

CLUSTER_SIZE_LIMIT = 9999
counts = clustered_series.value_counts()
ticker_count_reduced = counts[(counts>1) & (counts<=CLUSTER_SIZE_LIMIT)]
print ("Clusters formed: %d" % len(ticker_count_reduced))
print ("Pairs to evaluate: %d" % (ticker_count_reduced*(ticker_count_reduced-1)).sum())

Clusters formed: 26
Pairs to evaluate: 12166

In [370]:

# plot a handful of the smallest clusters
plt.figure(figsize=(12,7))
cluster_vis_list[0:min(len(cluster_vis_list), 4)]

Out[370]:

[11, 25, 16, 20]

<Figure size 864x504 with 0 Axes>

In [371]:

for clust in cluster_vis_list[0:min(len(cluster_vis_list), 4)]:
    tickers = list(clustered_series[clustered_series==clust].index)
    means = np.log(dataset.loc[:"2018-02-01", tickers].mean())
    data = np.log(dataset.loc[:"2018-02-01", tickers]).sub(means)
    data.plot(title='Stock Time Series for Cluster %d' % clust)
plt.show()

Looking at the charts above, across all the clusters with small number of stocks, we see similar movement of the stocks under different clusters, which corroborates the effectiveness of the clustering technique.

6. Pairs Selection¶

6.1. Cointegration and Pair Selection Function¶

In [377]:

def find_cointegrated_pairs(data, significance=0.05):
    # This function is from https://www.quantopian.com/lectures/introduction-to-pairs-trading
    n = data.shape[1]    
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(1):
        for j in range(i+1, n):
            S1 = data[keys[i]]            
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < significance:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs

In [378]:

from statsmodels.tsa.stattools import coint
cluster_dict = {}
for i, which_clust in enumerate(ticker_count_reduced.index):
    tickers = clustered_series[clustered_series == which_clust].index   
    score_matrix, pvalue_matrix, pairs = find_cointegrated_pairs(
        dataset[tickers]
    )
    cluster_dict[which_clust] = {}
    cluster_dict[which_clust]['score_matrix'] = score_matrix
    cluster_dict[which_clust]['pvalue_matrix'] = pvalue_matrix
    cluster_dict[which_clust]['pairs'] = pairs

In [380]:

pairs = []
for clust in cluster_dict.keys():
    pairs.extend(cluster_dict[clust]['pairs'])

In [381]:

print ("Number of pairs found : %d" % len(pairs))
print ("In those pairs, there are %d unique tickers." % len(np.unique(pairs)))

Number of pairs found : 32
In those pairs, there are 47 unique tickers.

In [387]:

pairs

Out[387]:

[('AOS', 'FITB'),
 ('AOS', 'ZION'),
 ('AIG', 'TEL'),
 ('ABBV', 'BWA'),
 ('AFL', 'ARE'),
 ('AFL', 'ED'),
 ('AFL', 'MMC'),
 ('AFL', 'WM'),
 ('ACN', 'EQIX'),
 ('A', 'WAT'),
 ('ADBE', 'ADI'),
 ('ADBE', 'CDNS'),
 ('ADBE', 'VFC'),
 ('ABT', 'AZO'),
 ('ABT', 'CHD'),
 ('ABT', 'IQV'),
 ('ABT', 'WELL'),
 ('ALL', 'GL'),
 ('MO', 'CCL'),
 ('ALB', 'CTL'),
 ('ALB', 'FANG'),
 ('ALB', 'EOG'),
 ('ALB', 'HP'),
 ('ALB', 'NOV'),
 ('ALB', 'PVH'),
 ('ALB', 'TPR'),
 ('ADSK', 'ULTA'),
 ('ADSK', 'XLNX'),
 ('AAL', 'FCX'),
 ('CMG', 'EW'),
 ('CMG', 'KEYS'),
 ('XEC', 'DXC')]

6.2. Pair Visualization¶

In [3]:

from sklearn.manifold import TSNE
import matplotlib.cm as cm
stocks = np.unique(pairs)
X_df = pd.DataFrame(index=X.index, data=X).T

In [383]:

in_pairs_series = clustered_series.loc[stocks]
stocks = list(np.unique(pairs))
X_pairs = X_df.T.loc[stocks]

In [384]:

X_tsne = TSNE(learning_rate=50, perplexity=3, random_state=1337).fit_transform(X_pairs)

In [385]:

plt.figure(1, facecolor='white',figsize=(16,8))
plt.clf()
plt.axis('off')
for pair in pairs:
    #print(pair[0])
    ticker1 = pair[0]
    loc1 = X_pairs.index.get_loc(pair[0])
    x1, y1 = X_tsne[loc1, :]
    #print(ticker1, loc1)

    ticker2 = pair[0]
    loc2 = X_pairs.index.get_loc(pair[1])
    x2, y2 = X_tsne[loc2, :]
      
    plt.plot([x1, x2], [y1, y2], 'k-', alpha=0.3, c='gray');
    
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=220, alpha=0.9, c=in_pairs_series.values, cmap=cm.Paired)
plt.title('T-SNE Visualization of Validated Pairs'); 

# zip joins x and y coordinates in pairs
for x,y,name in zip(X_tsne[:,0],X_tsne[:,1],X_pairs.index):

    label = name

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,10), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
    
plt.plot(centroids[:,0],centroids[:,1],'sg',markersize=11)

Out[385]:

[<matplotlib.lines.Line2D at 0x14901bdf7f0>]

Conclusion

The clustering techniques do not directly help in stock trend prediction. However, they can be effectively used in portfolio construction for finding the right pairs, which eventually help in risk mitigation and one can achieve superior risk adjusted returns.

We showed the approaches to finding the appropriate number of clusters in k-means and built a hierarchy graph in hierarchical clustering. A next step from this case study would be to explore and backtest various long/short trading strategies with pairs of stocks from the groupings of stocks.

Clustering can effectively be used for dividing stocks into groups with “similar characteristics” for many other kinds of trading strategies and can help in portfolio construction to ensure we choose a universe of stocks with sufficient diversification between them.