In this case study we use principal component analysis (PCA) to generate the typical movements of a yield curve
Our goal in this case study is to use dimensionality reduction techniques to generate the “typical” movements of a yield curve. The data used for this case study is obtained from Quandl.
Quandl is a premier source for financial, economic and alternative datasets. We use the data of 11 tenors (from 1 month to 30 years) of the treasury curves. The frequency of the data is daily and the data is available from 1960 onwards
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import quandl
#Import Model Packages
from sklearn.decomposition import PCA
#The API Key can be optained from Quandl website by registering.
quandl.ApiConfig.api_key = 'QUANDL_API_KEY'
treasury = ['FRED/DGS1MO',
'FRED/DGS3MO',
'FRED/DGS6MO',
'FRED/DGS1',
'FRED/DGS2',
'FRED/DGS3',
'FRED/DGS5',
'FRED/DGS7',
'FRED/DGS10',
'FRED/DGS20',
'FRED/DGS30']
treasury_df = quandl.get(treasury)
treasury_df.columns = ['TRESY1mo',
'TRESY3mo',
'TRESY6mo',
'TRESY1y',
'TRESY2y',
'TRESY3y',
'TRESY5y',
'TRESY7y',
'TRESY10y',
'TRESY20y',
'TRESY30y']
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
dataset = treasury_df
type(dataset)
pandas.core.frame.DataFrame
# shape
dataset.shape
(14420, 11)
# peek at data
set_option('display.width', 100)
dataset.tail(5)
TRESY1mo | TRESY3mo | TRESY6mo | TRESY1y | TRESY2y | TRESY3y | TRESY5y | TRESY7y | TRESY10y | TRESY20y | TRESY30y | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
2019-09-20 | 1.95 | 1.91 | 1.91 | 1.84 | 1.69 | 1.63 | 1.61 | 1.68 | 1.74 | 1.99 | 2.17 |
2019-09-23 | 1.94 | 1.94 | 1.93 | 1.81 | 1.68 | 1.61 | 1.59 | 1.65 | 1.72 | 1.98 | 2.16 |
2019-09-24 | 1.90 | 1.92 | 1.91 | 1.78 | 1.60 | 1.53 | 1.52 | 1.58 | 1.64 | 1.91 | 2.09 |
2019-09-25 | 1.80 | 1.89 | 1.90 | 1.82 | 1.68 | 1.61 | 1.60 | 1.66 | 1.73 | 1.99 | 2.18 |
2019-09-26 | 1.91 | 1.83 | 1.88 | 1.79 | 1.66 | 1.61 | 1.59 | 1.65 | 1.70 | 1.96 | 2.15 |
# types
set_option('display.max_rows', 500)
dataset.dtypes
TRESY1mo float64 TRESY3mo float64 TRESY6mo float64 TRESY1y float64 TRESY2y float64 TRESY3y float64 TRESY5y float64 TRESY7y float64 TRESY10y float64 TRESY20y float64 TRESY30y float64 dtype: object
# describe data
set_option('precision', 3)
dataset.describe()
TRESY1mo | TRESY3mo | TRESY6mo | TRESY1y | TRESY2y | TRESY3y | TRESY5y | TRESY7y | TRESY10y | TRESY20y | TRESY30y | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 4542.000 | 9437.000 | 9437.000 | 14420.000 | 10828.000 | 14420.000 | 14420.000 | 12550.000 | 14420.000 | 6502.000 | 9656.000 |
mean | 1.293 | 3.919 | 4.102 | 5.113 | 5.297 | 5.533 | 5.799 | 6.194 | 6.136 | 4.634 | 6.756 |
std | 1.496 | 3.121 | 3.222 | 3.397 | 3.757 | 3.260 | 3.113 | 3.157 | 2.891 | 1.588 | 3.036 |
min | 0.000 | 0.000 | 0.020 | 0.080 | 0.160 | 0.280 | 0.560 | 0.910 | 1.370 | 1.690 | 1.940 |
25% | 0.060 | 0.960 | 1.060 | 2.570 | 1.860 | 3.270 | 3.630 | 3.610 | 4.070 | 3.020 | 4.260 |
50% | 0.880 | 3.970 | 4.210 | 5.130 | 5.200 | 5.530 | 5.710 | 6.240 | 5.890 | 4.740 | 6.560 |
75% | 1.990 | 5.830 | 6.100 | 7.000 | 7.680 | 7.410 | 7.640 | 7.960 | 7.790 | 5.880 | 8.610 |
max | 5.270 | 15.490 | 15.670 | 17.310 | 16.950 | 16.590 | 16.270 | 16.050 | 15.840 | 8.300 | 15.210 |
Let us look at the movement of the yield curve.
dataset.plot(figsize=(10,5))
plt.ylabel("Rate")
plt.legend(bbox_to_anchor=(1.01, 0.9), loc=2)
plt.show()
In the next step we look at the correlation.
# correlation
correlation = dataset.corr()
plt.figure(figsize=(15,15))
plt.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')
<matplotlib.axes._subplots.AxesSubplot at 0x14d8a23d278>
As it can be seen by the picture above there is a significant positive correlation between the stocks.
We check for the NAs in the data, either drop them or fill them with the mean of the column and the steps are same as mentioned in previ‐ ous case studies.
#Checking for any null values and removing the null values'''
print('Null Values =',dataset.isnull().values.any())
Null Values = True
Given that there are null values drop the rown contianing the null values.
# Fill the missing values with the last value available in the dataset.
dataset=dataset.fillna(method='ffill')
# Drop the rows containing NA
dataset= dataset.dropna(axis=0)
# Fill na with 0
#dataset.fillna('0')
dataset.head(2)
TRESY1mo | TRESY3mo | TRESY6mo | TRESY1y | TRESY2y | TRESY3y | TRESY5y | TRESY7y | TRESY10y | TRESY20y | TRESY30y | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
2001-07-31 | 3.67 | 3.54 | 3.47 | 3.53 | 3.79 | 4.06 | 4.57 | 4.86 | 5.07 | 5.61 | 5.51 |
2001-08-01 | 3.65 | 3.53 | 3.47 | 3.56 | 3.83 | 4.09 | 4.62 | 4.90 | 5.11 | 5.63 | 5.53 |
All the variables should be on the same scale before applying PCA, otherwise a feature with large values will dominate the result. We use StandardScaler in sklearn to standardize the dataset’s features onto unit scale (mean = 0 and variance = 1).
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(dataset)
rescaledDataset = pd.DataFrame(scaler.fit_transform(dataset),columns = dataset.columns, index = dataset.index)
# summarize transformed data
dataset.dropna(how='any', inplace=True)
rescaledDataset.dropna(how='any', inplace=True)
rescaledDataset.head(2)
TRESY1mo | TRESY3mo | TRESY6mo | TRESY1y | TRESY2y | TRESY3y | TRESY5y | TRESY7y | TRESY10y | TRESY20y | TRESY30y | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
2001-07-31 | 1.590 | 1.443 | 1.297 | 1.294 | 1.380 | 1.474 | 1.647 | 1.705 | 1.697 | 1.612 | 1.405 |
2001-08-01 | 1.576 | 1.436 | 1.297 | 1.313 | 1.408 | 1.496 | 1.688 | 1.740 | 1.734 | 1.630 | 1.424 |
Visualising the standardised dataset
rescaledDataset.plot(figsize=(14,10))
plt.ylabel("Rate")
plt.legend(bbox_to_anchor=(1.01, 0.9), loc=2)
plt.show()
As the next step We create a function to compute Principle Component Analysis from Sklearn. This function computes an inversed elbow chart that shows the amount of principle components and how many of them explain the variance treshold.
pca = PCA()
PrincipalComponent=pca.fit(rescaledDataset)
NumEigenvalues=5
fig, axes = plt.subplots(ncols=2, figsize=(14,4))
pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).sort_values().plot.barh(title='Explained Variance Ratio by Top Factors',ax=axes[0]);
pd.Series(pca.explained_variance_ratio_[:NumEigenvalues]).cumsum().plot(ylim=(0,1),ax=axes[1], title='Cumulative Explained Variance');
# explained_variance
pd.Series(np.cumsum(pca.explained_variance_ratio_)).to_frame('Explained Variance_Top 5').head(NumEigenvalues).style.format('{:,.2%}'.format)
Explained Variance_Top 5 | |
---|---|
0 | 84.36% |
1 | 98.44% |
2 | 99.53% |
3 | 99.83% |
4 | 99.94% |
Indeed, the first principal component accounts for 84.4% of variance, with the second principal component getting 98.44% and the third 99.53%. The first 3 principal components account for, cumulatively, 99.5% of all movements in the data. Hence, in terms of dimensionality reduction, the first 3 principal components are representative of the data.
We first have a function to determine the weights of each principal component. We then perform the visualization of the principal components.
def PCWeights():
'''
Principal Components (PC) weights for each 28 PCs
'''
weights = pd.DataFrame()
for i in range(len(pca.components_)):
weights["weights_{}".format(i)] = pca.components_[i] / sum(pca.components_[i])
weights = weights.values.T
return weights
weights=PCWeights()
weights = PCWeights()
NumComponents=3
topPortfolios = pd.DataFrame(weights[:NumComponents], columns=dataset.columns)
topPortfolios.index = [f'Principal Component {i}' for i in range(1, NumComponents+1)]
axes = topPortfolios.T.plot.bar(subplots=True, legend=False,figsize=(14,10))
plt.subplots_adjust(hspace=0.35)
axes[0].set_ylim(0, .2);
plt.plot(pca.components_[0:3].T)
plt.xlabel("Principal Component")
plt.show()
Looking at the the interpretation of the first three principal components, they correspond to:
Principal Component 1: Directional movements in the yield curve. These are movements that shift the entire yield curve up or down.
Principal Component 2: Slope movements in the yield curve. These are movements that steepen or flatten (change the first derivative wrt maturity) the entire yield curve.
Principal Component 3: Curvature movements in the yield curve. These are movements that change the curvature (or the second derivative wrt maturity) of the entire yield curve.
pca.transform(rescaledDataset)[:,:2]
array([[ 4.97514826, -0.48514999], [ 5.03634891, -0.52005102], [ 5.14497849, -0.58385444], ..., [-1.82544584, 2.82360062], [-1.69938513, 2.6936174 ], [-1.73186029, 2.73073137]])
Using the simple matrix reconstruction, we can generate an approximation/almost exact replica of the initial data.
Mechanically PCA is just a matrix multiplication:
Y=XW,
where Y is your principal components, X is input data, and W is a matrix of coefficients.
The only trick here is that your matrix of coefficients is quite special: you can use it to recover the original matrix.
X=YW′,
nComp=3
reconst= pd.DataFrame(np.dot(pca.transform(rescaledDataset)[:,:nComp], pca.components_[:nComp,:]),columns=dataset.columns)
plt.figure(figsize=(10,8))
plt.plot(reconst)
plt.ylabel("Treasury Rate")
plt.title("Reconstructed Dataset")
plt.show()
Figure above shows the replicated treasury rate chart.
Conclusion
We demonstrated the efficiency of dimensionality reduction and principal components analysis in reducing the number of dimension and coming up with new intuitive feature.
The first three principal components explain more than 99.5% of the variation and represent directional movements, slope movements, and curvature movements respectively. Overall, by using principal component analysis, analyzing the eigen vectors and understanding the intuition behind them, we demonstrated how the implementation of a dimensionality reduction lead to fewer intuitive dimensions in the yield curve.