A Summary of lecture "Time Series Analysis in Python", via datacamp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
Google Trends allows users to see how often a term is searched for. We downloaded a file from Google Trends containing the frequency over time for the search word "diet". A first step when analyzing a time series is to visualize the data with a plot. You should be able to clearly see a gradual decrease in searches for "diet" throughout the calendar year, hitting a low around the December holidays, followed by a spike in searches around the new year as people make New Year's resolutions to lose weight.
Like many time series datasets you will be working with, the index of dates are strings and should be converted to a datetime index before plotting.
diet = pd.read_csv('./dataset/diet.csv', index_col=0)
diet.head()
diet | |
---|---|
Date | |
6/26/2011 | 70 |
7/3/2011 | 71 |
7/10/2011 | 73 |
7/17/2011 | 74 |
7/24/2011 | 72 |
# Convert the date index to datetime
diet.index = pd.to_datetime(diet.index)
# Plot the entire time series diet and show gridlines
diet.plot(grid=True);
plt.title('Seasonal trend of "Diet" keywords');
# Slice the dataset to keep only 2012
diet2012 = diet[diet.index.year == 2012]
# Plot 2012 data
diet2012.plot(grid=True);
plt.title('2012 trend of "Diet" keywords');
Stock and bond markets in the U.S. are closed on different days. For example, although the bond market is closed on Columbus Day (around Oct 12) and Veterans Day (around Nov 11), the stock market is open on those days. One way to see the dates that the stock market is open and the bond market is closed is to convert both indexes of dates into sets and take the difference in sets.
The pandas .join()
method is a convenient tool to merge the stock and bond DataFrames on dates when both markets are open.
Stock prices and 10-year US Government bond yields is downloaded from FRED
stocks = pd.read_csv('./dataset/stocks.csv', index_col=0)
bonds = pd.read_csv('./dataset/bonds.csv', index_col=0)
stocks.index = pd.to_datetime(stocks.index)
bonds.index = pd.to_datetime(bonds.index)
# Convert the stock index and bond index into sets
set_stock_dates = set(stocks.index)
set_bond_dates = set(bonds.index)
# Take the difference between the sets and print
print(set_stock_dates - set_bond_dates)
# Merge stocks and bonds DataFrame using join()
stocks_and_bonds = stocks.join(bonds, how='inner')
{Timestamp('2016-11-11 00:00:00'), Timestamp('2013-11-11 00:00:00'), Timestamp('2010-11-11 00:00:00'), Timestamp('2007-11-12 00:00:00'), Timestamp('2011-10-10 00:00:00'), Timestamp('2009-10-12 00:00:00'), Timestamp('2009-11-11 00:00:00'), Timestamp('2017-06-09 00:00:00'), Timestamp('2010-10-11 00:00:00'), Timestamp('2012-10-08 00:00:00'), Timestamp('2011-11-11 00:00:00'), Timestamp('2007-10-08 00:00:00'), Timestamp('2014-10-13 00:00:00'), Timestamp('2008-11-11 00:00:00'), Timestamp('2015-11-11 00:00:00'), Timestamp('2012-11-12 00:00:00'), Timestamp('2008-10-13 00:00:00'), Timestamp('2013-10-14 00:00:00'), Timestamp('2016-10-10 00:00:00'), Timestamp('2015-10-12 00:00:00'), Timestamp('2014-11-11 00:00:00')}
Investors are often interested in the correlation between the returns of two different assets for asset allocation and hedging purposes. In this exercise, you'll try to answer the question of whether stocks are positively or negatively correlated with bonds. Scatter plots are also useful for visualizing the correlation between the two variables.
Keep in mind that you should compute the correlations on the percentage changes rather than the levels.
# Compute percent change using pct_change()
returns = stocks_and_bonds.pct_change()
# Compute correlation using corr()
correlation = returns['SP500'].corr(returns['US10Y'])
print("Correlation of stocks and interest rates: ", correlation)
# Make scatter plot
plt.scatter(returns['SP500'], returns['US10Y']);
Correlation of stocks and interest rates: 0.4119448886249272
The positive correlation means that when interest rates go down, stock prices go down. For example, during crises like 9/11, investors sold stocks and moved their money to less risky bonds (this is sometimes referred to as a 'flight to quality'). During these periods, stocks drop and interest rates drop as well. Of course, there are times when the opposite relationship holds too.
Two trending series may show a strong correlation even if they are completely unrelated. This is referred to as "spurious correlation". That's why when you look at the correlation of say, two stocks, you should look at the correlation of their returns and not their levels.
To illustrate this point, calculate the correlation between the levels of the stock market and the annual sightings of UFOs. Both of those time series have trended up over the last several decades, and the correlation of their levels is very high. Then calculate the correlation of their percent changes. This will be close to zero, since there is no relationship between those two series.
UFO data was downloaded from www.nuforc.org.
DJI = pd.read_csv('./dataset/DJI.csv', index_col=0)
DJI.columns = ['DJI']
DJI.head()
DJI | |
---|---|
Date | |
1941 | 110.96 |
1942 | 119.40 |
1943 | 135.89 |
1944 | 152.32 |
1945 | 192.91 |
UFO = pd.read_csv('./dataset/UFO.csv', index_col=0)
UFO.columns = ['UFO']
UFO.head()
UFO | |
---|---|
Date | |
1941 | 1 |
1942 | 2 |
1943 | 9 |
1944 | 9 |
1945 | 9 |
UFO.index = pd.to_datetime(UFO.index, format="%Y")
DJI.index = pd.to_datetime(DJI.index, format="%Y")
levels = UFO.join(DJI, how='inner')
levels.head()
UFO | DJI | |
---|---|---|
Date | ||
1941-01-01 | 1 | 110.96 |
1942-01-01 | 2 | 119.40 |
1943-01-01 | 9 | 135.89 |
1944-01-01 | 9 | 152.32 |
1945-01-01 | 9 | 192.91 |
levels.plot(grid=True);
plt.xlabel('Year');
plt.ylabel('Dow Jones Average/Number of Sightings')
plt.savefig('../images/dji_ufo.png')
# Compute correlation o f levels
correlation1 = levels['DJI'].corr(levels['UFO'])
print("Correlation of levels: ", correlation1)
# Compute correlation fo percent changes
changes = levels.pct_change()
correlation2 = changes['DJI'].corr(changes['UFO'])
print("Correlation of changes: ", correlation2)
Correlation of levels: 0.9399762210726428 Correlation of changes: 0.06026935462405373
R-squared measures how closely the data fit the regression line, so the R-squared in a simple regression is related to the correlation between the two variables. In particular, the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.
In this exercise, you will start using the statistical package statsmodels
, which performs much of the statistical modeling and testing that is found in R and software packages like SAS and MATLAB.
You will take two series, x
and y
, compute their correlation, and then regress y
on x
using the function OLS(y,x)
in the statsmodels.api
library (note that the dependent, or right-hand side variable y is the first argument). Most linear regressions contain a constant term which is the intercept (the $\alpha$ in the regression $y_t = \alpha + \beta x_t + \epsilon_t$). To include a constant using the function OLS()
, you need to add a column of 1's to the right hand side of the regression.
df_x = pd.read_csv('./dataset/x.csv', index_col=0, header=None)
df_y = pd.read_csv('./dataset/y.csv', index_col=0, header=None)
df_x.columns = ['x']
df_y.columns = ['y']
x = df_x.reset_index(drop=True)['x']
y = df_y.reset_index(drop=True)['y']
y
0 -0.835129 1 -0.061004 2 -0.194677 3 -2.461142 4 1.040073 ... 995 -1.017080 996 -0.430943 997 1.989779 998 -1.171907 999 -1.565902 Name: y, Length: 1000, dtype: float64
import statsmodels.api as sm
# Compute correlation of x and y
correlation = x.corr(y)
print("The correlation between x and y is %4.2f" % (correlation))
# Convert the Series x to a DataFrame and name the column x
dfx = pd.DataFrame(x, columns=['x'])
# Add a constant to the DataFrame dfx
dfx1 = sm.add_constant(dfx)
# Regress y on dfx1
result = sm.OLS(y, dfx1).fit()
# Print out the results and look at the relationship between R-squared and the correlation above
print(result.summary())
The correlation between x and y is -0.90 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.818 Model: OLS Adj. R-squared: 0.817 Method: Least Squares F-statistic: 4471. Date: Sun, 07 Jun 2020 Prob (F-statistic): 0.00 Time: 20:39:54 Log-Likelihood: -560.10 No. Observations: 1000 AIC: 1124. Df Residuals: 998 BIC: 1134. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -0.0052 0.013 -0.391 0.696 -0.032 0.021 x -0.9080 0.014 -66.869 0.000 -0.935 -0.881 ============================================================================== Omnibus: 0.048 Durbin-Watson: 2.066 Prob(Omnibus): 0.976 Jarque-Bera (JB): 0.103 Skew: -0.003 Prob(JB): 0.950 Kurtosis: 2.951 Cond. No. 1.03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
One puzzling anomaly with stocks is that investors tend to overreact to news. Following large jumps, either up or down, stock prices tend to reverse. This is described as mean reversion in stock prices: prices tend to bounce back, or revert, towards previous levels after large moves, which are observed over time horizons of about a week. A more mathematical way to describe mean reversion is to say that stock returns are negatively autocorrelated.
This simple idea is actually the basis for a popular hedge fund strategy. If you're curious to learn more about this hedge fund strategy (although it's not necessary reading for anything else later in the course), see here.
You'll look at the autocorrelation of weekly returns of MSFT stock from 2012 to 2017. You'll start with a DataFrame MSFT
of daily prices. You should use the .resample()
method to get weekly prices and then compute returns from prices. Use the pandas method .autocorr()
to get the autocorrelation and show that the autocorrelation is negative. Note that the .autocorr()
method only works on Series, not DataFrames (even DataFrames with one column), so you will have to select the column in the DataFrame.
MSFT = pd.read_csv('./dataset/MSFT.csv', index_col=0)
MSFT.index = pd.to_datetime(MSFT.index, format="%m/%d/%Y")
MSFT.head()
Adj Close | |
---|---|
Date | |
2012-08-06 | 26.107651 |
2012-08-07 | 26.377876 |
2012-08-08 | 26.438896 |
2012-08-09 | 26.587088 |
2012-08-10 | 26.517351 |
# Convert the daily data to weekly data
MSFT = MSFT.resample(rule='W').last()
# Compute the percentage change of prices
returns = MSFT.pct_change()
# Compute and print the autocorrelation of returns
autocorrelation = returns['Adj Close'].autocorr()
print('The auto correlation of weekly returns is %4.2f' % (autocorrelation))
The auto correlation of weekly returns is -0.16
When you look at daily changes in interest rates, the autocorrelation is close to zero. However, if you resample the data and look at annual changes, the autocorrelation is negative. This implies that while short term changes in interest rates may be uncorrelated, long term changes in interest rates are negatively autocorrelated. A daily move up or down in interest rates is unlikely to tell you anything about interest rates tomorrow, but a move in interest rates over a year can tell you something about where interest rates are going over the next year. And this makes some economic sense: over long horizons, when interest rates go up, the economy tends to slow down, which consequently causes interest rates to fall, and vice versa.
daily_rates = pd.read_csv('./dataset/daily_rates.csv', index_col=0, parse_dates=['DATE'])
daily_rates.head()
US10Y | |
---|---|
DATE | |
1962-01-02 | 4.06 |
1962-01-03 | 4.03 |
1962-01-04 | 3.99 |
1962-01-05 | 4.02 |
1962-01-08 | 4.03 |
# Compute the daily change in interest rates
daily_diff = daily_rates.diff()
# Compute and print the autocorrelation of daily changes
autocorrelation_daily = daily_diff['US10Y'].autocorr()
print("The autocorrelation of daily interest rate changes is %4.2f" %(autocorrelation_daily))
# Convert the daily data to annual data
yearly_rates = daily_rates.resample(rule='A').last()
# Repeat above for annual data
yearly_diff = yearly_rates.diff()
autocorrelation_yearly = yearly_diff['US10Y'].autocorr()
print("The autocorrelation of annual interest rate changes is %4.2f" %(autocorrelation_yearly))
The autocorrelation of daily interest rate changes is 0.07 The autocorrelation of annual interest rate changes is -0.22