#!/usr/bin/env python
# coding: utf-8

# # Correlation and Autocorrelation
# > A Summary of lecture "Time Series Analysis in Python", via datacamp
# 
# - toc: true 
# - badges: true
# - comments: true
# - author: Chanseok Kang
# - categories: [Python, Datacamp, Time_Series_Analysis]
# - image: images/dji_ufo.png

# In[1]:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 5)


# ## Introduction to Course
# 

# ### A "Thin" Application of Time Series
# [Google Trends](https://trends.google.com/trends/) allows users to see how often a term is searched for. We downloaded a file from Google Trends containing the frequency over time for the search word "diet". A first step when analyzing a time series is to visualize the data with a plot. You should be able to clearly see a gradual decrease in searches for "diet" throughout the calendar year, hitting a low around the December holidays, followed by a spike in searches around the new year as people make New Year's resolutions to lose weight.
# 
# Like many time series datasets you will be working with, the index of dates are strings and should be converted to a datetime index before plotting.

# - Preprocess

# In[2]:


diet = pd.read_csv('./dataset/diet.csv', index_col=0)
diet.head()


# In[3]:


# Convert the date index to datetime
diet.index = pd.to_datetime(diet.index)

# Plot the entire time series diet and show gridlines
diet.plot(grid=True);
plt.title('Seasonal trend of "Diet" keywords');

# Slice the dataset to keep only 2012
diet2012 = diet[diet.index.year == 2012]

# Plot 2012 data
diet2012.plot(grid=True);
plt.title('2012 trend of "Diet" keywords');


# ### Merging Time Series With Different Dates
# Stock and bond markets in the U.S. are closed on different days. For example, although the bond market is closed on Columbus Day (around Oct 12) and Veterans Day (around Nov 11), the stock market is open on those days. One way to see the dates that the stock market is open and the bond market is closed is to convert both indexes of dates into sets and take the difference in sets.
# 
# The pandas ```.join()``` method is a convenient tool to merge the stock and bond DataFrames on dates when both markets are open.
# Stock prices and 10-year US Government bond yields is downloaded from [FRED](https://fred.stlouisfed.org/)
# 

# - Preprocess

# In[4]:


stocks = pd.read_csv('./dataset/stocks.csv', index_col=0)
bonds = pd.read_csv('./dataset/bonds.csv', index_col=0)


# In[5]:


stocks.index = pd.to_datetime(stocks.index)
bonds.index = pd.to_datetime(bonds.index)


# In[6]:


# Convert the stock index and bond index into sets
set_stock_dates = set(stocks.index)
set_bond_dates = set(bonds.index)

# Take the difference between the sets and print
print(set_stock_dates - set_bond_dates)

# Merge stocks and bonds DataFrame using join()
stocks_and_bonds = stocks.join(bonds, how='inner')


# ## Correlation of Two Time Series
# 

# ### Correlation of Stocks and Bonds
# Investors are often interested in the correlation between the returns of two different assets for asset allocation and hedging purposes. In this exercise, you'll try to answer the question of whether stocks are positively or negatively correlated with bonds. Scatter plots are also useful for visualizing the correlation between the two variables.
# 
# Keep in mind that you should compute the correlations on the percentage changes rather than the levels.

# In[7]:


# Compute percent change using pct_change()
returns = stocks_and_bonds.pct_change()

# Compute correlation using corr()
correlation = returns['SP500'].corr(returns['US10Y'])
print("Correlation of stocks and interest rates: ", correlation)

# Make scatter plot
plt.scatter(returns['SP500'], returns['US10Y']);


# The positive correlation means that when interest rates go down, stock prices go down. For example, during crises like 9/11, investors sold stocks and moved their money to less risky bonds (this is sometimes referred to as a 'flight to quality'). During these periods, stocks drop and interest rates drop as well. Of course, there are times when the opposite relationship holds too.

# ### Flying Saucers Aren't Correlated to Flying Markets
# Two trending series may show a strong correlation even if they are completely unrelated. This is referred to as "spurious correlation". That's why when you look at the correlation of say, two stocks, you should look at the correlation of their returns and not their levels.
# 
# To illustrate this point, calculate the correlation between the levels of the stock market and the annual sightings of UFOs. Both of those time series have trended up over the last several decades, and the correlation of their levels is very high. Then calculate the correlation of their percent changes. This will be close to zero, since there is no relationship between those two series.
# 
# UFO data was downloaded from [www.nuforc.org](www.nuforc.org).

# - Preprocess

# In[8]:


DJI = pd.read_csv('./dataset/DJI.csv', index_col=0)
DJI.columns = ['DJI']
DJI.head()


# In[9]:


UFO = pd.read_csv('./dataset/UFO.csv', index_col=0)
UFO.columns = ['UFO']
UFO.head()


# In[10]:


UFO.index = pd.to_datetime(UFO.index, format="%Y")
DJI.index = pd.to_datetime(DJI.index, format="%Y")


# In[11]:


levels = UFO.join(DJI, how='inner')
levels.head()


# In[13]:


levels.plot(grid=True);
plt.xlabel('Year');
plt.ylabel('Dow Jones Average/Number of Sightings')
plt.savefig('../images/dji_ufo.png')


# In[13]:


# Compute correlation o f levels
correlation1 = levels['DJI'].corr(levels['UFO'])
print("Correlation of levels: ", correlation1)

# Compute correlation fo percent changes
changes = levels.pct_change()
correlation2 = changes['DJI'].corr(changes['UFO'])
print("Correlation of changes: ", correlation2)


# ## Simple Linear Regression
# - What is a Regression?
#     - Simple linear regression: $y_t = \alpha + \beta x_t + \epsilon_t$
# - Relationship between R-Squared and Correlation
#     - $[corr(x, y)]^2 = R^2$
#     - $sign(corr) = sign(\text{regression slope})$

# ### Looking at a Regression's R-Squared
# R-squared measures how closely the data fit the regression line, so the R-squared in a simple regression is related to the correlation between the two variables. In particular, the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.
# 
# In this exercise, you will start using the statistical package ```statsmodels```, which performs much of the statistical modeling and testing that is found in R and software packages like SAS and MATLAB.
# 
# You will take two series, ```x``` and ```y```, compute their correlation, and then regress ```y``` on ```x``` using the function ```OLS(y,x)``` in the ```statsmodels.api``` library (note that the dependent, or right-hand side variable y is the first argument). Most linear regressions contain a constant term which is the intercept (the $\alpha$ in the regression $y_t = \alpha + \beta x_t + \epsilon_t$). To include a constant using the function ```OLS()```, you need to add a column of 1's to the right hand side of the regression.

# - Preprocess

# In[61]:


df_x = pd.read_csv('./dataset/x.csv', index_col=0, header=None)
df_y = pd.read_csv('./dataset/y.csv', index_col=0, header=None)

df_x.columns = ['x']
df_y.columns = ['y']

x = df_x.reset_index(drop=True)['x']
y = df_y.reset_index(drop=True)['y']


# In[62]:


y


# In[63]:


import statsmodels.api as sm

# Compute correlation of x and y
correlation = x.corr(y)
print("The correlation between x and y is %4.2f" % (correlation))

# Convert the Series x to a DataFrame and name the column x
dfx = pd.DataFrame(x, columns=['x'])

# Add a constant to the DataFrame dfx
dfx1 = sm.add_constant(dfx)

# Regress y on dfx1
result = sm.OLS(y, dfx1).fit()

# Print out the results and look at the relationship between R-squared and the correlation above
print(result.summary())


# ## Autocorrelation
# - Correlation of a time series with a lagged copy of itself
# - Lag-one autocorrelation
# - Also called **serial correlation**

# ### A Popular Strategy Using Autocorrelation
# One puzzling anomaly with stocks is that investors tend to overreact to news. Following large jumps, either up or down, stock prices tend to reverse. This is described as mean reversion in stock prices: prices tend to bounce back, or revert, towards previous levels after large moves, which are observed over time horizons of about a week. A more mathematical way to describe mean reversion is to say that stock returns are negatively autocorrelated.
# 
# This simple idea is actually the basis for a popular hedge fund strategy. If you're curious to learn more about this hedge fund strategy (although it's not necessary reading for anything else later in the course), see [here](https://www.quantopian.com/posts/enhancing-short-term-mean-reversion-strategies-1).
# 
# You'll look at the autocorrelation of weekly returns of MSFT stock from 2012 to 2017. You'll start with a DataFrame ```MSFT``` of daily prices. You should use the ```.resample()``` method to get weekly prices and then compute returns from prices. Use the pandas method ```.autocorr()``` to get the autocorrelation and show that the autocorrelation is negative. Note that the ```.autocorr()``` method only works on Series, not DataFrames (even DataFrames with one column), so you will have to select the column in the DataFrame.

# - Preprocess

# In[14]:


MSFT = pd.read_csv('./dataset/MSFT.csv', index_col=0)
MSFT.index = pd.to_datetime(MSFT.index, format="%m/%d/%Y")
MSFT.head()


# In[15]:


# Convert the daily data to weekly data
MSFT = MSFT.resample(rule='W').last()

# Compute the percentage change of prices
returns = MSFT.pct_change()

# Compute and print the autocorrelation of returns
autocorrelation = returns['Adj Close'].autocorr()
print('The auto correlation of weekly returns is %4.2f' % (autocorrelation))


# ### Are Interest Rates Autocorrelated?
# When you look at daily changes in interest rates, the autocorrelation is close to zero. However, if you resample the data and look at annual changes, the autocorrelation is negative. This implies that while short term changes in interest rates may be uncorrelated, long term changes in interest rates are negatively autocorrelated. A daily move up or down in interest rates is unlikely to tell you anything about interest rates tomorrow, but a move in interest rates over a year can tell you something about where interest rates are going over the next year. And this makes some economic sense: over long horizons, when interest rates go up, the economy tends to slow down, which consequently causes interest rates to fall, and vice versa.

# - Preprocess

# In[72]:


daily_rates = pd.read_csv('./dataset/daily_rates.csv', index_col=0, parse_dates=['DATE'])
daily_rates.head()


# In[75]:


# Compute the daily change in interest rates
daily_diff = daily_rates.diff()

# Compute and print the autocorrelation of daily changes
autocorrelation_daily = daily_diff['US10Y'].autocorr()
print("The autocorrelation of daily interest rate changes is %4.2f" %(autocorrelation_daily))

# Convert the daily data to annual data
yearly_rates = daily_rates.resample(rule='A').last()

# Repeat above for annual data
yearly_diff = yearly_rates.diff()
autocorrelation_yearly = yearly_diff['US10Y'].autocorr()
print("The autocorrelation of annual interest rate changes is %4.2f" %(autocorrelation_yearly))