Spurious Regression
# The best way to understand **spurious regression** is to generate two random walk series then regression one onto the other. Let's simulate two assets' price $Y_t$ and $X_t$
# \begin{aligned}
# Y_{t} &= \alpha_1+\alpha_2 Y_{t-1}+u_{t} \\
# X_{t} &=\beta_1+\beta_2 X_{t-1}+v_{t}
# \end{aligned}
# In[38]:
def rw_dr(dr_param, slope_param, init_val, length):
rw_array = []
rw_array.append(init_val)
last = init_val
for i in range(length):
current = dr_param + slope_param * last + 50 * np.random.randn()
rw_array.append(current)
last = current
return rw_array
N = 5000
X = rw_dr(2, 1, 0, N)
Y = rw_dr(0, 1, 0, N)
dates = pd.date_range("19900101", periods=N + 1)
df = pd.DataFrame([X, Y]).T
df.columns = ["Asset_X", "Asset_Y"]
df.index = dates
df.plot(figsize=(12, 8), grid=True)
plt.show()
# In[39]:
model = smf.ols(formula="Asset_Y ~ Asset_X", data=df)
results = model.fit()
print(results.summary())
# Note that $t$-statistic of $X$ is highly significant, and $R^2$ is relatively low though $F$-statistic extremely significant. You will see different results when running codes on your computer, because I didn't set random seeds in Python.
# We know the data generating processes of $X$ and $Y$ are absolutely irrelevant, but regression results say otherwise. This a **phenomenon of spurious regression**. You can see the $dw$ is almost $0$, which signals strong autocorrelation issue.
#
# An $R^2 > dw$ is a good rule of thumb to suspect that the estimated regression is spurious. Also note that all statistic tests are invalid because $t$-statistics are not distributed as $t$-distribution.
# ## Tests Of Stationarity
# ### Autocorrelation Function And Correlogram
# Autocorrelation function (ACF) at lag $k$ is defined as
# \begin{aligned}
# \rho_{k} &=\frac{\gamma_{k}}{\gamma_{0}}=\frac{\text { covariance at lag } k}{\text { variance }}
# \end{aligned}
# In practice, we can only compute **sample autocorrelation function**, $\hat{\rho}_k$
# \begin{aligned}
# &\hat{\gamma}_{k}=\frac{\sum\left(Y_{t}-\bar{Y}\right)\left(Y_{t+k}-\bar{Y}\right)}{n-k} \\
# &\hat{\gamma}_{0}=\frac{\sum\left(Y_{t}-\bar{Y}\right)^{2}}{n-1}
# \end{aligned}
# Note that $\hat{\rho}_0 = 1$.
#
# If we plot $\hat{\rho}$ against $k$, we obtain correlogram, here is a correlogram of a white noise
# In[40]:
from statsmodels.graphics import tsaplots
X = np.random.randn(1000)
plot_acf(X, lags=100, color="red").set_size_inches(16, 6)
plt.show()
# In[41]:
plt.figure(figsize=(16, 6))
pd.plotting.autocorrelation_plot(X).set_xlim([0, 100])
plt.show()
# Obviously, autocorrelation at various lags hover around $0$, which is a sign of stationarity.
#
# Here is the correlogram of monthly urban consumers' CPI in US. Note the old tradition in econometrics is to computer ACF up to $1/3$ length of time series, we are following this rule below.
# In[42]:
df = pdr.data.DataReader(["CPIAUCSL"], "fred", start, end)
df.columns = ["CPI_Urban"]
plt.figure(figsize=(16, 6))
pd.plotting.autocorrelation_plot(df["CPI_Urban"]).set_xlim([0, np.round(len(df) / 3)])
plt.show()
# ### Ljung-Box Test
# Apart from the _Breusch-Godfrey_ and _Durbin-Watson_ test that we have discussed in the chapter of autocorrelation, we will introduce one more common autocorrelation test in time series, **Ljung-Box Test** (LB test).
# $$
# \mathrm{LB}=n(n+2) \sum_{k=1}^{m}\left(\frac{\hat{\rho}_{k}^{2}}{n-k}\right) \sim \chi^{2}_m
# $$
#
# The hypotheses are
# $$
# H_0: \text{No autocorrelation presents}\\
# H_1: \text{Autocorrelation presents for specified lags}
# $$
#
# There are many controversial debates about when and how to use these statistics, but those academic debates belong to academia, we go lightly here. LB test is group test which depends on any specification of lags. If you want to check if first $10$ lags if it shows any sign of autocorrelation, we use the code as below
# In[43]:
sm.stats.acorr_ljungbox(df["CPI_Urban"], lags=[10], return_df=True, boxpierce=True)
# The test shows overwhelmingly evidence for autocorrelation up to lag $10$. Note that we also add ```boxpierce``` argument in the function, this is a variant of LB test, usually printed for reference.
# ### Dickey–Fuller (DF) test
# If we suspect that $Y_t$ follows a unit root process, why don't we simply regress $Y_t$ onto $Y_{t-1}$, i.e.
# $$
# Y_{t}=\rho Y_{t-1}+u_{t} \quad-1 \leq \rho \leq 1
# $$
# The problem is that if $\rho=1$, the $t$-statistic is severely biased, let's simulate the process and examine if this is the case.
# However, some manipulation can circumvent the issue
# \begin{aligned}
# Y_{t}-Y_{t-1} &=\rho Y_{t-1}-Y_{t-1}+u_{t} \\
# &=(\rho-1) Y_{t-1}+u_{t}\\
# \Delta Y_t &= \delta Y_{t-1}+u_{t}
# \end{aligned}
# where $\delta = \rho-1$. If $\delta =0$, i.e. $\rho=1$, then $\Delta Y_t = u_t$, therefore $Y_t$ is unstationary; if $\delta <0$, then $\rho <1$, in that case $y_t$ is stationary.
#
# The last equation $\Delta Y_t = \delta Y_{t-1}+u_{t}$ is the one to estimate and hypotheses are
# $$
# H_0: \delta = 0, \text{unstationary}\\
# H_1: \delta < 0, \text{stationary}
# $$
# It turns out the $t$-statistic calculated on $\delta$ doesn't really follow a $t$-distribution, it actually follows $\tau$-distribution or **Dickey-Fuller distribution**, therefore we call it _Dickey-Fuller test_.
# In[44]:
from statsmodels.tsa.stattools import adfuller
# In[45]:
results_dickeyfuller = adfuller(X)
print("ADF Statistic: %f" % results_dickeyfuller[0])
print("p-value: %f" % results_dickeyfuller[1])
# ## Transforming Nonstationary Time Series
# In academia, unfortunately transformation of nonstationarity is still a controversial topic. Jame Hamilton argues that we should never use HP filter due to its heavy-layered differencing, i.e. the cyclical components of HP-filtered series are the fourth difference of its corresponding trend component.
#
# Also because most of economic and financial time series exhibit features of random walk, the cyclical components are merely artifacts of applying the filter.
#
# However, let's not be perplexed by sophisticated academic discussion. Here's the rule of thumb.
#
#
HP filter or F.O.D.?
# If you want to do business cycle analysis with structural macro model, use HP-filter.
# If you want to do ordinary time series modeling, use first order difference.
#
# ## Cointegration
# We have seen the invalid estimated results from regression of nonstationary time series onto another nonstationary time series, which possibly would cause a spurious regression.
#
# However, if both series share a common trend, the regression between them will not be necessarily spurious.
# In[46]:
start = dt.datetime(1980, 1, 1)
end = dt.datetime(2021, 10, 1)
df = pdr.data.DataReader(["PCEC96", "A229RX0"], "fred", start, end).dropna()
df.columns = ["real_PCE", "real_DPI"]
df.head()
# In[47]:
model = smf.ols(formula="np.log(real_PCE) ~ np.log(real_DPI)", data=df)
results = model.fit()
# In[48]:
print(results.summary())
# In[49]:
dickey_fuller = adfuller(results.resid)
print("ADF Statistic: %f" % dickey_fuller[0])
print("p-value: %f" % dickey_fuller[1])
# Note that we are able to reject null hypothesis of nonstationarity with $5\%$ siginificance level.
# In[50]:
plot_acf(results.resid, lags=100, color="red").set_size_inches(16, 6)
# The disturbance terms can be seen as a linear combination of $PCE$ and #DPI#
# $$
# u_{t}=\ln{PCE}_{t}-\beta_{1}-\beta_{2} \ln{DPI}_{t}
# $$
# If both variables are $I(1)$, but residuals are $I(0)$, we say that the two variables are **cointegrated**, the regression is also known as **cointegrating regression**. Economically speaking, two variables will be cointegrated if they have a long-term, or equilibrium, relationship between them.
# One of the key considerations prior to testing for cointegration, is whether there is theoretical support for the cointegrating relationship. It is important to remember that cointegration occurs when separate time series share an underlying stochastic trend. The idea of a shared trend should be supported by economic theory.
# One thing to note here, in cointegration context, we usually perform a **Engle–Granger** (EG) or **augmented Engle–Granger** (AEG) test, they are essential DF test,
#
# $$
# H_0: \text{No cointegration}\\
# H_1: \text{Cointegreation presents}
# $$
# In[51]:
from statsmodels.tsa.stattools import coint
coint_test1 = coint(df["real_PCE"], df["real_DPI"], trend="c", autolag="BIC")
coint_test2 = coint(df["real_PCE"], df["real_DPI"], trend="ct", autolag="BIC")
# In[52]:
def coint_output(res):
output = pd.Series(
[res[0], res[1], res[2][0], res[2][1], res[2][2]],
index=[
"Test Statistic",
"p-value",
"Critical Value (1%)",
"Critical Value (5%)",
"Critical Value (10%)",
],
)
print(output)
# Without linear trend.
# In[53]:
coint_output(coint_test1)
# With linear trend.
# In[54]:
coint_output(coint_test2)
# ## Error Correction Mechanism
# **Error Correction Mechanism** (ECM) is an phenomenon that any variable deviates from its equilibrium will correct its own error gradually.
# $$
# u_{t-1}=\ln{PCE}_{t-1}-\beta_{1}-\beta_{2} \ln{DPI}_{t-1}
# $$
# $$
# \Delta \ln{PCE}_{t}=\alpha_{0}+\alpha_{1} \Delta \ln{DPI}_{t}+\alpha_{2} \mathrm{u}_{t-1}+\varepsilon_{t}
# $$
# Suppose $u_{t-1}> 0$ and $\Delta DPI_t =0$, it means $PCE_t$ is higher than equilibrium. In order to correct this temporary error, $\alpha_2$ has to be negative to adjust back to the equilibrium, hence $\Delta \ln{PCE_t}<0$.
# We can get $u_{t-1}$ by $\hat{u}_{t-1}$
# $$
# \hat{u}_{t-1}=\ln{PCE}_{t-1}-\hat{\beta}_{1}-\hat{\beta}_{2} \ln{DPI}_{t-1}
# $$
# In[55]:
df_new = {
"Delta_ln_PCE": df["real_PCE"].diff(1),
"Delta_ln_DPI": df["real_DPI"].diff(1),
"u_tm1": results.resid.shift(-1),
}
df_new = pd.DataFrame(df_new).dropna()
df_new
# In[56]:
model_ecm = smf.ols(formula="Delta_ln_PCE ~ Delta_ln_DPI + u_tm1", data=df_new)
results_ecm = model_ecm.fit()
print(results_ecm.summary())