A regression model is a statistical model of the influence of random variables on a certain random variable. If we assume that they follow a simple linear relationship, of the form:
$$ Y = \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k, $$then this is a linear regression model. In this case, there are $k$ independent variables. They are alternatively called explanatory variables. Sometimes, they are constants, and then they are called constant terms. The coefficients $\beta_i$ are the parameters of the model that have to be estimated using data.
The simplest cast of a linear model is $Y = \alpha + \beta X$. We will work with some data from Yahoo! Finance subsequently.
For such a simple model, using a scatter plot can aid in visualization.
#!pip install yfinance
#!pip install yahoofinancials
#!pip install pandas
import pandas as pd
import yfinance as yf
import numpy as np
import scipy.stats
from yahoofinancials import YahooFinancials
import matplotlib.pyplot as plt
from scipy.stats import linregress
pd.options.mode.chained_assignment = None
startdate = '2016-11-10' #Only take data after this date
enddate = '2019-05-26' #Only take data before this date
# Download AMEX data, S&P 500 Data
amex_data = yf.download('AXP')
datestotakestart = amex_data.index>startdate
datestotakeend = amex_data.index<enddate
amex_data = amex_data[np.logical_and.reduce([datestotakestart,datestotakeend])]
# print(amex_data)
amex_data["Log Return"] = np.log(amex_data["Close"]/amex_data["Open"])
amex_log_returns = amex_data[["Log Return"]].to_numpy()
SnP_data = yf.download('^GSPC')
datestotakestart = SnP_data.index>startdate
datestotakeend = SnP_data.index<enddate
SnP_data = SnP_data[np.logical_and.reduce([datestotakestart,datestotakeend])]
SnP_data["Log Return"] = np.log(SnP_data["Close"]/SnP_data["Open"])
SnP_log_returns = SnP_data[["Log Return"]].to_numpy()
# print(SnP_log_returns.shape)
# Linear Regression
# print(np.transpose(SnP_log_returns)[0])
res = linregress(np.transpose(SnP_log_returns)[0], np.transpose(amex_log_returns)[0])
# Plotting Scatterplot
plt.figure(figsize=(20, 20))
plt.scatter(SnP_log_returns,amex_log_returns,s=2,label='Data Points')
plt.plot(SnP_log_returns, res.intercept + res.slope*SnP_log_returns, 'r', label='fitted line, slope = %s, intercept = %s, Standard Error Regression = %s'%(res.slope,res.intercept,res.stderr))
plt.legend()
plt.xlabel('S&P 500 Log Return')
plt.ylabel('AMEX Log Return')
plt.show()
[*********************100%***********************] 1 of 1 completed [*********************100%***********************] 1 of 1 completed
Usually, the points will not lie exactly along the linear regression line, unless they are perfectly correlated. One can include an error term in the simple linear model
$$ Y_t = \alpha + \beta X_t + \epsilon_t. $$$\epsilon_t$ is called the error process. For low correlation between $X$ and $Y$, the error process will have a relatively high variance, and similarly, a high correlation will imply that the error process has a low variance.
We now need to introduce more notation. We adopt the hat notation $\hat{}$ to denote estimators. Then, the fitted line is denoted as
$$ \hat{Y} = \hat{\alpha} + \hat{\beta} X. $$The difference between $Y$ and $\hat{Y}$ is known as the residual, $e_t = Y_t - \hat{Y}_t$. Then, $Y_t = \hat{\alpha} + \hat{\beta}X_t + e_t$. This is the estimated model. We can regard the residuals as observations on the error process $\epsilon_t$. By testing the properties of the residuals, we can test assumptions about the behavior of the error process. The residuals will depend on the estimated parameters of the linear regression.
It remains to choose a way to estimate the parameters that minimizes the residuals. Just minimizing the sum of residuals will not work as positive and negative values will cancel each other out. The simplest way to estimate them, which has nice mathematical properties, is to minimize the variance of the residuals, or put another way, to minimize the sum of the squared residuals. This is called the ordinary least squares optimization criterion. We denote $RSS$:
$$ RSS = \sum_t e_t^2 = \sum_t (Y_t - (\alpha + \beta X_t))^2. $$Then the OLS estimators $\hat{\alpha}, \hat{\beta}$ are found by solving the optimization problem
$$ \min_{\alpha,\beta} RSS, $$which is known as the OLS criterion. If we differentiate them with respect to $\alpha$ and $\beta$ and set the derivatives to zero, we get:
$$ \hat{\beta} = \frac{\sum_t (X_t - \bar{X})(Y_t - \bar{Y})}{\sum_t (X_t - \bar{X})^2}, $$where $\bar{X}$ and $\bar{Y}$ denote the means of $X$ and $Y$.
As a way to generalize this, if we consider a matrix-vector formulation:
$$ \sum_j X_{ij} \beta_j = Y_i\\ = \textbf{X} \vec{\beta} = \vec{Y}, $$then,
$$ RSS(\vec{\beta}) = (\vec{Y} - \textbf{X} \vec{\beta})^T (\vec{Y} - \textbf{X} \vec{\beta}),\\ \hat{\vec{\beta}} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \vec{Y}. $$Going back to the simple linear regression model, we note that the estimate of beta is the ratio of the sample variance to the sample covariance:
$$ \hat{\beta} = \text{est.} Cov(X,Y) / \text{est.} V(X). $$Since the covariance if the product of the correlation, standard deviation of $X$ and standard deviation of $Y$,
$$ \hat{\beta} = Corr(X,Y) * \frac{\sigma_Y}{\sigma_X}. $$For now, let us assume that the errors are generated by an independnent and identically distributed (i.i.d.) process, $\epsilon_t \sim \text{i.i.d.} (0,\sigma^2)$. We assume the expectation is $0$ as if not the error would have bias. The OLS estimate of $\sigma^2$ is
$$ s^2 = \frac{RSS}{T-2}, $$where $T$ is the number of data points. This gives you the standard error of the regression $s$. Obviously a small $s$ is good, but we need to quantify how small is small. We introduce the total sum of squares $TSS = \sum_t (Y_t-\bar{Y})^2$, which is related to the sample variance of $Y$. This measures the amount of variation in $Y$ that we seek to explain by the regression model. We next can introduce the explained sum of squares $ESS$ as the amount of variation in $Y$ that can be explained by the regression, obtained as $ESS = TSS - RSS$. We can next also introduce the analysis of variance $ANOVA$, which is basically the decomposition of the total variance of $Y$ into the variance that can be explained by the model and the residual variance. This can be summarized in the single statistic $R^2$ called the regression, and which is given by $R^2 = ESS/RSS$. A simple statistical test of the significance of $R^2$ can be done using the $F$ statistic:
$$ F = \frac{ESS (T-2)}{RSS} \sim F_{1,T-2}. $$One can also perform hypothesis tests on the true value of coefficients. Such tests are useful for determining whether the explanatory variable is significant enough to be included in the regression. We write the null and alternative hypotheses as:
$$ H_0 : \beta = 0 \text{ VS } H_1:\beta \neq 0. $$One can also test different hypotheses like:
$$ H_0 : \alpha = 0 \text{ VS } H_1: \alpha > 0. $$The alternative hypothesis can be one-sided or two-sided. For a linear regression, the test statistics have a Student $t$ distribution. One can use a $t$ test for the coefficient parameters.
$$ t = \frac{\hat{\alpha} - \alpha_0}{\text{est.s.e.}(\hat{\alpha})} \sim t_{T-2},\\ t = \frac{\hat{\beta}-\beta_0}{\text{est.s.e.}(\hat{\beta})} \sim t_{T-2},\\ \text{est.s.e.}(\hat{\beta}) = \frac{s}{s_X \sqrt{T-1}},\\ \text{est.s.e.}(\hat{\alpha}) = \text{est.s.e.}(\hat{\beta}) \left( T^{-1} \sum_t X_t^2 \right)^{1/2}. $$We illustrate this with the previous example data of AMEX and SnP 500 data.
Let us consider two hypotheses:
$$ (a) H_0 : \beta = 1 \text{ vs } H_1 : \beta \neq 1,\\ (b) H_0 : \alpha = 0 \text{ vs } H_1 : \alpha > 0. $$sx = np.sqrt(scipy.stats.moment(SnP_log_returns,moment=2)/(len(SnP_log_returns)-1)) #Sample standard deviation of X variable
est_beta = res.slope
est_alpha = res.intercept
rss = np.sum((amex_log_returns - (est_alpha + est_beta*SnP_log_returns))**2)
s = np.sqrt(rss/(len(SnP_log_returns)-2))
est_s_e_beta = s/(sx*(np.sqrt(len(SnP_log_returns)-1)))
est_s_e_alpha = est_s_e_beta*np.sqrt(np.sum(SnP_log_returns*SnP_log_returns)/len(SnP_log_returns))
#Now we compute the t ratios for each test
t_ratio_a = (est_beta-1)/est_s_e_beta
print('t ratio for a is %s'%(t_ratio_a))
t_ratio_b = (est_alpha)/est_s_e_alpha
print('t ratio for b is %s'%(t_ratio_b))
#One will compare these values to the critical values of the test statistics.
print('T test one tail significance levels')
print('10%% is %s'%(scipy.stats.t.ppf(1-0.10,len(SnP_log_returns))))
print('5%% is %s'%(scipy.stats.t.ppf(1-0.05,len(SnP_log_returns))))
print('1%% is %s'%(scipy.stats.t.ppf(1-0.01,len(SnP_log_returns))))
print('T test two tail significance levels')
print('10%% is %s'%(scipy.stats.t.ppf(1-0.10/2,len(SnP_log_returns))))
print('5%% is %s'%(scipy.stats.t.ppf(1-0.05/2,len(SnP_log_returns))))
print('1%% is %s'%(scipy.stats.t.ppf(1-0.01/2,len(SnP_log_returns))))
t ratio for a is [0.09365462] t ratio for b is [0.01020616] T test one tail significance levels 10% is 1.2828840860370843 5% is 1.647253005258086 1% is 2.332225364901448 T test two tail significance levels 10% is 1.647253005258086 5% is 1.963700958511729 1% is 2.583581612989292
Since the t values are far below the critical values, it is safe to accept the null hypothesis in both cases.
We note that there will be many different OLS estimates of any coefficient, with differeing data. The estimated coefficients are random variables (in this case, $\hat{\alpha}, \hat{\beta}$) and thus one can ask how the distribution of the estimators looks like. We want to emphasize that two different types of random variables may be present in the context of regression. Firstly, the error process (which we previously assumed to be normal and i.i.d.), and the coefficient estimators, which are random because different data sources used will yield different values.
The properties of the estimator distributions will follow from the properties of the error distributions. Eg, if the error is normal and i.i.d, then the OLS estimators will also have normal distributions. One can how use classical statistical inference tests, like t tests and F tests, on the true value of the model parameters.
When we talk about estimator distributions, we usually hope that they are unbiased and efficient. Unbiased means that the expected value of the estimator equals the true value of the parameter, and efficient means that the variance is as small as possible. When we have access to many data points, a looser condition is to hope that the estimator is consistent, which means that it converges to the true value of the parameter as the sample size tends to infinity, even if it is biased and inefficient for small sample sizes.
The Gauss-Markov Theorem states that if the residuals are i.i.d., then the OLS estimators are the best linear unbiased estimators (BLUE). Best means that the OLS estimators are the most efficient linear unbiased estimator. Linear means that the OLS formular is linear in data. Unbiased means that its expectation is the true value of the coefficient. However, this theorem assumes that the explanatory variables are not stocastic. However, explanatory variables are usually stochastic, so this theorem is of limited use in practice. Even so, the OLS estimators will always be consistent provided the error process is stationary.
We have already mentioned a generalization of linear regression for multiple variables. Using matrix notation, this is:
$$ \vec{Y} = X \vec{\beta} + \vec{\epsilon}. $$As mentioned above, the OLS gives:
$$ RSS(\vec{\beta}) = (\vec{Y} - \textbf{X} \vec{\beta})^T (\vec{Y} - \textbf{X} \vec{\beta}),\\ \hat{\vec{\beta}} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \vec{Y}. $$Some other values:
$$ TSS = \vec{Y}^\dag \vec{Y} - T \bar{\vec{Y}}^2,\\ s^2 = RSS/(T-2),\\ est.V(\hat{\vec{\beta}}) = s^2 (X^\dag X)^{-1}. $$When doing hypothesis testing, the proceedure is roughly similar, just that now the Student t distribution has $T-k$ degrees of freedom, where $k$ is the number of explanatory variables.
Hypothesis involving more than 2 parameters can be formulated as a set of linear restrictions. Let us write this in matrix form:
$$ R \vec{\beta} = \vec{q}, $$where $\vec{q}$ is a vector of length $p$, where $p$ is the number of restrictions on the parameters, and $R$ is a $p$ by $k$ matrix. As a simple example, if we are considering the null hypothesis $H_0 : \beta_1 = \beta_2, \beta_1 + \beta_3 = 0, \beta_2 + 2 \beta_3 = \beta_1 + 2$, then this is written as:
$$ \begin{pmatrix} 1 & -1 & 0 \\ 1 & 0 & 1 \\ -1 & 1 & 2 \end{pmatrix} \begin{pmatrix} \beta_1 \\ \beta_2 \\ \beta_3 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \\ 2 \end{pmatrix}. $$These sort of hypothesis can be tested using this statistic:
$$ \frac{p^{-1} (RSS_R - RSS_U)}{v^{-1} RSS_U} \sim F_{p,v}. $$$p$ is the number of restrictions, $v=T-k$ is the sample size minus the number of variables in the regression including the constant, $RSS_U$ is the residual sum of squares when the regression model is estimated with no restrictions, and $RSS_R$ is the residual sum of squares obtained after imposing the restrictions in the null hypothesis.
One special case is the hypothesis $H_0 : \beta_2 = \dots = \beta_k = 0$. This gives a restricted model $Y_t = \beta_1 + \epsilon_t$, and $RSS_R = TSS$.
Other tests that can be used when the errors are not normal but the sample size is large are the Wald test, a Lagrange Multiplier test, or a Likelihood ratio test.
We use the $t$ distribution to find approximate confidence intervals for the true value of a regression model coefficient. This $t$ statistic also forms the basis of confidence intervals for the true value of regression coefficients $\beta$. We again remember that:
$$ \frac{\hat{\beta} - \beta}{est.s.e. \hat{\beta}} \sim t_v. $$We denote by $t_v^{-1}(0.975)$ the 97.5 % critical value of the standard $t$ distribution on $v$ degrees of freedom. Now, since
$$ P(\frac{\hat{\beta} - \beta}{est.s.e. \hat{\beta}} < -t_v^{-1}(0.975)) = 0.025,\\ P(\frac{\hat{\beta} - \beta}{est.s.e. \hat{\beta}} < t_v^{-1}(0.975)) = 0.025, $$we can now create a confidence interval:
$$ P(\hat{\beta} - t_v^{-1} (0.975) \times est.s.e. \hat{\beta} < \beta < \hat{\beta} + t_v^{-1} (0.975) \times est.s.e. \hat{\beta}) = 0.95. $$Sometimes, the explanatory variables might have a high degree of correlation between themselves. In such a case it is very difficult to determine their individual effects. This is referred to as Multicollinearity. Perfect multicollinearity occurs when two or more explanatory variables are perfectly correlated. In this case, the OLS estimators break down. For example, if we consider a case where $X_1$ and $X_2$ are perfectly correlated, the linear regression model relating $Y$ to $X_1,X_2$ can be written in an infinite number of ways.
Usually, perfect multicollinearity only occurs when a mistake has been made, for example, a linear transform of one or more of the explanatory variables has been included as an additional explanatory variable. Typically, real problems arise when there is a high degree of multicollinearity, indicating not perfect, but high correlation, between the explanatory variables. The OLS estimators exist and are still the most efficient, but their variances and covariances can be large. The estimated covariance matrix of the OLS estimators will have some very large elements and some very small elements, and the estimated standard errors for the coefficients will be biased.
Typically, if we run a regression model, and we find that a variable does not appear to be significant unless we add another variable, which itself is not necessarily significant, then we have encountered multicollinearity. This is not the only sign of multicollinearity.
There are no formal tests for multicollinearity. Usually, there are a few common solutions to overcome multicollinearity.
Use different data
Drop the least significant/important of the collinear variables, and continue until multicollinearity is no longer a problem
Use a ridge estimator
Apply principal component analysis to the set of explanatory variables and use the first few principal components as independent variables in the regression. This is typically called orthogonal regression.