# Side note of $t$-distribution
#
# Here below is a figure for refreshing $t$-statistics, we set $\alpha=.05$. The first axes demonstrates how $t$-statistic changes as degree of freedom raises, the second shows that $t$-distribution approximates normal distribution indefinitely with rising degree of freedom.
# In[25]:
degree_frd = np.arange(1, 50, 1)
t_array = sp.stats.t.ppf(0.975, df=degree_frd)
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(14, 12))
ax[0].plot(degree_frd, t_array)
ax[0].set_xlim([0, 20])
ax[0].set_xlabel("degree of freedom")
ax[0].set_ylabel("critical value of $t$-statistic")
ax[0].set_title("The Change of $t$-statistic As d.o.f. Increases")
x = np.linspace(-3, 3, len(degree_frd))
for i in range(len(degree_frd)):
t_pdf = sp.stats.t.pdf(
x, degree_frd[i], loc=0, scale=1
) # pdf(x, df, loc=0, scale=1)
if i % 8 == 0:
ax[1].plot(x, t_pdf, label="t-Distribution with d.o.f. = {}".format(i))
else:
ax[1].plot(x, t_pdf)
ax[1].set_title(
"The Shape of $t$-Distribution Approximate Normal Distribution As d.o.f. Increases"
)
ax[1].legend()
plt.show()
# # $p$-Values
# Personally, I myself prefer to $p$-values. It is more informative than $t$-statistic, $p$-value gives the probability of obtaining corresponding $t$-statistic if null hypothesis is true, which is exactly the probability of **type I error**.
#
# With proper use of ```sp.stats.t.cdf```, we can access $p$-value conveniently.
# In[26]:
1 - sp.stats.t.cdf(
t_b2, df=n
) # because t_b2 is positive, so p-value should be deducted from 1, if negative then without
# $p$-value tells that if null hypothesis is true, the probability of obtaining $t_{b_2}=3.6349$ or even higher is merely $0.0006$. That means, very unlikely the null hypothesis is true, we can safely reject null hypothesis with a tiny probability of Type I error.
#
# Medical researches conventionally uses $p$-value, econometrics tend to use estimates with standard error bracketed below. But they just different ways of expressing the same ideas, pick based on your preference unless you don't are writing an economic paper.
#
# Side note of Type I and II Error
# The blue shaded area are genuinely generated by null distribution, however they are too distant (i.e. $2\sigma$ away) from the mean ($0$ in this example), and they are mistakenly rejected, this is what we call Type I Error.
#
#
# The orange shaded area are actually generated by alternative distribution, however they are in the adjacent area of mean of null hypothesis, so we failed to reject they, but wrongly. And this is called Type II Error.
#
#
# As you can see from the chart, if null distribution and alternative are far away from each other, the probability of both type of errors diminish to trivial.
#
# In[27]:
from plot_material import type12_error
type12_error()
# # Confidence Interval
# Why bother with confidence interval if we have hypothesis testing?
#
# If you have a theory that house price has a certain linear relationship with disposable income, you test theory with model, this is called _hypothesis testing_. But what if you don't have a theory yet, and runs the regression, you are wondering how confident these estimates can represent the true parameters, the range that you feel confident is called **confidence interval**.
#
# These two procedures complementing each other, that's why we see them often reported together.
# Recall the rejection rules are
# $$
# \frac{\hat{\beta}-\beta}{\text { s.e. }\left(\hat{\beta}\right)}>t_{\text {crit }} \quad \text { or } \quad \frac{\hat{\beta}-\beta}{\text { s.e. }\left(\hat{\beta}\right)}<-t_{\text {crit }}
# $$
# If we slight rearrange and join them, we get the confidence interval
# $$
# \hat{\beta}-\text { s.e. }\left(b\right) \times t_{\text {crit }} \leq \beta\leq \hat{\beta}+\text { s.e. }\left(\hat{\beta}\right) \times t_{\text {crit }}
# $$
# The higher significance level, the smaller $\alpha$ is, the larger confidence interval (because of larger $t_{crit}$). For example, if the significance level is $\alpha=.05$, then confidence level is $0.95$.
#
# Here's the confidence interval ($95\%$) of our house price example.
# In[28]:
t_crit = sp.stats.t.ppf(0.975, df=n - 1)
print("C.I. of b1: [{}, {}]".format(b1 - t_crit * std_err_b1, b1 + t_crit * std_err_b1))
print("C.I. of b2: [{}, {}]".format(b2 - t_crit * std_err_b2, b2 + t_crit * std_err_b2))
# There are $95\%$ chances the true parameter $\beta$ will 'fall' in this confidence interval.
# # One-Tailed vs Two-Tailed Test
# So far we have been discussing about two-tailed test, but there are scenarios that one-tailed test make more sense. In our house price example, some practitioners would prefer to test the theory or common sense: _disposable income would not have negative effects on house price_. The alternative would be that _disposable income would have either no effect or positive effects on house price_.
#
# Thus the one-tailed test hypotheses are
# $$
# H_0: \beta_2<0\\
# H_1: \beta_2\geq 0
# $$
# In one-tailed test, we don't split $\alpha$ anymore since there is only one side, that means the critical value will be smaller, easier to reject null hypothesis.
#
# Here is $t_{crit}$ of $\alpha=5\%$. However, these are conventional rules, if you still prefer $2.5\%$ on one-side, feel free to do so. Especially you have a very significant $t$-statistic, such as $10$, one-tailed or two-tailed won't really matter.
# In[29]:
t_crit_oneside = sp.stats.t.ppf(0.95, df=n - 1)
t_crit_oneside
# So here the rule of thumb for one-tailed test.
#
# Rules of Thumb for One-Tailed Test
# 1. If the theory or common sense supports one side test, e.g. household consumption increases as disposable incomes increase.
# 2. If two-tailed test failed to reject, but one-tailed reject, you can report one-tailed test results if the first rule satisfied too.
#
# # $F$-test
# $F$-test is based on _Analysis of Variance_ (ANOVA), the goal is to test **multiple restrictions** on the regression model. In simple linear regression model, the **joint hypothesis** is usually
# \begin{align}
# H_0&: \beta_1 = 0,\qquad\beta_2=0\\
# H_1&: \text{One or more restrictions does not hold}
# \end{align}
#
# once you have ANOVA done, $F$-statistic is an natural byproduct.
# $$
# F=\frac{E S S /(k-1)}{R S S /(n-k)}
# $$
# where $k$ is the number of number of parameters in the regression model, here in simple regression model $k=2$ and $n$ is the sample size.
#
# You might have doubt now: why aren't we using same old $t$-tests such that
# $$
# H_0: \hat{\beta}_1=0 \qquad H_0: \hat{\beta}_2=0 \qquad \\
# H_1: \hat{\beta}_1\neq0 \qquad H_1: \hat{\beta}_2\neq0\qquad\\
# $$
#
# Apparently, the number of $t$-tests will be as large as ${k \choose 2} $ where $k$ is the number of parameters. If there are $5$ parameters, then we have to test ${5 \choose 2}=10$ pairs. With $95\%$ confidence level, $10$ $t$-tests would cut back confidence level dramatically to $95\%^{10}=59.8\%$, which also means the probability of _type I_ error would be around $40\%$.
#
# We have user-defined functions written in the OLS class, so $F$-statistic is
# In[30]:
f_stat = (ess / (2 - 1)) / (rss / (len(df["salary"]) - 2))
print("F-statistic is {:.4f}.".format(f_stat))
# In[31]:
p_value = 1 - sp.stats.f.cdf(
f_stat, 1, len(df["salary"]) - 2
) # sp.stats.f.cdf(df on nominator, df on denom)
print("p-value is {:.4f}.".format(p_value))
# To explore further, we can even prove that in simple linear regression $F$ are the just the square of $t$.
#
# $F$-statistic and $t$-statistic
# Here's the proof that $F$ and $t$ are connected
# $$
# F=\frac{R^{2}}{\left(1-R^{2}\right) /(n-2)}=\frac{\frac{\operatorname{Var}(\hat{Y})}{\operatorname{Var}(Y)}}{\left\{1-\frac{\operatorname{Var}(\hat{Y})}{\operatorname{Var}(Y)}\right\} /(n-2)}\\
# =\frac{\frac{\operatorname{Var}(\hat{Y})}{\operatorname{Var}(Y)}}{\left\{\frac{\operatorname{Var}(Y)-\operatorname{Var}(\hat{Y})}{\operatorname{Var}(Y)}\right\} /(n-2)}=\frac{\operatorname{Var}(\hat{Y})}{\operatorname{Var}(\varepsilon) /(n-2)}\\
# =\frac{\operatorname{Var}\left(\hat{\beta}_{1}+\hat{\beta}_{2} X\right)}{\left\{\frac{1}{n} \sum_{i=1}^{n} \varepsilon_{i}^{2}\right\} /(n-2)}=\frac{\hat{\beta}_{2}^{2} \operatorname{Var}(X)}{\frac{1}{n} s_{u}^{2}}=\frac{\hat{\beta}_{2}^{2}}{\frac{s_{u}^{2}}{n \operatorname{Var}(X)}}=\frac{\hat{\beta}_{2}^{2}}{\left[\operatorname{s.e.} \left(\hat{\beta}_{2}\right)\right]^{2}}=t^{2}
# $$
#
#
# $F$-statistic and $R^2$
# $F$-statistic is a different angle of evaluating the goodness of fit, it can be shown that $F$ and $R^2$ are closely connected, divide both nominator and denominator by $TSS$:
# $$
# F=\frac{(E S S / T S S) /(k-1)}{(R S S / T S S) /(n-k)}=\frac{R^{2} /(k-1)}{\left(1-R^{2}\right) /(n-k)}
# $$
# We prefer to $F$ for hypothesis test, it's because critical value of $F$ is straightforward, and critical value of $R^2$ has to be calculated based on $F_{crit}$:
# $$
# R_{\mathrm{crit}}^{2}=\frac{(k-1) F_{\mathrm{crit}}}{(k-1) F_{\text {crit }}+(n-k)}
# $$
#
# # Regression vs Correlation
# Here is the message for all beginner that misinterpret regression relationship as causation.
#
# Does Regression Imply Causation
#
# It's tempting to interpret regression result as causality, but it's not. Regression only implies a statistical relationship, the independent variables may or may not be the cause of dependent variables, sometimes we know thanks to theories, sometimes we don't.
#
# For instance, researches found that parents with higher education tend to have healthier children, but this is hardly a causality. Perhaps higher education parents are in general wealthier, they can afford decent medical packages. Or they spend time with their kids on sports and dining. We can form some hypothesis, but not a definite causality based on one regression.
#
# But regressions do resemble the correlation to some extent
#
# Does Regression Imply Correlation
# From formula of $\hat{\beta}_2$
# $$
# \hat{\beta}_2 =\frac{\text{Cov}(X, Y)}{\text{Var}(X)}
# $$
# We can see the regression indeed has a component of correlation (covariance in the formula), but it's normalized by variance (i.e. $\sigma_X\sigma_X$) rather than $\sigma_X\sigma_Y$. To compare with correlation coefficient of $X$ and $Y$
# $$
# \rho_{XY}=\frac{\text{Cov}(X, Y)}{\sigma_X\sigma_Y}
# $$
# We can see one important difference is that regression coefficient does not treat both variables symmetrically, but correlation coefficient does. Joining two formulae, we have a different view of the coefficient.
# $$
# \hat{\beta}_2=\frac{\rho_{XY}\sigma_X\sigma_Y}{\text{Var}(X)}= \frac{\rho_{XY}\sigma_X\sigma_Y}{\sigma_X\sigma_X}=\rho_{XY}\frac{\sigma_Y}{\sigma_X}
# $$
#
# Besides that, the purpose of these two techniques are different, regression are mainly predicting dependent variables behaviors, but correlation are mainly summarizing the direction and strength among two or more variables.
#
# Maybe a chart can share insight of their relationship. All data are simulated by $u\sim N(0, 1)$. It's easy to notice the smaller correlation implies a smaller slope coefficient in terms of absolute value. And larger disturbance term also implies lower correlation coefficient.
# In[32]:
from plot_material import reg_corr_plot
reg_corr_plot()
# # Joint Confidence Region
# **Joint confidence region** are the joint distribution of regression coefficients, it is theoretically an ellipse. It shows the distributed location of the coefficient pair.
#
# Here is a Monte Carlo simulation, we set $\beta_1 = 3$, $\beta_2 = 4$ and $u\sim N(0, 10)$, run $1000$ times then plot the estimates.
# In[33]:
beta1, beta2 = 3, 4
beta1_array, beta2_array = [], []
for i in range(1000):
u = 10 * np.random.randn(30)
X2 = np.linspace(10, 100, 30)
Y = beta1 + beta2 * X2 + u
df = pd.DataFrame([Y, X2]).transpose()
df.columns = ["Y", "X2"]
X_inde = df["X2"]
Y = df["Y"]
X_inde = sm.add_constant(X_inde)
model = sm.OLS(Y, X_inde).fit()
beta1_array.append(model.params[0])
beta2_array.append(model.params[1])
fig, ax = plt.subplots(figsize=(10, 10))
ax.grid()
for i in range(1000):
ax.scatter(
beta1_array[i], beta2_array[i]
) # no need for a loop, i just want different colors
ax.set_xlabel(r"$\beta_1$")
ax.set_ylabel(r"$\beta_2$")
plt.show()
# But why the joint distribution of coefficient has an elliptic shape? If you take a look at any linear regression plot, it wouldn't be difficult to notice that the high slope coefficient $\beta_2$ would cause low the intercept coefficient $\beta_1$, this is a geometric feature of linear regression model.
#
# And from the plot, we can see the range of $\beta_1$ is much larger than $\beta_2$ and even include $0$, especial the data points are far away from $0$, $\beta_1$ can have erratic results, that's also the reason we don't expect to interpret $\beta_1$ most of time.
# # Stochastic Regressors
# One of G-M condition is $\text{Cov}(X_i, u_i)=0$, assuming $X_i$ is non-stochastic. But we commonly encounters that $X_i$ is stochastic, for instance you sample $10$ family's annual income, you have no clue on how much they are earning eventually, therefore assuming they are stochastic would be more appropriate.
#
# If $X_i$ is stochastic and distributed independently of $u_i$, it guarantees that $\text{Cov}(X_i, u_i)=0$, but not vice versa.
# # Maximum Likelihood Estimation
# In more advanced level of econometric research, **maximum likelihood estimation**(MLE) is used more often than OLS. The reason is that MLE is more flexible on assumption of disturbance term, i.e. not assuming the normality of the disturbance term. But it requires you to have an assumption of certain distribution, it can be exponential or gamma distribution or whatever.
#
# We will provide two examples to illustrate the philosophy of MLE, one is to estimate simple linear regression with MLE and the other is to estimate a mean of an exponential distribution.
# ## MLE for Simple Linear Regression
# Once you have a dataset, you should know the data are just observations of random variables. For instance, you are about to collect $100$ family's annual income data for year 2021, each family's income is essentially a random variable, hence it follows some distribution, it can be normal distribution or skewed gamma distribution.
#
# Once the data is collected, the observances are done. Each data point, i.e. $Y_1, Y_2, ..., Y_n$ is just single realization of its distribution, the joint distribution of all random variables is
# $$
# f\left(Y_{1}, Y_{2}, \ldots, Y_{n}\right)
# $$
# We assume a simple linear regression $Y_{i}=\beta_{1}+\beta_{2} X_{i}+u_{i}$ model to explain $Y$, then the joint distribution is conditional
# $$
# f\left(Y_{1}, Y_{2}, \ldots, Y_{n} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right)
# $$
# We also assume each family's income is independent of rest, then joint distribution equals the product of all distributions.
# $$
# \begin{aligned}
# &f\left(Y_{1}, Y_{2}, \ldots, Y_{n} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right)=f\left(Y_{1} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right) f\left(Y_{2} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right) \cdots f\left(Y_{n} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right) =\prod_{i=0}^nf\left(Y_{i} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right)
# \end{aligned}
# $$
# Now we need to ask ourselves, what distribution does $Y$ follow? It's the same as asking what's the distribution of $u$? It's reasonable to assume $u$ follow normal distribution, so are $Y$'s.
# $$
# f(Y_i)= \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}\frac{[Y_i-E(Y_i)]^2}{\sigma^2}}=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}\frac{[Y_i-\beta_1-\beta_2X_i]^2}{\sigma^2}}
# $$
# Then the joint distribution is the product of this PDF function
# $$
# \prod_{i=0}^nf\left(Y_{i} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right)=\frac{1}{\sigma^{n}(\sqrt{2 \pi})^{n}} e^{-\frac{1}{2} \sum \frac{\left(Y_{i}-\beta_{1}-\beta_{2} X_{i}\right)^{2}}{\sigma^{2}}}
# $$
# Once we have joint distribution function, in a frequentist view, the observed data is generated by this distribution with certain parameters, in this case $\beta_1, \beta_2$ and $\sigma^2$. So the fundamental the questions are what those parameters are? How to find them?
#
# We give a name to the join distribution function above - **likelihood function**, which means how likely a set of parameters can generate a set of data. We denote likelihood function as $LF(\beta_1, \beta_2, \sigma^2)$.
#
# The MLE, as its name indicates, is to estimate the parameters that in such manner that the probability of generating $Y$'s is the highest.
#
# To derive a analytical solution, usually we use log form likelihood function
# $$
# \ln{LF}=\ln{\prod_{i=0}^nf\left(Y_{i} \mid \beta_{1}+\beta_{2} X_{i}, \sigma^{2}\right)} =-\frac{n}{2} \ln \sigma^{2}-\frac{n}{2} \ln (2 \pi)-\frac{1}{2} \sum \frac{\left(Y_{i}-\beta_{1}-\beta_{2} X_{i}\right)^{2}}{\sigma^{2}}
# $$
# Take derivative with respect to $\beta_1$, $\beta_2$ and $\sigma^2$ and equal them to $0$ will yield the **maximum likelihood estimators** for simple linear regression with assumption of normally distributed disturbance term
# \begin{align}
# \hat{\beta}_2 &= \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^n(X_i-\bar{X})^2}=\frac{\text{Cov}(X, Y)}{\text{Var}(X)}\\
# \hat{\beta}_1 &= \bar{Y}-\beta_2\bar{X}\\
# s^2 &= \frac{1}{n} \sum_{i=1}^{n}\left(Y_{i}-\left(\hat{\beta}_{1}+\hat{\beta}_{2} X_{i}\right)\right)^{2} = \frac{1}{n}\sum_{i=1}^n e_i^2
# \end{align}
# MLE $\hat{\beta}_1$ and $\hat{\beta}_2$ are exactly the same as OLS estimators, only the $s^2$ differs from OLS estimator $s^2 = \frac{\sum e^2_i}{n-2}$. The MLE $s^2$ is biased in small sample, but _consistent_ as sample size increases.
# We'll experiment with MLE with simulated data. Generate a data for OLS estimation and print the estimated coefficients.
# ```statsmodel``` library doesn't have a direct api for maximum likelihood estimation, but we can construct it with ```Scipy```'s ```minimize``` function, so we define a negative log likelihood. The reason of negative function is because ```Scipy``` has only ```minimize``` function.
# In[34]:
df = pd.read_excel("Basic_Econometrics_practice_data.xlsx", "CN_Cities_house_price")
df.head()
# In[35]:
y = df["house_price"]
X = df["salary"]
# In[36]:
def neg_LL(theta_array):
y_fitted = theta_array[0] + theta_array[1] * X
sigma = theta_array[2]
LL = (
-(len(y_fitted) / 2) * np.log(theta_array[2] ** 2)
- (len(y_fitted) / 2) * np.log(2 * np.pi)
- 1 / (2 * theta_array[2] ** 2) * np.sum((y - y_fitted) ** 2)
)
return -LL
# In[37]:
theta_start = np.array([1, 1, 1])
res = minimize(neg_LL, theta_start, method="Nelder-Mead", options={"disp": True})
# In[38]:
results = pd.DataFrame({"parameters": res["x"]})
results.index = ["Intercept", "Slope", "Sigma"]
results.head()
# As you can see the intercept and slope coefficients are the same as OLS, but $\hat{\sigma}$ is underestimated as the theory tells.
# ## MLE for Exponential Distribution
# MLE is more general than OLS, it can estimate any parameters as long as you know or assume the distribution, this example we will give an explicit function form for an exponential function.
#
# Consider
# $$
# f(Y_i; \theta) = \theta e^{-\theta Y_i}
# $$
# The joint distribution (likelihood function) is
# $$
# \prod_{i=1}^n f(Y_i; \theta) = \theta^ n e^{-\theta \sum Y_i}
# $$
# Log likelihood function is
# $$
# \ln \prod_{i=1}^n f(Y_i; \theta) = n\ln{\theta} - \theta \sum_{i=1}^n Y_i
# $$
# Take derivative with respect to $\theta$
# $$
# \frac{\partial LF}{\partial \theta} = \frac{n}{\theta}-\sum_{i=1}^n Y_i = 0
# $$
# The result is
# $$
# \hat{\theta} = \frac{n}{n \bar{Y}} = \frac{1}{\bar{Y}}
# $$
# The estimated parameter $\hat{\theta}$ is the reciprocal of mean of $Y_i$.
# # Statsmodels Library
# Fortunately, we don't need to program our own toolbox for estimations, ```statsmodels``` library will be our main tool from now on, all the codes are self-explanatory unless it requires further clarification.
#
# The estimation results are reported in as in Stata or Eview, there is a test run below.
# In[39]:
X = df["salary"]
Y = df["house_price"]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
print_model = model.summary()
print(print_model)
# However, you might see some people using $R$-style formula such as
# ```Y ~ X + Z```, which produces the same results.
# In[40]:
df = pd.read_excel(
"Basic_Econometrics_practice_data.xlsx", sheet_name="CN_Cities_house_price"
)
df.head()
model = smf.ols(formula="house_price ~ salary", data=df)
results = model.fit()
print(results.summary())
# In[41]:
res = pd.DataFrame(results.params, columns=["OLS estimates"])
res
# In[42]:
results.params
# In[43]:
results.fittedvalues
# In[ ]: