import numpy as np
import scipy as sp
import scipy.stats
import matplotlib.pyplot as plt
import math as mt
import scipy.special
import seaborn as sns
plt.style.use("fivethirtyeight")
from statsmodels.graphics.tsaplots import plot_acf
import pandas as pd
Define some color in this cell. $$ \require{color} \definecolor{red}{RGB}{240,5,5} \definecolor{blue}{RGB}{5,5,240} \definecolor{green}{RGB}{4,240,5} \definecolor{black}{RGB}{0,0,0} \definecolor{dsb}{RGB}{72, 61, 139} \definecolor{Maroon}{RGB}{128,0,0} $$
The common matrix form of linear regression is $$ y = X \beta+u $$ where $u \sim N( {0}, \sigma^2 I)$.
The covariance matrix of disturbance term $u$ is $$ \begin{aligned} &\operatorname{var}(u) \equiv\left[\begin{array}{cccc} \operatorname{var}\left(u_{1}\right) & \operatorname{cov}\left(u_{1}, u_{2}\right) & \ldots & \operatorname{cov}\left(u_{1}, u_{N}\right) \\ \operatorname{cov}\left(u_{1}, u_{2}\right) & \operatorname{var}\left(u_{2}\right) & \ldots & . \\ \cdot & \operatorname{cov}\left(u_{2}, u_{3}\right) & \ldots & . \\ \cdot & \cdot & \ldots & \operatorname{cov}\left(u_{N-1}, u_{N}\right) \\ \operatorname{cov}\left(u_{1}, u_{N}\right) & . & \ldots & \operatorname{var}\left(u_{N}\right) \end{array}\right]=\left[\begin{array}{cccc} h^{-1} & 0 & \ldots & 0 \\ 0 & h^{-1} & \ldots & . \\ . & . & \ldots & . \\ . & . & \cdots & 0 \\ 0 & . & . & h^{-1} \end{array}\right] \end{aligned} $$
Some derivation usually make use of $h = 1/\sigma^2$, which is the precision. The diagonal form of covariance matrix represents two assumptions: no serial correlation and homoscedasticity.
Also with the assumption that $X$ is exogenous, we can construct the joint probability density $$ P(y, X \mid \beta, h)=P(y \mid X, \beta, h) P(X) $$
However because $X$ does not depend $\beta$ and $h$, we narrow our interest onto $$ P(y \mid X, \beta, h) $$
Recall that multivariate normal distribution takes the form
where $|\Sigma|$ is the determinant of covariance matrix, $\Sigma^{-1}$ is the inverse of covariance matrix.
In the linear regression context, the determinant of covariance matrix is $$ |\text{var}(u)|^{-1/2}=\left(\prod_{i=1}^Nh^{-1}\right)^{-1/2} = (h^{-N})^{-1/2}=h^{N/2}=|\Sigma|^{-1/2} $$
The inverse matrix of covariance matrix $$ \text{var}(u)^{-1} = (h^{-1}I)^{-1}=hI=\Sigma^{-1} $$
Plug in back to likelihood function $$ P(y \mid X, \beta, h)=(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp \left[-\frac{h}{2}\left(({y}-{X} \hat{{\beta}})^{T}({y}-{X} \hat{{\beta}})+(\beta-\hat{\beta})^T {X}^T {X}(\beta-\hat{\beta})\right)\right] $$
This form of likelihood function suggest a natural conjugate prior which has the same function form as likelihood and also yields a posterior within the same class of distribution.
Our prior should be a joint distribution $P(\beta, h)$, however in order to transform into Normal-Gamma distribution, it proves convenient to write $$ P(\beta, h)=P(\beta \mid h) P(h) $$
And it is the Normal-Gamma distribution we are talking about, the right-hand side has the follow distribution $$ \begin{gathered} \beta \mid h \sim N\left(\mu, (h V)^{-1}\right) \\ h \sim \operatorname{Gamma}(m, v) \end{gathered} $$ The advantage of this distribution is that $\beta$ is on real number's domain and $h$ is on positive number's domain.
Recall the function form of NG distribution. $$ f(X, h)=(2 \pi)^{-\frac{N}{2}}(h)^{-\frac{N}{2}}|\Sigma|^{-1 / 2} \exp \left(-\frac{h}{2}(X-\mu)^T \Sigma^{-1}(X-\mu)\right) \frac{1}{\left(\frac{2 m}{v}\right)^{v / 2}} \frac{1}{\Gamma\left(\frac{v}{2}\right)} h^{\frac{v-2}{2}} \exp \left[-\frac{h v}{2 m}\right] $$
The posterior is formulated by Bayes' Theorem
Expand both terms, again the $2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}$ and $2 {\mu}^{T} {\Lambda} {\beta}$ exist because cross terms are scalars. \begin{align} ({\beta}-\hat{\beta})^{T} {X}^{T} {X}({\beta}-\hat{\beta})+\left({\beta}-{\mu}\right)^{T} {V}\left({\beta}-{\mu}\right)&= {\beta}^{T} {X}^{T} {X} {\beta}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}-2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}+ {\beta}^{T} V {\beta}+ {\mu}^{T} V {\mu}-2 {\mu}^{T} V {\beta}\\ &=\color{red}{\beta}^{T} {X}^{T} {X} {\beta}\color{black} + \color{red}{\beta}^{T} {V} {\beta} \color{black}-\color{blue}2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}-\color{blue}2 {\mu}^{T} V {\beta}+\color{black} {\mu}^{T} V {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\color{red}\beta^T\underbrace{(X^TX+V)}_{W}\beta\color{black}-\color{blue}2\underbrace{(\hat{\beta}^TX^TX+\mu^TV)}_{w^T}\beta+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\underbrace{\color{red}\beta^TW\beta\color{black}-\color{blue}2w^T\beta}_{\text{needs completing the square}}+\color{black} {\mu}^{T} {V} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ \end{align}
where the variable are replace for completing the square $$ W = (X^TX+V)\\ w^T=\hat{\beta}^TX^TX+\mu^TV $$
Details of completing the square $$ \begin{align} \color{red}\beta^TW\beta\color{black}-\color{blue}2w^T\beta\color{black}+\overbrace{\color{green}w^TW^{-1}w-w^TW^{-1}w}^{0} &=\overbrace{\color{red}\beta^TW\beta\color{black}-\color{blue}w^T\beta\color{black}-\color{blue}\beta^Tw\color{black}+\color{green}w^TW^{-1}w}^{\text{complete the square}}-\color{green}w^TW^{-1}w\\ &=\color{Maroon}(\beta^TW-w^T)(\beta-W^{-1}w)\color{black}-\color{green}w^TW^{-1}w\\ &=\color{Maroon}(\beta^T-w^TW^{-1})W(\beta-W^{-1}w)\color{black}-\color{green}w^TW^{-1}w\\ &=\color{Maroon}(\beta-W^{-1}w)^TW(\beta-W^{-1}w)\color{black}-\color{green}w^TW^{-1}w \end{align} $$ We used the fact that $(M^{-1})^T=M^{-1}$.
Define the mean in the posterior normal kernel, $N$ subscript in $\mu_N^{\ }$ means it is from the normal part $$ \begin{align} \mu_{N}^{\ } = W^{-1}w &= W^{-1}(X^TX\hat{\beta}+V^T\mu)\\ &=W^{-1}(X^TX(X^TX)^{-1}X^Xy+V^T\mu)\\ &=W^{-1}\underbrace{(X^Ty+V^T\mu)}_{w} \end{align} $$
Simplify each on the right-hand side
Omit the terms that do not depend on $\beta$ or $h$ $$ \begin{align} P(\beta, h \mid {X}, y) \propto&(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp{\left(-\frac{h}{2}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })\right)}\\ &\times h^{\frac{k}{2}}|V|^{\frac{1}{2}}\exp \left({y}^{T} {y} -\mu_N^T W \mu_{N}^{\ }+\color{black} {\mu}^{T} {V} {\mu}\right)\\ &\times\frac{1}{\beta^{\alpha}}\frac{1}{\Gamma(\alpha)} h^{\alpha-1}\exp{\left(-\frac{h}{\beta}\right)} \end{align} $$
Omit the terms that do not depend on $\beta$ or $h$ $$ \begin{align} P(\beta, h \mid {X}, y) \propto&(2 \pi)^{-\frac{N}{2}} h^{\frac{N}{2}} \exp{\left(-\frac{h}{2}(\beta-\mu_{N}^{\ })^TW(\beta-\mu_{N}^{\ })\right)}\\ &\times h^{\alpha+\frac{k}{2}-1}|V|^{\frac{1}{2}}\exp \left(-\frac{h}{\beta}+{y}^{T} {y} -\mu_N^T W \mu_{N}^{\ }+\color{black} {\mu}^{T} {V} {\mu}\right) \end{align} $$
First ignore the constant parts such as $2\pi$ and gamma function, then combine $h$
Join exponential terms $$ \begin{align} P(\beta, h \mid Y, \mathrm{X}) &\propto h^{\frac{k}{2}} \exp \left[-\frac{h}{2}\left(\beta-\widehat{\beta}\right)^T\mathrm{X}^T \mathrm{X}\left(\beta-\widehat{\beta}\right)-\frac{h}{2}(\beta-\mu)^{\prime} V^{-1}(\beta-\mu)\right] h^{\frac{N+v-2}{2}} \exp \left[-\frac{h}{2}\hat{u}^T \hat{u}-\frac{h v}{2 m}\right]\\ &\propto\ h^{\frac{k}{2}} \exp \left[-\frac{h}{2}\left[\left(\beta-\widehat{\beta}\right)^T\mathrm{X}^{\prime} \mathrm{X}\left(\beta-\widehat{\beta}\right)+(\beta-\mu)^T V^{-1}(\beta-\mu)\right]\right] h^{\frac{N+v-2}{2}} \exp \left[-\frac{h}{2}\left(\hat{u}^T \hat{u}+\frac{v}{m}\right)\right] \end{align} $$
Recall multivariate normal distribution density function
We assume likelihood ${y} \sim {\mathcal{N}}\left({ {X}\beta}, \sigma^{2} {I}\right)$ follows exactly the same form as density function above
where $N$ is the number of observation in the data.
Use determinants rules $$ \left|\sigma^{2} {I}\right|=\sigma^{2N}\\ \left(\sigma^{2} {I}\right)^{-1}=\sigma^{-2}I $$
Prior is usually decomposed $p\left({\beta}, \sigma^{2}\right)=p\left({\beta} \mid \sigma^{2}\right) p\left(\sigma^{2}\right)$ where ${\beta} \mid \sigma^{2} \sim {N}\left({0}, \sigma^{2} {\Lambda}^{-1}\right)$ and $\sigma^{2} \sim \operatorname{InvGamma}\left(\alpha, \beta\right)$
where $k$ is the number of $\beta$ parameters in the model.
Use determinants rules $$ |\sigma^2\Lambda^{-1}|^{-\frac{1}{2}}=(\sigma^{2k}|\Lambda^{-1}|)^{-\frac{1}{2}}=(\sigma^{2k}|\Lambda|^{-1})^{-\frac{1}{2}}=\sigma^{-k}|\Lambda|^{\frac{1}{2}}\\ $$ then $$ (2 \pi)^{-\frac{k}{2}}\left|\sigma^{2} {\Lambda}^{-1}\right|^{-\frac{1}{2}}=(2\pi\sigma^2)^{-\frac{k}{2}}|\Lambda|^\frac{1}{2} $$
Recall the inverse gamma distribution is
$$ f(x ; \alpha, \beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}(1 / x)^{\alpha+1} \exp (-\beta / x) $$To move forward, there will be horrible amount of linear algebraic manipulation, the first is kernel decomposition of likelihood function
The quadratic form $(\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})$ is created be combined with the kernal of prior $P\left( {\beta} \mid \sigma^{2}\right)$ that is $\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)$
Joining second decomposed terms in likelihood and prior $P(\beta\mid\sigma^2)$ will end up as an addition of them due the feature of exponential term $$ (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right) $$
Expand both terms, again the $2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}$ and $2 {\mu}^{T} {\Lambda} {\beta}$ exist because cross terms are scalars. \begin{align} (\hat{{\beta}}-{\beta})^{T} {X}^{T} {X}(\hat{{\beta}}-{\beta})+\left({\beta}-{\mu}\right)^{T} {\Lambda}\left({\beta}-{\mu}\right)&= {\beta}^{T} {X}^{T} {X} {\beta}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}-2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}+ {\beta}^{T} {\Lambda} {\beta}+ {\mu}^{T} {\Lambda} {\mu}-2 {\mu}^{T} {\Lambda} {\beta}\\ &=\color{red}{\beta}^{T} {X}^{T} {X} {\beta}\color{black} + \color{red}{\beta}^{T} {\Lambda} {\beta} \color{black}-\color{blue}2 \hat{ {\beta}}^T {X}^{T} {X} {\beta}-\color{blue}2 {\mu}^{T} {\Lambda} {\beta}+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\color{red}\beta^T\underbrace{(X^TX+\Lambda)}_{M}\beta\color{black}-\color{blue}2\underbrace{(\hat{\beta}^TX^TX+\mu^T\Lambda)}_{m^T}\beta+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ &=\underbrace{\color{red}\beta^TM\beta\color{black}-\color{blue}2m^T\beta}_{\text{needs completing the square}}+\color{black} {\mu}^{T} {\Lambda} {\mu}+\hat{ {\beta}}^T {X}^{T} {X} \hat{ {\beta}}\\ \end{align}
We have defined $$ M=X^TX+\Lambda\\ m^T = \hat{\beta}^TX^TX+\mu^T\Lambda $$
Continue to complete the square of colored term $$ \begin{align} \color{red}\beta^TM\beta\color{black}-\color{blue}2m^T\beta\color{black}+\overbrace{\color{green}m^TM^{-1}m-m^TM^{-1}m}^{0} &=\overbrace{\color{red}\beta^TM\beta\color{black}-\color{blue}m^T\beta\color{black}-\color{blue}\beta^Tm\color{black}+\color{green}m^TM^{-1}m}^{\text{complete the square}}-\color{green}m^TM^{-1}m\\ &=\color{Maroon}(\beta^TM-m^T)(\beta-M^{-1}m)\color{black}-\color{green}m^TM^{-1}m\\ &=\color{Maroon}(\beta^T-m^TM^{-1})M(\beta-M^{-1}m)\color{black}-\color{green}m^TM^{-1}m\\ &=\color{Maroon}(\beta-M^{-1}m)^TM(\beta-M^{-1}m)\color{black}-\color{green}m^TM^{-1}m \end{align} $$
Last step above made use of the fact $(M^{-1})^T=M^{-1}$, invoking the inverse matrix rule $(M^{-1})^T = (M^{T})^{-1}$ .
So far we have achieved the kernel for the normal distribution part in posterior. And the rest will be shuffle into the inverse-gamma part.
Define the mean in the posterior normal kernel, $N$ subscript in $\mu_N^{\ }$ shows it is from the normal part $$ \begin{align} \mu_{N}^{\ } = M^{-1}m &= M^{-1}(X^TX\hat{\beta}+\Lambda^T\mu)\\ &=M^{-1}(X^TX(X^TX)^{-1}X^Xy+\Lambda^T\mu)\\ &=M^{-1}\underbrace{(X^Ty+\Lambda^T\mu)}_{m} \end{align} $$
We have defined $m^T$ above, here we derived its transpose form i.e. $m$.
Simplify each term on the right hand side
Join them
where $$ M=(X^TX+\Lambda)\\ \mu_{N}^{\ }=M^{-1}\underbrace{(X^Ty+\Lambda^T\mu)}_{m} $$
We obtain the normal part kernel in posterior, so the rest of terms in results above can combine with the inverse gamma prior $P(\sigma^2)$ to achieve the posterior gamma kernel.
Because it is proportional, we can safely ignore the normalizer in last two lines $$ \frac{\beta ^{\alpha}}{\Gamma\left(a \right)}\ \text{ and }\ 2\pi^{-k/2} $$
which gives us chance to combine terms
Define $$ \alpha_G^{\ } = \alpha+\frac{k}{2}\\ \beta_G^{\ } = \beta+ \frac{1}{2}(y^Ty + \mu^T\Lambda\mu -\mu^T_{N}M \mu_{N}^{\ }) $$