Sascha Spors, Professorship Signal Theory and Digital Signal Processing, Institute of Communications Engineering (INT), Faculty of Computer Science and Electrical Engineering (IEF), University of Rostock, Germany
Winter Semester 2023/24 (Master Course #24512)
Feel free to contact lecturer frank.schultz@uni-rostock.de
statsmodels
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import inv, pinv
from scipy.stats import f, t
from scipy.linalg import svdvals
from statsmodels.api import OLS
from statsmodels.graphics.gofplots import qqplot
rng = np.random.default_rng(1234) # to reproduce results
def my_r2(X, theta_hat, y):
"""(Adjusted) coefficient of determination R^2.
also known as empirical correlation coefficient between y and yhat
"""
N = X.shape[0] # number of samples / observations
p = X.shape[1] # number of model parameters (including intercept beta[0])
yhat = X @ theta_hat # do regression / prediction
# sum of squares T.otal:
SST = np.sum((y - np.mean(y)) ** 2)
# sum of squares due to R.egression:
SSR = np.sum((yhat - np.mean(y)) ** 2)
# sum of squared E.rrors:
SSE = np.sum((y - yhat) ** 2)
# SST = SSR + SSE holds (numerical errors might corrupt this equality)
# R2 = SSR / SST is the ratio between regression and total, 0<=R2<=1
# rearranged R2 = (SST - SSE) / SST = 1**2 - SSE / SST
# p.58 in [MT11], p.111 in [DB18], p.108 (1.34)-(1.36) in [FHT96],
# p.125 in [FKLM21], Ch.2.4.6 in [Agresti15]
R2 = 1**2 - SSE / SST
# R2 should be adjusted by number of samples and model complexity
# note that this equation holds for models that include an intercept term
# p.58 in [MT11], p.163 in [FKLM21], Ch.2.4.6 in [Agresti15]
# R2adj = 1**2 - (1-R2)*(n-1)/(n-d)
# or rewritten to see the adjustments in a more convenient way
R2adj = 1**2 - (SSE / (N - p)) / (SST / (N - 1))
return (R2, R2adj)
def my_deviance_for_normal_pdf(X, beta_hat, y):
"""Scaled deviance for normally distributed data and errors.
- is basically the sum of squared errors (SSE)
- cf. p.89ff in [DB18], p.133 in [Agresti15], [MT11]
- note that deviances are different for other distributions, but
the same statistical concepts using deviances hold
- so instead of thinking this as SSE we should get used to deviances
"""
D = np.sum((y - X @ beta_hat) ** 2) # SSE
# SSE can also be calculated with the dot product:
# tmp = y - X @ beta_hat
# D = np.squeeze(tmp[:, None].T @ tmp[:, None])
return D
alpha = 0.05 # standard alpha error for H0 rejection
N = 2**5 # number of observations / samples
noise = rng.normal(loc=0, scale=1.5, size=N)
print("noise: mean=", np.mean(noise), ", std=", np.std(noise, ddof=1))
x = np.linspace(1e-16, 200, N)
X = np.column_stack((x, 500 * np.abs(np.sin(x)), np.abs(np.log(x))))
print(X.shape)
beta = np.array([1e0, 1e-1, 1e-2, 1e-3]) # some nice numbers for true beta
# add bias column to design matrix
X = np.hstack((np.ones((X.shape[0], 1)), X))
hasconst = True
debug_flag = False
if debug_flag: # works well for N=2**4
X = np.column_stack((x**1, x**3, x**5))
print(X.shape)
beta = np.array([0, 1e0, 0, 1e-2])
X = np.hstack((np.ones((X.shape[0], 1)), X))
noise = rng.normal(loc=0, scale=0.5, size=N)
# generate 'real world' data with design matrix, add noise
y = np.dot(X, beta) + noise
print(X.shape)
print(y.shape)
if debug_flag:
print("debug_flag", debug_flag)
# we might want to get another design matrix to debug F/p and t/p in detail
X = np.column_stack((x**0.125, x**0.25, x**0.75))
X = np.hstack((np.ones((X.shape[0], 1)), X))
print(svdvals(X))
Choosing the model type here is a bit pointless, as we deal here intentionally with OLS. However, we could us ask whether weighted least squares, or LS with L1, L2 regularization types or other GLMs do a good job here.
We assume that the outcome originates from above data synthesis $$\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{noise}$$ We could set up a design matrix $\mathbf{X}_f$ for the linear model $$\hat{\mathbf{y}} = \mathbf{X}_f \hat{\mathbf{\beta}}$$ such that $$||\mathbf{y} - \hat{\mathbf{y}}||_2^2$$ becomes minimized by choosing / calculating the 'right' coefficients $\hat{\mathbf{\beta}}$. Note, the distinction between $\mathbf{X}$ (the real world features) and $\mathbf{X}_f$ (the features that we think will explain the world and we have chosen carefully in advance). Below, we will use $\mathbf{X} = \mathbf{X}_f$ as starting point.
There are two major concepts of solving the problem $$\mathrm{min}_{\mathrm{wrt}\, \beta}||\mathbf{y} - \hat{\mathbf{y}}||_2^2$$
The first concepts solves the problem by maximum likelihood estimation (MLE), for the second we can trust linear algebra and solve an over-determined set of linear equations, precisely known as OLS. It turns out, that MLE yields same results as OLS under normal distribution assumptions. This means, that if our data approximately fulfills normal distribution assumptions and we derive the solution via OLS, we are very close to the MLE. We should read in the above mentioned text books!
The solution (we assume full column rank) is given as $$\hat{\mathbf{\beta}} = (\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T} \mathbf{y}$$
The matrix $(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}$ is the left inverse of $\mathbf{X}$, i.e. left multiplied $\mathbf{X}$ yields $$(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T} \mathbf{X} = \mathbf{I}$$
For the predictions $$\hat{\mathbf{y}} = \mathbf{X}\cdot\hat{\mathbf{\beta}} = \mathbf{X}\cdot(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T} \mathbf{y} = \hat{\mathbf{y}}$$ we find that the (hat)-matrix $\mathbf{X}(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}$ is a projection matrix which projects $\mathbf{y}$ to $\hat{\mathbf{y}}$, i.e. to the column space of $\mathbf{X}$.
The projection matrix $\mathbf{I} - \mathbf{X}(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}$ projects $\mathbf{y}$ to the orthogonal left null space of $\mathbf{X}$. This results in the residual vector $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{X}(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}) \mathbf{y}$$
By that, we know $\mathbf{e} \perp \hat{\mathbf{y}}$. Very often $||\mathbf{e}||_2^2$ is termed sum of squared errors (SSE) in the context of OLS. We see other sum of squares below.
Training/Fitting...TBD...this is probably a nice homework assignment
model = OLS(y, X, hasconst=hasconst)
results = model.fit() # this solves the OLS problem, we fit / train the model
# to estimate the unknown model parameters beta_hat
print(results.summary()) # get some useful information about this model
# number of observations
X.shape[0], results.nobs
# df of model parameters
# we count all beta coefficients except beta[0], which is considered as const
df_model = X.shape[1] - 1
df_model, results.df_model
# df of residuals
# when we have 16 observations, we spend one df to calc the mean of y
# and three df for the beta coeff 1,2,3. Thus, df of 12 is remaining
df_resid = X.shape[0] - X.shape[1]
df_resid, results.df_resid
# check coefficients
beta_hat = pinv(X) @ y
print(beta_hat) # cf. with 'coef' in the summary table above
np.allclose(beta_hat, results.params)
# check prediction
yhat = X @ beta_hat # predict outcomes on design matrix data
np.allclose(yhat, results.predict(X))
plt.plot(y, "C0o-", label="y")
plt.plot(yhat, "C1o:", ms=3, label="yhat")
plt.xlabel("sample index")
plt.legend()
plt.grid(True)
# check residuals
resid = y - yhat
np.allclose(resid, results.resid)
# check empirical correlation coefficient Rsquared between y and yhat
# also known as coefficient of determination
R2, R2adj = my_r2(X, beta_hat, y)
print("R-squared:", R2, "Adj. R-squared:", R2adj)
print("R-squared:", results.rsquared, "Adj. R-squared:", results.rsquared_adj)
np.allclose(R2, results.rsquared), np.allclose(R2adj, results.rsquared_adj)
# Overall-F-Test (Goodness of fit-Test)
# central F distribution is used for hypothesis test
# checks if there is a linear relationship between outcome and ANY of the
# regressors (i.e. the explanatory variables, features), i.e. H0 is
# beta_hat[1]=beta_hat[2]=beta_hat[...] = 0, thus under H0 we have the model
# beta_hat[0] = np.mean(y)
# H1: at least ONE regressor in X[:,1:] explains data well in terms of
# statistical evaluation
# see p.109 (1.38) in [FHT96], p.147 in [FKLM21]
# do regression, calc usual sum of squares
yhat = X @ beta_hat # do regression / prediction
# sum of squares Total
SST = np.sum((y - np.mean(y)) ** 2)
# sum of squares due to Regression model
# my_deviance_for_normal_pdf(X, beta_hat, np.mean(y))
SSR = np.sum((yhat - np.mean(y)) ** 2)
# sum of squared Errors
SSE = np.sum((y - yhat) ** 2) # my_deviance_for_normal_pdf(X, beta_hat, y)
# get the degrees of freedom
p = X.shape[1]
q = 1
dfn, dfd = p - q, N - p # dfn->1 for intercept, dfd for residual
# we should avoid this due to numerical precision issues:
Fval = R2 / (1 - R2) * dfd / dfn
# we better use the equivalent, numerical more stable
# and probably more intuitive:
Fval = SSR / SSE * dfd / dfn
# or even better, because F-values are signal-to-noise ratios, where the SS
# are normalized by their corresponding dfs:
Fval = (SSR / dfn) / (SSE / dfd)
probF = f.sf(Fval, dfn, dfd) # get the probability for this F value and the dfs
print("F", Fval, results.fvalue)
print("probability for F", probF, results.f_pvalue)
np.allclose(Fval, results.fvalue), np.allclose(probF, results.f_pvalue)
print("reject H0?", probF < alpha) # if True we can reject H0
# we could consider this Overall-F-Test (Goodness of fit-Test) as a very
# special case of deviance statistics, which actually are likelihood ratio tests
# cf. Ch.5.7.1 in [DB18], Ch.3.5 in [MT11]
D0 = SST # perfect model, overfitted as it 'models' exactly the outcome data
D1 = my_deviance_for_normal_pdf(X, beta_hat, y) # regression model
Fval = ((D0 - D1) / (dfn)) / (D1 / (dfd))
probF = f.sf(Fval, dfn, dfd) # get the probability for this F value and the dfs
print("F", Fval, results.fvalue)
print("probability for F", probF, results.f_pvalue)
np.allclose(Fval, results.fvalue), np.allclose(probF, results.f_pvalue)
# log-likelihood
# cf. https://github.com/statsmodels/statsmodels/blob/main/statsmodels/regression/linear_model.py
# line 949ff
# cf. p.54 eq. (2.6) in [FHT96] and p.89 in [Agresti15] for the log-likelihood case
# here rather the profile log likelihood is given, as we don't know the true sigma
# cf. p.89 in [Agresti15] last equation on this page, where likelihood is given
# for estimated beta AND sigma
nobs = X.shape[0]
nobs2 = X.shape[0] / 2
ssr = np.sum((y - yhat) ** 2)
llf = -nobs2 * np.log(2 * np.pi) - nobs2 * np.log(ssr / nobs) - nobs2
print(llf)
np.allclose(llf, results.llf)
# Akaike information criterion
# cf. p.146 in [Agresti15], p.165 in [DB18]
# https://en.wikipedia.org/wiki/Akaike_information_criterion
AIC = -2 * (llf - X.shape[1])
print(AIC)
np.allclose(AIC, results.aic)
# Bayesian information criterion
# cf. p.165 in [DB18], p.165 in [FKLM21]
BIC = -2 * llf + X.shape[1] * np.log(X.shape[0])
print(BIC)
np.allclose(BIC, results.bic)
# standardized residuals
# cf. Ch.2.5. in [Agresti15], p.136ff in [FKLM21], Ch.6.2.6 in [DB18]
H = X @ pinv(X) # hat matrix, this is the matrix that maps y to yhat
# i.e. projecting the y into the column space of design matrix X resulting in yhat
e = y - yhat # residuals
p = X.shape[1] # df for residual vector
sg2_e = np.sum(e**2) / (N - p)
sg_e = np.sqrt(sg2_e)
std_res = e / (sg_e * np.sqrt(1 - np.diag(H))) # standardized residuals
fig, ax = plt.subplots(figsize=(4, 4))
qqplot(std_res, ax=ax, line="45")
ax.axis("equal")
ax.grid(True)
ax.set_title("standardized residuals vs. standard normal PDF")
print(results.summary())
# test of significance for coefficient beta_hat[i]
# H0: beta_hat[i]=0 H1: beta_hat[i] not 0
e = y - yhat # residuals
p = X.shape[1]
sg2_e = np.sum(e**2) / (N - p) # unbiased estimator of residuals' variance
sg_e = np.sqrt(sg2_e) # unbiased estimator of residuals' standard deviation
# estimator for covariance matrix, cf. p.132 in [FKLM21]
Cov_beta_hat = sg2_e * np.linalg.inv(X.T @ X)
std_err = np.sqrt(np.diag(Cov_beta_hat))
print("std_err", std_err)
tval = beta_hat / std_err
print("\nt values:", tval)
probt = t.sf(np.abs(tval), N - p) * 2
print("\nprop > |t|:", probt)
print("\nreject H0?", probt < alpha / 2) # if True we can reject H0
np.allclose(sg2_e, results.scale), np.allclose(
tval, results.tvalues
), np.allclose(probt, results.pvalues)
# TBD: CIs