Notebook

Sascha Spors, Professorship Signal Theory and Digital Signal Processing, Institute of Communications Engineering (INT), Faculty of Computer Science and Electrical Engineering (IEF), University of Rostock, Germany

Data Driven Audio Signal Processing - A Tutorial with Computational Examples¶

Winter Semester 2023/24 (Master Course #24512)

Feel free to contact lecturer frank.schultz@uni-rostock.de

Linear Regression with Ordinary Least Squares (OLS)¶

toy example for linear regression using ordinary least squares (OLS)
we use package statsmodels
we check these results with manual calculations
very useful textbooks that might be referenced in the code below
- [FHT96] Ludwig Fahrmeir, Alfred Hamerle, Gerhard Tutz (1996): "Multivariate statistische Verfahren", 2nd ed., de Gruyter. https://doi.org/10.1515/9783110816020
- [FKLM21] Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx (2021): "Regression", 2nd ed., Springer. https://doi.org/10.1007/978-3-662-63882-8
- [DB18] Annette J. Dobson, Adrian G. Barnett (2018): "An Introduction to Generalized Linear Models", 4th ed., CRC Press. https://doi.org/10.1201/9781315182780
- [MT11] Henrik Madsen, Poul Thyregod (2011): "Introduction to General and Generalized Linear Models", CRC Press. https://doi.org/10.1201/9781439891148
- [Agresti15] Alan Agresti (2015): "Foundations of Linear and Generalized Models", Wiley. https://www.wiley.com/en-us/Foundations+of+Linear+and+Generalized+Linear+Models-p-9781118730034

In [ ]:

import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import inv, pinv
from scipy.stats import f, t
from scipy.linalg import svdvals
from statsmodels.api import OLS
from statsmodels.graphics.gofplots import qqplot


rng = np.random.default_rng(1234)  # to reproduce results

In [ ]:

def my_r2(X, theta_hat, y):
    """(Adjusted) coefficient of determination R^2.

    also known as empirical correlation coefficient between y and yhat
    """
    N = X.shape[0]  # number of samples / observations
    p = X.shape[1]  # number of model parameters (including intercept beta[0])

    yhat = X @ theta_hat  # do regression / prediction

    # sum of squares T.otal:
    SST = np.sum((y - np.mean(y)) ** 2)
    # sum of squares due to R.egression:
    SSR = np.sum((yhat - np.mean(y)) ** 2)
    # sum of squared E.rrors:
    SSE = np.sum((y - yhat) ** 2)
    # SST = SSR + SSE holds (numerical errors might corrupt this equality)

    # R2 = SSR / SST is the ratio between regression and total, 0<=R2<=1
    # rearranged R2 = (SST - SSE) / SST = 1**2 - SSE / SST
    # p.58 in [MT11], p.111 in [DB18], p.108 (1.34)-(1.36) in [FHT96],
    # p.125 in [FKLM21], Ch.2.4.6 in [Agresti15]
    R2 = 1**2 - SSE / SST

    # R2 should be adjusted by number of samples and model complexity
    # note that this equation holds for models that include an intercept term
    # p.58 in [MT11], p.163 in [FKLM21], Ch.2.4.6 in [Agresti15]
    # R2adj = 1**2 - (1-R2)*(n-1)/(n-d)
    # or rewritten to see the adjustments in a more convenient way
    R2adj = 1**2 - (SSE / (N - p)) / (SST / (N - 1))
    return (R2, R2adj)


def my_deviance_for_normal_pdf(X, beta_hat, y):
    """Scaled deviance for normally distributed data and errors.

    - is basically the sum of squared errors (SSE)
    - cf. p.89ff in [DB18], p.133 in [Agresti15], [MT11]
    - note that deviances are different for other distributions, but
      the same statistical concepts using deviances hold
    - so instead of thinking this as SSE we should get used to deviances
    """
    D = np.sum((y - X @ beta_hat) ** 2)  # SSE
    # SSE can also be calculated with the dot product:
    # tmp = y - X @ beta_hat
    # D = np.squeeze(tmp[:, None].T @ tmp[:, None])
    return D


alpha = 0.05  # standard alpha error for H0 rejection

Create some data of the real world and add measurement noise¶

create design matrix X with constant intercept term, x, x^2, x^3. This yield the
- independent variables, aka
- exog
- right hand side
- regressors
- design
- explanatory variable
- features
define some true beta coefficients for the real world phenomenon
create outcome y as linear combination of columns in X with added noise. This yield the
- dependent variables, aka
- endog
- left hand side
- regressand
- outcome
- response variable

$$\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{noise}$$

note that vector $\mathbf{noise}$ might partly live in the column space of $\mathbf{X}$ and if so it will influence the estimated solution $\hat{\mathbf{\beta}}$

In [ ]:

N = 2**5  # number of observations / samples

noise = rng.normal(loc=0, scale=1.5, size=N)
print("noise: mean=", np.mean(noise), ", std=", np.std(noise, ddof=1))

x = np.linspace(1e-16, 200, N)
X = np.column_stack((x, 500 * np.abs(np.sin(x)), np.abs(np.log(x))))
print(X.shape)
beta = np.array([1e0, 1e-1, 1e-2, 1e-3])  # some nice numbers for true beta
# add bias column to design matrix
X = np.hstack((np.ones((X.shape[0], 1)), X))
hasconst = True


debug_flag = False
if debug_flag:  # works well for N=2**4
    X = np.column_stack((x**1, x**3, x**5))
    print(X.shape)
    beta = np.array([0, 1e0, 0, 1e-2])
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    noise = rng.normal(loc=0, scale=0.5, size=N)


# generate 'real world' data with design matrix, add noise
y = np.dot(X, beta) + noise
print(X.shape)
print(y.shape)


if debug_flag:
    print("debug_flag", debug_flag)
    # we might want to get another design matrix to debug F/p and t/p in detail
    X = np.column_stack((x**0.125, x**0.25, x**0.75))
    X = np.hstack((np.ones((X.shape[0], 1)), X))

print(svdvals(X))

Choose a model type, define the model, train the model with above data¶

Choosing the model type here is a bit pointless, as we deal here intentionally with OLS. However, we could us ask whether weighted least squares, or LS with L1, L2 regularization types or other GLMs do a good job here.

We assume that the outcome originates from above data synthesis $$\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{noise}$$ We could set up a design matrix $\mathbf{X}_f$ for the linear model $$\hat{\mathbf{y}} = \mathbf{X}_f \hat{\mathbf{\beta}}$$ such that $$||\mathbf{y} - \hat{\mathbf{y}}||_2^2$$ becomes minimized by choosing / calculating the 'right' coefficients $\hat{\mathbf{\beta}}$. Note, the distinction between $\mathbf{X}$ (the real world features) and $\mathbf{X}_f$ (the features that we think will explain the world and we have chosen carefully in advance). Below, we will use $\mathbf{X} = \mathbf{X}_f$ as starting point.

There are two major concepts of solving the problem $$\mathrm{min}_{\mathrm{wrt}\, \beta}||\mathbf{y} - \hat{\mathbf{y}}||_2^2$$

one with assuming that data, estimators, residuals and so on are normally distributed
one with not assuming this and hoping that our results provide good results

The first concepts solves the problem by maximum likelihood estimation (MLE), for the second we can trust linear algebra and solve an over-determined set of linear equations, precisely known as OLS. It turns out, that MLE yields same results as OLS under normal distribution assumptions. This means, that if our data approximately fulfills normal distribution assumptions and we derive the solution via OLS, we are very close to the MLE. We should read in the above mentioned text books!

The solution (we assume full column rank) is given as $$\hat{\mathbf{\beta}} = (\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T} \mathbf{y}$$

The matrix $(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}$ is the left inverse of $\mathbf{X}$, i.e. left multiplied $\mathbf{X}$ yields $$(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T} \mathbf{X} = \mathbf{I}$$

For the predictions $$\hat{\mathbf{y}} = \mathbf{X}\cdot\hat{\mathbf{\beta}} = \mathbf{X}\cdot(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T} \mathbf{y} = \hat{\mathbf{y}}$$ we find that the (hat)-matrix $\mathbf{X}(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}$ is a projection matrix which projects $\mathbf{y}$ to $\hat{\mathbf{y}}$, i.e. to the column space of $\mathbf{X}$.

The projection matrix $\mathbf{I} - \mathbf{X}(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}$ projects $\mathbf{y}$ to the orthogonal left null space of $\mathbf{X}$. This results in the residual vector $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{X}(\mathbf{X}^\mathrm{T}\mathbf{X})^{-1}\mathbf{X}^\mathrm{T}) \mathbf{y}$$

By that, we know $\mathbf{e} \perp \hat{\mathbf{y}}$. Very often $||\mathbf{e}||_2^2$ is termed sum of squared errors (SSE) in the context of OLS. We see other sum of squares below.

Training/Fitting...TBD...this is probably a nice homework assignment

In [ ]:

model = OLS(y, X, hasconst=hasconst)
results = model.fit()  # this solves the OLS problem, we fit / train the model
# to estimate the unknown model parameters beta_hat
print(results.summary())  # get some useful information about this model

Comparison of manual calculation vs. statsmodels results¶

In [ ]:

# number of observations
X.shape[0], results.nobs

In [ ]:

# df of model parameters
# we count all beta coefficients except beta[0], which is considered as const
df_model = X.shape[1] - 1
df_model, results.df_model

In [ ]:

# df of residuals
# when we have 16 observations, we spend one df to calc the mean of y
# and three df for the beta coeff 1,2,3. Thus, df of 12 is remaining
df_resid = X.shape[0] - X.shape[1]
df_resid, results.df_resid

In [ ]:

# check coefficients
beta_hat = pinv(X) @ y
print(beta_hat)  # cf. with 'coef' in the summary table above
np.allclose(beta_hat, results.params)

In [ ]:

# check prediction
yhat = X @ beta_hat  # predict outcomes on design matrix data
np.allclose(yhat, results.predict(X))

In [ ]:

plt.plot(y, "C0o-", label="y")
plt.plot(yhat, "C1o:", ms=3, label="yhat")
plt.xlabel("sample index")
plt.legend()
plt.grid(True)

In [ ]:

# check residuals
resid = y - yhat
np.allclose(resid, results.resid)

In [ ]:

# check empirical correlation coefficient Rsquared between y and yhat
# also known as coefficient of determination
R2, R2adj = my_r2(X, beta_hat, y)
print("R-squared:", R2, "Adj. R-squared:", R2adj)
print("R-squared:", results.rsquared, "Adj. R-squared:", results.rsquared_adj)
np.allclose(R2, results.rsquared), np.allclose(R2adj, results.rsquared_adj)

In [ ]:

# Overall-F-Test (Goodness of fit-Test)
# central F distribution is used for hypothesis test
# checks if there is a linear relationship between outcome and ANY of the
# regressors (i.e. the explanatory variables, features), i.e. H0 is
# beta_hat[1]=beta_hat[2]=beta_hat[...] = 0, thus under H0 we have the model
# beta_hat[0] = np.mean(y)
# H1: at least ONE regressor in X[:,1:] explains data well in terms of
# statistical evaluation
# see p.109 (1.38) in [FHT96], p.147 in [FKLM21]

# do regression, calc usual sum of squares
yhat = X @ beta_hat  # do regression / prediction
# sum of squares Total
SST = np.sum((y - np.mean(y)) ** 2)
# sum of squares due to Regression model
# my_deviance_for_normal_pdf(X, beta_hat, np.mean(y))
SSR = np.sum((yhat - np.mean(y)) ** 2)
# sum of squared Errors
SSE = np.sum((y - yhat) ** 2)  # my_deviance_for_normal_pdf(X, beta_hat, y)

# get the degrees of freedom
p = X.shape[1]
q = 1
dfn, dfd = p - q, N - p  # dfn->1 for intercept, dfd for residual

# we should avoid this due to numerical precision issues:
Fval = R2 / (1 - R2) * dfd / dfn
# we better use the equivalent, numerical more stable
# and probably more intuitive:
Fval = SSR / SSE * dfd / dfn
# or even better, because F-values are signal-to-noise ratios, where the SS
# are normalized by their corresponding dfs:
Fval = (SSR / dfn) / (SSE / dfd)

probF = f.sf(Fval, dfn, dfd)  # get the probability for this F value and the dfs
print("F", Fval, results.fvalue)
print("probability for F", probF, results.f_pvalue)
np.allclose(Fval, results.fvalue), np.allclose(probF, results.f_pvalue)
print("reject H0?", probF < alpha)  # if True we can reject H0

In [ ]:

# we could consider this Overall-F-Test (Goodness of fit-Test) as a very
# special case of deviance statistics, which actually are likelihood ratio tests
# cf. Ch.5.7.1 in [DB18], Ch.3.5 in [MT11]
D0 = SST  # perfect model, overfitted as it 'models' exactly the outcome data
D1 = my_deviance_for_normal_pdf(X, beta_hat, y)  # regression model
Fval = ((D0 - D1) / (dfn)) / (D1 / (dfd))

probF = f.sf(Fval, dfn, dfd)  # get the probability for this F value and the dfs
print("F", Fval, results.fvalue)
print("probability for F", probF, results.f_pvalue)
np.allclose(Fval, results.fvalue), np.allclose(probF, results.f_pvalue)

In [ ]:

# log-likelihood
# cf. https://github.com/statsmodels/statsmodels/blob/main/statsmodels/regression/linear_model.py
# line 949ff
# cf. p.54 eq. (2.6) in [FHT96] and p.89 in [Agresti15] for the log-likelihood case

# here rather the profile log likelihood is given, as we don't know the true sigma
# cf. p.89 in [Agresti15] last equation on this page, where likelihood is given
# for estimated beta AND sigma
nobs = X.shape[0]
nobs2 = X.shape[0] / 2
ssr = np.sum((y - yhat) ** 2)
llf = -nobs2 * np.log(2 * np.pi) - nobs2 * np.log(ssr / nobs) - nobs2
print(llf)
np.allclose(llf, results.llf)

In [ ]:

# Akaike information criterion
# cf. p.146 in [Agresti15], p.165 in [DB18]
# https://en.wikipedia.org/wiki/Akaike_information_criterion
AIC = -2 * (llf - X.shape[1])
print(AIC)
np.allclose(AIC, results.aic)

In [ ]:

# Bayesian information criterion
# cf. p.165 in [DB18], p.165 in [FKLM21]
BIC = -2 * llf + X.shape[1] * np.log(X.shape[0])
print(BIC)
np.allclose(BIC, results.bic)

In [ ]:

# standardized residuals
# cf. Ch.2.5. in [Agresti15], p.136ff in [FKLM21], Ch.6.2.6 in [DB18]
H = X @ pinv(X)  # hat matrix, this is the matrix that maps y to yhat
# i.e. projecting the y into the column space of design matrix X resulting in yhat

e = y - yhat  # residuals
p = X.shape[1]  # df for residual vector
sg2_e = np.sum(e**2) / (N - p)
sg_e = np.sqrt(sg2_e)

std_res = e / (sg_e * np.sqrt(1 - np.diag(H)))  # standardized residuals
fig, ax = plt.subplots(figsize=(4, 4))
qqplot(std_res, ax=ax, line="45")
ax.axis("equal")
ax.grid(True)
ax.set_title("standardized residuals vs. standard normal PDF")

In [ ]:

print(results.summary())

In [ ]:

# test of significance for coefficient beta_hat[i]
# H0: beta_hat[i]=0   H1: beta_hat[i] not 0
e = y - yhat  # residuals
p = X.shape[1]
sg2_e = np.sum(e**2) / (N - p)  # unbiased estimator of residuals' variance
sg_e = np.sqrt(sg2_e)  # unbiased estimator of residuals' standard deviation

# estimator for covariance matrix, cf. p.132 in [FKLM21]
Cov_beta_hat = sg2_e * np.linalg.inv(X.T @ X)
std_err = np.sqrt(np.diag(Cov_beta_hat))
print("std_err", std_err)
tval = beta_hat / std_err
print("\nt values:", tval)
probt = t.sf(np.abs(tval), N - p) * 2
print("\nprop > |t|:", probt)
print("\nreject H0?", probt < alpha / 2)  # if True we can reject H0
np.allclose(sg2_e, results.scale), np.allclose(
    tval, results.tvalues
), np.allclose(probt, results.pvalues)

In [ ]:

# TBD: CIs

Copyright¶

the notebooks are provided as Open Educational Resources
the text is licensed under Creative Commons Attribution 4.0
the code of the IPython examples is licensed under the MIT license
feel free to use the notebooks for your own purposes
please attribute the work as follows: Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.