Sascha Spors, Professorship Signal Theory and Digital Signal Processing, Institute of Communications Engineering (INT), Faculty of Computer Science and Electrical Engineering (IEF), University of Rostock, Germany
Winter Semester 2023/24 (Master Course #24512)
Feel free to contact lecturer frank.schultz@uni-rostock.de
In this notebook we therefore check over-/underfitting via bias$^2$/variance quantities and $R_{\text{adjusted}}^2$ on linear (and unregularized) models that were trained and predicted on noisy data (note here: training data=test data for convenience).
For this toy example we know the true world (unnoisy) data, because we know the linear model equation that creates this data. Hence, we can be pretty confident about our interpretations on the performances of the different models. In real practice we typically deal with an unknown model equation. Then we need to properly check for over-/underfitting on our model estimates.
A robust prediction model should have a reasonable trade-off between bias^2 and variance and reasonable high $R_{\text{adjusted}}^2$ mean but low $R_{\text{adjusted}}^2$ variance (see this notebook).
Furthermore, a robust prediction model should predict reasonable outcome to unknown input data, such that it generalizes well on unseen data. This is discussed in the notebook bias_variance_ridge_regression.ipynb.
Useful chapters in textbooks on this fundamental aspect:
For a (mapping) function $f(x)$ and and a (we assume a stationary) random process for drawing values $x$ described by the probability density function (PDF) $p(x)$ the expectation of $f(x)$ is defined as
$$\mathbb{E}_x\{f(x)\} = \int f(x) p(x) \mathrm{d}x.$$The so called 1st order raw moment (also known as linear mean) can be derived from that basic expectation formula just by setting the mapping function $f(x) = x^1$. The 1st raw moment is thus defined as $$\mu_x = \mathbb{E}_x\{x^1\} = \int x^1 p(x) \mathrm{d}x.$$
The so called 2nd order raw moment (also known as quadratic mean) can be derived from that basic expectation formula just by setting the mapping function $f(x) = x^2$. The 2nd raw moment is thus defined as $$\mathbb{E}_x\{x^2\} = \int x^2 p(x) \mathrm{d}x.$$
The so called 2nd order centralized moment plays a fundamental role in processing statistical. Its mapping function is $f(x) = (x^1 - \mathbb{E}_x\{x\})^2$ and hence $$\mathbb{E}_x\{(x^1 - \mathbb{E}_x\{x\})^2\} = \int (x^1 - \mathbb{E}_x\{x\})^2 p(x) \mathrm{d}x.$$ With the linear mean $\mu_x = \mathbb{E}_x\{x\}$ from above we can rewrite $$\sigma_x^2 = \mathbb{E}_x\{(x - \mu_x)^2\} = \int (x - \mu_x)^2 p(x) \mathrm{d}x.$$ The 2nd centralized moment $\sigma_x^2$ is known as variance, its square root $\sqrt{\sigma_x^2} = \sigma_x$ is known as standard deviation.
Above formulations hold for continuous $x$, thus continuously given arguments (data). Most often we deal with sampled data. Then, we need to approximate the above moments. This is an own theory by itself, see our lecture / tutorial for digital signal processing
In short, we approximate the integral by a sum and the integral kernel with the signal values itself and an average (to overcome the missing information from the PDF that is used above).
Then the basic expectation equation is $$\mathbb{E}\{f(x)\} \approx \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} f(x_m)$$ denoting $x_m$ as the $m$-th sample of the data set with finite amount of samples.
The raw and centralized moments can now be approximated by the very same mappings as above.
The so called 1st order raw moment (also known as linear mean) can be approximated from the basic expectation formula by setting the mapping function $f(x) = x^1$. The 1st raw moment thus is defined as $$\bar{x} = \mathbb{E}\{x\} = \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} x_m$$
The so called 2nd order raw moment (also known as quadratic mean) can be approximated from the basic expectation formula by setting the mapping function $f(x) = x^2$. The 2nd raw moment thus is defined as $$\mathbb{E}\{x^2\} = \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} x^2_m$$
The so called 2nd order centralized moment (also known as variance) can be approximated from the basic expectation formula by setting the mapping function $f(x) = x - \bar{x}$. The 2nd centralized moment thus is defined as $$\mathbb{E}\{(x-\bar{x})^2\} = \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} (x_m - \bar{x})^2$$
We utilize our well known least-squares error problem
$$\min_{\text{wrt }\beta} (||\mathbf{y} - \mathbf{X} \beta||_2^2)$$with the solution
$$\hat{\beta} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}$$as always assuming an $M \times N$, full column rank $r=N$, $M \gg N$ matrix $\mathbf{X}$ (i.e. more data samples than features).
In our toy example we set up matrices by a given data column vector $\mathbf{x}$ and $N-1$ different (non)-linear mapping functions $f(\mathbf{x})$, such that the feature matrix becomes
$$ \mathbf{X} = \begin{bmatrix} 1 & f_1(\mathbf{x}_1) & f_2(\mathbf{x}_1) & \dots & f_{N-1}(\mathbf{x}_1)\\ 1 & f_1(\mathbf{x}_2) & f_2(\mathbf{x}_2) & \dots & f_{N-1}(\mathbf{x}_2)\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & f_1(\mathbf{x}_M) & f_2(\mathbf{x}_M) & \dots & f_{N-1}(\mathbf{x}_M)\\ \end{bmatrix} $$with the first column filled with ones to realize the intercept of the linear model.
Note, that the feature design $f(\mathbf{x})$ might be non-linear, but the model itself is still linear, because the outcome is the linear combination $\hat{\mathbf{y}} = \mathbf{X} \hat{\beta}$.
The true data is created with a column vector $\mathbf{x}$ containing $0$ to $2\pi$ equidistantly sampled with $M=2^8$ samples. The corresponding true feature matrix $\mathbf{X}_t$ is build with
and has rank 5. The true model weights are $\beta_t = [3,2,1, \frac{1}{2}, \frac{1}{4}]^\mathrm{T}$. The true outcome vector (model output) is then $\mathbf{t} = \mathbf{X}_t \beta_t$, which should serve as reference for the optimum prediction (cf. $h(\mathbf{x})$ in [Bishop 2006, Sec. 3.2]) in the discussion for bias/variance trade-off.
We should realize that we do a Fourier series synthesis here.
Measured data will be obtained by
$$\mathbf{y} = \mathbf{t} + \mathbf{n},$$where the column vector $\mathbf{n}$ (for noise) contains a random signal drawn from normal distribution with zero mean and variance noise_scale**2
.
Thus $\mathbf{y}$ potentially also lives in the left null space of $\mathbf{X}_t$, whereas we know by design that $\mathbf{t}$ purely lives in column space of $\mathbf{X}_t$.
We create $L$ different $\mathbf{y}$ vectors, as we sample $L$ different $\mathbf{n}$ vectors.
We should denote the $l$-th measurement data vector with $$\mathbf{y}^{(l)}$$
We should denote the $m$-th entry in $\mathbf{y}^{(l)}$ as $$\mathbf{y}_m^{(l)}$$
So, the indices $l=1...L$ and $m=1...M$ are used later on.
We specify a certain model with a dedicated feature matrix $\mathbf{X}$. We always use above introduced column vector $\mathbf{x}$ but different features and by that different model complexity.
The $l$-th measurement leads to the $l$-th estimate of the model parameters
$$\hat{\beta}^{(l)} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}^{(l)}$$and thus the $l$-th prediction is given as
$$\hat{\mathbf{y}}^{(l)} = \mathbf{X} \hat{\beta}^{(l)} = \mathbf{X} (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}^{(l)}$$We should consistently denote the $m$-th entry in the predicted model output $\hat{\mathbf{y}}^{(l)}$ as $$\hat{\mathbf{y}}_m^{(l)}$$
According to [Bishop 2006, Sec. 3.2] (it is worth to really think about these equations) we can calculate bias$^2$ and variance, when here assuming that the optimum prediction is precisely our true data (which we fortunately know here, as we designed it with $\mathbf{t}$).
Firstly, a 1st order raw moment is calculated as [Bishop 2006, (3.45)]
$$\tilde{\mathbf{y}} = \frac{1}{L} \sum_{l=1}^{L} \hat{\mathbf{y}}^{(l)},$$so we get a $M \times 1$ column vector $\tilde{\mathbf{y}}$ (note the tilde above it), where the $m$-th entry is the average of all $m$-th entries of the $L$ models. Doing that and hoping that the prediction noise is averaging out, we get a fair idea what the true data vector could be.
We could set up a 2nd order moment to quantify the error between $\tilde{\mathbf{y}}$ and the true data $\mathbf{t}$ in squared sense. This time we should sum over the $M$ samples contained in both vectors, so it is a sum of squared deviation operation, which very often occurs in machine learning, cf. typical loss functions.
We then get the bias$^2$ [Bishop 2006, (3.46)]
$$\mathrm{bias}^2 = \frac{1}{M} \sum_{m=1}^{M} (\mathbf{t}_m - \tilde{\mathbf{y}}_m)^2$$The variance is a little more complicated, cf. [Bishop 2006, (3.47)]. First we calculate a 2nd order centralized moment over the $L$ models resulting in an $M \times 1$ column vector
$$ \mathbf{v} = \frac{1}{L} \sum_{l=1}^{L} (\hat{\mathbf{y}}^{(l)} - \tilde{\mathbf{y}})^2$$and then a 1st order raw moment over the $M$ samples
$$\mathrm{variance} = \frac{1}{M} \sum_{m=1}^{M} \mathbf{v}_m$$Let us define a 1st raw moment for the measured data along its samples (note the bar above ${y}$ and do not confuse it with the tilde vector from above!), we get a scalar
$$\bar{{y}} = \frac{1}{M} \sum_{m=1}^{M} \mathbf{y}_m$$We can define 2nd order moments, here rather called sum of squares
SSE is actually the cost function for the optimization problem, because it can be rewriten as $||\mathbf{y} - \mathbf{X} \hat{\beta}||_2^2 = (\mathbf{y} - \mathbf{X} \hat{\beta})^\mathrm{T} (\mathbf{y} - \mathbf{X} \hat{\beta})$, which was used above.
Important note: in literature different abbreviations and phrases are used for these three quantities, we should carefully check them. Sum of Squares of Residuals might be abbreviated as SSR, as well as we actually use it for Sum of Squares Regression, but that are totally different things! This might be very confusing at the beginning, the more we know what these equations actually do, the less important is the actual wording for them.
It is known that
$$\mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}$$We might be interested in the ratio
$$R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = \frac{\mathrm{SST}-\mathrm{SSE}}{\mathrm{SST}} = 1^2 - \frac{\mathrm{SSE}}{\mathrm{SST}}$$which is typically called empirical correlation coefficient or coefficient of determination and for which
$$0 \leq R^2 \leq 1$$holds. For SSE = 0, the measured data and the predicted data are identical, so $R^2=1$.
Typically, $R^2$ can be increased by increasing the number of data samples $M$ and the number of features $N$, so taking more data and making the model more complex. But that's somehow a trivial and naive approach to train models...which brings us easily into the region of overfitting.
So, we rather need a robust measure which is somehow independent of $M$ and $N$, to draw reasonable conclusions of the models performance. A well know approach is to compensate for the so called degrees of freedom in the model, which brings up the so called adjusted $R^2$
$$R_\mathrm{adj}^2 = 1^2 - \frac{\frac{\mathrm{SSE}}{M-N}}{\frac{\mathrm{SST}}{M-1}}$$for models that use a intercept parameter $\hat{\beta}_0$ (which we do here).
Let us getting into some coding business for this toy example...
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.api import OLS
def my_r2adj(X, beta_hat, y):
"""(Adjusted) coefficient of determination R^2.
also known as empirical correlation coefficient between y and yhat
"""
M = X.shape[0] # number of samples / observations
N = X.shape[1] # number of model parameters (including intercept beta[0])
yhat = X @ beta_hat # do regression / prediction
# sum of squares T.otal:
SST = np.sum((y - np.mean(y)) ** 2)
# sum of squares due to R.egression:
SSR = np.sum((yhat - np.mean(y)) ** 2)
# sum of squared E.rrors:
SSE = np.sum((y - yhat) ** 2)
# SST = SSR + SSE holds (numerical errors might corrupt this equality)
# R2 = SSR / SST is the ratio between regression and total, 0<=R2<=1
# rearranged R2 = (SST - SSE) / SST = 1**2 - SSE / SST
# p.58 in [MT11], p.111 in [DB18], p.108 (1.34)-(1.36) in [FHT96],
# p.125 in [FKLM21], Ch.2.4.6 in [Agresti15]
R2 = 1**2 - SSE / SST
# R2 should be adjusted by number of samples and model complexity
# note that this equation holds for models that include an intercept term
# p.58 in [MT11], p.163 in [FKLM21], Ch.2.4.6 in [Agresti15]
# R2adj = 1**2 - (1-R2)*(n-1)/(n-d)
# or rewritten to see the adjustments in a more convenient way
R2adj = 1**2 - (SSE / (M - N)) / (SST / (M - 1))
return R2adj
# number of observations / samples:
M = 2**8
# true model with x as input variable
x = np.linspace(0, 2 * np.pi, M)
# to create 4 features in design/feature matrix X
X = np.column_stack((np.cos(x), np.sin(2 * x), np.cos(5 * x), np.cos(6 * x)))
# add a bias/intercept column to the design/feature matrix:
X = np.hstack((np.ones((X.shape[0], 1)), X))
hasconst = True
# some nice numbers for the true model parameters beta:
beta = np.array([3, 2, 1, 1 / 2, 1 / 4])
# generate 'true' data with the design matrix of 'true' model
y = X @ beta
plt.figure(figsize=(6, 3))
plt.plot(y, "k-")
plt.xlabel("independent features' input variable x")
plt.ylabel(("dependent variable y, true data"))
plt.title("true model data as linear model (x -> 4 features + intercept)")
plt.xlim(0, M)
plt.ylim(-2, 8)
plt.grid(True)
print(X.shape, y.shape)
plt.tight_layout()
plt.savefig('bias_variance_plots/true_data.png', dpi=300)
def bias_variance_of_model(
X, noise_scale=1 / 3
): # noise_scale=2 is nice as well
# add bias column to the design matrix
X = np.hstack((np.ones((X.shape[0], 1)), X))
hasconst = True
print(
"\nshape of model/feature matrix X:",
X.shape,
"\nrank of matrix X / # of model parameters:",
np.linalg.matrix_rank(X),
)
# init random number generator to reproduce results
rng = np.random.default_rng(12345)
# generate L data sets with added noise
L = 2**7
noise = rng.normal(loc=0, scale=noise_scale, size=(M, L))
Yn = y[:, None] + noise
# alloc memory for all predictions
Yhat = np.zeros((M, L))
rsquared_adj = np.zeros(L)
# train and predict L models on these L data sets
for i in range(L):
model = OLS(Yn[:, i], X, hasconst=hasconst) # OLS model
results = model.fit() # train the model
Yhat[:, i] = results.predict(X) # predict outcome
rsquared_adj[i] = my_r2adj(X, results.params, Yn[:, i])
# print(np.allclose(results.rsquared_adj, rsquared_adj[i]))
# get average prediction, i.e. mean over the L models
# which is a numerical eval of the expectation:
ym = np.mean(Yhat, axis=1) # (3.45) in [Bishop 2006]
# get integrated squared bias (numerical eval of the expectation):
# note: y is the true model data
bias_squared = np.mean((ym - y) ** 2) # (3.42), (3.46) in [Bishop 2006]
# get integrated variance (numerical eval of the expectation):
variance = np.mean(
np.mean((Yhat - ym[:, None]) ** 2, axis=1), axis=0
) # (3.43), (3.47) in [Bishop 2006]
for i in range(L):
axs[0, 0].plot(Yn[:, i])
axs[0, 1].plot(Yhat[:, i])
axs[0, 1].plot(y, "k-", label="true model")
axs[0, 1].plot(
np.mean(Yhat, axis=1), ":", color="gold", label="$\mu(\hat{Y})$"
)
axs[0, 1].plot(
np.mean(Yhat, axis=1) + np.std(Yhat, axis=1),
"--",
lw=0.75,
color="gold",
label="$\mu(\hat{Y}) \pm \sigma(\hat{Y})$",
)
axs[0, 1].plot(
np.mean(Yhat, axis=1) - np.std(Yhat, axis=1),
"-.",
lw=0.75,
color="gold",
)
axs[0, 1].set_title(
r"bias$^2$="
+ "{:4.3f}".format(bias_squared)
+ ", var="
+ "{:4.3f}".format(variance)
+ r", bias$^2$+var="
+ "{:4.3f}".format(bias_squared + variance)
)
for i in range(2):
axs[0, i].set_xlim(0, M)
axs[0, i].set_ylim(-2, 8)
axs[0, i].grid(True)
axs[0, i].set_xlabel("independent features' input variable x")
axs[0, 0].set_ylabel("dependent variable yn")
axs[0, 1].set_ylabel("predicted variable yhat")
axs[0, 1].legend()
axs[1, 0].plot(rsquared_adj)
axs[1, 0].set_title(
r"$\hat{\mu}(R_{adj}^2)="
+ "{:4.3f}$".format(np.mean(rsquared_adj))
+ r", $\hat{\sigma}(R_{adj}^2)="
+ "{:4.3f}$".format(np.std(rsquared_adj))
)
axs[1, 0].set_xlim(0, L)
axs[1, 0].set_ylim(0, 1)
axs[1, 0].set_xlabel("model index")
axs[1, 0].set_ylabel(r"$R_{adj}^2$")
axs[1, 0].grid(True)
axs[1, 1].set_xlabel("intentionally empty")
plt.tight_layout()
print("bias^2 + variance = ", bias_squared + variance)
typically
which reads as: the model is far away from predicting the true data (high bias part), but this wrong prediction is very consistent (low variance part). In this example, the predicted data is along lines (obviously not matching the true data shape), but these lines are similarly aligned. This typically leads to non appropriate predictions on unseen data, because we model the true world to simple. We should avoid such modeling.
# we take just a simple line equation model y = beta1 x + beta0 here
# note that intercept is only added in function bias_variance_of_model(X)
X = np.copy(x)[:, None]
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title("underfit, too low model complexity, high bias, low var");
plt.tight_layout()
plt.savefig('bias_variance_plots/too_simple_model.png', dpi=300)
typically
which reads as: the model is very (too) good to predict all the different noisy data sets (low bias part), but on average these different individual predictions differ too much from the true data (high variance). This typically leads to non-robust predictions on noisy unknown data. We should avoid such modeling.
# we take a Fourier series expansion model here
X = np.column_stack((np.cos(x), np.sin(x))) # init with first two features
# add more features according to a Fourier series expansion
# <=M//2 makes sure we do not use more model parameters than signal samples
# in order to solve this as a least-squares problem, i.e. using left-inverse
for m in range(2, M // 2):
X = np.column_stack((X, np.sin(m * x), np.cos(m * x)))
# note that intercept is only added in function bias_variance_of_model(X)
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title("overfit, too high model complexity, low bias, high var");
plt.tight_layout()
plt.savefig('bias_variance_plots/too_complex_model.png', dpi=300)
a rare case (and obviously here for didactic reasons): we know the exact model equation and thus can set up our feature matrix according to it. By concept this yields the
The remaining variance is due to the added (measurement) noise, which we don't want to have explained by the model, because we then end up in the too complex model world, just discussed above.
# we take all features of the true model here
# (we generally not know this exactly in practice)
X = np.column_stack((np.cos(x), np.sin(2 * x), np.cos(5 * x), np.cos(6 * x)))
# note that intercept is only added in function bias_variance_of_model(X)
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X) # lowest possible bias^2+variance, because we
# know the true model (again: which in practice likely never will occur)
# the remaining variance is from the added noise
axs[0, 0].set_title("true model features, lowest bias, lowest var");
plt.tight_layout()
plt.savefig('bias_variance_plots/true_model.png', dpi=300)
If the exact model equation is not known, we need to find a model that is neither to simple, nor too complex.
We could consider the following model as the most robust for this example.
A robust prediction model should have a reasonable trade-off between bias^2 and variance and reasonable high $R_{\text{adjusted}}^2$ mean but low $R_{\text{adjusted}}^2$ variance.
This is in good agreement with the model here.
# we take only the first two features of the true model
# as these oscillations explain much of the dependent variable y
X = np.column_stack((np.cos(x), np.sin(2 * x)))
# note that intercept is only added in function bias_variance_of_model(X)
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title("reasonable bias/var trade-off if true model is unknown");
plt.tight_layout()
plt.savefig('bias_variance_plots/robust_model.png', dpi=300)
We might check other noise power, for example bias_variance_of_model(X, noise_scale=2)
.
Although, the Prediction Model == True Model has by concept lowest bias and variance, the performance obviously degrades by increasing noise power. This is also indicated by the decreasing $R_{\text{adjusted}}^2$ value.
We should compare the performance of The Not Too Simple, Not Too Complex Model with the Prediction Model == True Model.
Recall that we almost never have access to the true model equation. That's why the last model should be considered as the most robust in this toy example.
We might check another model that includes the 3 most important features of the true model.
What do we expect in terms of the performance measures?