Notebook

Sascha Spors, Professorship Signal Theory and Digital Signal Processing, Institute of Communications Engineering (INT), Faculty of Computer Science and Electrical Engineering (IEF), University of Rostock, Germany

Data Driven Audio Signal Processing - A Tutorial with Computational Examples¶

Winter Semester 2023/24 (Master Course #24512)

Feel free to contact lecturer frank.schultz@uni-rostock.de

Trade-Off Between Bias^2 / Variance and Check for Model Complexity¶

we use plain ordinary least squares (OLS) based linear regression to discuss a very fundamental aspect when we learn from data, i.e. when we create prediction models
this aspect is known as bias-variance trade-off
in general we can split the sum of (true model data - predicted model data)^2 into three components

$$\text{model bias}^2 + \text{model variance} + \text{noise variance}$$

a model will never explain all variance (which is actually not wanted for a useful robust model), so certain noise variance remains
we can influence the model bias and model variance obviously by the choice of the model
however, we cannot at the same time have lowest model bias and lowest model variance to reduce the overall error for predictions
we therefore need to find a good compromise between bias and variance and especially we need to avoid two extremes
- underfit case, with typically too low model complexity yielding high bias and low variance
- overfit case, with typically too high model complexity yielding low bias and high variance

In this notebook we therefore check over-/underfitting via bias$^2$/variance quantities and $R_{\text{adjusted}}^2$ on linear (and unregularized) models that were trained and predicted on noisy data (note here: training data=test data for convenience).

For this toy example we know the true world (unnoisy) data, because we know the linear model equation that creates this data. Hence, we can be pretty confident about our interpretations on the performances of the different models. In real practice we typically deal with an unknown model equation. Then we need to properly check for over-/underfitting on our model estimates.

A robust prediction model should have a reasonable trade-off between bias^2 and variance and reasonable high $R_{\text{adjusted}}^2$ mean but low $R_{\text{adjusted}}^2$ variance (see this notebook).

Furthermore, a robust prediction model should predict reasonable outcome to unknown input data, such that it generalizes well on unseen data. This is discussed in the notebook bias_variance_ridge_regression.ipynb.

Useful chapters in textbooks on this fundamental aspect:

[Bishop 2006] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006, Chapter 3.2
Sergios Theodoridis, Machine Learning, Academic Press, 2020, 2nd ed., Chapter 3.9
Kevin P. Murphy, Machine Learning-A Probabilistic Perspective, MIT Press, 2012, 1st ed., Chapter 6.4.4
Kevin P. Murphy, Probabilistic Machine Learning-An Introduction, MIT Press, 2022, Chapter 4.7.6.3
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009, 2nd ed., Chapter 2.9
Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2021, 2nd ed., Chapter 2.2.2
Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification, Wiley, 2000, 2nd ed., Chapter 9.3

Expectation and Moments, Continuous Data¶

For a (mapping) function $f(x)$ and and a (we assume a stationary) random process for drawing values $x$ described by the probability density function (PDF) $p(x)$ the expectation of $f(x)$ is defined as

$$\mathbb{E}_x\{f(x)\} = \int f(x) p(x) \mathrm{d}x.$$

The so called 1st order raw moment (also known as linear mean) can be derived from that basic expectation formula just by setting the mapping function $f(x) = x^1$. The 1st raw moment is thus defined as $$\mu_x = \mathbb{E}_x\{x^1\} = \int x^1 p(x) \mathrm{d}x.$$

The so called 2nd order raw moment (also known as quadratic mean) can be derived from that basic expectation formula just by setting the mapping function $f(x) = x^2$. The 2nd raw moment is thus defined as $$\mathbb{E}_x\{x^2\} = \int x^2 p(x) \mathrm{d}x.$$

The so called 2nd order centralized moment plays a fundamental role in processing statistical. Its mapping function is $f(x) = (x^1 - \mathbb{E}_x\{x\})^2$ and hence $$\mathbb{E}_x\{(x^1 - \mathbb{E}_x\{x\})^2\} = \int (x^1 - \mathbb{E}_x\{x\})^2 p(x) \mathrm{d}x.$$ With the linear mean $\mu_x = \mathbb{E}_x\{x\}$ from above we can rewrite $$\sigma_x^2 = \mathbb{E}_x\{(x - \mu_x)^2\} = \int (x - \mu_x)^2 p(x) \mathrm{d}x.$$ The 2nd centralized moment $\sigma_x^2$ is known as variance, its square root $\sqrt{\sigma_x^2} = \sigma_x$ is known as standard deviation.

Expectation and Moments, Sampled Data¶

Above formulations hold for continuous $x$, thus continuously given arguments (data). Most often we deal with sampled data. Then, we need to approximate the above moments. This is an own theory by itself, see our lecture / tutorial for digital signal processing

In short, we approximate the integral by a sum and the integral kernel with the signal values itself and an average (to overcome the missing information from the PDF that is used above).

Then the basic expectation equation is $$\mathbb{E}\{f(x)\} \approx \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} f(x_m)$$ denoting $x_m$ as the $m$-th sample of the data set with finite amount of samples.

The raw and centralized moments can now be approximated by the very same mappings as above.

The so called 1st order raw moment (also known as linear mean) can be approximated from the basic expectation formula by setting the mapping function $f(x) = x^1$. The 1st raw moment thus is defined as $$\bar{x} = \mathbb{E}\{x\} = \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} x_m$$

The so called 2nd order raw moment (also known as quadratic mean) can be approximated from the basic expectation formula by setting the mapping function $f(x) = x^2$. The 2nd raw moment thus is defined as $$\mathbb{E}\{x^2\} = \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} x^2_m$$

The so called 2nd order centralized moment (also known as variance) can be approximated from the basic expectation formula by setting the mapping function $f(x) = x - \bar{x}$. The 2nd centralized moment thus is defined as $$\mathbb{E}\{(x-\bar{x})^2\} = \lim_{M \to \infty} \frac{1}{M} \sum_{m=1}^{M} (x_m - \bar{x})^2$$

Bias and Variance for OLS Model Example¶

Model Assumption¶

We utilize our well known least-squares error problem

$$\min_{\text{wrt }\beta} (||\mathbf{y} - \mathbf{X} \beta||_2^2)$$

with the solution

$$\hat{\beta} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}$$

as always assuming an $M \times N$, full column rank $r=N$, $M \gg N$ matrix $\mathbf{X}$ (i.e. more data samples than features).

Data Handling¶

In our toy example we set up matrices by a given data column vector $\mathbf{x}$ and $N-1$ different (non)-linear mapping functions $f(\mathbf{x})$, such that the feature matrix becomes

$$ \mathbf{X} = \begin{bmatrix} 1 & f_1(\mathbf{x}_1) & f_2(\mathbf{x}_1) & \dots & f_{N-1}(\mathbf{x}_1)\\ 1 & f_1(\mathbf{x}_2) & f_2(\mathbf{x}_2) & \dots & f_{N-1}(\mathbf{x}_2)\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & f_1(\mathbf{x}_M) & f_2(\mathbf{x}_M) & \dots & f_{N-1}(\mathbf{x}_M)\\ \end{bmatrix} $$

with the first column filled with ones to realize the intercept of the linear model.

Note, that the feature design $f(\mathbf{x})$ might be non-linear, but the model itself is still linear, because the outcome is the linear combination $\hat{\mathbf{y}} = \mathbf{X} \hat{\beta}$.

True Data¶

The true data is created with a column vector $\mathbf{x}$ containing $0$ to $2\pi$ equidistantly sampled with $M=2^8$ samples. The corresponding true feature matrix $\mathbf{X}_t$ is build with

intercept $\mathbf{1}$ vector
$f_1(\mathbf{x}) = \cos(\mathbf{x})$
$f_2(\mathbf{x}) = \sin(2\mathbf{x})$
$f_3(\mathbf{x}) = \cos(5\mathbf{x})$
$f_4(\mathbf{x}) = \cos(6\mathbf{x})$

and has rank 5. The true model weights are $\beta_t = [3,2,1, \frac{1}{2}, \frac{1}{4}]^\mathrm{T}$. The true outcome vector (model output) is then $\mathbf{t} = \mathbf{X}_t \beta_t$, which should serve as reference for the optimum prediction (cf. $h(\mathbf{x})$ in [Bishop 2006, Sec. 3.2]) in the discussion for bias/variance trade-off.

We should realize that we do a Fourier series synthesis here.

Measured Data¶

Measured data will be obtained by

$$\mathbf{y} = \mathbf{t} + \mathbf{n},$$

where the column vector $\mathbf{n}$ (for noise) contains a random signal drawn from normal distribution with zero mean and variance noise_scale**2. Thus $\mathbf{y}$ potentially also lives in the left null space of $\mathbf{X}_t$, whereas we know by design that $\mathbf{t}$ purely lives in column space of $\mathbf{X}_t$.

We create $L$ different $\mathbf{y}$ vectors, as we sample $L$ different $\mathbf{n}$ vectors.

We should denote the $l$-th measurement data vector with $$\mathbf{y}^{(l)}$$

We should denote the $m$-th entry in $\mathbf{y}^{(l)}$ as $$\mathbf{y}_m^{(l)}$$

So, the indices $l=1...L$ and $m=1...M$ are used later on.

Model Training / Fitting¶

We specify a certain model with a dedicated feature matrix $\mathbf{X}$. We always use above introduced column vector $\mathbf{x}$ but different features and by that different model complexity.

The $l$-th measurement leads to the $l$-th estimate of the model parameters

$$\hat{\beta}^{(l)} = (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}^{(l)}$$

and thus the $l$-th prediction is given as

$$\hat{\mathbf{y}}^{(l)} = \mathbf{X} \hat{\beta}^{(l)} = \mathbf{X} (\mathbf{X}^\mathrm{H} \mathbf{X})^{-1} \mathbf{X}^\mathrm{H} \mathbf{y}^{(l)}$$

We should consistently denote the $m$-th entry in the predicted model output $\hat{\mathbf{y}}^{(l)}$ as $$\hat{\mathbf{y}}_m^{(l)}$$

Calculate Bias and Variance¶

According to [Bishop 2006, Sec. 3.2] (it is worth to really think about these equations) we can calculate bias$^2$ and variance, when here assuming that the optimum prediction is precisely our true data (which we fortunately know here, as we designed it with $\mathbf{t}$).

Firstly, a 1st order raw moment is calculated as [Bishop 2006, (3.45)]

$$\tilde{\mathbf{y}} = \frac{1}{L} \sum_{l=1}^{L} \hat{\mathbf{y}}^{(l)},$$

so we get a $M \times 1$ column vector $\tilde{\mathbf{y}}$ (note the tilde above it), where the $m$-th entry is the average of all $m$-th entries of the $L$ models. Doing that and hoping that the prediction noise is averaging out, we get a fair idea what the true data vector could be.

We could set up a 2nd order moment to quantify the error between $\tilde{\mathbf{y}}$ and the true data $\mathbf{t}$ in squared sense. This time we should sum over the $M$ samples contained in both vectors, so it is a sum of squared deviation operation, which very often occurs in machine learning, cf. typical loss functions.

We then get the bias$^2$ [Bishop 2006, (3.46)]

$$\mathrm{bias}^2 = \frac{1}{M} \sum_{m=1}^{M} (\mathbf{t}_m - \tilde{\mathbf{y}}_m)^2$$

The variance is a little more complicated, cf. [Bishop 2006, (3.47)]. First we calculate a 2nd order centralized moment over the $L$ models resulting in an $M \times 1$ column vector

$$ \mathbf{v} = \frac{1}{L} \sum_{l=1}^{L} (\hat{\mathbf{y}}^{(l)} - \tilde{\mathbf{y}})^2$$

and then a 1st order raw moment over the $M$ samples

$$\mathrm{variance} = \frac{1}{M} \sum_{m=1}^{M} \mathbf{v}_m$$

Empirical Correlation Coefficient $R^2$ Between $\mathbf{y}$ and $\hat{\mathbf{y}}$¶

Let us define a 1st raw moment for the measured data along its samples (note the bar above ${y}$ and do not confuse it with the tilde vector from above!), we get a scalar

$$\bar{{y}} = \frac{1}{M} \sum_{m=1}^{M} \mathbf{y}_m$$

We can define 2nd order moments, here rather called sum of squares

Sum of Squares Total (SST), we use the measured data and the mean of the measured data

$$\mathrm{SST} = \sum_{m=1}^{M} (\mathbf{y}_m - \bar{{y}})^2$$

Sum of Squares (due to) Regression (SSR), we use the predicted data and the mean of the measured data

$$\mathrm{SSR} = \sum_{m=1}^{M} (\hat{\mathbf{y}}_m - \bar{{y}})^2$$

Sum of Squares Error (SSE), we use the measured data and the predicted data

$$\mathrm{SSE} = \sum_{m=1}^{M} (\mathbf{y}_m - \hat{\mathbf{y}}_m)^2$$

SSE is actually the cost function for the optimization problem, because it can be rewriten as $||\mathbf{y} - \mathbf{X} \hat{\beta}||_2^2 = (\mathbf{y} - \mathbf{X} \hat{\beta})^\mathrm{T} (\mathbf{y} - \mathbf{X} \hat{\beta})$, which was used above.

Important note: in literature different abbreviations and phrases are used for these three quantities, we should carefully check them. Sum of Squares of Residuals might be abbreviated as SSR, as well as we actually use it for Sum of Squares Regression, but that are totally different things! This might be very confusing at the beginning, the more we know what these equations actually do, the less important is the actual wording for them.

It is known that

$$\mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}$$

We might be interested in the ratio

$$R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = \frac{\mathrm{SST}-\mathrm{SSE}}{\mathrm{SST}} = 1^2 - \frac{\mathrm{SSE}}{\mathrm{SST}}$$

which is typically called empirical correlation coefficient or coefficient of determination and for which

$$0 \leq R^2 \leq 1$$

holds. For SSE = 0, the measured data and the predicted data are identical, so $R^2=1$.

Typically, $R^2$ can be increased by increasing the number of data samples $M$ and the number of features $N$, so taking more data and making the model more complex. But that's somehow a trivial and naive approach to train models...which brings us easily into the region of overfitting.

So, we rather need a robust measure which is somehow independent of $M$ and $N$, to draw reasonable conclusions of the models performance. A well know approach is to compensate for the so called degrees of freedom in the model, which brings up the so called adjusted $R^2$

$$R_\mathrm{adj}^2 = 1^2 - \frac{\frac{\mathrm{SSE}}{M-N}}{\frac{\mathrm{SST}}{M-1}}$$

for models that use a intercept parameter $\hat{\beta}_0$ (which we do here).

Let us getting into some coding business for this toy example...

In [64]:

import matplotlib.pyplot as plt
import numpy as np
from statsmodels.api import OLS

In [65]:

def my_r2adj(X, beta_hat, y):
    """(Adjusted) coefficient of determination R^2.

    also known as empirical correlation coefficient between y and yhat
    """
    M = X.shape[0]  # number of samples / observations
    N = X.shape[1]  # number of model parameters (including intercept beta[0])

    yhat = X @ beta_hat  # do regression / prediction

    # sum of squares T.otal:
    SST = np.sum((y - np.mean(y)) ** 2)
    # sum of squares due to R.egression:
    SSR = np.sum((yhat - np.mean(y)) ** 2)
    # sum of squared E.rrors:
    SSE = np.sum((y - yhat) ** 2)
    # SST = SSR + SSE holds (numerical errors might corrupt this equality)

    # R2 = SSR / SST is the ratio between regression and total, 0<=R2<=1
    # rearranged R2 = (SST - SSE) / SST = 1**2 - SSE / SST
    # p.58 in [MT11], p.111 in [DB18], p.108 (1.34)-(1.36) in [FHT96],
    # p.125 in [FKLM21], Ch.2.4.6 in [Agresti15]
    R2 = 1**2 - SSE / SST

    # R2 should be adjusted by number of samples and model complexity
    # note that this equation holds for models that include an intercept term
    # p.58 in [MT11], p.163 in [FKLM21], Ch.2.4.6 in [Agresti15]
    # R2adj = 1**2 - (1-R2)*(n-1)/(n-d)
    # or rewritten to see the adjustments in a more convenient way
    R2adj = 1**2 - (SSE / (M - N)) / (SST / (M - 1))
    return R2adj

True Model and Its Data¶

In [66]:

# number of observations / samples:
M = 2**8
# true model with x as input variable
x = np.linspace(0, 2 * np.pi, M)
# to create 4 features in design/feature matrix X
X = np.column_stack((np.cos(x), np.sin(2 * x), np.cos(5 * x), np.cos(6 * x)))
# add a bias/intercept column to the design/feature matrix:
X = np.hstack((np.ones((X.shape[0], 1)), X))
hasconst = True
# some nice numbers for the true model parameters beta:
beta = np.array([3, 2, 1, 1 / 2, 1 / 4])

In [ ]:

# generate 'true' data with the design matrix of 'true' model
y = X @ beta
plt.figure(figsize=(6, 3))
plt.plot(y, "k-")
plt.xlabel("independent features' input variable x")
plt.ylabel(("dependent variable y, true data"))
plt.title("true model data as linear model (x -> 4 features + intercept)")
plt.xlim(0, M)
plt.ylim(-2, 8)
plt.grid(True)
print(X.shape, y.shape)
plt.tight_layout()
plt.savefig('bias_variance_plots/true_data.png', dpi=300)

Function for Train / Predict and Calc Bias^2 / Variance¶

In [ ]:

def bias_variance_of_model(
    X, noise_scale=1 / 3
):  # noise_scale=2 is nice as well
    # add bias column to the design matrix
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    hasconst = True
    print(
        "\nshape of model/feature matrix X:",
        X.shape,
        "\nrank of matrix X / # of model parameters:",
        np.linalg.matrix_rank(X),
    )
    # init random number generator to reproduce results
    rng = np.random.default_rng(12345)
    # generate L data sets with added noise
    L = 2**7
    noise = rng.normal(loc=0, scale=noise_scale, size=(M, L))
    Yn = y[:, None] + noise
    # alloc memory for all predictions
    Yhat = np.zeros((M, L))
    rsquared_adj = np.zeros(L)
    # train and predict L models on these L data sets
    for i in range(L):
        model = OLS(Yn[:, i], X, hasconst=hasconst)  # OLS model
        results = model.fit()  # train the model
        Yhat[:, i] = results.predict(X)  # predict outcome
        rsquared_adj[i] = my_r2adj(X, results.params, Yn[:, i])
        # print(np.allclose(results.rsquared_adj, rsquared_adj[i]))

    # get average prediction, i.e. mean over the L models
    # which is a numerical eval of the expectation:
    ym = np.mean(Yhat, axis=1)  # (3.45) in [Bishop 2006]

    # get integrated squared bias (numerical eval of the expectation):
    # note: y is the true model data
    bias_squared = np.mean((ym - y) ** 2)  # (3.42), (3.46) in [Bishop 2006]

    # get integrated variance (numerical eval of the expectation):
    variance = np.mean(
        np.mean((Yhat - ym[:, None]) ** 2, axis=1), axis=0
    )  # (3.43), (3.47) in [Bishop 2006]

    for i in range(L):
        axs[0, 0].plot(Yn[:, i])
        axs[0, 1].plot(Yhat[:, i])

    axs[0, 1].plot(y, "k-", label="true model")

    axs[0, 1].plot(
        np.mean(Yhat, axis=1), ":", color="gold", label="$\mu(\hat{Y})$"
    )

    axs[0, 1].plot(
        np.mean(Yhat, axis=1) + np.std(Yhat, axis=1),
        "--",
        lw=0.75,
        color="gold",
        label="$\mu(\hat{Y}) \pm \sigma(\hat{Y})$",
    )
    axs[0, 1].plot(
        np.mean(Yhat, axis=1) - np.std(Yhat, axis=1),
        "-.",
        lw=0.75,
        color="gold",
    )

    axs[0, 1].set_title(
        r"bias$^2$="
        + "{:4.3f}".format(bias_squared)
        + ", var="
        + "{:4.3f}".format(variance)
        + r", bias$^2$+var="
        + "{:4.3f}".format(bias_squared + variance)
    )
    for i in range(2):
        axs[0, i].set_xlim(0, M)
        axs[0, i].set_ylim(-2, 8)
        axs[0, i].grid(True)
        axs[0, i].set_xlabel("independent features' input variable x")
    axs[0, 0].set_ylabel("dependent variable yn")
    axs[0, 1].set_ylabel("predicted variable yhat")
    axs[0, 1].legend()

    axs[1, 0].plot(rsquared_adj)
    axs[1, 0].set_title(
        r"$\hat{\mu}(R_{adj}^2)="
        + "{:4.3f}$".format(np.mean(rsquared_adj))
        + r", $\hat{\sigma}(R_{adj}^2)="
        + "{:4.3f}$".format(np.std(rsquared_adj))
    )
    axs[1, 0].set_xlim(0, L)
    axs[1, 0].set_ylim(0, 1)
    axs[1, 0].set_xlabel("model index")
    axs[1, 0].set_ylabel(r"$R_{adj}^2$")
    axs[1, 0].grid(True)

    axs[1, 1].set_xlabel("intentionally empty")

    plt.tight_layout()

    print("bias^2 + variance  = ", bias_squared + variance)

Check Models on Noisy Data¶

Too Simple¶

typically

high bias
low variance

which reads as: the model is far away from predicting the true data (high bias part), but this wrong prediction is very consistent (low variance part). In this example, the predicted data is along lines (obviously not matching the true data shape), but these lines are similarly aligned. This typically leads to non appropriate predictions on unseen data, because we model the true world to simple. We should avoid such modeling.

In [ ]:

# we take just a simple line equation model y = beta1 x + beta0 here
# note  that intercept is only added in function bias_variance_of_model(X)
X = np.copy(x)[:, None]
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title("underfit, too low model complexity, high bias, low var");
plt.tight_layout()
plt.savefig('bias_variance_plots/too_simple_model.png', dpi=300)

Too Complex¶

typically

low bias
high variance

which reads as: the model is very (too) good to predict all the different noisy data sets (low bias part), but on average these different individual predictions differ too much from the true data (high variance). This typically leads to non-robust predictions on noisy unknown data. We should avoid such modeling.

In [ ]:

# we take a Fourier series expansion model here
X = np.column_stack((np.cos(x), np.sin(x)))  # init with first two features
# add more features according to a Fourier series expansion
# <=M//2 makes sure we do not use more model parameters than signal samples
# in order to solve this as a least-squares problem, i.e. using left-inverse
for m in range(2, M // 2):
    X = np.column_stack((X, np.sin(m * x), np.cos(m * x)))
# note  that intercept is only added in function bias_variance_of_model(X)
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title("overfit, too high model complexity, low bias, high var");
plt.tight_layout()
plt.savefig('bias_variance_plots/too_complex_model.png', dpi=300)

Prediction Model == True Model¶

a rare case (and obviously here for didactic reasons): we know the exact model equation and thus can set up our feature matrix according to it. By concept this yields the

lowest possible bias
lowest possible variance

The remaining variance is due to the added (measurement) noise, which we don't want to have explained by the model, because we then end up in the too complex model world, just discussed above.

In [ ]:

# we take all features of the true model here
# (we generally not know this exactly in practice)
X = np.column_stack((np.cos(x), np.sin(2 * x), np.cos(5 * x), np.cos(6 * x)))
# note  that intercept is only added in function bias_variance_of_model(X)
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)  # lowest possible bias^2+variance, because we
# know the true model (again: which in practice likely never will occur)
# the remaining variance is from the added noise
axs[0, 0].set_title("true model features, lowest bias, lowest var");
plt.tight_layout()
plt.savefig('bias_variance_plots/true_model.png', dpi=300)

Not Too Simple, Not Too Complex Model¶

If the exact model equation is not known, we need to find a model that is neither to simple, nor too complex.

We could consider the following model as the most robust for this example.

A robust prediction model should have a reasonable trade-off between bias^2 and variance and reasonable high $R_{\text{adjusted}}^2$ mean but low $R_{\text{adjusted}}^2$ variance.

This is in good agreement with the model here.

In [ ]:

# we take only the first two features of the true model
# as these oscillations explain much of the dependent variable y
X = np.column_stack((np.cos(x), np.sin(2 * x)))
# note  that intercept is only added in function bias_variance_of_model(X)
fig, axs = plt.subplots(2, 2, figsize=(10, 5))
bias_variance_of_model(X)
axs[0, 0].set_title("reasonable bias/var trade-off if true model is unknown");
plt.tight_layout()
plt.savefig('bias_variance_plots/robust_model.png', dpi=300)

Check Higher Noise Power¶

We might check other noise power, for example bias_variance_of_model(X, noise_scale=2). Although, the Prediction Model == True Model has by concept lowest bias and variance, the performance obviously degrades by increasing noise power. This is also indicated by the decreasing $R_{\text{adjusted}}^2$ value. We should compare the performance of The Not Too Simple, Not Too Complex Model with the Prediction Model == True Model. Recall that we almost never have access to the true model equation. That's why the last model should be considered as the most robust in this toy example. We might check another model that includes the 3 most important features of the true model. What do we expect in terms of the performance measures?

Copyright¶

the notebooks are provided as Open Educational Resources
the text is licensed under Creative Commons Attribution 4.0
the code of the IPython examples is licensed under the MIT license
feel free to use the notebooks for your own purposes
please attribute the work as follows: Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.