# Dynamic Models¶

### Example Problem¶

• We consider a one-dimensional cart position tracking problem, see Faragher 2012.

• The hidden states are the position $z_t$ and velocity $\dot z_t$. We can apply an external acceleration/breaking force $u_t$. (Noisy) observations are represented by $x_t$.

• The equations of motions are given by

\begin{align} \begin{bmatrix} z_t \\ \dot{z_t}\end{bmatrix} &= \begin{bmatrix} 1 & \Delta t \\ 0 & 1\end{bmatrix} \begin{bmatrix} z_{t-1} \\ \dot z_{t-1}\end{bmatrix} + \begin{bmatrix} (\Delta t)^2/2 \\ \Delta t\end{bmatrix} u_t + \mathcal{N}(0,\Sigma_z) \\ x_t &= \begin{bmatrix} z_t \\ \dot{z_t}\end{bmatrix} + \mathcal{N}(0,\Sigma_x) \end{align}
• Task: Infer the position $z_t$ after 10 time steps. (Solution later in this lesson).

### Dynamical Models¶

• In this lesson, we consider models where the sequence order of observations matters.

• Consider the ordered observation sequence $x^T \triangleq \left(x_1,x_2,\ldots,x_T\right)$.

• (For brevity, in this lesson we use the notation $x_t^T$ to denote $(x_t,x_{t+1},\ldots,x_T)$ and drop the subscript if $t=1$, so $x^T = x_1^T = \left(x_1,x_2,\ldots,x_T\right)$).
• We wish to develop a generative model $$p( x^T )$$ that 'explains' the time series $x^T$.
• We cannot use the IID assumption $p( x^T ) = \prod_t p(x_t )$. In general, we can use the chain rule (a.k.a. the general product rule)
\begin{align*} p(x^T) &= p(x_T|x^{T-1}) \,p(x^{T-1}) \\ &= p(x_T|x^{T-1}) \,p(x_{T-1}|x^{T-2}) \cdots p(x_2|x_1)\,p(x_1) \\ &= p(x_1)\prod_{t=2}^T p(x_t\,|\,x^{t-1}) \end{align*}
• Generally, we will want to limit the depth of dependencies on previous observations. For example, a $K$th-order linear Auto-Regressive (AR) model that is given by \begin{align*} p(x_t\,|\,x^{t-1}) = \mathcal{N}\left(x_t \,\middle|\, \sum_{k=1}^K a_k x_{t-k}\,,\sigma^2\,\right) \end{align*} limits the dependencies to the past $K$ samples.

### State-space Models¶

• A limitation of AR models is that they need a lot of parameters in order to create a flexible model. E.g., if $x_t$ is an $M$-dimensional discrete variable, then a $K$th-order AR model will have about $M^{K}$ parameters.
• Similar to our work on Gaussian Mixture models, we can create a flexible dynamic system by introducing latent (unobserved) variables $z^T \triangleq \left(z_1,z_2,\dots,z_T\right)$ (one $z_t$ for each observation $x_t$). In dynamic systems, $z_t$ are called state variables.
• A state space model is a particular latent variable dynamical model defined by \begin{align*} p(x^T,z^T) &= \underbrace{p(z_1)}_{\text{initial state}} \prod_{t=2}^T \underbrace{p(z_t\,|\,z_{t-1})}_{\text{state transitions}}\,\prod_{t=1}^T \underbrace{p(x_t\,|\,z_t)}_{\text{observations}} \end{align*}
• The condition $p(z_t\,|\,z^{t-1}) = p(z_t\,|\,z_{t-1})$ is called a $1$st-order Markov condition.
• The Forney-style factor graph for a state-space model:

### Hidden Markov Models and Linear Dynamical Systems¶

• A Hidden Markov Model (HMM) is a specific state-space model with discrete-valued state variables $z_t$.
• Typically, $z_t$ is a $K$-dimensional one-hot coded latent 'class indicator' with transition probabilities $a_{jk} \triangleq p(z_{tk}=1\,|\,z_{t-1,j}=1)$, or equivalently, $$p(z_t|z_{t-1}) = \prod_{k=1}^K \prod_{j=1}^K a_{jk}^{z_{t-1,j}\cdot z_{tk}}$$ which is usually accompanied by an initial state distribution $p(z_{1k}=1) = \pi_k$.
• The classical HMM has also discrete-valued observations but in pratice any (probabilistic) observation model $p(x_t|z_t)$ may be coupled to the hidden Markov chain.
• Another well-known state-space model with continuous-valued state variables $z_t$ is the (Linear) Gaussian Dynamical System (LGDS), which is defined as
\begin{align*} p(z_t\,|\,z_{t-1}) &= \mathcal{N}\left(\, A z_{t-1}\,,\,\Sigma_z\,\right) \\ p(x_t\,|\,z_t) &= \mathcal{N}\left(\, C z_t\,,\,\Sigma_x\,\right) \\ p(z_1) &= \mathcal{N}\left(\, \mu_1\,,\,\Sigma_1\,\right) \end{align*}
• Note that the joint distribution over all states and observations $\{(x_1,z_1),\ldots,(x_t,z_t)\}$ is a (large-dimensional) Gaussian distribution. This means that, in principle, every inference problem on the LGDS model also leads to a Gaussian distribution.
• HMM's and LGDS's (and variants thereof) are at the basis of a wide range of complex information processing systems, such as speech and language recognition, robotics and automatic car navigation, and even processing of DNA sequences.

### Common Signal Processing Tasks as Message Passing-based Inference¶

• As we have seen, inference tasks in linear Gaussian state space models can be analytically solved.

• However, these derivations quickly become cumbersome and prone to errors.

• Alternatively, we could specify the generative model in a (Forney-style) factor graph and use automated message passing to infer the posterior over the hidden variables. Here follows some examples.

• Filtering, a.k.a. state estimation: estimation of a state (at time step $t$), based on past and current (at $t$) observations.

• Smoothing: estimation of a state based on both past and future observations. Needs backward messages from the future.

• Prediction: estimation of future state or observation based only on observations of the past.

### Kalman Filtering¶

• Technically, a Kalman filter is the solution to the recursive estimation (inference) of the hidden state $z_t$ based on past observations in an LGDS, i.e., Kalman filtering solves the problem $p(z_t\,|\,x^t)$ based on the previous estimate $p(z_{t-1}\,|\,x^{t-1})$ and a new observation $x_t$ (in the context of the given model specification of course).
• Let's infer the Kalman filter for a scalar linear Gaussian dynamical system: \begin{align*} p(z_t\,|\,z_{t-1}) &= \mathcal{N}(z_t\,|\,a z_{t-1},\sigma_z^2) \tag{state transition} \\ p(x_t\,|\,z_t) &= \mathcal{N}(x_t\,|\,c z_t,\sigma_x^2) \tag{observation} \end{align*}
• Kalman filtering comprises inferring $p(z_t\,|\,x^t)$ from a given prior estimate $p(z_{t-1}\,|\,x^{t-1})$ (available after the previous time step) and a new observation $x_t$. Let us assume that \begin{align} p(z_{t-1}\,|\,x^{t-1}) = \mathcal{N}(z_{t-1} \,|\, \mu_{t-1}, \sigma_{t-1}^2) \tag{prior} \end{align}
• Note that everything is Gaussian, so it is in principle possible to execute inference problems analytically and the result will be a Gaussian posterior:
• (In the following derivation we make use of the renormalization equality $\mathcal{N}(x\,|\,cz,\sigma^2) = \frac{1}{c}\mathcal{N}\left(z \,\big|\,\frac{x}{c},\left(\frac{\sigma}{c}\right)^2\right)$).
\begin{align*} \underbrace{p(z_t\,|\,x^t)}_{\text{posterior}} &= p(z_t\,|\,x_t,x^{t-1}) \propto p(x_t,z_t\,|\,x^{t-1}) \\ &\propto p(x_t\,|\,z_t) \,p(z_t\,|\,x^{t-1}) \\ &= p(x_t\,|\,z_t) \, \sum_{z_{t-1}} p(z_t,z_{t-1}\,|\,x^{t-1}) \\ &= \underbrace{p(x_t\,|\,z_t)}_{\text{observation}} \, \sum_{z_{t-1}} \underbrace{p(z_t\,|\,z_{t-1})}_{\text{state transition}} \, \underbrace{p(z_{t-1}\,|\,x^{t-1})}_{\text{prior}} \\ &= \mathcal{N}(x_t\,|\,c z_t,\sigma_x^2) \sum_{z_{t-1}} \mathcal{N}(z_t\,|\,a z_{t-1},\sigma_z^2) \, \mathcal{N}(z_{t-1} \,|\, \mu_{t-1}, \sigma_{t-1}^2) \\ &= \frac{1}{c}\mathcal{N}\left(z_t\bigm| \frac{x_t}{c} ,\left(\frac{\sigma_x}{c}\right)^2\right) \sum_{z_{t-1}} \frac{1}{a}\underbrace{\mathcal{N}\left(z_{t-1}\bigm| \frac{z_t}{a},\left(\frac{\sigma_z}{a}\right)^2 \right) \, \mathcal{N}(z_{t-1} \,|\, \mu_{t-1}, \sigma_{t-1}^2)}_{\text{use Gaussian multiplication formula SRG-6}} \\ &\propto \underbrace{\mathcal{N}\left(z_t\,\bigm| \,\frac{x_t}{c} ,\left(\frac{\sigma_x}{c}\right)^2\right) \cdot \mathcal{N}\left(z_t\, \bigm|\,a \mu_{t-1},\sigma_z^2 + \left(a \sigma_{t-1}\right)^2 \right)}_{\text{use SRG-6 again}} \\ &\propto \mathcal{N}\left( z_t \,|\, \mu_t, \sigma_t^2\right) \end{align*}

with \begin{align*} \rho_t^2 &= a^2 \sigma_{t-1}^2 + \sigma_z^2 \tag{predicted variance}\\ K_t &= \frac{c \rho_t^2}{c^2 \rho_t^2 + \sigma_x^2} \tag{Kalman gain} \\ \mu_t &= \underbrace{a \mu_{t-1}}_{\text{prior prediction}} + K_t \cdot \underbrace{\left( x_t - c a \mu_{t-1}\right)}_{\text{prediction error}} \tag{posterior mean}\\ \sigma_t^2 &= \left( 1 - c\cdot K_t \right) \rho_t^2 \tag{posterior variance} \end{align*}

• Kalman filtering consists of computing/updating these last four equations for each new observation ($x_t$). This is a very efficient recursive algorithm to estimate the state $z_t$ from all observations (until $t$).

• It turns out that it's also possible to get an analytical result for $p(x_t|x^{t-1})$, which is the model evidence in a filtering context. See optional slides for details.

### Multi-dimensional Kalman Filtering¶

• The Kalman filter equations can also be derived for multidimensional state-space models. In particular, for the model \begin{align*} z_t &= A z_{t-1} + \mathcal{N}(0,\Gamma) \\ x_t &= C z_t + \mathcal{N}(0,\Sigma) \end{align*} the Kalman filter update equations for the posterior $p(z_t |x^t) = \mathcal{N}\left(z_t \bigm| \mu_t, V_t \right)$ are given by (see Bishop, pg.639) \begin{align*} P_t &= A V_{t-1} A^T + \Gamma \tag{predicted variance}\\ K_t &= P_t C^T \cdot \left(C P_t C^T + \Sigma \right)^{-1} \tag{Kalman gain} \\ \mu_t &= A \mu_{t-1} + K_t\cdot\left(x_t - C A \mu_{t-1} \right) \tag{posterior state mean}\\ V_t &= \left(I-K_t C \right) P_{t} \tag{posterior state variance} \end{align*}

### Code Example: Kalman Filtering and the Cart Position Tracking Example Revisited¶

• We can now solve the cart tracking problem of the introductory example by implementing the Kalman filter.
In [1]:
using Pkg;Pkg.activate("probprog/workspace/");Pkg.instantiate()
IJulia.clear_output();

In [2]:
using LinearAlgebra, PyPlot
include("scripts/cart_tracking_helpers.jl")

# Specify the model parameters
Δt = 1.0                     # assume the time steps to be equal in size
A = [1.0 Δt;
0.0 1.0]
b = [0.5*Δt^2; Δt]
Σz = convert(Matrix,Diagonal([0.2*Δt; 0.1*Δt])) # process noise covariance
Σx = convert(Matrix,Diagonal([1.0; 2.0]))     # observation noise covariance;

# Generate noisy observations
n = 10                # perform 10 timesteps
z_start = [10.0; 2.0] # initial state
u = 0.2 * ones(n)     # constant input u
noisy_x = generateNoisyMeasurements(z_start, u, A, b, Σz, Σx);

m_z = noisy_x[1]                                    # initial predictive mean
V_z = A * (1e8*Diagonal(I,2) * A') + Σz             # initial predictive covariance

for t = 2:n
global m_z, V_z, m_pred_z, V_pred_z
#predict
m_pred_z = A * m_z + b * u[t]                   # predictive mean
V_pred_z = A * V_z * A' + Σz                    # predictive covariance
#update
gain = V_pred_z * inv(V_pred_z + Σx)            # Kalman gain
m_z = m_pred_z + gain * (noisy_x[t] - m_pred_z) # posterior mean update
V_z = (Diagonal(I,2)-gain)*V_pred_z             # posterior covariance update
end
println("Prediction: ",ProbabilityDistribution(Multivariate,GaussianMeanVariance,m=m_pred_z,v=V_pred_z))
println("Measurement: ", ProbabilityDistribution(Multivariate,GaussianMeanVariance,m=noisy_x[n],v=Σx))
println("Posterior: ", ProbabilityDistribution(Multivariate,GaussianMeanVariance,m=m_z,v=V_z))
plotCartPrediction2(m_pred_z[1], V_pred_z[1], m_z[1], V_z[1], noisy_x[n][1], Σx[1][1]);

Prediction: 𝒩(m=[42.21, 4.51], v=[[1.30, 0.39][0.39, 0.34]])

Measurement: 𝒩(m=[40.88, 5.41], v=[[1.00, 0.00][0.00, 2.00]])

Posterior: 𝒩(m=[41.55, 4.42], v=[[0.55, 0.15][0.15, 0.24]])



### The Cart Tracking Problem Revisited: Inference by Message Passing¶

• Let's solve the cart tracking problem by sum-product message passing in a factor graph like the one depicted above. All we have to do is create factor nodes for the state-transition model $p(z_t|z_{t-1})$ and the observation model $p(x_t|z_t)$. Then we just build the factor graph and let ForneyLab execute the message passing schedule.
• Since the factor graph is just a concatination of $n$ identical "sections" (one for each time step), we only have to specify a single section. When running the message passing algorithm we will explictly use the posterior of the previous timestep as prior in the next one. Let's build a section of the factor graph:
In [3]:
fg = FactorGraph()
z_prev_m = Variable(id=:z_prev_m); placeholder(z_prev_m, :z_prev_m, dims=(2,))
z_prev_v = Variable(id=:z_prev_v); placeholder(z_prev_v, :z_prev_v, dims=(2,2))
bu = Variable(id=:bu); placeholder(bu, :bu, dims=(2,))

@RV z_prev ~ GaussianMeanVariance(z_prev_m, z_prev_v, id=:z_prev) # p(z_prev)
@RV noise_z ~ GaussianMeanVariance(constant(zeros(2), id=:noise_z_m), constant(Σz, id=:noise_z_v)) # process noise
@RV z = constant(A, id=:A) * z_prev + bu + noise_z; z.id = :z # p(z|z_prev) (state transition model)
@RV x ~ GaussianMeanVariance(z, constant(Σx, id=:Σx)) # p(x|z) (observation model)
placeholder(x, :x, dims=(2,));
ForneyLab.draw(fg)


Now that we've built the factor graph, we can perform Kalman filtering by inserting measurement data into the factor graph and performing message passing.

In [4]:
include("scripts/cart_tracking_helpers.jl")
algo = messagePassingAlgorithm(z)
source_code = algorithmSourceCode(algo)
eval(Meta.parse(source_code))
marginals = Dict()
messages = Array{Message}(undef,6)
z_prev_m_0 = noisy_x[1]
z_prev_v_0 = A * (1e8*Diagonal(I,2) * A') + Σz
for t=2:n
data = Dict(:x => noisy_x[t], :bu => b*u[t],:z_prev_m => z_prev_m_0, :z_prev_v => z_prev_v_0)
step!(data, marginals, messages) # perform msg passing (single timestep)
# Posterior of z becomes prior of z in the next timestep:
z_prev_m_0 = ForneyLab.unsafeMean(marginals[:z])
z_prev_v_0 = ForneyLab.unsafeCov(marginals[:z])
end
# Collect prediction p(z[n]|z[n-1]), measurement p(z[n]|x[n]), corrected prediction p(z[n]|z[n-1],x[n])
prediction      = messages[5].dist # the message index is found by manual inspection of the schedule
measurement     = messages[6].dist
corr_prediction = convert(ProbabilityDistribution{Multivariate, GaussianMeanVariance}, marginals[:z])
println("Prediction: ",prediction)
println("Measurement: ",measurement)
println("Posterior: ", corr_prediction)

# Make a fancy plot of the prediction, noisy measurement, and corrected prediction after n timesteps
plotCartPrediction(prediction, measurement, corr_prediction);

Prediction: 𝒩(m=[42.21, 4.51], v=[[1.30, 0.39][0.39, 0.34]])

Measurement: 𝒩(m=[40.88, 5.41], v=[[1.00, 0.00][0.00, 2.00]])

Posterior: 𝒩(m=[41.55, 4.42], v=[[0.55, 0.15][0.15, 0.24]])


• Note that both the analytical Kalman filtering solution and the message passing solution lead to the same results. The advantage of message passing-based inference with ForneyLab is that we did not need to derive any inference equations. ForneyLab took care of all that.

### Recap Dynamical Models¶

• Dynamical systems do not obey the sample-by-sample independence assumption, but still can be specified, and state and parameter estimation equations can be solved by similar tools as for static models.
• Two of the more famous and powerful models with latent states include the hidden Markov model (with discrete states) and the Linear Gaussian dynamical system (with continuous states).
• For the LGDS, the Kalman filter is a well-known recursive state estimation procedure. The Kalman filter can be derived through Bayesian update rules on Gaussian distributions.
• If anything changes in the model, e.g., the state noise is not Gaussian, then you have to re-derive the inference equations again from scratch and it may not lead to an analytically pleasing answer.
• $\Rightarrow$ Generally, we will want to automate the inference processes. As we discussed, message passing in a factor graph is a visually appealing method to automate inference processes. We showed how Kalman filtering emerged naturally by automated message passing.

## OPTIONAL SLIDES¶

### Proof of Kalman filtering equations including evidence updating¶

• Now let's proof the Kalman filtering equations including evidence updating by probabilistic calculus: \begin{align*} \overbrace{p(z_t\,|\,x^t)}^{\text{posterior}} &= p(z_t\,|\,x_t,x^{t-1}) \\ &= \frac{p(x_t,z_t\,|\,x^{t-1})}{p(x_t\,|\,x^{t-1})} \\ &= \frac{p(x_t\,|\,z_t) \,p(z_t\,|\,x^{t-1})}{p(x_t\,|\,x^{t-1})} \\ &= \frac{p(x_t\,|\,z_t) \, \sum_{z_{t-1}} p(z_t,z_{t-1}\,|\,x^{t-1}) }{p(x_t\,|\,x^{t-1})} \\ &= \frac{p(x_t\,|\,z_t) \, \sum_{z_{t-1}} p(z_t\,|\,z_{t-1}) \, p(z_{t-1}\,|\,x^{t-1})}{p(x_t\,|\,x^{t-1})} \\ &= \frac{\overbrace{\mathcal{N}(x_t|c z_t, \sigma_x^2)}^{\text{likelihood}} \, \sum_{z_{t-1}} \overbrace{\mathcal{N}(z_t\,|\,a z_{t-1},\sigma_z^2)}^{\text{state transition}} \, \overbrace{\mathcal{N}(z_{t-1} \,|\, \mu_{t-1}, \sigma_{t-1}^2) }^{\text{prior}} }{\underbrace{p(x_t\,|\,x^{t-1})}_{\text{evidence}}} \end{align*}

• The posterior $p(z_t\,|\,x^t)$ is proportional to the numerator, which by making use of the renormalization equality \begin{align} \mathcal{N}(x\,|\,cz,\sigma^2) = \frac{1}{c}\cdot \mathcal{N}\left(z \,\big|\,\frac{x}{c},\left(\frac{\sigma}{c}\right)^2\right) \tag{renormalization}\,, \end{align} can be evaluated with Gaussian multiplication rules:

\begin{align*} \mathcal{N}(x_t|c z_t, &\sigma_x^2) \, \sum_{z_{t-1}} \underbrace{\mathcal{N}(z_t\,|\,a z_{t-1},\sigma_z^2)}_{\text{use renormalization}} \mathcal{N}(z_{t-1} \,|\, \mu_{t-1}, \sigma_{t-1}^2) \\ &= \mathcal{N}(x_t|c z_t, \sigma_x^2) \, \sum_{z_{t-1}} \frac{1}{a}\underbrace{\mathcal{N}\left(z_{t-1}\bigm| \frac{z_t}{a},\left(\frac{\sigma_z}{a}\right)^2 \right) \mathcal{N}(z_{t-1} \,|\, \mu_{t-1}, \sigma_{t-1}^2)}_{\text{use Gaussian multiplication formula SRG-6}} \\ &= \frac{1}{a} \mathcal{N}(x_t|c z_t, \sigma_x^2) \, \sum_{z_{t-1}} \underbrace{\mathcal{N}\left(\mu_{t-1}\bigm| \frac{z_t}{a},\left(\frac{\sigma_z}{a}\right)^2 + \sigma_{t-1}^2 \right)}_{\text{not a function of }z_{t-1}} \underbrace{\mathcal{N}(z_{t-1} \,|\, \cdot, \cdot)}_{\text{sums to }1} \\ &= \frac{1}{a} \underbrace{\mathcal{N}(x_t|c z_t, \sigma_x^2)}_{\text{use renormalization rule}} \, \underbrace{\mathcal{N}\left(\mu_{t-1}\bigm| \frac{z_t}{a},\left(\frac{\sigma_z}{a}\right)^2 + \sigma_{t-1}^2 \right)}_{\text{use renormalization rule}} \\ &= \frac{1}{c} \underbrace{\mathcal{N}\left(z_t \bigm| \frac{x_t}{c}, \left( \frac{\sigma_x}{c}\right)^2 \right) \mathcal{N}\left(z_t\, \bigm|\,a \mu_{t-1},\sigma_z^2 + \left(a \sigma_{t-1}\right)^2 \right)}_{\text{use SRG-6 again}} \\ &= \underbrace{\frac{1}{c} \mathcal{N}\left( \frac{x_t}{c} \bigm| a \mu_{t-1}, \left( \frac{\sigma_x}{c}\right)^2+ \sigma_z^2 + \left(a \sigma_{t-1}\right)^2\right)}_{\text{use renormalization}} \, \mathcal{N}\left( z_t \,|\, \mu_t, \sigma_t^2\right)\\ &= \underbrace{\mathcal{N}\left(x_t \,|\, ca \mu_{t-1}, \sigma_x^2 + c^2(\sigma_z^2+a^2\sigma_{t-1}^2) \right)}_{\text{evidence } p(x_t|x^{t-1})} \underbrace{\mathcal{N}\left( z_t \,|\, \mu_t, \sigma_t^2\right)}_{\text{posterior }p(z_t|x^t) } \end{align*}

with \begin{align*} \rho_t^2 &= a^2 \sigma_{t-1}^2 + \sigma_z^2 \tag{predicted variance}\\ K_t &= \frac{c \rho_t^2}{c^2 \rho_t^2 + \sigma_x^2} \tag{Kalman gain} \\ \mu_t &= \underbrace{a \mu_{t-1}}_{\text{prior prediction}} + K_t \cdot \underbrace{\left( x_t - c a \mu_{t-1}\right)}_{\text{prediction error}} \tag{posterior mean}\\ \sigma_t^2 &= \left( 1 - c\cdot K_t \right) \rho_t^2 \tag{posterior variance} \end{align*}

### Extensions of Generative Gaussian Models¶

• Using the methods of the previous lessons, it is possible to create your own new models based on stacking Gaussian and categorical distributions in new ways:

In [5]:
open("../../styles/aipstyle.html") do f display("text/html", read(f, String)) end

In [ ]: