Year: 2022-2023
# Enter your name and student ID
name =
ID =
In this assignment, we will go through the Bayesian model design cycle:
In questions 1 and 2, you will build a simple model, fit it to data and evaluate its performance on future data. You will find that its performance is not great. In question 3, you will improve the model in multiple ways. Finally, in question 4, you will do model selection based on free energy.
The final questions will require knowledge from the last probabilistic programming session. But questions 1 and 2 can be done relatively early in the course.
using Pkg
Pkg.activate(".")
Pkg.instantiate();
using CSV
using DataFrames
using LinearAlgebra
using ProgressMeter
using RxInfer
using Plots
default(label="",
grid=false,
linewidth=3,
markersize=4,
guidefontsize=12,
margins=15Plots.pt)
Many Europeans suspect that the air quality in their city is declining. A recent study measured the air quality of a major city in North Italy using an electronic nose. The measurements were made in the middle of the city and reflect urban activity. We will inspect the specific chemical concentrations found and build a model to accurately predict CO for future time points.
Photograph taken by Claudio Furlan/LaPresse/Zuma Press/Rex/Shutterstock (link)
The data can be found here: https://archive.ics.uci.edu/ml/datasets/Air+Quality. I've done some pre-processing and selected the most important features. In this assignment we will infer parameters in a model of the data and predict air quality in the future. For that purpose, the data has been split into past and future.
# Load training data
past_data = DataFrame(CSV.File("data/airquality_past.csv"))
Row | time | CO |
---|---|---|
DateTime | Float64 | |
1 | 2004-03-10T18:00:00 | 1360.0 |
2 | 2004-03-10T19:00:00 | 1292.0 |
3 | 2004-03-10T20:00:00 | 1402.0 |
4 | 2004-03-10T21:00:00 | 1376.0 |
5 | 2004-03-10T22:00:00 | 1272.0 |
6 | 2004-03-10T23:00:00 | 1197.0 |
7 | 2004-03-11T00:00:00 | 1185.0 |
8 | 2004-03-11T01:00:00 | 1136.0 |
9 | 2004-03-11T02:00:00 | 1094.0 |
10 | 2004-03-11T03:00:00 | 1010.0 |
11 | 2004-03-11T04:00:00 | 1011.0 |
12 | 2004-03-11T05:00:00 | 1066.0 |
13 | 2004-03-11T06:00:00 | 1052.0 |
⋮ | ⋮ | ⋮ |
189 | 2004-03-18T14:00:00 | 1379.0 |
190 | 2004-03-18T15:00:00 | 1322.0 |
191 | 2004-03-18T16:00:00 | 1496.0 |
192 | 2004-03-18T17:00:00 | 1409.0 |
193 | 2004-03-18T18:00:00 | 1513.0 |
194 | 2004-03-18T19:00:00 | 1667.0 |
195 | 2004-03-18T20:00:00 | 1667.0 |
196 | 2004-03-18T21:00:00 | 1430.0 |
197 | 2004-03-18T22:00:00 | 1333.0 |
198 | 2004-03-18T23:00:00 | 1262.0 |
199 | 2004-03-19T00:00:00 | 1287.0 |
200 | 2004-03-19T01:00:00 | 1134.0 |
Let's visualize the carbon monoxide measurements over time.
scatter(past_data[:,1],
past_data[:,2],
size=(900,300),
color="black",
xlabel="time",
ylabel="CO (ppm)",
ylims=[400,2000])
We suspect that there is a temporal dependence in this dataset. In other words, the data changes relatively slowly over time and neighbouring data points end up being highly correlated. To exploit this correlation, we will build an auto-regressive model of the form:
$$ y_k = \theta y_{k-1} + \epsilon_k \, , $$where the noise $\epsilon_k$ is drawn from a zero-mean Gaussian with precision parameter $\tau$:
$$ \epsilon_k \sim \mathcal{N}(0, \tau^{-1}) \, .$$Tasks:
### YOUR CODE HERE
# Number of data points
T = size(past_data,1);
# Prepare data
y_k = [past_data[i,2] for i in 2:T]
y_prev = [past_data[i,2] for i in 1:(T-1)];
# Specify parameters of prior distributions
prior_params = Dict(:θ => (1.0, 100.0))
@model function auto_regression1(prior_params; num_samples=1)
x = datavar(Float64,num_samples)
y = datavar(Float64,num_samples)
# Prior coefficients
θ ~ Normal(mean = prior_params[:θ][1], variance = prior_params[:θ][2])
for i = 1:num_samples
# Likelihood
y[i] ~ Normal(mean = θ*x[i], precision = 1.0)
end
return y, x, θ
end
results = inference(
model = auto_regression1(prior_params, num_samples=T-1),
data = (x = y_prev, y = y_k,),
free_energy = true,
)
# Plot posterior
plot(0.0:1e-3:2.0, x -> pdf(results.posteriors[:θ], x), xlabel="θ", ylabel="p(θ|y)", size=(900,300))
####
We want to evaluate the parameters inferred under the model. For now, we will do this by visually inspecting the 1-step ahead predictions on our data set. Later, we will use free energy as a metric.
The posterior predictive distribution for the next time step is:
$$ p(y_{t+1} \mid y_{t}, \mathcal{D}) = \int p(y_{t+1} \mid \theta, y_{t}) p(\theta \mid \mathcal{D}) \, \mathrm{d}\theta \, , $$where $\mathcal{D}$ refers to the data used to infer the posterior distribution. To make 1-step ahead predictions, you will have to loop over the data (i.e., for t in 1:T
), plug in the current data point and compute the parameters of the posterior predictive distribution for the next data point. You may start from $t=2$, using $y_1$ as initial "previous observation".
Tasks:
ribbon=
) along with $y_{2:11}$ (scatterplot).Note that if you failed to infer a posterior distribution in the previous question, you can still answer this question using a standard normal, $p(\theta) = \mathcal{N}(0,1)$.
### YOUR CODE HERE
# Preallocate predictions
sim_mean = zeros(T)
sim_vars = zeros(T)
# Initialize recursive previous
y_prev = past_data[1,2]
# Extract parameters of posterior
mθ, vθ = mean_var(results.posteriors[:θ])
for t = 2:T
# Compute parameters of posterior
sim_mean[t] = mθ*y_prev
sim_vars[t] = y_prev*vθ*y_prev + 1.0
# Update data
y_prev = past_data[t,2]
end
scatter(past_data[2:11,1], past_data[2:11,2], color="black", label="data", xlabel="time", ylabel="CO")
plot!(past_data[2:11,1], sim_mean[2:11], ribbon=sim_vars[2:11], color="blue", size=(900,300), label="predictions")
###
From the results of the previous question, you may conclude that our initial model isn't great: it only considers extremely short-term changes, which are highly affected by noise.
The model can be improved in two ways:
where $M$ refers to model order.
Tasks:
# Number of iterations of variational inference
n_iters = 10;
# Model order
M = 3;
### YOUR CODE HERE
# Number of data points
T = size(past_data,1);
# Prepare data
y_k = [past_data[i,2] for i in (M+1):T]
y_prev = [past_data[i-1:-1:i-M,2] for i in (M+1):T]
# Specify parameters of prior distributions
prior_params = Dict(:θ => (exp.(-(1:M)), diagm(ones(M))),
:τ => (1.0, 1e3))
@model function auto_regression2(prior_params; model_order=2, num_samples=1)
x = datavar(Vector{Float64}, num_samples)
y = datavar(Float64, num_samples)
# Prior coefficients
θ ~ MvNormal(mean = prior_params[:θ][1], covariance = prior_params[:θ][2])
# Prior noise precision
τ ~ Gamma(shape = prior_params[:τ][1], rate = prior_params[:τ][2])
for i = 1:num_samples
# Likelihood
y[i] ~ Normal(mean = dot(θ,x[i]), precision = τ)
end
return y, x, θ, τ
end
constraints = @constraints begin
q(θ,τ) = q(θ)q(τ)
end
results = inference(
model = auto_regression2(prior_params, model_order=M, num_samples=T-M),
data = (x = y_prev, y = y_k,),
constraints = constraints,
initmarginals = (θ = MvNormal(prior_params[:θ]...), τ = Gamma(prior_params[:τ]...)),
returnvars = (θ = KeepLast(), τ = KeepLast(),),
iterations = n_iters,
free_energy = true,
showprogress = true,
)
# Posterior for noise precision
p1 = plot(1e-5:1e-6:5e-4, x -> exp.(logpdf(results.posteriors[:τ], x)))
# Preallocate predictions
sim_mean = zeros(T)
sim_vars = zeros(T)
# Initialize recursive previous
y_prev = past_data[M:-1:1,2]
# Extract parameters of posterior
mθ, Sθ = mean_cov(results.posteriors[:θ])
mτ = mode(results.posteriors[:τ])
for t = M:T
# Compute parameters of posterior
sim_mean[t] = mθ'*y_prev
sim_vars[t] = y_prev'*Sθ*y_prev + inv(mτ)
# Update recursion
y_prev = past_data[t:-1:t-M+1,2]
end
p2 = scatter(past_data[M:T,1], past_data[M:T,2], color="black", label="data", xlabel="time", ylabel="CO (ppm)")
plot!(past_data[M:T,1], sim_mean[M:T], ribbon=sqrt.(sim_vars[M:T]), color="blue", size=(900,300), label="predicted")
plot(p1, p2, layout=(2,1), size=(900,600))
####
Progress: 100%|█████████████████████████████████████████| Time: 0:00:05
We now essentially have a different model for each value of $M$. Which is the best?
Tasks:
# Model order range
model_orders = [2,4,8,16,32];
##### YOUR CODE HERE
# Preallocate free energy
FE = zeros(length(model_orders), n_iters)
# Preallocate list of posteriors
posts = Vector(undef, length(model_orders))
# Loop over model orders
@showprogress for (j,m) in enumerate(model_orders)
# Prepare data
y_k = [past_data[i,2] for i in (m+1):T]
y_prev = [past_data[i-1:-1:i-m,2] for i = (m+1):T]
# Specify parameters of prior distributions
prior_params = Dict(:θ => (exp.(-(1:m)), diagm(ones(m))),
:τ => (1.0, 1e3))
results = inference(
model = auto_regression2(prior_params, model_order=m, num_samples=T-m),
data = (x = y_prev, y = y_k,),
constraints = constraints,
initmarginals = (θ = MvNormal(prior_params[:θ]...), τ = Gamma(prior_params[:τ]...)),
returnvars = (θ = KeepLast(), τ = KeepLast(),),
iterations = n_iters,
free_energy = true,
showprogress = true,
)
# Store final FE
FE[j,:] = results.free_energy
# Save posteriors
posts[j] = results.posteriors
end
# Extract best model and report
M_best = model_orders[argmin(FE[:,end])]
println("FE indicates M = "*string(M_best)*" is optimal.")
# Visualize free energies
plot(1:n_iters,
FE',
xlabel="Iterations of variational inference",
ylabel="Free energy [F(q)]",
labels = cat([["M = $m"] for m in model_orders]..., dims=2),
size=(900,300))
#####
Progress: 100%|█████████████████████████████████████████| Time: 0:00:00 Progress: 100%|█████████████████████████████████████████| Time: 0:00:01
FE indicates M = 32 is optimal.
# Load test data
future_data = DataFrame(CSV.File("data/airquality_future.csv"))
Row | time | CO |
---|---|---|
DateTime | Float64 | |
1 | 2004-03-19T02:00:00 | 999.0 |
2 | 2004-03-19T03:00:00 | 961.0 |
3 | 2004-03-19T04:00:00 | 934.0 |
4 | 2004-03-19T05:00:00 | 913.0 |
5 | 2004-03-19T06:00:00 | 969.0 |
6 | 2004-03-19T07:00:00 | 1182.0 |
7 | 2004-03-19T08:00:00 | 1740.0 |
8 | 2004-03-19T09:00:00 | 1819.0 |
9 | 2004-03-19T10:00:00 | 1427.0 |
10 | 2004-03-19T11:00:00 | 1390.0 |
11 | 2004-03-19T12:00:00 | 1283.0 |
12 | 2004-03-19T13:00:00 | 1304.0 |
13 | 2004-03-19T14:00:00 | 1364.0 |
⋮ | ⋮ | ⋮ |
289 | 2004-03-31T02:00:00 | 972.0 |
290 | 2004-03-31T03:00:00 | 832.0 |
291 | 2004-03-31T04:00:00 | 851.0 |
292 | 2004-03-31T05:00:00 | 909.0 |
293 | 2004-03-31T06:00:00 | 1038.0 |
294 | 2004-03-31T07:00:00 | 1636.0 |
295 | 2004-03-31T08:00:00 | 1580.0 |
296 | 2004-03-31T09:00:00 | 1262.0 |
297 | 2004-03-31T10:00:00 | 1197.0 |
298 | 2004-03-31T11:00:00 | 1277.0 |
299 | 2004-03-31T12:00:00 | 1430.0 |
300 | 2004-03-31T13:00:00 | 1242.0 |
### YOUR CODE HERE
# Select posterior
best_post = posts[argmin(FE[:,end])]
# Extract parameters of selected posterior
mθ, Sθ = mean_cov(best_post[:θ])
mτ = mode(best_post[:τ])
T = size(future_data,1)
# Preallocate predictions
sim_mean = zeros(T)
sim_vars = zeros(T)
# Initialize recursive previous
y_prev = future_data[M_best:-1:1,2]
for t = (M_best+1):T
# Compute parameters of posterior
sim_mean[t] = mθ'*y_prev
sim_vars[t] = y_prev'*Sθ*y_prev + inv(mτ)
# Update data
y_prev = future_data[t-1:-1:t-M_best,2]
end
ix = (M_best+1):300
scatter(future_data[ix,1], future_data[ix,2], color="black", label="data", xlabel="time", ylabel="CO")
plot!(future_data[ix,1], sim_mean[ix], ribbon=sqrt.(sim_vars[ix]), color="blue", size=(900,300), label="predictions")
###