Notebook

Probabilistic Programming 2: Bayesian regression and classification¶

Goal¶

Learn how to infer a posterior distribution for a linear regression model using a probabilistic programming language.
Learn how to infer a posterior distribution for a linear classification model using a probabilistic programming language.

Materials¶

Mandatory
- This notebook.
- Lecture notes on regression.
- Lecture notes on discriminative classification.
Optional
- Bayesian linear regression (Section 3.3 Bishop)
- Bayesian logistic regression (Section 4.5 Bishop)
- Cheatsheets: how does Julia differ from Matlab / Python.

In [ ]:

using Pkg; Pkg.activate("../../."); Pkg.instantiate();
using IJulia; try IJulia.clear_output(); catch _ end

Problem: Economic growth¶

In 2008, the credit crisis sparked a recession in the US, which spread to other countries in the ensuing years. It took most countries a couple of years to recover. Now, the year is 2011. The Turkish government is asking you to estimate whether Turkey is out of the recession. You decide to look at the data of the national stock exchange to see if there's a positive trend.

In [2]:

using CSV
using DataFrames
using LinearAlgebra
using Distributions
using StatsFuns
using RxInfer
using Plots
default(label="", margin=10Plots.pt)

Data¶

We are going to start with loading in a data set. We have daily measurements from Istanbul, from the 5th of January 2009 until 22nd of February 2011. The dataset comes from an online resource for machine learning data sets: the UCI ML Repository.

In [3]:

# Read CSV file
df = DataFrame(CSV.File("../datasets/stock_exchange.csv"))

251×2 DataFrame

226 rows omitted

Row	date	ISE
	String15	Float64
1	5-Jan-09	3.57537
2	6-Jan-09	2.54259
3	7-Jan-09	-2.88617
4	8-Jan-09	-6.22081
5	9-Jan-09	0.98599
6	12-Jan-09	-2.9191
7	13-Jan-09	1.54453
8	14-Jan-09	-4.11676
9	15-Jan-09	0.0661905
10	16-Jan-09	2.20373
11	19-Jan-09	-2.26925
12	20-Jan-09	-1.37087
13	21-Jan-09	0.0864697
⋮	⋮	⋮
240	17-Dec-09	-1.69489
241	18-Dec-09	0.350124
242	21-Dec-09	2.25302
243	22-Dec-09	0.48947
244	23-Dec-09	-0.721131
245	24-Dec-09	0.581665
246	25-Dec-09	0.389112
247	28-Dec-09	-0.0811768
248	29-Dec-09	0.322285
249	30-Dec-09	-0.227405
250	31-Dec-09	2.21384
251	4-Jan-10	1.02294

We can plot the evolution of the stock market values over time.

In [4]:

# Count number of samples
time_period = 1:251
num_samples = length(time_period)

# Extract columns
dates_num = 1:num_samples
dates_str = df[time_period,1]
stock_val = df[time_period,2]

# Set xticks
xtick_points = Int64.(round.(range(1, stop=num_samples, length=5)))

# Scatter exchange levels
scatter(dates_num, 
        stock_val, 
        color="black",
        label="", 
        ylabel="Stock Market Levels", 
        xlabel="time (days)",
        xticks=(xtick_points, [dates_str[i] for i in xtick_points]), 
        size=(800,300))

Model specification¶

We have a date $x_i \in \mathbb{R}$, referred to as a "covariate" or input variable, and the value of the stock exchange at that time point $y_i \in \mathbb{R}$, referred to as a "response" or output variable. A regression model has parameters $\theta$, used to predict $y = (y_1, \dots, y_N)$ from $x = (x_1, \dots, x_N)$. We are looking for a posterior distribution for the parameters $\theta$:

$$\underbrace{p(\theta \mid y, x)}_{\text{posterior}} \propto\ \underbrace{p(y \mid x, \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}$$

We assume each observation $y_i$ is generated via:

$$ y_i = f_\theta(x_i) + e_i$$

where $e_i$ is white noise, $e_i \sim \mathcal{N}(0, \sigma^2_y)$, and the regression function $f_\theta$ is linear: $f_\theta(x) = x \theta_1 + \theta_2$. The parameters consist of a slope coefficient $\theta_1$ and an intercept $\theta_2$, which are summarized into the vector $\theta = \begin{bmatrix}\theta_1 \\ \theta_2 \end{bmatrix}$. In practice, we augment the data point $x$ with a 1, i.e., $\begin{bmatrix}x \\ 1 \end{bmatrix}$, so that we may define $f_\theta(x) = \theta^{\top}x$.

Likelihood¶

If we integrate out the noise $e$, then we obtain a Gaussian likelihood function centered on $f_\theta(x)$ with variance $\sigma^2_Y$:

$$p(y_i \mid x_i, \theta) = \mathcal{N}(y_i \mid f_\theta(x_i),\sigma^2_y)\, \ .$$

But this is just for a single sample and we have an entire data set. The likelihood of all $(x,y)$ is:

$$\begin{aligned} p(y \mid x, \theta) &= \prod_{i=1}^{N} p(y_i \mid x_i, \theta) \\ & = \prod_{i=1}^{N} \mathcal{N}(y_i \mid f_{\theta}(x_i), \sigma^2_y) \, . \end{aligned} $$

Prior distribution¶

We know that the weights are real numbers and that they can be negative. That motivates us to use a Gaussian prior:

$$ p(\theta) = \mathcal{N}(\theta \mid \mu_\theta, \Sigma_\theta) \, .$$

We can specify these equations almost directly in our PPL.

In [5]:

@model function linear_regression(y, X, μ_θ, Σ_θ, σ2, N)
    "Bayesian linear regression"
    
    # Prior distribution of coefficients
    θ ~ MvNormalMeanCovariance(μ_θ, Σ_θ)
    
    for i = 1:N

        # Likelihood of i-th sample
        y[i] ~ NormalMeanVariance(dot(θ,X[i]), σ2)
        
    end
end

In [6]:

# Prior parameters
μ_θ, Σ_θ = (zeros(2), diagm(ones(2)))

# Likelihood variance
σ2_y = var(stock_val)

3.2657337986971937

Now that we have our model, it is time to infer parameters.

In [7]:

results = infer(
    model       = linear_regression(μ_θ=μ_θ, Σ_θ=Σ_θ, σ2=σ2_y, N=num_samples),
    data        = (y = stock_val, X = [[dates_num[i], 1.0] for i in 1:num_samples]),
)

Inference results:
  Posteriors       | available for (θ)

Let's visualize the resulting posterior.

In [8]:

# Extract posterior weights 
post_θ = results.posteriors[:θ]

# Define ranges for plot
x1 = range(-1., length=501, stop=1.)
x2 = range(-1., length=501, stop=1.)

# Draw contour plots of distributions
prior_θ = MvNormal(μ_θ, Σ_θ)
p1a = contour(x1, x2, (x1,x2) -> pdf(prior_θ, [x1,x2]), levels=3, xlabel="θ1", ylabel="θ2", title="prior")
p1b = contour(x1, x2, (x1,x2) -> pdf(post_θ, [x1,x2]), levels=3, xlabel="θ1", title="posterior")
plot(p1a, p1b, size=(900,300))

WARNING: both ExponentialFamily and ReactiveMP export "MvNormalMeanScalePrecision"; uses of it in module RxInfer must be qualified

It has become quite sharply peaked in a small area of parameter space.

We can extract the MAP point estimate to compute and visualize the most probable regression function $f_\theta$.

In [9]:

# Extract estimated weights
θ_MAP = mode(post_θ)
θ_Σ   = cov( post_θ)

# Report results
println("Slope coefficient = "*string(θ_MAP[1]))
println("Intercept coefficient = "*string(θ_MAP[2]))

# Make predictions
regression_estimated = zeros(num_samples)
regression_uncertainty = zeros(num_samples) 
for (i,date) in enumerate(dates_num)
    regression_estimated[i] = date * θ_MAP[1] .+ θ_MAP[2];
    regression_uncertainty[i] = [date, 1.]' *θ_Σ * [date, 1.] + σ2_y
end

Slope coefficient = 0.000941327449846718
Intercept coefficient = 0.15081258570809555

Let's visualize the estimated regression function.

In [10]:

# Visualize observations
scatter(dates_num, stock_val, color="black", xticks=(xtick_points, [dates_str[i] for i in xtick_points]), label="observations", legend=:topleft)

# Overlay regression function
plot!(dates_num, regression_estimated, ribbon=regression_uncertainty, color="blue", label="regression", size=(800,300))

The slope coefficient $\theta_1$ is nearly zero and the plot shows a horizontal line. So the ISE did not experienced a decline, but also did not grow in 2009.

Exercise¶

Change the time period variable. Re-run the regression and see how the results change.

Forecasting¶

We've answered the Turkish government's question. But before they change their country's economic policy, they want to know what the future looks like if the stock market continues on this trend. In other words, they want us to make predictions for future outputs given our current regression coefficient estimates.

We can request the toolbox to make predictions for future outputs by providing missing data points.

In [11]:

future = DataFrame(CSV.File("../datasets/stock_futures.csv"))

251×2 DataFrame

226 rows omitted

Row	date	ISE
	String15	Missing
1	5-Jan-10	missing
2	6-Jan-10	missing
3	7-Jan-10	missing
4	8-Jan-10	missing
5	11-Jan-10	missing
6	12-Jan-10	missing
7	13-Jan-10	missing
8	14-Jan-10	missing
9	15-Jan-10	missing
10	18-Jan-10	missing
11	19-Jan-10	missing
12	20-Jan-10	missing
13	21-Jan-10	missing
⋮	⋮	⋮
240	21-Dec-10	missing
241	22-Dec-10	missing
242	23-Dec-10	missing
243	24-Dec-10	missing
244	27-Dec-10	missing
245	28-Dec-10	missing
246	29-Dec-10	missing
247	30-Dec-10	missing
248	31-Dec-10	missing
249	3-Jan-11	missing
250	4-Jan-11	missing
251	5-Jan-11	missing

In [12]:

all_N = num_samples + length(future[!,:date])
all_d = [df[!,:date]; future[!,:date]]
all_y = [stock_val; future[!,:ISE]]
all_X = [[i, 1.0] for i in 1:all_N]

results = infer(
    model       = linear_regression(μ_θ=μ_θ, Σ_θ=Σ_θ, σ2=σ2_y, N=all_N),
    data        = (y = all_y, X = all_X,),
    predictvars = (y = KeepLast(),),
)

Inference results:
  Posteriors       | available for (θ)
  Predictions      | available for (y)

In [13]:

regression_estimated = mode.(results.predictions[:y])
regression_uncertainty = 2std.(results.predictions[:y]) # 2 standard deviations

final_prediction = regression_estimated[end]
final_uncertainty = regression_uncertainty[end]
println("Final predicted stock value = $final_prediction (±$final_uncertainty)")

Final predicted stock value = 0.6233589655311479 (±3.6142682793047856)

Although the prediction itself is positive, its uncertainty tells us that we should not trust this result too much. Let's see what this looks like in a plot.

In [14]:

# Visualize observations
plot(xlabel="time", xticks=(1:50:all_N, [all_d[i] for i in 1:50:all_N]), size=(800,300))
scatter!(dates_str, stock_val, color="black", label="observations")
vline!([251], color="green", linewidth=5, linestyle=:dash)

# Overlay regression function
plot!(1:all_N, regression_estimated, ribbon=regression_uncertainty, color="blue", label="regression")

The ribbon is relatively wide. This tells us that the future looks uncertain for the stock market.

Problem: Medical Diagnosis¶

A company is trying to develop a measurement device that tells a patient whether they suffer from Chronic Obstructive Pulmonary Disease (COPD). They believe they can detect certain compounds in a saliva sample that indicate the presence of COPD. To test the device, they collect data from both COPD patients and healthy controls. They train a classifier and make predictions on a test sample. If the diagnosis can be accurately predicted, then the new device is deemed an informative tool and will be brought to market.

Data¶

The data set comes from the UCI ML Repository. It contains measurements of 79 participants, split into 40 samples for training and 39 for testing. The columns in the data file marked :x are biometric features (x1,x2 = measured signal, x3 = gender, x4 = age, x5 = smoking) and the final column marked :y is the diagnosis (healthy control =0, COPD =1).

In [15]:

# Read CSV file
train_data = DataFrame(CSV.File("../datasets/diagnosis_train.csv"))

# Split dataframe into features and labels
features_train = Matrix(train_data[:,1:5])
labels_train = Vector(train_data[:,6])

# Store number of features
num_features = size(features_train,2)

# Number of training samples
num_train = size(features_train,1);

Let's have a look at measurements from the device.

In [16]:

plot(xlabel="feature1", ylabel="feature2", size=(800,400))
scatter!(features_train[labels_train .== 0, 1], features_train[labels_train .== 0, 2], color="blue", label="Healthy Controls")
scatter!(features_train[labels_train .== 1, 1], features_train[labels_train .== 1, 2], color="red",  label="COPD patients")

Mmhh, it doesn't look like the samples can easily be separated. Are these measurements really informative?

Exercise¶

The plot above shows features 1 and 2. Have a look at the other combinations of features.

Model specification¶

We have features $X$, labels $Y$ and parameters $\theta$. Same as with regression, we are looking for a posterior distribution of the classification parameters:

$$\underbrace{p(\theta \mid Y, X)}_{\text{posterior}} \propto\ \underbrace{p(Y \mid X, \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}$$

The likelihood in this case will be of a probit form:

$$ p(Y \mid X, \theta) = \prod_{i=1}^{N} \ \mathcal{B} \big(Y_i \mid \Phi(f_\theta(X_i) \big) \, .$$

As you can see it is a Bernoulli distribution with a cumulative normal distribution as transfer (a.k.a. link) function:

$$ \Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{x} \exp \left(-\frac{t^2}{2} \right) \mathrm{d}t \, .$$

The transfer function maps the input ($f_\theta(X_i)$) to the interval $(0,1)$ so that the result acts as a rate parameter to the Bernoulli. Check Bert's lecture on discriminative classification for more information.

We will use a Gaussian prior distribution for the classification parameters $\theta$:

$$ p(\theta) = \mathcal{N}(\theta \mid \mu_\theta, \Sigma_\theta) \, .$$

You have probably noticed that this combination of likelihood and prior is not part of the family of conjugate pairings. As such, we don't have an exact posterior. The Laplace approximation is one procedure but under the hood, our toolbox is actually performing a different procedure for obtaining the posterior parameters: moment matching. The cumulative normal distribution allows for integrating the product of prior and likelihood by hand, with respect to the first (mean) and second (variance) moments. The toolbox is essentially performing a lookup for the analytically derived formula, which computationally cheaper than performing the iterative steps necessary for the Laplace approximation.

In [17]:

# Parameters for priors
μ_θ = zeros(num_features+1,)
Σ_θ = diagm(ones(num_features+1));

In [18]:

@model function linear_classification(y,X, μ_θ, Σ_θ, N)
    "Bayesian classification model"
    
    # Weight prior distribution
    θ ~ MvNormalMeanCovariance(μ_θ, Σ_θ)
    
    # Binary likelihood
    for i = 1:N

        y[i] ~ Probit(dot(θ, X[i]))

    end
end

In [19]:

results = infer(
    model       = linear_classification(μ_θ=μ_θ, Σ_θ=Σ_θ, N=num_train),
    data        = (y = labels_train, X = [[features_train[i,:]; 1.0] for i in 1:num_train]),
    returnvars  = (θ = KeepLast()),
    iterations  = 10,
)

Inference results:
  Posteriors       | available for (θ)

Unfortunately, we cannot visualize a distribution of more than 2 dimensions. But we can visualize a pair of dimensions:

In [20]:

# Coefficients to visualize
cix = [1,2]

# Reduce posterior distribution to chosen dimensions
m_cix = mean(results.posteriors[:θ])[cix]
S_cix = cov( results.posteriors[:θ])[cix,cix]
post_θ = MvNormal(m_cix, S_cix)

# Reduce prior distribution to chosen dimensions
prior_θ = MvNormal(μ_θ[cix], Σ_θ[cix,cix])

# Define ranges for plot
x1 = range(-1., length=500, stop=1.)
x2 = range(-1., length=500, stop=1.)

# Draw contour plots of distributions
p1a = contour(x1, x2, (x1,x2) -> pdf(prior_θ, [x1,x2]), levels=3, xlabel="θ1", ylabel="θ2", title="prior", label="")
p1b = contour(x1, x2, (x1,x2) -> pdf(post_θ, [x1,x2]), levels=3, xlabel="θ1", title="posterior", label="")
plot(p1a, p1b, size=(900,300))

Predict test data¶

The device should make accurate predictions for future patients. We can evaluate this with a test data set.

In [21]:

# Read CSV file
test_data = DataFrame(CSV.File("../datasets/diagnosis_test.csv"))

# Split dataframe into features and labels
features_test = Matrix(test_data[:,1:5])
labels_test = Vector(test_data[:,6])

# Number of test samples
num_test = size(features_test,1);

We can generate the most probable class labels by extracting the most probable classifier weights, calculating the dot product with the test feature vectors, and applying the Probit node's link function. This produces the class label probability and by rounding we get hard assignments to $0$ or $1$. This can be checked against the true test labels.

In [22]:

# Extract MAP estimate of classification parameters
θ_MAP = mode(results.posteriors[:θ])

# Compute dot product between parameters and test data
fθ_pred = [features_test ones(num_test,)] * θ_MAP

# Predict labels through probit
labels_pred = round.(normcdf.(fθ_pred));

# Compute classification accuracy of test data
accuracy_test = mean(labels_test .== labels_pred)

# Report result
println("Test Accuracy = "*string(accuracy_test*100)*"%")

Test Accuracy = 92.3076923076923%

Those predictions are very accurate. So, is the device informative after all?

Exercise¶

Re-run the classifier on just the saliva measurement features. Does the accuracy drop? And if so, should the device be used?