Probabilistic Programming 3: Regression and classification¶

Goal¶

Learn how to infer a posterior distribution for a linear regression model using a probabilistic programming language.
Learn how to infer a posterior distribution for a linear classification model using a probabilistic programming language.

Materials¶

Mandatory
- This notebook.
- Lecture notes on regression.
- Lecture notes on discriminative classification.
Optional
- Bayesian linear regression (Section 3.3 Bishop)
- Bayesian logistic regression (Section 4.5 Bishop)
- Cheatsheets: how does Julia differ from Matlab / Python.

In [ ]:

using Pkg
Pkg.activate("../../../lessons/")
Pkg.instantiate();
IJulia.clear_output();

Problem: Economic growth¶

In 2008, the credit crisis sparked a recession in the US, which spread to other countries in the ensuing years. It took most countries a couple of years to recover. Now, the year is 2011. The Turkish government is asking you to estimate whether Turkey is out of the recession. You decide to look at the data of the national stock exchange to see if there's a positive trend.

In [ ]:

using CSV
using DataFrames
using LinearAlgebra
using Distributions
using StatsFuns
using RxInfer
using Plots
default(label="", margin=10Plots.pt)

Data¶

We are going to start with loading in a data set. We have daily measurements from Istanbul, from the 5th of January 2009 until 22nd of February 2011. The dataset comes from an online resource for machine learning data sets: the UCI ML Repository.

In [3]:

# Read CSV file
df = DataFrame(CSV.File("../datasets/stock_exchange.csv"))

Out[3]:

536×2 DataFrame

511 rows omitted

Row	date	ISE
	String15	Float64
1	5-Jan-09	0.0357537
2	6-Jan-09	0.0254259
3	7-Jan-09	-0.0288617
4	8-Jan-09	-0.0622081
5	9-Jan-09	0.00985991
6	12-Jan-09	-0.029191
7	13-Jan-09	0.0154453
8	14-Jan-09	-0.0411676
9	15-Jan-09	0.000661905
10	16-Jan-09	0.0220373
11	19-Jan-09	-0.0226925
12	20-Jan-09	-0.0137087
13	21-Jan-09	0.000864697
⋮	⋮	⋮
525	7-Feb-11	-0.0061961
526	8-Feb-11	0.00535559
527	9-Feb-11	0.00482299
528	10-Feb-11	-0.0176644
529	11-Feb-11	0.00478229
530	14-Feb-11	-0.00249793
531	15-Feb-11	0.00360638
532	16-Feb-11	0.00859906
533	17-Feb-11	0.00931031
534	18-Feb-11	0.000190969
535	21-Feb-11	-0.013069
536	22-Feb-11	-0.00724632

We can plot the evolution of the stock market values over time.

In [4]:

# Count number of samples
time_period = 436:536
num_samples = length(time_period)

# Extract columns
dates_num = 1:num_samples
dates_str = df[time_period,1]
stock_val = df[time_period,2]

# Set xticks
xtick_points = Int64.(round.(range(1, stop=num_samples, length=5)))

# Scatter exchange levels
scatter(dates_num, 
        stock_val, 
        color="black",
        label="", 
        ylabel="Stock Market Levels", 
        xlabel="time (days)",
        xticks=(xtick_points, [dates_str[i] for i in xtick_points]), 
        size=(800,300))

Out[4]:

Model specification¶

We have a date $x_i \in \mathbb{R}$, referred to as a "covariate" or input variable, and the value of the stock exchange at that time point $y_i \in \mathbb{R}$, referred to as a "response" or output variable. A regression model has parameters $\theta$, used to predict $y = (y_1, \dots, y_N)$ from $x = (x_1, \dots, x_N)$. We are looking for a posterior distribution for the parameters $\theta$:

$$\underbrace{p(\theta \mid y, x)}_{\text{posterior}} \propto\ \underbrace{p(y \mid x, \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}$$

We assume each observation $y_i$ is generated via:

$$ y_i = f_\theta(x_i) + e_i$$

where $e_i$ is white noise, $e_i \sim \mathcal{N}(0, \sigma^2_y)$, and the regression function $f_\theta$ is linear: $f_\theta(x) = x \theta_1 + \theta_2$. The parameters consist of a slope coefficient $\theta_1$ and an intercept $\theta_2$, which are summarized into the vector $\theta = \begin{bmatrix}\theta_1 \\ \theta_2 \end{bmatrix}$. In practice, we augment the data point $x$ with a 1, i.e., $\begin{bmatrix}x \\ 1 \end{bmatrix}$, so that we may define $f_\theta(x) = \theta^{\top}x$.

Likelihood¶

If we integrate out the noise $e$, then we obtain a Gaussian likelihood function centered on $f_\theta(x)$ with variance $\sigma^2_Y$:

$$p(y_i \mid x_i, \theta) = \mathcal{N}(y_i \mid f_\theta(x_i),\sigma^2_y)\, \ .$$

But this is just for a single sample and we have an entire data set. The likelihood of all $(x,y)$ is:

$$\begin{aligned} p(y \mid x, \theta) &= \prod_{i=1}^{N} p(y_i \mid x_i, \theta) \\ & = \prod_{i=1}^{N} \mathcal{N}(y_i \mid f_{\theta}(x_i), \sigma^2_y) \, . \end{aligned} $$

Prior distribution¶

We know that the weights are real numbers and that they can be negative. That motivates us to use a Gaussian prior:

$$ p(\theta) = \mathcal{N}(\theta \mid \mu_\theta, \Sigma_\theta) \, .$$

We can specify these equations almost directly in our PPL.

In [5]:

@model function linear_regression(μ_θ, Σ_θ, σ2_y; N=1)
    "Bayesian linear regression"
    
    # Allocate data variables
    X = datavar(Vector{Float64}, N)
    y = datavar(Float64, N)
    
    # Prior distribution of coefficients
    θ ~ MvNormalMeanCovariance(μ_θ, Σ_θ)
    
    for i = 1:N

        # Likelihood of i-th sample
        y[i] ~ NormalMeanVariance(dot(θ,X[i]), σ2_y)
        
    end
    return y, X, θ
end

In [6]:

# Prior parameters
μ_θ, Σ_θ = (zeros(2), diagm(ones(2)))

# Likelihood variance
σ2_y = 1.0;

Now that we have our model, it is time to infer parameters.

In [7]:

# Call inference function
results = inference(
    model       = linear_regression(μ_θ, Σ_θ, σ2_y, N=num_samples),
    data        = (y = stock_val, X = [[dates_num[i], 1.0] for i in 1:num_samples]),
    returnvars  = (θ = KeepLast()),
)

Out[7]:

Inference results:
  Posteriors       | available for (θ)

Let's visualize the resulting posterior.

In [8]:

# Extract posterior weights 
post_θ = results.posteriors[:θ]

# Define ranges for plot
x1 = range(-1.5, length=500, stop=1.5)
x2 = range(-2.5, length=500, stop=2.5)

# Draw contour plots of distributions
prior_θ = MvNormal(μ_θ, Σ_θ)
p1a = contour(x1, x2, (x1,x2) -> pdf(prior_θ, [x1,x2]), xlabel="θ1", ylabel="θ2", title="prior", label="")
p1b = contour(x1, x2, (x1,x2) -> pdf(post_θ, [x1,x2]), xlabel="θ1", title="posterior", label="")
plot(p1a, p1b, size=(900,300))

Out[8]:

It has become quite sharply peaked in a small area of parameter space.

We can extract the MAP point estimate to compute and visualize the most probable regression function $f_\theta$.

In [9]:

# Extract estimated weights
θ_MAP = mode(post_θ)

# Report results
println("Slope coefficient = "*string(θ_MAP[1]))
println("Intercept coefficient = "*string(θ_MAP[2]))

# Make predictions
regression_estimated = dates_num * θ_MAP[1] .+ θ_MAP[2];

Slope coefficient = -4.320468927288264e-5
Intercept coefficient = 0.0022453122200430586

Let's visualize it.

In [10]:

# Visualize observations
scatter(dates_num, stock_val, color="black", xticks=(xtick_points, [dates_str[i] for i in xtick_points]), label="observations", legend=:topleft)

# Overlay regression function
plot!(dates_num, regression_estimated, color="blue", label="regression", linewidth=2, size=(800,300))

Out[10]:

The slope coefficient $\theta_1$ is negative and the plot shows a decreasing line. The ISE experienced a negative linear trend from October 2010 up to March 2011. Assuming the stock market is an indicator of economic growth, then we may conclude that in March 2011 the Turkish economy is still in recession.

Exercise¶

Change the time period variable. Re-run the regression and see how the results change.

Problem: Credit Assignment¶

We will now look at a classification problem. Suppose you are a bank and that you have to decide whether you will grant credit, e.g. a mortgage or a small business loan, to a customer. You have a historic data set where your experts have assigned credit to hundreds of people. You have asked them to report on what aspects of the problem are important. You hope to automate this decision process by training a classifier on the data set.

Data¶

The data set we are going to use actually comes from the UCI ML Repository. It consists of past credit assignments, labeled as successful (=1) or unsuccessful (=0). Many of the features have been anonymized for privacy concerns.

In [11]:

# Read CSV file
df = DataFrame(CSV.File("../datasets/credit_train.csv"))

# Split dataframe into features and labels
features_train = Matrix(df[:,1:7])
labels_train = Vector(df[:,8])

# Store number of features
num_features = size(features_train,2)

# Number of training samples
num_train = size(features_train,1);

Let's visualize the data and see if we can make sense of it.

In [12]:

scatter(features_train[labels_train .== 0, 1], features_train[labels_train .== 0, 2], color="blue", label="unsuccessful", xlabel="feature1", ylabel="feature2")
scatter!(features_train[labels_train .== 1, 1], features_train[labels_train .== 1, 2], color="red", label="successful", size=(800,300))

Out[12]:

Mmhh, it doesn't look like the samples can easily be separated. This will be challenging.

Exercise¶

The plot above shows features 1 and 2. Have a look at the other combinations of features.

Model specification¶

We have features $X$, labels $Y$ and parameters $\theta$. Same as with regression, we are looking for a posterior distribution of the classification parameters:

$$\underbrace{p(\theta \mid Y, X)}_{\text{posterior}} \propto\ \underbrace{p(Y \mid X, \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}$$

The likelihood in this case will be of a probit form:

$$ p(Y \mid X, \theta) = \prod_{i=1}^{N} \ \mathcal{B} \big(Y_i \mid \Phi(f_\theta(X_i) \big) \, .$$

As you can see it is a Bernoulli distribution with a cumulative normal distribution as transfer (a.k.a. link) function:

$$ \Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{x} \exp \left(-\frac{t^2}{2} \right) \mathrm{d}t \, .$$

The transfer function maps the input ($f_\theta(X_i)$) to the interval $(0,1)$ so that the result acts as a rate parameter to the Bernoulli. Check Bert's lecture on discriminative classification for more information.

We will use a Gaussian prior distribution for the classification parameters $\theta$:

$$ p(\theta) = \mathcal{N}(\theta \mid \mu_\theta, \Sigma_\theta) \, .$$

You have probably noticed that this combination of likelihood and prior is not part of the family of conjugate pairings. As such, we don't have an exact posterior. The Laplace approximation is one procedure but under the hood, our toolbox is actually performing a different procedure for obtaining the posterior parameters: moment matching. The cumulative normal distribution allows for integrating the product of prior and likelihood by hand, with respect to the first (mean) and second (variance) moments. The toolbox is essentially performing a lookup for the analytically derived formula, which computationally cheaper than performing the iterative steps necessary for the Laplace approximation.

In [13]:

# Parameters for priors
μ_θ = zeros(num_features+1,)
Σ_θ = diagm(ones(num_features+1));

In [14]:

@model function linear_classification(μ_θ, Σ_θ; N=1)
    "Bayesian classification model"
    
    # Allocate data variables
    X = datavar(Vector{Float64}, N)
    y = datavar(Int64, N)
    
    # Weight prior distribution
    θ ~ MvNormalMeanCovariance(μ_θ, Σ_θ)
    
    for i = 1:N
        
        # Binary likelihood
        y[i] ~ Probit(dot(θ, X[i]))
        
    end
    return y, X, θ
end

In [15]:

results = inference(
    model       = linear_classification(μ_θ, Σ_θ, N=num_train),
    data        = (y = labels_train, X = [[features_train[i,:]; 1.0] for i in 1:num_train]),
    returnvars  = (θ = KeepLast()),
)

Out[15]:

Inference results:
  Posteriors       | available for (θ)

Unfortunately, we cannot visualize a distribution of more than 2 dimensions.

Predict test data¶

The bank has some test data for you as well.

In [16]:

# Read CSV file
df = DataFrame(CSV.File("../datasets/credit_test.csv"))

# Split dataframe into features and labels
features_test = Matrix(df[:,1:7])
labels_test = Vector(df[:,8])

# Number of test samples
num_test = size(features_test,1);

You can classify test samples by taking the MAP for the classification parameters, computing the linear function $f_\theta$ and rounding the result to obtain the most probable label.

In [17]:

# Extract MAP estimate of classification parameters
θ_MAP = mode(results.posteriors[:θ])

# Compute dot product between parameters and test data
fθ_pred = [features_test ones(num_test,)] * θ_MAP

# Predict labels through probit
labels_pred = round.(normcdf.(fθ_pred));

# Compute classification accuracy of test data
accuracy_test = mean(labels_test .== labels_pred)

# Report result
println("Test Accuracy = "*string(accuracy_test*100)*"%")

Test Accuracy = 49.0%

Mmmhh... If you were a bank, you might decide that you don't want to automatically assign credit to your customers.