Goal
Materials
Solution: To be solved later in this lesson.
Let's first generate a data set (see next slide).
using Pkg; Pkg.activate("../."); Pkg.instantiate();
using IJulia; try IJulia.clear_output(); catch _ end
Activating project at `~/github/bertdv/BMLIP/lessons`
using Plots, Distributions
N = 250; p_apple = 0.7; Σ = [0.2 0.1; 0.1 0.3]
p_given_apple = MvNormal([1.0, 1.0], Σ) # p(X|y=apple)
p_given_peach = MvNormal([1.7, 2.5], Σ) # p(X|y=peach)
X = Matrix{Float64}(undef,2,N); y = Vector{Bool}(undef,N) # true corresponds to apple
for n=1:N
y[n] = (rand() < p_apple) # Apple or peach?
X[:,n] = y[n] ? rand(p_given_apple) : rand(p_given_peach) # Sample features
end
X_apples = X[:,findall(y)]'; X_peaches = X[:,findall(.!y)]' # Sort features on class
x_test = [2.3; 1.5] # Features of 'new' data point
scatter(X_apples[:,1], X_apples[:,2], label="apples", marker=:x, markerstrokewidth=3) # apples
scatter!(X_peaches[:,1], X_peaches[:,2], label="peaches", marker=:+, markerstrokewidth=3) # peaches
scatter!([x_test[1]], [x_test[2]], label="unknown") # 'new' unlabelled data point
where we used $m_k \triangleq \sum_n y_{nk}$.
where $\hat \pi_k$, $\hat{\mu}_k$ and $\hat{\Sigma}_k$ are the sample proportion, sample mean and sample variance for the $k$th class, respectively.
where $\sigma(a_k) \triangleq \frac{\exp(a_k)}{\sum_{k^\prime}\exp(a_{k^\prime})}$ is called a softmax (a.k.a. normalized exponential) function, and $$\begin{align*} \beta_k &= \hat{\Sigma}^{-1} \hat{\mu}_k \\ \gamma_k &= - \frac{1}{2} \hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k + \log \hat{\pi}_k \\ Z &= \sum_{k^\prime}\exp\{\beta_{k^\prime}^T x_\bullet + \gamma_{k^\prime}\}\,. \quad \text{(normalization constant)} \end{align*}$$
where we defined $\beta_{kj} \triangleq \beta_k - \beta_j$ and similarly for $\gamma_{kj}$.
is an appealing decision.
We'll apply the above results to solve the "apple or peach" example problem.
# Make sure you run the data-generating code cell first
using Distributions, Plots
# Multinomial (in this case binomial) density estimation
p_apple_est = sum(y.==true) / length(y)
π_hat = [p_apple_est; 1-p_apple_est]
# Estimate class-conditional multivariate Gaussian densities
d1 = fit_mle(FullNormal, X_apples') # MLE density estimation d1 = N(μ₁, Σ₁)
d2 = fit_mle(FullNormal, X_peaches') # MLE density estimation d2 = N(μ₂, Σ₂)
Σ = π_hat[1]*cov(d1) + π_hat[2]*cov(d2) # Combine Σ₁ and Σ₂ into Σ
conditionals = [MvNormal(mean(d1), Σ); MvNormal(mean(d2), Σ)] # p(x|C)
# Calculate posterior class probability of x∙ (prediction)
function predict_class(k, X) # calculate p(Ck|X)
norm = π_hat[1]*pdf(conditionals[1],X) + π_hat[2]*pdf(conditionals[2],X)
return π_hat[k]*pdf(conditionals[k], X) ./ norm
end
println("p(apple|x=x∙) = $(predict_class(1,x_test))")
# Discrimination boundary of the posterior (p(apple|x;D) = p(peach|x;D) = 0.5)
β(k) = inv(Σ)*mean(conditionals[k])
γ(k) = -0.5 * mean(conditionals[k])' * inv(Σ) * mean(conditionals[k]) + log(π_hat[k])
function discriminant_x2(x1)
# Solve discriminant equation for x2
β12 = β(1) .- β(2)
γ12 = (γ(1) .- γ(2))[1,1]
return -1*(β12[1]*x1 .+ γ12) ./ β12[2]
end
scatter(X_apples[:,1], X_apples[:,2], label="apples", marker=:x, markerstrokewidth=3) # apples
scatter!(X_peaches[:,1], X_peaches[:,2], label="peaches", marker=:+, markerstrokewidth=3) # peaches
scatter!([x_test[1]], [x_test[2]], label="unknown") # 'new' unlabelled data point
x1 = range(-1,length=10,stop=3)
plot!(x1, discriminant_x2(x1), color="black", label="") # Plot discrimination boundary
plot!(x1, discriminant_x2(x1), fillrange=-10, alpha=0.2, color=:blue, label="")
plot!(x1, discriminant_x2(x1), fillrange=10, alpha=0.2, color=:red, xlims=(-0.5, 3), ylims=(-1, 4), label="")
p(apple|x=x∙) = 0.758239422055906
A student in one of the previous years posed the following question at Piazza:
" After re-reading topics regarding generative classification, this question popped into my mind: Besides the sole purpose of the lecture, which is getting to know the concepts of generative classification and how to implement them, are there any advantages of using this instead of using deep neural nets? DNNs seem simpler and more powerful."
The following answer was provided:
If you are only are interested in approximating a function, say $y=f_\theta(x)$, and you have lots of examples $\{(x_i,y_i)\}$ of desired behavior, then often a non-probabilistic DNN is a fine approach.
However, if you are willing to formulate your models in a probabilistic framework, you can improve on the deterministic approach in many ways, eg,
- Bayesian evidence for model performance assessment. This means you can use the whole data set for training without an ad-hoc split into testing and training data sets.
- Uncertainty about parameters in the model is a measure that allows you to do active learning, ie, choose data that is most informative (see also the lesson on intelligent agents). This will allow you to train on small data sets, whereas the deterministic DNNs generally require much larger data sets.
- Prediction with uncertainty/confidence bounds.
- Explicit specification and separation of your assumptions.
- A framework that supports scoring both accuracy and model complexity in the same currency (probability). How are you going to penalize the size of your network in a deterministic framework?
- Automatic learning rates, no tuning parameters. For instance, the Kalman gain is a data-dependent, optimal learning rate. How will you choose your learning rates in a deterministic framework? Trial and error?
- Principled absorption of different sources of knowledge. Eg, outcome of one set of experiments can be captured by a posterior distribution that serves as a prior distribution for the next set of experiments.
Admittedly, it's not easy to understand the probabilistic approach, but it is worth the effort.
where $\beta _k= \Sigma^{-1} \mu_k$ and $\gamma_k=- \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k$.
open("../../styles/aipstyle.html") do f
display("text/html", read(f,String))
end