Does this fragment resonate with your own experience?
In this lesson we introduce Probability Theory (PT) again. As we will see in the next lessons, PT is all you need to make sense of machine learning, artificial intelligence, statistics, etc.
where $p(a|b)$ means the probability that $a$ is true, given that $b$ is true.
The Bayesian interpretation contrasts with the frequentist interpretation of a probability as the relative frequency that an event would occur under repeated execution of an experiment.
“Compared with Bayesian methods, standard [frequentist] statistical techniques use only a small fraction of the available information about a research hypothesis (how well it predicts some observation), so naturally they will struggle when that limited information proves inadequate. Using standard statistical methods is like driving a car at night on a poorly lit highway: to keep from going in a ditch, we could build an elaborate system of bumpers and guardrails and equip the car with lane departure warnings and sophisticated navigation systems, and even then we could at best only drive to a few destinations. Or we could turn on the headlights.”
It will be helpful to introduce some terms concerning special probabilistic relationships between events.
Two events $A$ and $B$ are said to be independent if the probability of one is not altered by information about the truth of the other, i.e., $p(A|B) = p(A)$
$$\begin{align*} \sum_{Y\in \mathcal{Y}} p(X,Y) &= \sum_{Y\in \mathcal{Y}} p(Y|X) p(X) \\ &= p(X) \underbrace{\sum_{Y\in \mathcal{Y}} p(Y|X)}_{=1} \\ &= p(X) \end{align*}$$
$p(D,\theta)=p(\theta,D)$, and hence that $$p(D|\theta)p(\theta)=p(\theta|D)p(D)$$ or, equivalently, $$ p(\theta|D) = \frac{p(D|\theta) }{p(D)}p(\theta)\,.\qquad \text{(Bayes rule)}$$
This last formula is called Bayes rule (or Bayes theorem). While Bayes rule is always true, a particularly useful application occurs when $D$ refers to an observed data set and $\theta$ is set of model parameters. In that case,
$\Rightarrow$ Bayes rule tells us how to update our knowledge about model parameters when facing new data. Hence,
Consider the following simple model for the outcome (head or tail) $y \in \{0,1\}$ of a biased coin toss with parameter $\theta \in [0,1]$:
$$\begin{align*} p(y|\theta) &\triangleq \theta^y (1-\theta)^{1-y}\\ \end{align*}$$We can plot both the sampling distribution $p(y|\theta=0.5)$ and the likelihood function $L(\theta) \triangleq p(y=1|\theta)$.
using Pkg; Pkg.activate("../."); Pkg.instantiate();
using IJulia; try IJulia.clear_output(); catch _ end
Activating project at `~/github/bertdv/BMLIP/lessons`
using Plots
using LaTeXStrings
f(y,θ) = θ.^y .* (1 .- θ).^(1 .- y) # p(y|θ)
θ = 0.5
p1 = plot([0,1], f([0,1], θ),
line=:stem, marker=:circle, xrange=(-0.5, 1.5), yrange=(0,1), title="Sampling Distribution", xlabel="y", ylabel=L"p(y|θ=%$θ)", label="")
_θ = 0:0.01:1
y=1
p2 = plot(_θ, f(y, _θ),
ylabel=L"p(y=%$y | θ)", xlabel=L"θ", title="Likelihood Function", label="")
plot(p1, p2)
The (discrete) sampling distribution is a valid probability distribution. However, the likelihood function $L(\theta)$ clearly isn't, since $\int_0^1 L(\theta) \mathrm{d}\theta \neq 1$.
$$\begin{align*} p(\,\text{Mr.S.-killed-Mrs.S.} \;&|\; \text{he-has-her-blood-on-his-shirt}\,) \\ p(\,\text{transmitted-codeword} \;&|\;\text{received-codeword}\,) \end{align*}$$
where the 's' and 'p' above the equality sign indicate whether the sum or product rule was used.
Note that $p(\text{sick}|\text{positive test}) = 0.06$ while $p(\text{positive test} | \text{sick}) = 0.95$. This is a huge difference that is sometimes called the "medical test paradox" or the base rate fallacy.
Many people have trouble distinguishing $p(A|B)$ from $p(B|A)$ in their heads. This has led to major negative consequences. For instance, unfounded convictions in the legal arena and even lots of unfounded conclusions in the pursuit of scientific results. See Ioannidis (2005) and Clayton (2021).
and a ball is drawn out, which proves to be white. What is now the chance of drawing a white ball?
Solution: Again, use Bayes and marginalization to arrive at $p(\text{white}|\text{data})=2/3$, see the Exercises notebook.
$\Rightarrow$ Note that probabilities describe a person's state of knowledge rather than a 'property of nature'.
Solution: (a) $5/12$. (b) $5/11$, see the Exercises notebook.
$\Rightarrow$ Again, we conclude that conditional probabilities reflect implications for a state of knowledge rather than temporal causality.
Consider an arbitrary distribution $p(X)$ with mean $\mu_x$ and variance $\Sigma_x$ and the linear transformation $$Z = A X + b \,.$$
No matter the specification of $p(X)$, we can derive that (see Exercises notebook)
$X$ and $Y$, with PDF's $p_x(x)$ and $p_y(y)$. The PDF $p_z(z)$ for $Z=X+Y$ is given by the convolution
$$ p_z (z) = \int_{ - \infty }^\infty {p_x (x)p_y (z - x)\,\mathrm{d}{x}} $$$ p_z (z) = \int_{ - \infty }^\infty {p_x (x)p_y (z - x)\,\mathrm{d}{x}} $
using Plots, Distributions, LaTeXStrings
μx = 2.
σx = 1.
μy = 2.
σy = 0.5
μz = μx+μy; σz = sqrt(σx^2 + σy^2)
x = Normal(μx, σx)
y = Normal(μy, σy)
z = Normal(μz, σz)
range_min = minimum([μx-2*σx, μy-2*σy, μz-2*σz])
range_max = maximum([μx+2*σx, μy+2*σy, μz+2*σz])
range_grid = range(range_min, stop=range_max, length=100)
plot(range_grid, pdf.(x,range_grid), label=L"p_x", fill=(0, 0.1))
plot!(range_grid, pdf.(y,range_grid), label=L"p_y", fill=(0, 0.1))
plot!(range_grid, pdf.(z,range_grid), label=L"p_z", fill=(0, 0.1))
$X$ and $Y$, with PDF's $p_x(x)$ and $p_y(y)$, the PDF of $Z = X Y $ is given by $$ p_z(z) = \int_{-\infty}^{\infty} p_x(x) \,p_y(z/x)\, \frac{1}{|x|}\,\mathrm{d}x $$
which is also known as the Change-of-Variable theorem.
Let $p_x(x) = \mathcal{N}(x|\mu,\sigma^2)$ and $y = \frac{x-\mu}{\sigma}$.
Problem: What is $p_y(y)$?
Solution: Note that $h(x)$ is invertible with $x = g(y) = \sigma y + \mu$. The change-of-variable formula leads to
Finally, here is a notational convention that you should be precise about (but many authors are not.)
If you want to write that a variable $x$ is distributed as a Gaussian with mean $\mu$ and covariance matrix $\Sigma$, you can write this properly in either of two ways:
In the second version, the symbol $\sim$ can be interpreted as "is distributed as" (a Gaussian with parameters $\mu$ and $\Sigma$).
Don't write $p(x) = \mathcal{N}(\mu,\Sigma)$ because $p(x)$ is a function of $x$ but $\mathcal{N}(\mu,\Sigma)$ is not.
Also, $x \sim \mathcal{N}(x|\mu,\Sigma)$ is not proper because you already named the argument at the right-hand-site. On the other hand, $x \sim \mathcal{N}(\cdot|\mu,\Sigma)$ is fine, as is the shorter $x \sim \mathcal{N}(\mu,\Sigma)$.
open("../../styles/aipstyle.html") do f
display("text/html", read(f,String))
end