Probability Theory Review


  • Goal
    • Review of probability theory as a theory for rational/logical reasoning with uncertainties (i.e., a Bayesian interpretation)
  • Materials

Example Problem: Disease Diagnosis

  • Problem: Given a disease with prevalence of 1% and a test procedure with sensitivity ('true positive' rate) of 95% and specificity ('true negative' rate) of 85% , what is the chance that somebody who tests positive actually has the disease?
  • Solution: Use probabilistic inference, to be discussed in this lecture.

The Design of Probability Theory

  • Define an event (or "proposition") $A$ as a statement, whose truth can be contemplated by a person, e.g.,
$$𝐴= \texttt{'there is life on Mars'}$$
  • If we assume the fact $$I = \texttt{'All known life forms require water'}$$ and a new piece of information $$x = \texttt{'There is water on Mars'}$$ becomes available, how should our degree of belief in event $A$ be affected (if we were rational)?
  • Richard T. Cox, 1946 developed a calculus for rational reasoning about how to represent and update the degree of beliefs about the truth value of events when faced with new information.
  • In developing this calculus, only some very agreeable assumptions were made, e.g.,
    • (Transitivity). If the belief in $A$ is greater than the belief in $B$, and the belief in $B$ is greater than the belief in $C$, then the belief in $A$ must be greater than the belief in $C$.
    • (Consistency). If the belief in an event can be inferred in two different ways, then the two ways must agree on the resulting belief.
  • $\Rightarrow$ Probability theory (PT) provides the theory of optimal processing of incomplete information (see Cox theorem, and Caticha, pp.7-24), and as such provides the (only) proper quantitative framework for drawing conclusions from a finite (read: incomplete) data set.

Why Probability Theory for Machine Learning?

  • Machine learning concerns drawing conclusions about model parameter settings from (a finite set of) data and therefore PT provides the optimal calculus for machine learning.
  • In general, nearly all interesting questions in machine learning can be stated in the following form (a conditional probability):
$$p(\texttt{whatever-we-want-to-know}\, | \,\texttt{whatever-we-do-know})$$

where $p(a|b)$ means the probability that $a$ is true, given that $b$ is true.

  • Examples
    • Predictions $$p(\,\texttt{future-observations}\,|\,\texttt{past-observations}\,)$$
    • Classify a received data point $x$ $$p(\,x\texttt{-belongs-to-class-}k \,|\,x\,)$$
    • Update a model based on a new observation $$p(\,\texttt{model-parameters} \,|\,\texttt{new-observation},\,\texttt{past-observations}\,)$$

Frequentist vs. Bayesian Interpretation of Probabilities

  • The interpretation of a probability as a degree-of-belief about the truth value of an event is also called the Bayesian interpretation.
  • In the Bayesian interpretation, the probability is associated with a state-of-knowledge (usually held by a person).
    • For instance, in a coin tossing experiment, $p(\texttt{tail}) = 0.4$ should be interpreted as the belief that there is a 40% chance that $\texttt{tail}$ comes up if the coin were tossed.
    • Under the Bayesian interpretation, PT calculus (sum and product rules) extends boolean logic to rational reasoning with uncertainty.
  • The Bayesian interpretation contrasts with the frequentist interpretation of a probability as the relative frequency that an event would occur under repeated execution of an experiment.

    • For instance, if the experiment is tossing a coin, then $p(\texttt{tail}) = 0.4$ means that in the limit of a large number of coin tosses, 40% of outcomes turn up as $\texttt{tail}$.
  • The Bayesian viewpoint is more generally applicable than the frequentist viewpoint, e.g., it is hard to apply the frequentist viewpoint to events like '$\texttt{it will rain tomorrow}$'.
  • The Bayesian viewpoint is clearly favored in the machine learning community. (In this class, we also strongly favor the Bayesian interpretation).

Probability Theory Notation

  • Define an event $A$ as a statement, whose truth can be contemplated by a person, e.g.,
$$A = \text{'it will rain tomorrow'}$$
  • We write the denial of $A$, i.e. the event not-A, as $\bar{A}$.
  • Given two events $A$ and $B$, we write the conjunction "$A \wedge B$" as "$A,B$" or "$AB$". The conjunction $AB$ is true only if both $A$ and $B$ are true.
  • We will write the disjunction "$A \lor B$" as "$A + B$", which is true if either $A$ or $B$ is true or both $A$ and $B$ are true.
  • Note that, if $X$ is a variable, then an assignment $X=x$ (with $x$ a value, e.g., $X=5$) can be interpreted as an event.
  • For any event $A$, with background knowledge $I$, the conditional probability of $A$ given $I$, is written as $$p(A|I)\,.$$
  • All probabilities are in principle conditional probabilities of the type $p(A|I)$, since there is always some background knowledge.
Unfortunately, PT notation is usually rather sloppy :(
  • We often write $p(A)$ rather than $p(A|I)$ if the background knowledge $I$ is assumed to be obviously present. E.g., $p(A)$ rather than $p(\,A\,|\,\text{the-sun-comes-up-tomorrow}\,)$.
  • (In the context of random variable assignments) we often write $p(x)$ rather than $p(X=x)$, assuming that the reader understands the context.
  • In an apparent effort to further abuse notational conventions, $p(X)$ denotes the full distribution over random variable $X$, i.e., the distribution for all assignments for $X$.
  • If $X$ is a discretely valued variable, then $p(X=x)$ is a probability mass function (PMF) with $0\le p(X=x)\le 1$ and normalization $\sum_x p(x) =1$.
  • If $X$ is continuously valued, then $p(X=x)$ is a probability density function (PDF) with $p(X=x)\ge 0$ and normalization $\int_x p(x)\mathrm{d}x=1$.
    • Note that if $X$ is continuously valued, then the value of the PDF $p(x)$ is not necessarily $\le 1$. E.g., a uniform distribution on the continuous domain $[0,.5]$ has value $p(x) = 2$.
  • Often, we do not bother to specify if $p(x)$ refers to a continuous or discrete variable.
  • Let $p(A|I)$ indicate the belief in event $A$, given that $I$ is true.
  • The following product and sum rules are also known as the axioms of probability theory, but as discussed above, under some mild assumptions, they can be derived as the unique rules for rational reasoning under uncertainty (Cox theorem, 1946, and Caticha, 2012, pp.7-26).
  • Sum rule. The disjunction for two events $A$ and $B$ given background $I$ is given by $$ \boxed{p(A+B|I) = p(A|I) + p(B|I) - p(A,B|I)}$$
  • Product rule. The conjuction of two events $A$ and $B$ with given background $I$ is given by $$ \boxed{p(A,B|I) = p(A|B,I)\,p(B|I)}$$
  • All legitimate probabilistic relations can be derived from the sum and product rules!

Independent and Mutually Exclusive Events

  • Two events $A$ and $B$ are said to be independent if the probability of one is not altered by information about the truth of the other, i.e., $p(A|B) = p(A)$
    • $\Rightarrow$ If $A$ and $B$ are independent, given $I$, then the product rule simplifies to $$p(A,B|I) = p(A|I) p(B|I)$$
  • Two events $A_1$ and $A_2$ are said to be mutually exclusive if they cannot be true simultanously, i.e., if $p(A_1,A_2)=0$.
    • $\Rightarrow$ For mutually exclusive events, the sum rule simplifies to $$p(A_1+A_2) = p(A_1) + p(A_2)$$
  • A set of events $A_1, A_2, \ldots, A_N$ is said to be collectively exhaustive if one of the statements is necessarily true, i.e., $A_1+A_2+\cdots +A_N=\mathrm{TRUE}$, or equivalently $$p(A_1+A_2+\cdots +A_N)=1$$
  • Note that, if $A_1, A_2, \ldots, A_n$ are both mutually exclusive and collectively exhausitive (MECE) events, then $$\sum_{n=1}^N p(A_n) = p(A_1 + \ldots + A_N) = 1$$
    • More generally, if $\{A_n\}$ are MECE events, then $\sum_{n=1}^N p(A_n,B) = p(B)$

The Sum Rule and Marginalization

  • We mentioned that every inference problem in PT can be evaluated through the sum and product rules. Next, we present two useful corollaries: (1) Marginalization and (2) Bayes rule
  • If $X \in \mathcal{X}$ and $Y \in \mathcal{Y}$ are random variables over finite domains, than it follows from the above considerations about MECE events that $$ \sum_{Y\in \mathcal{Y}} p(X,Y) = p(X) \,. $$
  • Summing $Y$ out of a joint distribution $p(X,Y)$ is called marginalization and the result $p(X)$ is sometimes referred to as the marginal probability.
  • Note that this is just a generalized sum rule. In fact, Bishop (p.14) (and some other authors as well) calls this the sum rule.
  • Of course, in the continuous domain, the (generalized) sum rule becomes $$p(X)=\int p(X,Y) \,\mathrm{d}Y$$

The Product Rule and Bayes Rule

  • Consider two variables $D$ and $\theta$; it follows from symmetry arguments that $$p(D,\theta)=p(D|\theta)p(\theta)=p(\theta|D)p(D)$$ and hence that $$ p(\theta|D) = \frac{p(D|\theta) }{p(D)}p(\theta)\,.$$
  • This formula is called Bayes rule (or Bayes theorem). While Bayes rule is always true, a particularly useful application occurs when $D$ refers to an observed data set and $\theta$ is set of model parameters. In that case,

    • the prior probability $p(\theta)$ represents our state-of-knowledge about proper values for $\theta$, before seeing the data $D$.
    • the posterior probability $p(\theta|D)$ represents our state-of-knowledge about $\theta$ after we have seen the data.

$\Rightarrow$ Bayes rule tells us how to update our knowledge about model parameters when facing new data. Hence,

Bayes rule is the fundamental rule for learning from data!

Bayes Rule Nomenclature

  • Some nomenclature associated with Bayes rule: $$ \underbrace{p(\theta | D)}_{\text{posterior}} = \frac{\overbrace{p(D|\theta)}^{\text{likelihood}} \times \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(D)}_{\text{evidence}}} $$
  • Note that the evidence (a.k.a. marginal likelihood ) can be computed from the numerator through marginalization since $$ p(D) = \int p(D,\theta) \,\mathrm{d}\theta = \int p(D|\theta)\,p(\theta) \,\mathrm{d}\theta$$
  • Hence, having access to likelihood and prior is in principle sufficient to compute both the evidence and the posterior. To emphasize that point, Bayes rule is sometimes written as a transformation:
$$ \underbrace{\underbrace{p(\theta|D)}_{\text{posterior}}\cdot \underbrace{p(D)}_{\text{evidence}}}_{\text{this is what we want to compute}} = \underbrace{\underbrace{p(D|\theta)}_{\text{likelihood}}\cdot \underbrace{p(\theta)}_{\text{prior}}}_{\text{this is available}}$$

  • For a given data set $D$, the posterior probabilities of the parameters scale relatively against each other as
$$ p(\theta|D) \propto p(D|\theta) p(\theta) $$
  • $\Rightarrow$ All that we can learn from the observed data is contained in the likelihood function $p(D|\theta)$. This is called the likelihood principle.

The Likelihood Function vs the Sampling Distribution

  • Consider a distribution $p(D|\theta)$, where $D$ relates to variables that are observed (i.e., a "data set") and $\theta$ are model parameters.
  • In general, $p(D|\theta)$ is just a function of the two variables $D$ and $\theta$. We distinguish two interpretations of this function, depending on which variable is observed (or given by other means).
  • The sampling distribution (a.k.a. the data-generating distribution) $$p(D|\theta=\theta_0)$$ (which is a function of $D$ only) describes a probability distribution for data $D$, assuming that it is generated by the given model with parameters fixed at $\theta = \theta_0$.
  • In a machine learning context, often the data is observed, and $\theta$ is the free variable. In that case, for given observations $D=D_0$, the likelihood function (which is a function only of the model parameters $\theta$) is defined as $$\mathrm{L}(\theta) \triangleq p(D=D_0|\theta)$$
  • Note that $\mathrm{L}(\theta)$ is not a probability distribution for $\theta$ since in general $\sum_\theta \mathrm{L}(\theta) \neq 1$.

Code Example: Sampling Distribution and Likelihood Function for the Coin Toss

Consider the following simple model for the outcome (head or tail) $y \in \{0,1\}$ of a biased coin toss with parameter $\theta \in [0,1]$:

$$\begin{align*} p(y|\theta) &\triangleq \theta^y (1-\theta)^{1-y}\\ \end{align*}$$

We can plot both the sampling distribution $p(y|\theta=0.8)$ and the likelihood function $L(\theta) = p(y=0|\theta)$.

In [5]:
using Pkg; Pkg.activate("probprog/workspace");Pkg.instantiate();
In [6]:
using PyPlot
#using Plots
p(y,θ) = θ.^y .* (1 .- θ).^(1 .- y)
f = figure()

θ = 0.5 # Set parameter
# Plot the sampling distribution
subplot(221); stem([0,1], p([0,1],θ)); 
title("Sampling distribution");
xlim([-0.5,1.5]); ylim([0,1]); xlabel("y"); ylabel("p(y|θ=$(θ))");

 = 0:0.01:1
y = 1.0 # Plot p(y=1 | θ)
title("Likelihood function"); 
ylabel("L(θ) = p(y=$y)|θ)");

The (discrete) sampling distribution is a valid probability distribution. However, the likelihood function $L(\theta)$ clearly isn't, since $\int_0^1 L(\theta) \mathrm{d}\theta \neq 1$.

Probabilistic Inference

  • Probabilistic inference refers to computing $$ p(\,\text{whatever-we-want-to-know}\, | \,\text{whatever-we-already-know}\,) $$
    • For example: $$\begin{align*} p(\,\text{Mr.S.-killed-Mrs.S.} \;&|\; \text{he-has-her-blood-on-his-shirt}\,) \\ p(\,\text{transmitted-codeword} \;&|\;\text{received-codeword}\,) \end{align*}$$
  • This can be accomplished by repeated application of sum and product rules.
  • In particular, consider a joint distribution $p(X,Y,Z)$. Assume we are interested in $p(X|Z)$: $$\begin{align*} p(X|Z) \stackrel{p}{=} \frac{p(X,Z)}{p(Z)} \stackrel{s}{=} \frac{\sum_Y p(X,Y,Z)}{\sum_{X,Y} p(X,Y,Z)} \,, \end{align*}$$ where the 's' and 'p' above the equality sign indicate whether the sum or product rule was used.
  • In the rest of this course, we'll encounter many long probabilistic derivations. For each manipulation, you should be able to associate an 's' (for sum rule), a 'p' (for product or Bayes rule) or an 'm' (for a simplifying model assumption) above any equality sign.

Working out the example problem: Disease Diagnosis

  • Problem: Given a disease $D$ with prevalence of $1\%$ and a test procedure $T$ with sensitivity ('true positive' rate) of $95\%$ and specificity ('true negative' rate) of $85\%$, what is the chance that somebody who tests positive actually has the disease?
  • Solution: The given data are $p(D=1)=0.01$, $p(T=1|D=1)=0.95$ and $p(T=0|D=0)=0.85$. Then according to Bayes rule,
$$\begin{align*} p( D=1 &| T=1) \\ &\stackrel{p}{=} \frac{p(T=1|D=1)p(D=1)}{p(T=1)} \\ &\stackrel{s}{=} \frac{p(T=1|D=1)p(D=1)}{p(T=1|D=1)p(D=1)+p(T=1|D=0)p(D=0)} \\ &= \frac{0.95\times0.01}{0.95\times0.01 + 0.15\times0.99} = 0.0601 \end{align*}$$

Inference Exercise: Bag Counter

  • Problem: A bag contains one ball, known to be either white or black. A white ball is put in, the bag is shaken, and a ball is drawn out, which proves to be white. What is now the chance of drawing a white ball?
  • Solution: Again, use Bayes and marginalization to arrive at $p(\text{white}|\text{data})=2/3$, see the Exercises notebook.

  • $\Rightarrow$ Note that probabilities describe a person's state of knowledge rather than a 'property of nature'.

Inference Exercise: Causality?

  • Problem: A dark bag contains five red balls and seven green ones. (a) What is the probability of drawing a red ball on the first draw? Balls are not returned to the bag after each draw. (b) If you know that on the second draw the ball was a green one, what is now the probability of drawing a red ball on the first draw?
  • Solution: (a) $5/12$. (b) $5/11$, see the Exercises notebook.

  • $\Rightarrow$ Again, we conclude that conditional probabilities reflect implications for a state of knowledge rather than temporal causality.

Moments of the PDF

  • Consider a distribution $p(x)$. The expected value or mean is defined as $$\mu_x = \mathbb{E}[x] \triangleq \int x \,p(x) \,\mathrm{d}{x}$$
  • The variance of $x$ is defined as $$\Sigma_x \triangleq \mathbb{E} \left[(x-\mu_x)(x-\mu_x)^T \right]$$
  • The covariance matrix between vectors $x$ and $y$ is defined as $$\begin{align*} \Sigma_{xy} &\triangleq \mathbb{E}\left[ (x-\mu_x) (y-\mu_y)^T \right]\\ &= \mathbb{E}\left[ (x-\mu_x) (y^T-\mu_y^T) \right]\\ &= \mathbb{E}[x y^T] - \mu_x \mu_y^T \end{align*}$$
    • Clearly, if $x$ and $y$ are independent, then $\Sigma_{xy} = 0$, since $\mathbb{E}[x y^T] = \mathbb{E}[x] \mathbb{E}[y^T] = \mu_x \mu_y^T$.

Linear Transformations

  • Consider an arbitrary distribution $p(X)$ with mean $\mu_x$ and variance $\Sigma_x$ and the linear transformation $$Z = A X + b \,.$$

  • No matter the specification of $p(X)$, we can derive that (see Exercises notebook) $$\begin{align} \mu_z &= A\mu_x + b \tag{SRG-3a}\\ \Sigma_z &= A\,\Sigma_x\,A^T \tag{SRG-3b} \end{align}$$

    • (The tag (SRG-3a) refers to the corresponding eqn number in Sam Roweis Gaussian identities notes.)

PDF for the Sum of Two Variables

  • Given eqs SRG-3a and SRG-3b (previous cell), you should now be able to derive the following: for any distribution of variable $X$ and $Y$ and sum $Z = X+Y$ (proof by Exercise)
$$\begin{align*} \mu_z &= \mu_x + \mu_y \\ \Sigma_z &= \Sigma_x + \Sigma_y + 2\Sigma_{xy} \end{align*}$$
  • Clearly, it follows that if $X$ and $Y$ are independent, then
$$\Sigma_z = \Sigma_x + \Sigma_y $$
  • More generally, given two independent variables $X$ and $Y$, with PDF's $p_x(x)$ and $p_y(y)$. The PDF $p_z(z)$for $Z=X+Y$ is given by the convolution
$$ p_z (z) = \int_{ - \infty }^\infty {p_x (x)p_y (z - x)\,\mathrm{d}{x}} $$

  • Proof: Let $p_z(z)$ be the probability that $Z$ has value $z$. This occurs if $X$ has some value $x$ and at the same time $Y=z-x$, with joint probability $p_x(x)p_y(z-x)$. Since $x$ can be any value, we sum over all possible values for $x$ to get $ p_z (z) = \int_{ - \infty }^\infty {p_x (x)p_y (z - x)\,\mathrm{d}{x}} $

    • Note that $p_z(z) \neq p_x(x) + p_y(y)\,$ !!
  • In linear stochastic systems theory, the Fourier Transform of a PDF (i.e., the characteristic function) plays an important computational role. Why?

Code Example: Sum of Two Gaussian Distributed Variables

  • Consider the PDF of the sum of two independent Gaussian distributed $X$ and $Y$:
$$\begin{align*} p_X(x) &= \mathcal{N}(\,x\,|\,\mu_X,\sigma_X^2\,) \\ p_Y(y) &= \mathcal{N}(\,y\,|\,\mu_Y,\sigma_Y^2\,) \end{align*}$$
  • Let $Z = X + Y$. Performing the convolution (nice exercise) yields a Gaussian PDF for $Z$:
$$ p_Z(z) = \mathcal{N}(\,z\,|\,\mu_X+\mu_Y,\sigma_X^2+\sigma_Y^2\,). $$
In [7]:
using PyPlot, Distributions
μx = 2.
σx = 1.
μy = 2.
σy = 0.5
μz = μx+μy; σz = sqrt(σx^2 + σy^2)
x = Normal(μx, σx)
y = Normal(μy, σy)
z = Normal(μz, σz)
range_min = minimum([μx-2*σx, μy-2*σy, μz-2*σz])
range_max = maximum([μx+2*σx, μy+2*σy, μz+2*σz])
range_grid = range(range_min, stop=range_max, length=100)
plot(range_grid, pdf.(x,range_grid), "k-")
plot(range_grid, pdf.(y,range_grid), "b-")
plot(range_grid, pdf.(z,range_grid), "r-")
legend([L"p_X", L"p_Y", L"p_Z"])

PDF for the Product of Two Variables

  • For two continuous random independent variables $X$ and $Y$, with PDF's $p_x(x)$ and $p_y(y)$, the PDF of $Z = X Y $ is given by $$ p_z(z) = \int_{-\infty}^{\infty} p_x(x) \,p_y(z/x)\, \frac{1}{|x|}\,\mathrm{d}x $$

  • For proof, see

Variable Transformations

  • Suppose $x$ is a discrete random variable with probability mass function $P_x(x)$, and $y = h(x)$ is a one-to-one function with $x = g(y) = h^{-1}(y)$. Then
$$ P_y(y) = P_x(g(y))\,. $$
  • Proof: $P_y(\hat{y}) = P(y=\hat{y}) = P(h(x)=\hat{y}) = P(x=g(\hat{y})) = P_x(g(\hat{y})). \,\square$
  • If $x$ is defined on a continuous domain, and $p_x(x)$ is a probability density function, then probability mass is represented by the area under a (density) curve. Let $a=g(c)$ and $b=g(d)$. Then $$\begin{align*} P(a ≤ x ≤ b) &= \int_a^b p_x(x)\mathrm{d}x \\ &= \int_{g(c)}^{g(d)} p_x(x)\mathrm{d}x \\ &= \int_c^d p_x(g(y))\mathrm{d}g(y) \\ &= \int_c^d \underbrace{p_x(g(y)) g^\prime(y)}_{p_y(y)}\mathrm{d}y \\ &= P(c ≤ y ≤ d) \end{align*}$$

  • Equating the two probability masses leads to identificaiton of the relation $$p_y(y) = p_x(g(y)) g^\prime(y)\,,$$ which is also known as the Change-of-Variable theorem.

  • If the tranformation $y = h(x)$ is not invertible, then $x=g(y)$ does not exist. In that case, you can still work out the transformation by equating equivalent probability masses in the two domains.

Example: Transformation of a Gaussian Variable

  • Let $p_x(x) = \mathcal{N}(x|\mu,\sigma^2)$ and $y = \frac{x-\mu}{\sigma}$.

  • Problem: What is $p_y(y)$?

  • Solution: Note that $h(x)$ is invertible with $x = g(y) = \sigma y + \mu$. The change-of-variable formula leads to $$\begin{align*} p_y(y) &= p_x(g(y)) \cdot g^\prime(y) \\ &= p_x(\sigma y + \mu) \cdot \sigma \\ &= \frac{1}{\sigma\sqrt(2 \pi)} \exp\left( - \frac{(\sigma y + \mu - \mu)^2}{2\sigma^2}\right) \cdot \sigma \\ &= \frac{1}{\sqrt(2 \pi)} \exp\left( - \frac{y^2 }{2}\right)\\ &= \mathcal{N}(y|0,1) \end{align*}$$


  • Probabilities should be interpretated as degrees of belief, i.e., a state-of-knowledge, rather than a property of nature.
  • We can do everything with only the sum rule and the product rule. In practice, Bayes rule and marginalization are often very useful for inference, i.e., for computing
  • Bayes rule $$ p(\theta|D) = \frac{p(D|\theta)p(\theta)} {p(D)} $$ is the fundamental rule for learning from data!
  • For a variable $X$ with distribution $p(X)$ with mean $\mu_x$ and variance $\Sigma_x$, the mean and variance of the Linear Transformation $Z = AX +b$ is given by $$\begin{align} \mu_z &= A\mu_x + b \tag{SRG-3a}\\ \Sigma_z &= A\,\Sigma_x\,A^T \tag{SRG-3b} \end{align}$$
  • That's really about all you need to know about probability theory, but you need to really know it, so do the Exercises.
In [8]:
open("../../styles/aipstyle.html") do f
    display("text/html", read(f,String))
In [ ]: