- Goal
- Review of probability theory as a theory for rational/logical reasoning with uncertainties (i.e., a Bayesian interpretation)

Materials

Mandatory

- These lecture notes
- Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics, pp.7-26 (sections 2.1 through 2.5), on deriving probability theory. You may skip section 2.3.4: Cox's proof (pp.15-18).
- The assignment is only meant to appreciate how this line of "axiomatic derivation" of the rules of PT goes. I will not ask questions about any details of the derivations at the exam.

Optional

- the pre-recorded video guide and live class of 2020
- Ariel Caticha - 2012 - Entropic Inference and the Foundations of Physics, pp.7-56 (ch.2: probability)
- Great introduction to probability theory, in particular w.r.t. its correct interpretation as a state-of-knowledge.
- Absolutely worth your time to read the whole chapter!

- Edwin Jaynes - 2003 - Probability Theory -- The Logic of Science.
- Brilliant book on Bayesian view on probability theory. Just for fun, scan the annotated bibliography and references.

- Bishop pp. 12-24

**Problem**: Given a disease with prevalence of 1% and a test procedure with sensitivity ('true positive' rate) of 95% and specificity ('true negative' rate) of 85% , what is the chance that somebody who tests positive actually has the disease?

**Solution**: Use probabilistic inference, to be discussed in this lecture.

- Define an
**event**(or "proposition") $A$ as a statement, whose truth can be contemplated by a person, e.g.,

- If we assume the fact $$I = \texttt{'All known life forms require water'}$$ and a new piece of information $$x = \texttt{'There is water on Mars'}$$ becomes available, how
*should*our degree of belief in event $A$ be affected (if we were rational)?

- Richard T. Cox, 1946 developed a
**calculus for rational reasoning**about how to represent and update the degree of*beliefs*about the truth value of events when faced with new information.

- In developing this calculus, only some very agreeable assumptions were made, e.g.,
- (Transitivity). If the belief in $A$ is greater than the belief in $B$, and the belief in $B$ is greater than the belief in $C$, then the belief in $A$ must be greater than the belief in $C$.
- (Consistency). If the belief in an event can be inferred in two different ways, then the two ways must agree on the resulting belief.

- This effort resulted in confirming that the sum and product rules of Probability Theory are the
**only**proper rational way to process belief intensities.

- $\Rightarrow$ Probability theory (PT) provides
*the***theory of optimal processing of incomplete information**(see Cox theorem, and Caticha, pp.7-24), and as such provides the (only) proper quantitative framework for drawing conclusions from a finite (read: incomplete) data set.

- Machine learning concerns drawing conclusions about model parameter settings from (a finite set of) data and therefore PT provides the
*optimal calculus for machine learning*.

- In general, nearly all interesting questions in machine learning can be stated in the following form (a conditional probability):

where $p(a|b)$ means the probability that $a$ is true, given that $b$ is true.

- Examples
- Predictions $$p(\,\texttt{future-observations}\,|\,\texttt{past-observations}\,)$$
- Classify a received data point $x$ $$p(\,x\texttt{-belongs-to-class-}k \,|\,x\,)$$
- Update a model based on a new observation $$p(\,\texttt{model-parameters} \,|\,\texttt{new-observation},\,\texttt{past-observations}\,)$$

- The interpretation of a probability as a
**degree-of-belief**about the truth value of an event is also called the**Bayesian**interpretation.

- In the
**Bayesian**interpretation, the probability is associated with a**state-of-knowledge**(usually held by a person).- For instance, in a coin tossing experiment, $p(\texttt{tail}) = 0.4$ should be interpreted as the belief that there is a 40% chance that $\texttt{tail}$ comes up if the coin were tossed.
- Under the Bayesian interpretation, PT calculus (sum and product rules)
**extends boolean logic to rational reasoning with uncertainty**.

The Bayesian interpretation contrasts with the

**frequentist**interpretation of a probability as the relative frequency that an event would occur under repeated execution of an experiment.- For instance, if the experiment is tossing a coin, then $p(\texttt{tail}) = 0.4$ means that in the limit of a large number of coin tosses, 40% of outcomes turn up as $\texttt{tail}$.

- The Bayesian viewpoint is more generally applicable than the frequentist viewpoint, e.g., it is hard to apply the frequentist viewpoint to events like '$\texttt{it will rain tomorrow}$'.

- The Bayesian viewpoint is clearly favored in the machine learning community. (In this class, we also strongly favor the Bayesian interpretation).

- We write the denial of $A$, i.e. the event
**not**-A, as $\bar{A}$.

- Given two events $A$ and $B$, we write the
**conjunction**"$A \wedge B$" as "$A,B$" or "$AB$". The conjunction $AB$ is true only if both $A$ and $B$ are true.

- We will write the
**disjunction**"$A \lor B$" as "$A + B$", which is true if either $A$ or $B$ is true or both $A$ and $B$ are true.

- Note that, if $X$ is a variable, then an assignment $X=x$ (with $x$ a value, e.g., $X=5$) can be interpreted as an event.

- For any event $A$, with background knowledge $I$, the
**conditional probability of $A$ given $I$**, is written as $$p(A|I)\,.$$

- All probabilities are in principle conditional probabilities of the type $p(A|I)$, since there is always some background knowledge.

- We often write $p(A)$ rather than $p(A|I)$ if the background knowledge $I$ is assumed to be obviously present. E.g., $p(A)$ rather than $p(\,A\,|\,\text{the-sun-comes-up-tomorrow}\,)$.

- (In the context of random variable assignments) we often write $p(x)$ rather than $p(X=x)$, assuming that the reader understands the context.

- In an apparent effort to further abuse notational conventions, $p(X)$ denotes the full distribution over random variable $X$, i.e., the distribution for all assignments for $X$.

- If $X$ is a
*discretely*valued variable, then $p(X=x)$ is a probability*mass*function (PMF) with $0\le p(X=x)\le 1$ and normalization $\sum_x p(x) =1$.

- If $X$ is
*continuously*valued, then $p(X=x)$ is a probability*density*function (PDF) with $p(X=x)\ge 0$ and normalization $\int_x p(x)\mathrm{d}x=1$.- Note that if $X$ is continuously valued, then the value of the PDF $p(x)$ is not necessarily $\le 1$. E.g., a uniform distribution on the continuous domain $[0,.5]$ has value $p(x) = 2$.

- Often, we do not bother to specify if $p(x)$ refers to a continuous or discrete variable.

- Let $p(A|I)$ indicate the belief in event $A$, given that $I$ is true.

- The following product and sum rules are also known as the
**axioms of probability theory**, but as discussed above, under some mild assumptions, they can be derived as the unique rules for*rational reasoning under uncertainty*(Cox theorem, 1946, and Caticha, 2012, pp.7-26).

**Sum rule**. The disjunction for two events $A$ and $B$ given background $I$ is given by $$ \boxed{p(A+B|I) = p(A|I) + p(B|I) - p(A,B|I)}$$

**Product rule**. The conjuction of two events $A$ and $B$ with given background $I$ is given by $$ \boxed{p(A,B|I) = p(A|B,I)\,p(B|I)}$$

**All legitimate probabilistic relations can be derived from the sum and product rules!**

- Two events $A$ and $B$ are said to be
**independent**if the probability of one is not altered by information about the truth of the other, i.e., $p(A|B) = p(A)$- $\Rightarrow$ If $A$ and $B$ are independent, given $I$, then the product rule simplifies to $$p(A,B|I) = p(A|I) p(B|I)$$

- Two events $A_1$ and $A_2$ are said to be
**mutually exclusive**if they cannot be true simultanously, i.e., if $p(A_1,A_2)=0$.- $\Rightarrow$ For mutually exclusive events, the sum rule simplifies to $$p(A_1+A_2) = p(A_1) + p(A_2)$$

- A set of events $A_1, A_2, \ldots, A_N$ is said to be
**collectively exhaustive**if one of the statements is necessarily true, i.e., $A_1+A_2+\cdots +A_N=\mathrm{TRUE}$, or equivalently $$p(A_1+A_2+\cdots +A_N)=1$$

- Note that, if $A_1, A_2, \ldots, A_n$ are both
**mutually exclusive**and**collectively exhausitive**(MECE) events, then $$\sum_{n=1}^N p(A_n) = p(A_1 + \ldots + A_N) = 1$$- More generally, if $\{A_n\}$ are MECE events, then $\sum_{n=1}^N p(A_n,B) = p(B)$

- We mentioned that every inference problem in PT can be evaluated through the sum and product rules. Next, we present two useful corollaries: (1)
*Marginalization*and (2)*Bayes rule*

- If $X \in \mathcal{X}$ and $Y \in \mathcal{Y}$ are random variables over finite domains, than it follows from the above considerations about MECE events that $$ \sum_{Y\in \mathcal{Y}} p(X,Y) = p(X) \,. $$

- Summing $Y$ out of a joint distribution $p(X,Y)$ is called
**marginalization**and the result $p(X)$ is sometimes referred to as the**marginal probability**.

- Note that this is just a
**generalized sum rule**. In fact, Bishop (p.14) (and some other authors as well) calls this the sum rule.

- Of course, in the continuous domain, the (generalized) sum rule becomes $$p(X)=\int p(X,Y) \,\mathrm{d}Y$$

- Consider two variables $D$ and $\theta$; it follows from symmetry arguments that $$p(D,\theta)=p(D|\theta)p(\theta)=p(\theta|D)p(D)$$ and hence that $$ p(\theta|D) = \frac{p(D|\theta) }{p(D)}p(\theta)\,.$$

This formula is called

**Bayes rule**(or Bayes theorem). While Bayes rule is always true, a particularly useful application occurs when $D$ refers to an observed data set and $\theta$ is set of model parameters. In that case,- the
**prior**probability $p(\theta)$ represents our**state-of-knowledge**about proper values for $\theta$, before seeing the data $D$. - the
**posterior**probability $p(\theta|D)$ represents our state-of-knowledge about $\theta$ after we have seen the data.

- the

$\Rightarrow$ Bayes rule tells us how to update our knowledge about model parameters when facing new data. Hence,

Bayes rule is the fundamental rule for learning from data!

- Some nomenclature associated with Bayes rule: $$ \underbrace{p(\theta | D)}_{\text{posterior}} = \frac{\overbrace{p(D|\theta)}^{\text{likelihood}} \times \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(D)}_{\text{evidence}}} $$

- Note that the evidence (a.k.a.
*marginal likelihood*) can be computed from the numerator through marginalization since $$ p(D) = \int p(D,\theta) \,\mathrm{d}\theta = \int p(D|\theta)\,p(\theta) \,\mathrm{d}\theta$$

- Hence, having access to likelihood and prior is in principle sufficient to compute both the evidence and the posterior. To emphasize that point, Bayes rule is sometimes written as a transformation:

- For a given data set $D$, the posterior probabilities of the parameters scale relatively against each other as

- $\Rightarrow$ All that we can learn from the observed data is contained in the likelihood function $p(D|\theta)$. This is called the
**likelihood principle**.

- Consider a distribution $p(D|\theta)$, where $D$ relates to variables that are observed (i.e., a "data set") and $\theta$ are model parameters.

- In general, $p(D|\theta)$ is just a function of the two variables $D$ and $\theta$. We distinguish two interpretations of this function, depending on which variable is observed (or given by other means).

- The
**sampling distribution**(a.k.a. the**data-generating**distribution) $$p(D|\theta=\theta_0)$$ (which is a function of $D$ only) describes a probability distribution for data $D$, assuming that it is generated by the given model with parameters fixed at $\theta = \theta_0$.

- In a machine learning context, often the data is observed, and $\theta$ is the free variable. In that case, for given observations $D=D_0$, the
**likelihood function**(which is a function only of the model parameters $\theta$) is defined as $$\mathrm{L}(\theta) \triangleq p(D=D_0|\theta)$$

- Note that $\mathrm{L}(\theta)$ is not a probability distribution for $\theta$ since in general $\sum_\theta \mathrm{L}(\theta) \neq 1$.

Consider the following simple model for the outcome (head or tail) $y \in \{0,1\}$ of a biased coin toss with parameter $\theta \in [0,1]$:

$$\begin{align*} p(y|\theta) &\triangleq \theta^y (1-\theta)^{1-y}\\ \end{align*}$$We can plot both the sampling distribution $p(y|\theta=0.8)$ and the likelihood function $L(\theta) = p(y=0|\theta)$.

In [5]:

```
using Pkg; Pkg.activate("probprog/workspace");Pkg.instantiate();
IJulia.clear_output();
```

In [6]:

```
using PyPlot
#using Plots
p(y,θ) = θ.^y .* (1 .- θ).^(1 .- y)
f = figure()
θ = 0.5 # Set parameter
# Plot the sampling distribution
subplot(221); stem([0,1], p([0,1],θ));
title("Sampling distribution");
xlim([-0.5,1.5]); ylim([0,1]); xlabel("y"); ylabel("p(y|θ=$(θ))");
subplot(222);
_θ = 0:0.01:1
y = 1.0 # Plot p(y=1 | θ)
plot(_θ,p(y,_θ))
title("Likelihood function");
xlabel("θ");
ylabel("L(θ) = p(y=$y)|θ)");
```

The (discrete) sampling distribution is a valid probability distribution. However, the likelihood function $L(\theta)$ clearly isn't, since $\int_0^1 L(\theta) \mathrm{d}\theta \neq 1$.

**Probabilistic inference**refers to computing $$ p(\,\text{whatever-we-want-to-know}\, | \,\text{whatever-we-already-know}\,) $$- For example: $$\begin{align*} p(\,\text{Mr.S.-killed-Mrs.S.} \;&|\; \text{he-has-her-blood-on-his-shirt}\,) \\ p(\,\text{transmitted-codeword} \;&|\;\text{received-codeword}\,) \end{align*}$$

- This can be accomplished by repeated application of sum and product rules.

- In particular, consider a joint distribution $p(X,Y,Z)$. Assume we are interested in $p(X|Z)$: $$\begin{align*} p(X|Z) \stackrel{p}{=} \frac{p(X,Z)}{p(Z)} \stackrel{s}{=} \frac{\sum_Y p(X,Y,Z)}{\sum_{X,Y} p(X,Y,Z)} \,, \end{align*}$$ where the 's' and 'p' above the equality sign indicate whether the sum or product rule was used.

- In the rest of this course, we'll encounter many long probabilistic derivations. For each manipulation, you should be able to associate an 's' (for sum rule), a 'p' (for product or Bayes rule) or an 'm' (for a simplifying model assumption) above any equality sign.

**Problem**: Given a disease $D$ with prevalence of $1\%$ and a test procedure $T$ with sensitivity ('true positive' rate) of $95\%$ and specificity ('true negative' rate) of $85\%$, what is the chance that somebody who tests positive actually has the disease?

**Solution**: The given data are $p(D=1)=0.01$, $p(T=1|D=1)=0.95$ and $p(T=0|D=0)=0.85$. Then according to Bayes rule,

**Problem**: A bag contains one ball, known to be either white or black. A white ball is put in, the bag is shaken, and a ball is drawn out, which proves to be white. What is now the chance of drawing a white ball?

**Solution**: Again, use Bayes and marginalization to arrive at $p(\text{white}|\text{data})=2/3$, see the Exercises notebook.$\Rightarrow$ Note that probabilities describe

**a person's state of knowledge**rather than a 'property of nature'.

**Problem**: A dark bag contains five red balls and seven green ones. (a) What is the probability of drawing a red ball on the first draw? Balls are not returned to the bag after each draw. (b) If you know that on the second draw the ball was a green one, what is now the probability of drawing a red ball on the first draw?

**Solution**: (a) $5/12$. (b) $5/11$, see the Exercises notebook.$\Rightarrow$ Again, we conclude that conditional probabilities reflect

**implications for a state of knowledge**rather than temporal causality.

- Consider a distribution $p(x)$. The
**expected value**or**mean**is defined as $$\mu_x = \mathbb{E}[x] \triangleq \int x \,p(x) \,\mathrm{d}{x}$$

- The
**variance**of $x$ is defined as $$\Sigma_x \triangleq \mathbb{E} \left[(x-\mu_x)(x-\mu_x)^T \right]$$

- The
**covariance**matrix between*vectors*$x$ and $y$ is defined as $$\begin{align*} \Sigma_{xy} &\triangleq \mathbb{E}\left[ (x-\mu_x) (y-\mu_y)^T \right]\\ &= \mathbb{E}\left[ (x-\mu_x) (y^T-\mu_y^T) \right]\\ &= \mathbb{E}[x y^T] - \mu_x \mu_y^T \end{align*}$$- Clearly, if $x$ and $y$ are independent, then $\Sigma_{xy} = 0$, since $\mathbb{E}[x y^T] = \mathbb{E}[x] \mathbb{E}[y^T] = \mu_x \mu_y^T$.

Consider an arbitrary distribution $p(X)$ with mean $\mu_x$ and variance $\Sigma_x$ and the linear transformation $$Z = A X + b \,.$$

No matter the specification of $p(X)$, we can derive that (see Exercises notebook) $$\begin{align} \mu_z &= A\mu_x + b \tag{SRG-3a}\\ \Sigma_z &= A\,\Sigma_x\,A^T \tag{SRG-3b} \end{align}$$

- (The tag (SRG-3a) refers to the corresponding eqn number in Sam Roweis Gaussian identities notes.)

- Given eqs SRG-3a and SRG-3b (previous cell), you should now be able to derive the following: for any distribution of variable $X$ and $Y$ and sum $Z = X+Y$ (proof by Exercise)

- Clearly, it follows that if $X$ and $Y$ are
**independent**, then

- More generally, given two
**independent**variables $X$ and $Y$, with PDF's $p_x(x)$ and $p_y(y)$. The PDF $p_z(z)$for $Z=X+Y$ is given by the**convolution**

**Proof**: Let $p_z(z)$ be the probability that $Z$ has value $z$. This occurs if $X$ has some value $x$ and at the same time $Y=z-x$, with joint probability $p_x(x)p_y(z-x)$. Since $x$ can be any value, we sum over all possible values for $x$ to get $ p_z (z) = \int_{ - \infty }^\infty {p_x (x)p_y (z - x)\,\mathrm{d}{x}} $- Note that $p_z(z) \neq p_x(x) + p_y(y)\,$ !!

- https://en.wikipedia.org/wiki/List_of_convolutions_of_probability_distributions shows how these convolutions work out for a few common probability distributions.

- In linear stochastic systems theory, the Fourier Transform of a PDF (i.e., the characteristic function) plays an important computational role. Why?

- Consider the PDF of the sum of two independent Gaussian distributed $X$ and $Y$:

- Let $Z = X + Y$. Performing the convolution (nice exercise) yields a Gaussian PDF for $Z$:

In [7]:

```
using PyPlot, Distributions
μx = 2.
σx = 1.
μy = 2.
σy = 0.5
μz = μx+μy; σz = sqrt(σx^2 + σy^2)
x = Normal(μx, σx)
y = Normal(μy, σy)
z = Normal(μz, σz)
range_min = minimum([μx-2*σx, μy-2*σy, μz-2*σz])
range_max = maximum([μx+2*σx, μy+2*σy, μz+2*σz])
range_grid = range(range_min, stop=range_max, length=100)
plot(range_grid, pdf.(x,range_grid), "k-")
plot(range_grid, pdf.(y,range_grid), "b-")
plot(range_grid, pdf.(z,range_grid), "r-")
legend([L"p_X", L"p_Y", L"p_Z"])
grid()
```

For two continuous random

**independent**variables $X$ and $Y$, with PDF's $p_x(x)$ and $p_y(y)$, the PDF of $Z = X Y $ is given by $$ p_z(z) = \int_{-\infty}^{\infty} p_x(x) \,p_y(z/x)\, \frac{1}{|x|}\,\mathrm{d}x $$For proof, see https://en.wikipedia.org/wiki/Product_distribution

- Generally, this integral does not lead to an analytical expression for $p_z(z)$. For example,
**the product of two independent variables that are both normally (Gaussian) distributed does not lead to a normal distribution**.- Exception: the distribution of the product of two variables that both have log-normal distributions is again a lognormal distribution.
- (If $X$ has a normal distribution, then $Y=\exp(X)$ has a log-normal distribution.)

- Exception: the distribution of the product of two variables that both have log-normal distributions is again a lognormal distribution.

- Suppose $x$ is a
**discrete**random variable with probability**mass**function $P_x(x)$, and $y = h(x)$ is a one-to-one function with $x = g(y) = h^{-1}(y)$. Then

**Proof**: $P_y(\hat{y}) = P(y=\hat{y}) = P(h(x)=\hat{y}) = P(x=g(\hat{y})) = P_x(g(\hat{y})). \,\square$

If $x$ is defined on a

**continuous**domain, and $p_x(x)$ is a probability**density**function, then probability mass is represented by the area under a (density) curve. Let $a=g(c)$ and $b=g(d)$. Then $$\begin{align*} P(a ≤ x ≤ b) &= \int_a^b p_x(x)\mathrm{d}x \\ &= \int_{g(c)}^{g(d)} p_x(x)\mathrm{d}x \\ &= \int_c^d p_x(g(y))\mathrm{d}g(y) \\ &= \int_c^d \underbrace{p_x(g(y)) g^\prime(y)}_{p_y(y)}\mathrm{d}y \\ &= P(c ≤ y ≤ d) \end{align*}$$Equating the two probability masses leads to identificaiton of the relation $$p_y(y) = p_x(g(y)) g^\prime(y)\,,$$ which is also known as the Change-of-Variable theorem.

If the tranformation $y = h(x)$ is not invertible, then $x=g(y)$ does not exist. In that case, you can still work out the transformation by equating equivalent probability masses in the two domains.

Let $p_x(x) = \mathcal{N}(x|\mu,\sigma^2)$ and $y = \frac{x-\mu}{\sigma}$.

**Problem**: What is $p_y(y)$?**Solution**: Note that $h(x)$ is invertible with $x = g(y) = \sigma y + \mu$. The change-of-variable formula leads to $$\begin{align*} p_y(y) &= p_x(g(y)) \cdot g^\prime(y) \\ &= p_x(\sigma y + \mu) \cdot \sigma \\ &= \frac{1}{\sigma\sqrt(2 \pi)} \exp\left( - \frac{(\sigma y + \mu - \mu)^2}{2\sigma^2}\right) \cdot \sigma \\ &= \frac{1}{\sqrt(2 \pi)} \exp\left( - \frac{y^2 }{2}\right)\\ &= \mathcal{N}(y|0,1) \end{align*}$$

- Probabilities should be interpretated as degrees of belief, i.e., a state-of-knowledge, rather than a property of nature.

- We can do everything with only the
**sum rule**and the**product rule**. In practice,**Bayes rule**and**marginalization**are often very useful for inference, i.e., for computing

- Bayes rule $$ p(\theta|D) = \frac{p(D|\theta)p(\theta)} {p(D)} $$ is the fundamental rule for learning from data!

- For a variable $X$ with distribution $p(X)$ with mean $\mu_x$ and variance $\Sigma_x$, the mean and variance of the
**Linear Transformation**$Z = AX +b$ is given by $$\begin{align} \mu_z &= A\mu_x + b \tag{SRG-3a}\\ \Sigma_z &= A\,\Sigma_x\,A^T \tag{SRG-3b} \end{align}$$

- That's really about all you need to know about probability theory, but you need to
*really*know it, so do the Exercises.

In [8]:

```
open("../../styles/aipstyle.html") do f
display("text/html", read(f,String))
end
```

In [ ]:

```
```