Abstract: In this talk we review the probabilistic approach to machine learning. We start with a review of probability, and introduce the concepts of probabilistic modelling. We then apply the approach in practice to Naive Bayesian classification. In this session we review the Bayesian formalism in the context of linear models, reviewing initially maximum likelihood and introducing basis functions as a way of driving non-linearity in the model.
$$ \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$. . .
$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$. . .
. . .
. . .
. . .
. . .
. . .
. . .
pods
¶The pods
library is a library for supporting open data science (python
open data science). It allows you to load in various data sets and
provides tools for helping teach in the notebook.
To install pods you can use pip:
pip install pods
The code is also available on github: https://github.com/sods/ods
Once pods
is installed, it can be imported in the usual manner.
Terminology Mathematical notation Description
joint $P(X=x, Y=y)$ prob. that X=x and Y=y marginal $P(X=x)$ prob. that X=x regardless of Y conditional $P(X=x\vert Y=y)$ prob. that X=x given that Y=y
import teaching_plots as plot
plot.prob_diagram(diagrams='../slides/diagrams/mlai')
[Inspired by lectures from Christopher Bishop]{align="right"}
Terminology Definition Probability Notation
Joint Probability $\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{N}$ $P\left(X=3,Y=4\right)$ Marginal Probability $\lim_{N\rightarrow\infty}\frac{n_{X=5}}{N}$ $P\left(X=5\right)$ Conditional Probability $\lim_{N\rightarrow\infty}\frac{n_{X=3,Y=4}}{n_{Y=4}}$ $P\left(X=3\vert Y=4\right)$
Typically we should write out $P\left(X=x,Y=y\right)$.
In practice, we often use $P\left(x,y\right)$.
This looks very much like we might write a multivariate function, e.g. $f\left(x,y\right)=\frac{x}{y}$.
For a multivariate function though, $f\left(x,y\right)\neq f\left(y,x\right)$.
However $P\left(x,y\right)=P\left(y,x\right)$ because $P\left(X=x,Y=y\right)=P\left(Y=y,X=x\right)$.
We now quickly review the ‘rules of probability’.
All distributions are normalized. This is clear from the fact that $\sum_{x}n_{x}=N$, which gives $$\sum_{x}P\left(x\right)={\lim_{N\rightarrow\infty}}\frac{\sum_{x}n_{x}}{N}={\lim_{N\rightarrow\infty}}\frac{N}{N}=1.$$ A similar result can be derived for the marginal and conditional distributions.
Ignoring the limit in our definitions: * The marginal probability $P\left(y\right)$ is ${\lim_{N\rightarrow\infty}}\frac{n_{y}}{N}$ . * The joint distribution $P\left(x,y\right)$ is ${\lim_{N\rightarrow\infty}}\frac{n_{x,y}}{N}$. * $n_{y}=\sum_{x}n_{x,y}$ so $$ {\lim_{N\rightarrow\infty}}\frac{n_{y}}{N}={\lim_{N\rightarrow\infty}}\sum_{x}\frac{n_{x,y}}{N}, $$ in other words $$ P\left(y\right)=\sum_{x}P\left(x,y\right). $$ This is known as the sum rule of probability.
Consider the following distribution.
$y$ 1 2 3 4
$P\left(y\right)$ 0.3 0.2 0.1 0.4
What is the mean of the distribution?
What is the standard deviation of the distribution?
Are the mean and standard deviation representative of the distribution form?
What is the expected value of $-\log P(y)$?
We are given that:
$y$ 1 2 3 4
$P\left(y\right)$ 0.3 0.2 0.1 0.4 $y^2$ 1 4 9 16 $-\log(P(y))$ 1.204 1.609 2.302 0.916
Mean: $1\times 0.3 + 2\times 0.2 + 3 \times 0.1 + 4 \times 0.4 = 2.6$
Second moment: $1 \times 0.3 + 4 \times 0.2 + 9 \times 0.1 + 16 \times 0.4 = 8.4$
Variance: $8.4 - 2.6\times 2.6 = 1.64$
Standard deviation: $\sqrt{1.64} = 1.2806$
We are given that:
$y$ 1 2 3 4
$P\left(y\right)$ 0.3 0.2 0.1 0.4 $y^2$ 1 4 9 16 $-\log(P(y))$ 1.204 1.609 2.302 0.916
Expectation $-\log(P(y))$: $0.3\times 1.204 + 0.2\times 1.609 + 0.1\times 2.302 +0.4\times 0.916 = 1.280$
$i$ 1 2 3 4 5 6
------- ------ ------ ------ ------ ------ ------
$y_i$ 1.76 1.73 1.79 1.81 1.85 1.80
We can compute:
$i$ 1 2 3 4 5 6
$y_i$ 1.76 1.73 1.79 1.81 1.85 1.80 $y^2_i$ 3.0976 2.9929 3.2041 3.2761 3.4225 3.2400
Mean: $\frac{1.76 + 1.73 + 1.79 + 1.81 + 1.85 + 1.80}{6} = 1.79$
Second moment: $ \frac{3.0976 + 2.9929 + 3.2041 + 3.2761 + 3.4225 + 3.2400}{6} = 3.2055$
Variance: $3.2055 - 1.79\times1.79 = 1.43\times 10^{-3}$
Standard deviation: $0.0379$
No, you can’t compute it. You don’t have access to $P(y)$ directly.
$i$ 1 2 3 4 5 6
------- ------ ------ ------ ------ ------ ------
$y_i$ 1.76 1.73 1.79 1.81 1.85 1.80
. . .
$\mappingMatrix$ contains $\mappingMatrix_1$ and $\mappingMatrix_2$
$p(\mappingMatrix | \dataVector, \inputMatrix)$ is posterior density
$p(\dataScalar|\inputVector, \mappingMatrix)$ is the likelihood of data point
. . .
Normally assume independence: $$ p(\dataVector|\inputMatrix, \mappingMatrix) \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \mappingMatrix),$$
Can also consider priors over latents $$ p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \mappingMatrix \text{d} \inputMatrix \text{d}\inputMatrix_* $$
This gives unsupervised learning.
Data: $\dataVector$
Model: $p(\dataVector, \dataVector^*)$
Prediction: $p(\dataVector^*| \dataVector)$
import daft
from matplotlib import rc
rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)
pgm = daft.PGM(shape=[3, 1],
origin=[0, 0],
grid_unit=5,
node_unit=1.9,
observed_style='shaded',
line_width=3)
pgm.add_node(daft.Node("y_1", r"$y_1$", 0.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_2", r"$y_2$", 1.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_3", r"$y_3$", 2.5, 0.5, fixed=False))
pgm.add_edge("y_1", "y_2")
pgm.add_edge("y_2", "y_3")
pgm.render().figure.savefig("../slides/diagrams/ml/markov.svg", transparent=True)
Predict Perioperative Risk of Clostridium Difficile Infection Following Colon Surgery [@Steele:predictive12]
$\dataScalar$ 0 1
------------------ ----------- -------
$P(\dataScalar)$ $(1-\pi)$ $\pi$
This is the Bernoulli distribution.
The Bernoulli distribution $$ P(\dataScalar) = \pi^\dataScalar (1-\pi)^{(1-\dataScalar)} $$
Is a clever trick for switching probabilities, as code it would be
def bernoulli(y_i, pi):
if y_i == 1:
return pi
else:
return 1-pi
import pods
pods.notebook.display_google_book(id='CF4UAAAAQAAJ', page='PA87')
import matplotlib.pyplot as plt
import teaching_plots as plot
fig, ax = plt.subplots(figsize=plot.one_figsize)
plot.bernoulli_urn(ax, diagrams='../slides/diagrams/ml/')
import matplotlib.pyplot as plt
import teaching_plots as plot
fig, ax = plt.subplots(figsize=plot.one_figsize)
plot.bayes_billiard(ax, diagrams='../slides/diagrams/ml/')
\startslides{bayes_billiard}{1}{10}
import pods
from ipywidgets import IntSlider
pods.notebook.display_plots('bayes-billiard{counter:0>3}.svg',
directory='../slides/diagrams/ml',
counter=IntSlider(0,0,9,1))
Stationary point: set derivative to zero $$0 = -\frac{\sum_{i=1}^{\numData} \dataScalar_i}{\pi} + \frac{\sum_{i=1}^{\numData} (1-\dataScalar_i)}{1-\pi},$$
Rearrange to form $$(1-\pi)\sum_{i=1}^{\numData} \dataScalar_i = \pi\sum_{i=1}^{\numData} (1-\dataScalar_i),$$
Giving $$\sum_{i=1}^{\numData} \dataScalar_i = \pi\left(\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i\right),$$
Recognise that $\sum_{i=1}^{\numData} (1-\dataScalar_i) + \sum_{i=1}^{\numData} \dataScalar_i = n$ so we have $$\pi = \frac{\sum_{i=1}^{\numData} \dataScalar_i}{\numData}$$
Estimate the probability associated with the Bernoulli by setting it to the number of observed positives, divided by the total length of $\dataScalar$.
Makes intiutive sense.
What's your best guess of probability for coin toss is heads when you get 47 heads from 100 tosses?
Show that the maximum likelihood solution we have found is a minimum for our objective.
Four components:
Probabilistic Machine Learning: place probability distributions (or densities) over all the variables of interest.
In naive Bayes this is exactly what we do.
Form a classification algorithm by modelling the joint density of our observations.
Need to make assumption about joint density.
Given model parameters $\paramVector$ we assume that all data points in the model are independent. $$ p(\dataScalar^*, \inputVector^*, \dataVector, \inputMatrix|\paramVector) = p(\dataScalar^*, \inputVector^*|\paramVector)\prod_{i=1}^{\numData} p(\dataScalar_i, \inputVector_i | \paramVector). $$
This is a conditional independence assumption.
We also make similar assumptions for regression (where $\paramVector = \left\{\mappingVector,\dataStd^2\right\}$).
Here we assume joint density of $\dataVector$ and $\inputMatrix$ is independent across the data given the parameters.
Computing posterior distribution in this case becomes easier, this is known as the 'Bayes classifier'.
To specify the joint distribution we also need the marginal for $p(\dataScalar_i)$ $$p(\inputScalar_{i,j},\dataScalar_i| \paramVector) = p(\inputScalar_{i,j}|\dataScalar_i, \paramVector)p(\dataScalar_i).$$
Because $\dataScalar_i$ is binary the Bernoulli density makes a suitable choice for our prior over $\dataScalar_i$, $$p(\dataScalar_i|\pi) = \pi^{\dataScalar_i} (1-\pi)^{1-\dataScalar_i}$$ where $\pi$ now has the interpretation as being the prior probability that the classification should be positive.
import pods
pods.notebook.display_google_book(id='1YQPAAAAQAAJ', page='PA16')
How can you improve your classification, are all the features equally valid? Are some features more helpful than others? What happens if you remove features that appear to be less helpful. How might you select such features?
We have decided to classify positive if probability of R rating is greater than 0.5. This has led us to accidentally classify some films as 'safe for children' when the aren't in actuallity. Imagine you wish to ensure that the film is safe for children. With your test set how low do you have to set the threshold to avoid all the false negatives (i.e. films where you said it wasn't R-rated, but in actuality it was?