Original content created by Cam Davidson-Pilon
Ported to Python 3 and PyMC3 by Max Margenot (@clean_utensils) and Thomas Wiecki (@twiecki) at Quantopian (@quantopian)
This chapter of Bayesian Methods for Hackers focuses on the most debated and discussed part of Bayesian methodologies: how to choose an appropriate prior distribution. We also present how the prior's influence changes as our dataset increases, and an interesting relationship between priors and penalties on linear regression.
Up until now, we have mostly ignored our choice of priors. This is unfortunate as we can be very expressive with our priors, but we also must be careful about choosing them. This is especially true if we want to be objective, that is, not to express any personal beliefs in the priors.
Bayesian priors can be classified into two classes: objective priors, which aim to allow the data to influence the posterior the most, and subjective priors, which allow the practitioner to express his or her views into the prior.
What is an example of an objective prior? We have seen some already, including the flat prior, which is a uniform distribution over the entire possible range of the unknown. Using a flat prior implies that we give each possible value an equal weighting. Choosing this type of prior is invoking what is called "The Principle of Indifference", literally we have no prior reason to favor one value over another. Calling a flat prior over a restricted space an objective prior is not correct, though it seems similar. If we know $p$ in a Binomial model is greater than 0.5, then $\text{Uniform}(0.5,1)$ is not an objective prior (since we have used prior knowledge) even though it is "flat" over [0.5, 1]. The flat prior must be flat along the entire range of possibilities.
Aside from the flat prior, other examples of objective priors are less obvious, but they contain important characteristics that reflect objectivity. For now, it should be said that rarely is a objective prior truly objective. We will see this later.
On the other hand, if we added more probability mass to certain areas of the prior, and less elsewhere, we are biasing our inference towards the unknowns existing in the former area. This is known as a subjective, or informative prior. In the figure below, the subjective prior reflects a belief that the unknown likely lives around 0.5, and not around the extremes. The objective prior is insensitive to this.
%matplotlib inline
import numpy as np
import scipy.stats as stats
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt
figsize(12.5,3)
colors = ["#348ABD", "#A60628", "#7A68A6", "#467821"]
x = np.linspace(0,1)
y1, y2 = stats.beta.pdf(x, 1,1), stats.beta.pdf(x, 10,10)
p = plt.plot(x, y1,
label='An objective prior \n(uninformative, \n"Principle of Indifference")')
plt.fill_between(x, 0, y1, color = p[0].get_color(), alpha = 0.3)
p = plt.plot(x,y2 ,
label = "A subjective prior \n(informative)")
plt.fill_between(x, 0, y2, color = p[0].get_color(), alpha = 0.3)
p = plt.plot(x[25:], 2*np.ones(25), label = "another subjective prior")
plt.fill_between(x[25:], 0, 2, color = p[0].get_color(), alpha = 0.3)
plt.ylim(0,4)
plt.ylim(0, 4)
leg = plt.legend(loc = "upper left")
leg.get_frame().set_alpha(0.4)
plt.title("Comparing objective vs. subjective priors for an unknown probability");
The choice of a subjective prior does not always imply that we are using the practitioner's subjective opinion: more often the subjective prior was once a posterior to a previous problem, and now the practitioner is updating this posterior with new data. A subjective prior can also be used to inject domain knowledge of the problem into the model. We will see examples of these two situations later.
The choice, either objective or subjective mostly depends on the problem being solved, but there are a few cases where one is preferred over the other. In instances of scientific research, the choice of an objective prior is obvious. This eliminates any biases in the results, and two researchers who might have differing prior opinions would feel an objective prior is fair. Consider a more extreme situation:
A tobacco company publishes a report with a Bayesian methodology that retreated 60 years of medical research on tobacco use. Would you believe the results? Unlikely. The researchers probably chose a subjective prior that too strongly biased results in their favor.
Unfortunately, choosing an objective prior is not as simple as selecting a flat prior, and even today the problem is still not completely solved. The problem with naively choosing the uniform prior is that pathological issues can arise. Some of these issues are pedantic, but we delay more serious issues to the Appendix of this Chapter (TODO).
We must remember that choosing a prior, whether subjective or objective, is still part of the modeling process. To quote Gelman [5]:
... after the model has been fit, one should look at the posterior distribution
and see if it makes sense. If the posterior distribution does not make sense, this implies that additional prior knowledge is available that has not been included in the model, and that contradicts the assumptions of the prior distribution that has been used. It is then appropriate to go back and alter the prior distribution to be more consistent with this external knowledge.
If the posterior does not make sense, then clearly one had an idea what the posterior should look like (not what one hopes it looks like), implying that the current prior does not contain all the prior information and should be updated. At this point, we can discard the current prior and choose a more reflective one.
Gelman [4] suggests that using a uniform distribution with large bounds is often a good choice for objective priors. Although, one should be wary about using Uniform objective priors with large bounds, as they can assign too large of a prior probability to non-intuitive points. Ask yourself: do you really think the unknown could be incredibly large? Often quantities are naturally biased towards 0. A Normal random variable with large variance (small precision) might be a better choice, or an Exponential with a fat tail in the strictly positive (or negative) case.
If using a particularly subjective prior, it is your responsibility to be able to explain the choice of that prior, else you are no better than the tobacco company's guilty parties.
While not a true Bayesian method, empirical Bayes is a trick that combines frequentist and Bayesian inference. As mentioned previously, for (almost) every inference problem there is a Bayesian method and a frequentist method. The significant difference between the two is that Bayesian methods have a prior distribution, with hyperparameters $\alpha$, while empirical methods do not have any notion of a prior. Empirical Bayes combines the two methods by using frequentist methods to select $\alpha$, and then proceeds with Bayesian methods on the original problem.
A very simple example follows: suppose we wish to estimate the parameter $\mu$ of a Normal distribution, with $\sigma = 5$. Since $\mu$ could range over the whole real line, we can use a Normal distribution as a prior for $\mu$. How to select the prior's hyperparameters, denoted ($\mu_p, \sigma_p^2$)? The $\sigma_p^2$ parameter can be chosen to reflect the uncertainty we have. For $\mu_p$, we have two options:
Empirical Bayes can be argued as being semi-objective, since while the choice of prior model is ours (hence subjective), the parameters are solely determined by the data.
Personally, I feel that Empirical Bayes is double-counting the data. That is, we are using the data twice: once in the prior, which will influence our results towards the observed data, and again in the inferential engine of MCMC. This double-counting will understate our true uncertainty. To minimize this double-counting, I would only suggest using Empirical Bayes when you have lots of observations, else the prior will have too strong of an influence. I would also recommend, if possible, to maintain high uncertainty (either by setting a large $\sigma_p^2$ or equivalent.)
Empirical Bayes also violates a theoretical axiom in Bayesian inference. The textbook Bayesian algorithm of:
prior $\Rightarrow$ observed data $\Rightarrow$ posterior
is violated by Empirical Bayes, which instead uses
observed data $\Rightarrow$ prior $\Rightarrow$ observed data $\Rightarrow$ posterior
Ideally, all priors should be specified before we observe the data, so that the data does not influence our prior opinions (see the volumes of research by Daniel Kahneman et. al about anchoring).
A Gamma random variable, denoted $X \sim \text{Gamma}(\alpha, \beta)$, is a random variable over the positive real numbers. It is in fact a generalization of the Exponential random variable, that is:
$$ \text{Exp}(\beta) \sim \text{Gamma}(1, \beta) $$This additional parameter allows the probability density function to have more flexibility, hence allowing the practitioner to express his or her subjective priors more accurately. The density function for a $\text{Gamma}(\alpha, \beta)$ random variable is:
$$ f(x \mid \alpha, \beta) = \frac{\beta^{\alpha}x^{\alpha-1}e^{-\beta x}}{\Gamma(\alpha)} $$where $\Gamma(\alpha)$ is the Gamma function, and for differing values of $(\alpha, \beta)$ looks like:
figsize(12.5, 5)
gamma = stats.gamma
parameters = [(1, 0.5), (9, 2), (3, 0.5), (7, 0.5)]
x = np.linspace(0.001 ,20, 150)
for alpha, beta in parameters:
y = gamma.pdf(x, alpha, scale=1./beta)
lines = plt.plot(x, y, label = "(%.1f,%.1f)"%(alpha,beta), lw = 3)
plt.fill_between(x, 0, y, alpha = 0.2, color = lines[0].get_color())
plt.autoscale(tight=True)
plt.legend(title=r"$\alpha, \beta$ - parameters");
Until now, we have only seen random variables that are scalars. Of course, we can also have random matrices! Specifically, the Wishart distribution is a distribution over all positive semi-definite matrices. Why is this useful to have in our arsenal? (Proper) covariance matrices are positive-definite, hence the Wishart is an appropriate prior for covariance matrices. We can't really visualize a distribution of matrices, so I'll plot some realizations from the $5 \times 5$ (above) and $20 \times 20$ (below) Wishart distribution:
n = 4
for i in range(10):
ax = plt.subplot(2, 5, i+1)
if i >= 5:
n = 15
plt.imshow(stats.wishart.rvs(n+1, np.eye(n)), interpolation="none",
cmap = "hot")
ax.axis("off")
plt.suptitle("Random matrices from a Wishart Distribution");
One thing to notice is the symmetry of these matrices. The Wishart distribution can be a little troubling to deal with, but we will use it in an example later.
You may have seen the term beta
in previous code in this book. Often, I was implementing a Beta distribution. The Beta distribution is very useful in Bayesian statistics. A random variable $X$ has a $\text{Beta}$ distribution, with parameters $(\alpha, \beta)$, if its density function is:
where $B$ is the Beta function (hence the name). The random variable $X$ is only allowed in [0,1], making the Beta distribution a popular distribution for decimal values, probabilities and proportions. The values of $\alpha$ and $\beta$, both positive values, provide great flexibility in the shape of the distribution. Below we plot some distributions:
figsize(12.5, 5)
params = [(2, 5), (1, 1), (0.5, 0.5), (5, 5), (20, 4), (5, 1)]
x = np.linspace(0.01, .99, 100)
beta = stats.beta
for a, b in params:
y = beta.pdf(x, a, b)
lines = plt.plot(x, y, label = "(%.1f,%.1f)"%(a,b), lw = 3)
plt.fill_between(x, 0, y, alpha = 0.2, color = lines[0].get_color())
plt.autoscale(tight=True)
plt.ylim(0)
plt.legend(loc = 'upper left', title="(a,b)-parameters");
One thing I'd like the reader to notice is the presence of the flat distribution above, specified by parameters $(1,1)$. This is the Uniform distribution. Hence the Beta distribution is a generalization of the Uniform distribution, something we will revisit many times.
There is an interesting connection between the Beta distribution and the Binomial distribution. Suppose we are interested in some unknown proportion or probability $p$. We assign a $\text{Beta}(\alpha, \beta)$ prior to $p$. We observe some data generated by a Binomial process, say $X \sim \text{Binomial}(N, p)$, with $p$ still unknown. Then our posterior is again a Beta distribution, i.e. $p | X \sim \text{Beta}( \alpha + X, \beta + N -X )$. Succinctly, one can relate the two by "a Beta prior with Binomial observations creates a Beta posterior". This is a very useful property, both computationally and heuristically.
In light of the above two paragraphs, if we start with a $\text{Beta}(1,1)$ prior on $p$ (which is a Uniform), observe data $X \sim \text{Binomial}(N, p)$, then our posterior is $\text{Beta}(1 + X, 1 + N - X)$.
Adapted from an example by Ted Dunning of MapR Technologies
Suppose you are faced with $N$ slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so much. Of course, you don't know what these probabilities are. By only choosing one bandit per round, our task is devise a strategy to maximize our winnings.
Of course, if we knew the bandit with the largest probability, then always picking this bandit would yield the maximum winnings. So our task can be phrased as "Find the best bandit, and as quickly as possible".
The task is complicated by the stochastic nature of the bandits. A suboptimal bandit can return many winnings, purely by chance, which would make us believe that it is a very profitable bandit. Similarly, the best bandit can return many duds. Should we keep trying losers then, or give up?
A more troublesome problem is, if we have found a bandit that returns pretty good results, do we keep drawing from it to maintain our pretty good score, or do we try other bandits in hopes of finding an even-better bandit? This is the exploration vs. exploitation dilemma.
The Multi-Armed Bandit problem at first seems very artificial, something only a mathematician would love, but that is only before we address some applications:
Many of these questions above are fundamental to the application's field.
It turns out the optimal solution is incredibly difficult, and it took decades for an overall solution to develop. There are also many approximately-optimal solutions which are quite good. The one I wish to discuss is one of the few solutions that can scale incredibly well. The solution is known as Bayesian Bandits.
Any proposed strategy is called an online algorithm (not in the internet sense, but in the continuously-being-updated sense), and more specifically a reinforcement learning algorithm. The algorithm starts in an ignorant state, where it knows nothing, and begins to acquire data by testing the system. As it acquires data and results, it learns what the best and worst behaviours are (in this case, it learns which bandit is the best). With this in mind, perhaps we can add an additional application of the Multi-Armed Bandit problem:
The Bayesian solution begins by assuming priors on the probability of winning for each bandit. In our vignette we assumed complete ignorance of these probabilities. So a very natural prior is the flat prior over 0 to 1. The algorithm proceeds as follows:
For each round:
That's it. Computationally, the algorithm involves sampling from $N$ distributions. Since the initial priors are $\text{Beta}(\alpha=1,\beta=1)$ (a uniform distribution), and the observed result $X$ (a win or loss, encoded 1 and 0 respectfully) is Binomial, the posterior is a $\text{Beta}(\alpha=1+X,\beta=1+1−X)$.
To answer our question from before, this algorithm suggests that we should not discard losers, but we should pick them at a decreasing rate as we gather confidence that there exist better bandits. This follows because there is always a non-zero chance that a loser will achieve the status of $B$, but the probability of this event decreases as we play more rounds (see figure below).
Below we implement Bayesian Bandits using two classes, Bandits
that defines the slot machines, and BayesianStrategy
which implements the above learning strategy.
rand = np.random.rand
class Bandits(object):
"""
This class represents N bandits machines.
parameters:
p_array: a (n,) Numpy array of probabilities >0, <1.
methods:
pull( i ): return the results, 0 or 1, of pulling
the ith bandit.
"""
def __init__(self, p_array):
self.p = p_array
self.optimal = np.argmax(p_array)
def pull(self, i):
#i is which arm to pull
return np.random.rand() < self.p[i]
def __len__(self):
return len(self.p)
class BayesianStrategy(object):
"""
Implements a online, learning strategy to solve
the Multi-Armed Bandit problem.
parameters:
bandits: a Bandit class with .pull method
methods:
sample_bandits(n): sample and train on n pulls.
attributes:
N: the cumulative number of samples
choices: the historical choices as a (N,) array
bb_score: the historical score as a (N,) array
"""
def __init__(self, bandits):
self.bandits = bandits
n_bandits = len(self.bandits)
self.wins = np.zeros(n_bandits)
self.trials = np.zeros(n_bandits)
self.N = 0
self.choices = []
self.bb_score = []
def sample_bandits(self, n=1):
bb_score = np.zeros(n)
choices = np.zeros(n)
for k in range(n):
#sample from the bandits's priors, and select the largest sample
choice = np.argmax(np.random.beta(1 + self.wins, 1 + self.trials - self.wins))
#sample the chosen bandit
result = self.bandits.pull(choice)
#update priors and score
self.wins[choice] += result
self.trials[choice] += 1
bb_score[k] = result
self.N += 1
choices[k] = choice
self.bb_score = np.r_[self.bb_score, bb_score]
self.choices = np.r_[self.choices, choices]
return
Below we visualize the learning of the Bayesian Bandit solution.
figsize(11.0, 10)
beta = stats.beta
x = np.linspace(0.001,.999,200)
def plot_priors(bayesian_strategy, prob, lw = 3, alpha = 0.2, plt_vlines = True):
## plotting function
wins = bayesian_strategy.wins
trials = bayesian_strategy.trials
for i in range(prob.shape[0]):
y = beta(1+wins[i], 1 + trials[i] - wins[i])
p = plt.plot(x, y.pdf(x), lw = lw)
c = p[0].get_markeredgecolor()
plt.fill_between(x,y.pdf(x),0, color = c, alpha = alpha,
label="underlying probability: %.2f" % prob[i])
if plt_vlines:
plt.vlines(prob[i], 0, y.pdf(prob[i]) ,
colors = c, linestyles = "--", lw = 2)
plt.autoscale(tight = "True")
plt.title("Posteriors After %d pull" % bayesian_strategy.N +\
"s"*(bayesian_strategy.N > 1))
plt.autoscale(tight=True)
return
hidden_prob = np.array([0.85, 0.60, 0.75])
bandits = Bandits(hidden_prob)
bayesian_strat = BayesianStrategy(bandits)
draw_samples = [1, 1, 3, 10, 10, 25, 50, 100, 200, 600]
for j,i in enumerate(draw_samples):
plt.subplot(5, 2, j+1)
bayesian_strat.sample_bandits(i)
plot_priors(bayesian_strat, hidden_prob)
#plt.legend()
plt.autoscale(tight = True)
plt.tight_layout()
Note that we don't really care how accurate we become about the inference of the hidden probabilities — for this problem we are more interested in choosing the best bandit (or more accurately, becoming more confident in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it.
From the above, we can see that after 1000 pulls, the majority of the "blue" function leads the pack, hence we will almost always choose this arm. This is good, as this arm is indeed the best.
Below is a D3 app that demonstrates our algorithm updating/learning three bandits. The first figure are the raw counts of pulls and wins, and the second figure is a dynamically updating plot. I encourage you to try to guess which bandit is optimal, prior to revealing the true probabilities, by selecting the arm buttons
.
from IPython.core.display import HTML
#try executing the below command twice if the first time doesn't work
HTML(filename = "BanditsD3.html")
Rewards
0
Pulls
0
Reward/Pull Ratio
0
Deviations of the observed ratio from the highest probability is a measure of performance. For example,in the long run, optimally we can attain the reward/pull ratio of the maximum bandit probability. Long-term realized ratios less than the maximum represent inefficiencies. (Realized ratios larger than the maximum probability is due to randomness, and will eventually fall below).
We need a metric to calculate how well we are doing. Recall the absolute best we can do is to always pick the bandit with the largest probability of winning. Denote this best bandit's probability by $w_{opt}$. Our score should be relative to how well we would have done had we chosen the best bandit from the beginning. This motivates the total regret of a strategy, defined:
\begin{align} R_T & = \sum_{i=1}^{T} \left( w_{opt} - w_{B(i)} \right)\\\\ & = Tw^* - \sum_{i=1}^{T} \; w_{B(i)} \end{align}where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$ round. A total regret of 0 means the strategy is matching the best possible score. This is likely not possible, as initially our algorithm will often make the wrong choice. Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically, we achieve $w_{B(i)}=w_{opt}$ often)
Below we plot the total regret of this simulation, including the scores of some other strategies:
The code for these are in the other_strats.py
, where you can implement your own very easily.
figsize(12.5, 5)
from other_strats import *
#define a harder problem
hidden_prob = np.array([0.15, 0.2, 0.1, 0.05])
bandits = Bandits(hidden_prob)
#define regret
def regret(probabilities, choices):
w_opt = probabilities.max()
return (w_opt - probabilities[choices.astype(int)]).cumsum()
#create new strategies
strategies= [upper_credible_choice,
bayesian_bandit_choice,
ucb_bayes ,
max_mean,
random_choice]
algos = []
for strat in strategies:
algos.append(GeneralBanditStrat(bandits, strat))
#train 10000 times
for strat in algos:
strat.sample_bandits(10000)
#test and plot
for i,strat in enumerate(algos):
_regret = regret(hidden_prob, strat.choices)
plt.plot(_regret, label = strategies[i].__name__, lw = 3)
plt.title("Total Regret of Bayesian Bandits Strategy vs. Random guessing")
plt.xlabel("Number of pulls")
plt.ylabel("Regret after $n$ pulls");
plt.legend(loc = "upper left");
Like we wanted, Bayesian bandits and other strategies have decreasing rates of regret, representing we are achieving optimal choices. To be more scientific so as to remove any possible luck in the above simulation, we should instead look at the expected total regret:
$$\bar{R}_T = E[ R_T ] $$It can be shown that any sub-optimal strategy's expected total regret is bounded below logarithmically. Formally,
$$ E[R_T] = \Omega \left( \;\log(T)\; \right) $$Thus, any strategy that matches logarithmic-growing regret is said to "solve" the Multi-Armed Bandit problem [3].
Using the Law of Large Numbers, we can approximate Bayesian Bandit's expected total regret by performing the same experiment many times (500 times, to be fair):
#this can be slow, so I recommend NOT running it.
trials = 500
expected_total_regret = np.zeros((10000, 3))
for i_strat, strat in enumerate(strategies[:-2]):
for i in range(trials):
general_strat = GeneralBanditStrat(bandits, strat)
general_strat.sample_bandits(10000)
_regret = regret(hidden_prob, general_strat.choices)
expected_total_regret[:,i_strat] += _regret
plt.plot(expected_total_regret[:,i_strat]/trials, lw =3, label = strat.__name__)
plt.title("Expected Total Regret of Multi-armed Bandit strategies")
plt.xlabel("Number of pulls")
plt.ylabel("Exepected Total Regret \n after $n$ pulls");
plt.legend(loc = "upper left");