#!/usr/bin/env python # coding: utf-8 # # Notebook 2: Distributions # # [Bayesian Decision Analysis](https://allendowney.github.io/BayesianDecisionAnalysis/) # # Copyright 2021 Allen B. Downey # # License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) # In[1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt # ## Review # # In the previous lesson, we used a Bayes table to solve the cookie problem. # # In this notebook we'll solve an expanded version of the cookie problem with 101 Bowls. It might seem silly, but it's not: the solution demonstrates a Bayesian way to estimate a proportion, and it applies to lots of real problems that don't involve cookies. # # Then I'll introduce an alternative to the Bayes table, a probability mass function (PMF), which is a useful way to represent and do computations with distributions. # # Here's the function, from the previous notebook, we'll use to make Bayes tables: # In[2]: import pandas as pd def make_bayes_table(hypos, prior, likelihood): """Make a Bayes table. hypos: sequence of hypotheses prior: prior probabilities likelihood: sequence of likelihoods returns: DataFrame """ table = pd.DataFrame(index=hypos) table['prior'] = prior table['likelihood'] = likelihood table['unnorm'] = table['prior'] * table['likelihood'] prob_data = table['unnorm'].sum() table['posterior'] = table['unnorm'] / prob_data return table # ## 101 Bowls # # The Bayes table works with more than two hypotheses. As an example, we solved a cookie problem with five bowls. # # Now we'll take it even farther and solve a cookie problem with 101 bowls: # # * Bowl 0 contains no vanilla cookies, # # * Bowl 1 contains 1% vanilla cookies, # # * Bowl 2 contains 2% vanilla cookies, # # and so on, up to # # * Bowl 99 contains 99% vanilla cookies, and # # * Bowl 100 contains all vanilla cookies. # # As in the previous problems, there are only two kinds of cookies, vanilla and chocolate. So Bowl 0 is all chocolate cookies, Bowl 1 is 99% chocolate, and so on. # # Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla. What is the probability that the cookie came from Bowl $x$, for each value of $x$? # # To solve this problem, I'll use `np.arange` to represent 101 hypotheses, numbered from 0 to 100. # In[3]: # The prior probability for each bowl is $1/101$. I could create a sequence with 101 identical values, but if all of the priors are equal, we only have to probide one value: # In[4]: # Because of the way I numbered the bowls, the probability of a vanilla cookie from Bowl $x$ is $x/100$. So we can compute the likelihoods like this: # In[5]: # And that's all we need; the Bayes table does the rest: # In[6]: # Here's a feature we have not seen before: we can give the index of the Bayes table a name, which will appear when we display the table. # In[7]: table.index.name = 'Bowl' # Here are the first few rows: # In[8]: table.head() # Because Bowl 0 contains no vanilla cookies, its likelihood is 0, so its posterior probability is 0. That is, the cookie cannot have come from Bowl 0. # # Here are the last few rows of the table. # In[9]: table.tail() # The posterior probabilities are substantially higher for the high-numbered bowls. # # There is a pattern here that will be clearer if we plot the results. # ## Plotting posterior distributions # In[10]: import matplotlib.pyplot as plt def plot_table(table): """Plot results from the 101 Bowls problem. table: DataFrame representing a Bayes table """ table['prior'].plot() table['posterior'].plot() plt.xlabel('Bowl #') plt.ylabel('Probability') plt.legend() # In[11]: plot_table(table) plt.title('One cookie'); # The prior probabilities are uniform; that is, they are the same for every bowl. # # The posterior probabilities increase linearly; Bowl 0 is the least likely (actually impossible), and Bowl 100 is the most likely. # ## Two cookies # # Suppose we put the first cookie back, stir the bowl thoroughly, and draw another cookie from the same bowl. # And suppose it turns out to be another vanilla cookie. # Now what is the probability that we are drawing from Bowl $x$? # # To answer this question, we can use the posterior probabilities from the previous problem as prior probabilities for a new Bayes table, and then update with the new data. # In[12]: prior2 = table['posterior'] likelihood2 = likelihood table2 = make_bayes_table(xs, prior2, likelihood2) plot_table(table2) plt.title('Two cookies'); # The blue line shows the posterior after one cookie, which is the prior before the second cookie. # # The orange line shows the posterior after two cookies, which curves upward. Having see two vanilla cookies, the high-numbered bowls are more likely; the low-numbered bowls are less likely. # # I bet you can guess what's coming next. # ## Three cookies # # Suppose we put the cookie back, stir, draw another cookie from the same bowl, and get a chocolate cookie. # What do you think the posterior distribution looks like after these three cookies? # # Hint: what's the probability that the chocolate cookie came from Bowl 100? # ### Exercise # # Compute the posterior distribution after two vanilla cookies and one chocolate cookie. # In[13]: # In[14]: # In[15]: # In[16]: # In[17]: # ## Summarizing the posterior # # The posterior distribution has a peak near 60%. We can use `idxmax` to find it: # In[18]: # The peak in the posterior distribution is at 67%. # # This value has a name; it is the **MAP**, which stands for "Maximum Aposteori Probability" ("aposteori" is Latin for posterior). # # In this example, the MAP is close to the proportion of vanilla cookies in the dataset: 2/3. # ## Probability Mass Functions # # When we do more than one update, we don't always want to keep the whole Bayes table. In this section we'll replace the Bayes table with a more compact representation, a probability mass function, or PMF. # # A PMF is a set of possible outcomes and their corresponding probabilities. There are many ways to represent a PMF; in this notebook I'll use a Pandas Series. # # Here's a function that takes a sequence of quantities, `qs`, and a sequence of probabilities, `ps`, and returns a Pandas Series that represents a PMF. # In[19]: def make_pmf(qs, ps, **options): """Make a Series that represents a PMF. qs: sequence of quantities ps: sequence of probabilities options: keyword arguments passed to Series constructor returns: Pandas Series """ pmf = pd.Series(ps, index=qs, **options) return pmf # And here's a PMF that represents the prior from the 101 Bowls problem. # In[20]: xs = np.arange(101) prior = 1/101 pmf = make_pmf(xs, prior) pmf.head() # Now that we have a prior, we need to compute likelihoods. # # Here are the likelihoods for a vanilla cookie: # In[21]: likelihood_vanilla = xs / 100 # And for a chocolate cookie. # In[22]: likelihood_chocolate = 1 - xs / 100 # To compute posterior probabilities, I'll use the following function, which takes a PMF and a sequence of likelihoods, and updates the PMF: # In[23]: def bayes_update(pmf, likelihood): """Do a Bayesian update. pmf: Series that represents the prior likelihood: sequence of likelihoods """ pmf *= likelihood pmf /= pmf.sum() # The steps here are the same as in the Bayes table: # # 1. Multiply the prior by the likelihoods. # # 2. Add up the products to get the total probability of the data. # # 3. Divide through to normalize the posteriors. # Now we can do the update for a vanilla cookie. # In[24]: bayes_update(pmf, likelihood_vanilla) # Here's what the PMF looks like after the update. # In[25]: pmf.plot() plt.xlabel('Bowl #') plt.ylabel('Probability') plt.title('One cookie'); # That's consistent with what we got with the Bayes table. # # The advantage of using a PMF is that it is easier to do multiple updates. The following cell starts again with the uniform prior and does updates with two vanilla cookies and one chocolate cookie: # In[26]: # Here's what the results look like: # In[27]: pmf.plot() plt.xlabel('Bowl #') plt.ylabel('Probability') plt.title('Three cookies'); # Again, that's consistent with what we got with the Bayes table. # ## Summary # # In this notebook, we extended the cookie problem with more bowls. # # I defined the MAP, which is the quantity in a posterior distribution with the highest probability. # # Although the cookie problem is not particularly realistic or useful, the method we used to solve it applies to many problems in the real world where we want to estimate a proportion. # # In the next lesson we'll use the same method on a related problem, and take another step toward doing Bayesian statistics. # ### Exercise # # Suppose we take away bowls 51 through 101, leaving only the first 51 bowls. # # You choose one of the remaining bowls, draw three times (replacing and stirring after each draw), and get a vanilla cookie twice. # # What is the probability you drew from each bowl? # # Make a `Series` that represents the prior probabilities of the bowls. Update it based on the data and plot the posterior distribution. Compute the MAP. # # In this example, does the MAP equal the observed proportion? # In[28]: # In[29]: # In[30]: # In[31]: # In[32]: # In[33]: # In[34]: # In[ ]: