#!/usr/bin/env python # coding: utf-8 # # Estimating Counts # Think Bayes, Second Edition # # Copyright 2020 Allen B. Downey # # License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) # In[1]: # If we're running on Colab, install empiricaldist # https://pypi.org/project/empiricaldist/ import sys IN_COLAB = 'google.colab' in sys.modules if IN_COLAB: get_ipython().system('pip install empiricaldist') # In[2]: # Get utils.py from os.path import basename, exists def download(url): filename = basename(url) if not exists(filename): from urllib.request import urlretrieve local, _ = urlretrieve(url, filename) print('Downloaded ' + local) download('https://github.com/AllenDowney/ThinkBayes2/raw/master/soln/utils.py') # In[3]: from utils import set_pyplot_params set_pyplot_params() # In the previous chapter we solved problems that involve estimating proportions. # In the Euro problem, we estimated the probability that a coin lands heads up, and in the exercises, you estimated a batting average, the fraction of people who cheat on their taxes, and the chance of shooting down an invading alien. # # Clearly, some of these problems are more realistic than others, and some are more useful than others. # # In this chapter, we'll work on problems related to counting, or estimating the size of a population. # Again, some of the examples will seem silly, but some of them, like the German Tank problem, have real applications, sometimes in life and death situations. # ## The Train Problem # # I found the train problem # in Frederick Mosteller's, [*Fifty Challenging Problems in # Probability with Solutions*](https://store.doverpublications.com/0486653552.html): # # > "A railroad numbers its locomotives in order 1..N. One day you see a locomotive with the number 60. Estimate how many locomotives the railroad has." # # Based on this observation, we know the railroad has 60 or more # locomotives. But how many more? To apply Bayesian reasoning, we # can break this problem into two steps: # # * What did we know about $N$ before we saw the data? # # * For any given value of $N$, what is the likelihood of seeing the data (a locomotive with number 60)? # # The answer to the first question is the prior. The answer to the # second is the likelihood. # # We don't have much basis to choose a prior, so we'll start with # something simple and then consider alternatives. # Let's assume that $N$ is equally likely to be any value from 1 to 1000. # # Here's the prior distribution: # In[4]: import numpy as np from empiricaldist import Pmf hypos = np.arange(1, 1001) prior = Pmf(1, hypos) # # Now let's figure out the likelihood of the data. # In a hypothetical fleet of $N$ locomotives, what is the probability that we would see number 60? # If we assume that we are equally likely to see any locomotive, the chance of seeing any particular one is $1/N$. # # Here's the function that does the update: # In[5]: def update_train(pmf, data): """Update pmf based on new data.""" hypos = pmf.qs likelihood = 1 / hypos impossible = (data > hypos) likelihood[impossible] = 0 pmf *= likelihood pmf.normalize() # This function might look familiar; it is the same as the update function for the dice problem in the previous chapter. # In terms of likelihood, the train problem is the same as the dice problem. # # Here's the update: # In[6]: data = 60 posterior = prior.copy() update_train(posterior, data) # Here's what the posterior looks like: # In[7]: from utils import decorate posterior.plot(label='Posterior after train 60', color='C4') decorate(xlabel='Number of trains', ylabel='PMF', title='Posterior distribution') # Not surprisingly, all values of $N$ below 60 have been eliminated. # # The most likely value, if you had to guess, is 60. # In[8]: posterior.max_prob() # That might not seem like a very good guess; after all, what are the chances that you just happened to see the train with the highest number? # Nevertheless, if you want to maximize the chance of getting # the answer exactly right, you should guess 60. # # But maybe that's not the right goal. # An alternative is to compute the mean of the posterior distribution. # Given a set of possible quantities, $q_i$, and their probabilities, $p_i$, the mean of the distribution is: # # $$\mathrm{mean} = \sum_i p_i q_i$$ # # Which we can compute like this: # In[9]: np.sum(posterior.ps * posterior.qs) # Or we can use the method provided by `Pmf`: # In[10]: posterior.mean() # The mean of the posterior is 333, so that might be a good guess if you want to minimize error. # If you played this guessing game over and over, using the mean of the posterior as your estimate would minimize the [mean squared error](http://en.wikipedia.org/wiki/Minimum_mean_square_error) over the long run. # ## Sensitivity to the Prior # # The prior I used in the previous section is uniform from 1 to 1000, but I offered no justification for choosing a uniform distribution or that particular upper bound. # We might wonder whether the posterior distribution is sensitive to the prior. # With so little data---only one observation---it is. # # This table shows what happens as we vary the upper bound: # In[11]: import pandas as pd df = pd.DataFrame(columns=['Posterior mean']) df.index.name = 'Upper bound' for high in [500, 1000, 2000]: hypos = np.arange(1, high+1) pmf = Pmf(1, hypos) update_train(pmf, data=60) df.loc[high] = pmf.mean() df # As we vary the upper bound, the posterior mean changes substantially. # So that's bad. # # When the posterior is sensitive to the prior, there are two ways to proceed: # # * Get more data. # # * Get more background information and choose a better prior. # # With more data, posterior distributions based on different priors tend to converge. # For example, suppose that in addition to train 60 we also see trains 30 and 90. # # Here's how the posterior means depend on the upper bound of the prior, when we observe three trains: # In[12]: df = pd.DataFrame(columns=['Posterior mean']) df.index.name = 'Upper bound' dataset = [30, 60, 90] for high in [500, 1000, 2000]: hypos = np.arange(1, high+1) pmf = Pmf(1, hypos) for data in dataset: update_train(pmf, data) df.loc[high] = pmf.mean() df # The differences are smaller, but apparently three trains are not enough for the posteriors to converge. # ## Power Law Prior # # If more data are not available, another option is to improve the # priors by gathering more background information. # It is probably not reasonable to assume that a train-operating company with 1000 locomotives is just as likely as a company with only 1. # # With some effort, we could probably find a list of companies that # operate locomotives in the area of observation. # Or we could interview an expert in rail shipping to gather information about the typical size of companies. # # But even without getting into the specifics of railroad economics, we # can make some educated guesses. # In most fields, there are many small companies, fewer medium-sized companies, and only one or two very large companies. # # In fact, the distribution of company sizes tends to follow a power law, as Robert Axtell reports in *Science* (). # # This law suggests that if there are 1000 companies with fewer than # 10 locomotives, there might be 100 companies with 100 locomotives, # 10 companies with 1000, and possibly one company with 10,000 locomotives. # # Mathematically, a power law means that the number of companies with a given size, $N$, is proportional to $(1/N)^{\alpha}$, where $\alpha$ is a parameter that is often near 1. # # We can construct a power law prior like this: # In[13]: alpha = 1.0 ps = hypos**(-alpha) power = Pmf(ps, hypos, name='power law') power.normalize() # For comparison, here's the uniform prior again. # In[14]: hypos = np.arange(1, 1001) uniform = Pmf(1, hypos, name='uniform') uniform.normalize() # Here's what a power law prior looks like, compared to the uniform prior: # In[15]: uniform.plot(color='C4') power.plot(color='C1') decorate(xlabel='Number of trains', ylabel='PMF', title='Prior distributions') # Here's the update for both priors. # In[16]: dataset = [60] update_train(uniform, dataset) update_train(power, dataset) # And here are the posterior distributions. # In[17]: uniform.plot(color='C4') power.plot(color='C1') decorate(xlabel='Number of trains', ylabel='PMF', title='Posterior distributions') # The power law gives less prior probability to high values, which yields lower posterior means, and less sensitivity to the upper bound. # # Here's how the posterior means depend on the upper bound when we use a power law prior and observe three trains: # In[18]: df = pd.DataFrame(columns=['Posterior mean']) df.index.name = 'Upper bound' alpha = 1.0 dataset = [30, 60, 90] for high in [500, 1000, 2000]: hypos = np.arange(1, high+1) ps = hypos**(-alpha) power = Pmf(ps, hypos) for data in dataset: update_train(power, data) df.loc[high] = power.mean() df # Now the differences are much smaller. In fact, # with an arbitrarily large upper bound, the mean converges on 134. # # So the power law prior is more realistic, because it is based on # general information about the size of companies, and it behaves better in practice. # ## Credible Intervals # # So far we have seen two ways to summarize a posterior distribution: the value with the highest posterior probability (the MAP) and the posterior mean. # These are both **point estimates**, that is, single values that estimate the quantity we are interested in. # # Another way to summarize a posterior distribution is with percentiles. # If you have taken a standardized test, you might be familiar with percentiles. # For example, if your score is the 90th percentile, that means you did as well as or better than 90\% of the people who took the test. # # If we are given a value, `x`, we can compute its **percentile rank** by finding all values less than or equal to `x` and adding up their probabilities. # # `Pmf` provides a method that does this computation. # So, for example, we can compute the probability that the company has less than or equal to 100 trains: # In[19]: power.prob_le(100) # With a power law prior and a dataset of three trains, the result is about 29%. # So 100 trains is the 29th percentile. # # Going the other way, suppose we want to compute a particular percentile; for example, the median of a distribution is the 50th percentile. # We can compute it by adding up probabilities until the total exceeds 0.5. # Here's a function that does it: # In[20]: def quantile(pmf, prob): """Compute a quantile with the given prob.""" total = 0 for q, p in pmf.items(): total += p if total >= prob: return q return np.nan # The loop uses `items`, which iterates the quantities and probabilities in the distribution. # Inside the loop we add up the probabilities of the quantities in order. # When the total equals or exceeds `prob`, we return the corresponding quantity. # # This function is called `quantile` because it computes a quantile rather than a percentile. # The difference is the way we specify `prob`. # If `prob` is a percentage between 0 and 100, we call the corresponding quantity a percentile. # If `prob` is a probability between 0 and 1, we call the corresponding quantity a **quantile**. # # Here's how we can use this function to compute the 50th percentile of the posterior distribution: # In[21]: quantile(power, 0.5) # The result, 113 trains, is the median of the posterior distribution. # # `Pmf` provides a method called `quantile` that does the same thing. # We can call it like this to compute the 5th and 95th percentiles: # In[22]: power.quantile([0.05, 0.95]) # The result is the interval from 91 to 243 trains, which implies: # # * The probability is 5% that the number of trains is less than or equal to 91. # # * The probability is 5% that the number of trains is greater than 243. # # Therefore the probability is 90% that the number of trains falls between 91 and 243 (excluding 91 and including 243). # For this reason, this interval is called a 90% **credible interval**. # # `Pmf` also provides `credible_interval`, which computes an interval that contains the given probability. # In[23]: power.credible_interval(0.9) # ## The German Tank Problem # # During World War II, the Economic Warfare Division of the American # Embassy in London used statistical analysis to estimate German # production of tanks and other equipment. # # The Western Allies had captured log books, inventories, and repair # records that included chassis and engine serial numbers for individual # tanks. # # Analysis of these records indicated that serial numbers were allocated # by manufacturer and tank type in blocks of 100 numbers, that numbers # in each block were used sequentially, and that not all numbers in each # block were used. So the problem of estimating German tank production # could be reduced, within each block of 100 numbers, to a form of the # train problem. # # Based on this insight, American and British analysts produced # estimates substantially lower than estimates from other forms # of intelligence. And after the war, records indicated that they were # substantially more accurate. # # They performed similar analyses for tires, trucks, rockets, and other # equipment, yielding accurate and actionable economic intelligence. # # The German tank problem is historically interesting; it is also a nice # example of real-world application of statistical estimation. # # For more on this problem, see [this Wikipedia page](https://en.wikipedia.org/wiki/German_tank_problem) and Ruggles and Brodie, "An Empirical Approach to Economic Intelligence in World War II", *Journal of the American Statistical Association*, March 1947, [available here](https://web.archive.org/web/20170123132042/https://www.cia.gov/library/readingroom/docs/CIA-RDP79R01001A001300010013-3.pdf). # ## Informative Priors # # Among Bayesians, there are two approaches to choosing prior # distributions. Some recommend choosing the prior that best represents # background information about the problem; in that case the prior # is said to be **informative**. The problem with using an informative # prior is that people might have different information or # interpret it differently. So informative priors might seem arbitrary. # # The alternative is a so-called **uninformative prior**, which is # intended to be as unrestricted as possible, in order to let the data # speak for itself. In some cases you can identify a unique prior # that has some desirable property, like representing minimal prior # information about the estimated quantity. # # Uninformative priors are appealing because they seem more # objective. But I am generally in favor of using informative priors. # Why? First, Bayesian analysis is always based on # modeling decisions. Choosing the prior is one of those decisions, but # it is not the only one, and it might not even be the most subjective. # So even if an uninformative prior is more objective, the entire analysis is still subjective. # # Also, for most practical problems, you are likely to be in one of two # situations: either you have a lot of data or not very much. If you have a lot of data, the choice of the prior doesn't matter; # informative and uninformative priors yield almost the same results. # If you don't have much data, using relevant background information (like the power law distribution) makes a big difference. # # And if, as in the German tank problem, you have to make life and death # decisions based on your results, you should probably use all of the # information at your disposal, rather than maintaining the illusion of # objectivity by pretending to know less than you do. # ## Summary # # This chapter introduces the train problem, which turns out to have the same likelihood function as the dice problem, and which can be applied to the German Tank problem. # In all of these examples, the goal is to estimate a count, or the size of a population. # # In the next chapter, I'll introduce "odds" as an alternative to probabilities, and Bayes's Rule as an alternative form of Bayes's Theorem. # We'll compute distributions of sums and products, and use them to estimate the number of Members of Congress who are corrupt, among other problems. # # But first, you might want to work on these exercises. # ## Exercises # **Exercise:** Suppose you are giving a talk in a large lecture hall and the fire marshal interrupts because they think the audience exceeds 1200 people, which is the safe capacity of the room. # # You think there are fewer then 1200 people, and you offer to prove it. # It would take too long to count, so you try an experiment: # # * You ask how many people were born on May 11 and two people raise their hands. # # * You ask how many were born on May 23 and 1 person raises their hand. # * Finally, you ask how many were born on August 1, and no one raises their hand. # # How many people are in the audience? What is the probability that there are more than 1200 people. # Hint: Remember the binomial distribution. # In[24]: # Solution # I'll use a uniform prior from 1 to 2000 # (we'll see that the probability is small that there are # more than 2000 people in the room) hypos = np.arange(1, 2000, 10) prior = Pmf(1, hypos) prior.normalize() # In[25]: # Solution # We can use the binomial distribution to compute the probability # of the data for each hypothetical audience size from scipy.stats import binom likelihood1 = binom.pmf(2, hypos, 1/365) likelihood2 = binom.pmf(1, hypos, 1/365) likelihood3 = binom.pmf(0, hypos, 1/365) # In[26]: # Solution # Here's the update posterior = prior * likelihood1 * likelihood2 * likelihood3 posterior.normalize() # In[27]: # Solution # And here's the posterior distribution posterior.plot(color='C4', label='posterior') decorate(xlabel='Number of people in the audience', ylabel='PMF') # In[28]: # Solution # If we have to guess the audience size, # we might use the posterior mean posterior.mean() # In[29]: # Solution # And we can use prob_gt to compute the probability # of exceeding the capacity of the room. # It's about 1%, which may or may not satisfy the fire marshal posterior.prob_gt(1200) # **Exercise:** I often see [rabbits](https://en.wikipedia.org/wiki/Eastern_cottontail) in the garden behind my house, but it's not easy to tell them apart, so I don't really know how many there are. # # Suppose I deploy a motion-sensing [camera trap](https://en.wikipedia.org/wiki/Camera_trap) that takes a picture of the first rabbit it sees each day. After three days, I compare the pictures and conclude that two of them are the same rabbit and the other is different. # # How many rabbits visit my garden? # # To answer this question, we have to think about the prior distribution and the likelihood of the data: # # * I have sometimes seen four rabbits at the same time, so I know there are at least that many. I would be surprised if there were more than 10. So, at least as a starting place, I think a uniform prior from 4 to 10 is reasonable. # # * To keep things simple, let's assume that all rabbits who visit my garden are equally likely to be caught by the camera trap in a given day. Let's also assume it is guaranteed that the camera trap gets a picture every day. # In[30]: # Solution hypos = np.arange(4, 11) prior = Pmf(1, hypos) # In[31]: # Solution # The probability that the second rabbit is the same as the first is 1/N # The probability that the third rabbit is different is (N-1)/N N = hypos likelihood = (N-1) / N**2 # In[32]: # Solution posterior = prior * likelihood posterior.normalize() posterior.bar(alpha=0.7) decorate(xlabel='Number of rabbits', ylabel='PMF', title='The Rabbit Problem') # **Exercise:** Suppose that in the criminal justice system, all prison sentences are either 1, 2, or 3 years, with an equal number of each. One day, you visit a prison and choose a prisoner at random. What is the probability that they are serving a 3-year sentence? What is the average remaining sentence of the prisoners you observe? # In[33]: # Solution # Here's the prior distribution of sentences hypos = np.arange(1, 4) prior = Pmf(1/3, hypos) prior # In[34]: # Solution # If you visit a prison at a random point in time, # the probability of observing any given prisoner # is proportional to the duration of their sentence. likelihood = hypos posterior = prior * likelihood posterior.normalize() posterior # In[35]: # Solution # The mean of the posterior is the average sentence. # We can divide by 2 to get the average remaining sentence. posterior.mean() / 2 # **Exercise:** If I chose a random adult in the U.S., what is the probability that they have a sibling? To be precise, what is the probability that their mother has had at least one other child. # # [This article from the Pew Research Center](https://www.pewsocialtrends.org/2015/05/07/family-size-among-mothers/) provides some relevant data. # From it, I extracted the following distribution of family size for mothers in the U.S. who were 40-44 years old in 2014: # In[36]: import matplotlib.pyplot as plt qs = [1, 2, 3, 4] ps = [22, 41, 24, 14] prior = Pmf(ps, qs) prior.bar(alpha=0.7) plt.xticks(qs, ['1 child', '2 children', '3 children', '4+ children']) decorate(ylabel='PMF', title='Distribution of family size') # For simplicity, let's assume that all families in the 4+ category have exactly 4 children. # In[37]: # Solution # When you choose a person a random, you are more likely to get someone # from a bigger family; in fact, the chance of choosing someone from # any given family is proportional to the number of children likelihood = qs posterior = prior * likelihood posterior.normalize() posterior # In[38]: # Solution # The probability that they have a sibling is the probability # that they do not come from a family of 1 1 - posterior[1] # In[39]: # Solution # Or we could use prob_gt again posterior.prob_gt(1) # **Exercise:** The [Doomsday argument](https://en.wikipedia.org/wiki/Doomsday_argument) is "a probabilistic argument that claims to predict the number of future members of the human species given an estimate of the total number of humans born so far." # # Suppose there are only two kinds of intelligent civilizations that can happen in the universe. The "short-lived" kind go exinct after only 200 billion individuals are born. The "long-lived" kind survive until 2,000 billion individuals are born. # And suppose that the two kinds of civilization are equally likely. # Which kind of civilization do you think we live in? # # The Doomsday argument says we can use the total number of humans born so far as data. # According to the [Population Reference Bureau](https://www.prb.org/howmanypeoplehaveeverlivedonearth/), the total number of people who have ever lived is about 108 billion. # # Since you were born quite recently, let's assume that you are, in fact, human being number 108 billion. # If $N$ is the total number who will ever live and we consider you to be a randomly-chosen person, it is equally likely that you could have been person 1, or $N$, or any number in between. # So what is the probability that you would be number 108 billion? # # Given this data and dubious prior, what is the probability that our civilization will be short-lived? # In[40]: # Solution hypos = [200, 2000] prior = Pmf(1, hypos) # In[41]: # Solution likelihood = 1/prior.qs posterior = prior * likelihood posterior.normalize() posterior # In[42]: # Solution # According to this analysis, the probability is about 91% that our # civilization will be short-lived. # But this conclusion is based on a dubious prior. # And with so little data, the posterior depends strongly on the prior. # To see that, run this analysis again with a different prior, # and see what the results look like. # What do you think of the Doomsday argument? # In[ ]: