Bayesian Statistics Made Simple¶

Code and exercises from my workshop on Bayesian statistics in Python.

MIT License: https://opensource.org/licenses/MIT

In [ ]:

# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist

In [1]:

import numpy as np
import pandas as pd

import seaborn as sns
sns.set_style('white')
sns.set_context('talk')

import matplotlib.pyplot as plt

from empiricaldist import Pmf

The dice problem¶

Create a suite of hypotheses that represents dice with different numbers of sides.

In [3]:

dice = Pmf.from_seq([4, 6, 8, 12])
dice

Out[3]:

	probs
4	0.25
6	0.25
8	0.25
12	0.25

Exercise 1: We'll solve this problem two ways. First we'll do it "by hand", as we did with the cookie problem; that is, we'll multiply each hypothesis by the likelihood of the data, and then renormalize.

In the space below, update dice based on the likelihood of the data (rolling a 6), then normalize and display the results.

In [4]:

# Solution

dice[4] *= 0
dice[6] *= 1/6
dice[8] *= 1/8
dice[12] *= 1/12

dice.normalize()
dice

Out[4]:

	probs
4	0.000000
6	0.444444
8	0.333333
12	0.222222

Exercise 2: Now let's do the same calculation using Pmf.update, which encodes the structure of a Bayesian update.

Define a function called likelihood_dice that takes data and hypo and returns the probability of the data (the outcome of rolling the die) for a given hypothesis (number of sides on the die).

Hint: What should you do if the outcome exceeds the hypothetical number of sides on the die?

Here's an outline to get you started.

In [5]:

def likelihood_dice(data, hypo):
    """Likelihood function for the dice problem.
    
    data: outcome of the die roll
    hypo: number of sides
    
    returns: float probability
    """
    # TODO: fill this in!
    return 1

In [6]:

# Solution
    
def likelihood_dice(data, hypo):
    """Likelihood function for the dice problem.
    
    data: outcome of the die roll
    hypo: number of sides
    
    returns: float probability
    """
    if data > hypo:
        return 0
    else:
        return 1 / hypo

Now we can create a Pmf object and update it.

In [7]:

dice = Pmf.from_seq([4, 6, 8, 12])
dice.update(likelihood_dice, 6)
dice

Out[7]:

	probs
4	0.000000
6	0.444444
8	0.333333
12	0.222222

If we get more data, we can perform more updates.

In [8]:

for roll in [8, 7, 7, 5, 4]:
    dice.update(likelihood_dice, roll)

Here are the results.

In [9]:

dice

Out[9]:

	probs
4	0.000000
6	0.000000
8	0.919294
12	0.080706

The German tank problem¶

The German tank problem is actually identical to the dice problem.

In [10]:

def likelihood_tank(data, hypo):
    """Likelihood function for the tank problem.
    
    data: observed serial number
    hypo: number of tanks
    
    returns: float probability
    """
    if data > hypo:
        return 0
    else:
        return 1 / hypo

Here is the update after seeing Tank #42.

In [11]:

tank = Pmf.from_seq(range(100))
tank.update(likelihood_tank, 42)
tank.mean()

Out[11]:

66.32784309363326

And here's what the posterior distribution looks like.

In [12]:

def decorate_tank(title):
    """Labels the axes.
    
    title: string
    """
    plt.xlabel('Number of tanks')
    plt.ylabel('PMF')
    plt.title(title)

In [13]:

tank.plot()
decorate_tank('Distribution after one tank')

Exercise 3: Suppose we see another tank with serial number 17. What effect does this have on the posterior probabilities?

Update the Pmf with the new data and plot the results.

In [14]:

# Solution

tank.plot(color='0.7', label='prior')
tank.update(likelihood_tank, 17)
tank.plot(label='posterior')

plt.legend()
decorate_tank('Distribution after two tanks')

tank.mean()

Out[14]:

62.25994473449364

In [ ]: