MIT License: http://opensource.org/licenses/MIT

In [1]:

from __future__ import print_function, division

import thinkbayes2
import thinkplot

import numpy as np
from scipy import stats

%matplotlib inline

Here's an update that takes the prior probability of being sick prior=$x$, and the likelihoods of the data, prob_given_sick=$P(fever|sick)$ and prob_given_not =$P(fever|not sick)$

In [2]:

def update(prior, prob_given_sick, prob_given_not):
    suite = thinkbayes2.Suite()
    suite['sick'] = prior * prob_given_sick
    suite['not sick'] = (1-prior) * prob_given_not
    suite.Normalize()
    return suite['sick']

If we start with $x=0.1$ and update with the assumption that fever is more likely if you're sick, the posterior goes up to $x\prime = 0.25$

In [3]:

prior = 0.1
prob_given_sick = 0.9
prob_given_not = 0.3

post = update(prior, prob_given_sick, prob_given_not)
post

Out[3]:

0.24999999999999997

Now suppose we don't know $s =$ prob_given_sick=$P(fever|sick)$ and $t = $ prob_given_not =$P(fever|not sick)$, but we think they are uniformly distributed and independent.

In [4]:

dist_s = thinkbayes2.Beta(1, 1)
dist_t = thinkbayes2.Beta(1, 1)
dist_s.Mean(), dist_t.Mean()

Out[4]:

(0.5, 0.5)

We can compute the distribute of $x\prime$ by drawing samples from the distributions of $s$ and $t$ and computing the posterior for each.

In [5]:

n = 1000
ss = dist_s.Sample(n)
ts = dist_t.Sample(n)

Just checking that the samples have the right distributions:

In [6]:

thinkplot.Cdf(thinkbayes2.Cdf(ss))
thinkplot.Cdf(thinkbayes2.Cdf(ts))

Out[6]:

{'xscale': 'linear', 'yscale': 'linear'}

Now computing the posteriors:

In [7]:

posts = [update(prior, s, t) for s, t in zip(ss, ts)]

Here's what the distribution of values for $x\prime$ looks like:

In [8]:

cdf = thinkbayes2.Cdf(posts)
thinkplot.Cdf(cdf)

Out[8]:

{'xscale': 'linear', 'yscale': 'linear'}

And here's the mean:

In [9]:

cdf.Mean()

Out[9]:

0.15586200108737797

This result implies that if our prior probability for $x$ is 0.1, and then we learn that the patient has a fever, we should be uncertain about $x\prime$, and this distribution describes that uncertainty. It says that the fever probably has little predictive power, but might have quite a lot.

The mean of this distribution is a little higher than the prior, which suggests that our priors for $s$ and $t$ are not neutral with respect to updating $x$. It's surprising that the effect is not symmetric, because our beliefs about $s$ and $t$ are symmetric. But then again, we just computed an arithmetic mean on a set of probabilities, which is a bogus kind of thing to do. So maybe we deserve what we got.

Just for fun, what would we have to believe about $s$ and $t$ to make them neutral with respect to the posterior mean of $x$?

In [10]:

dist_s = thinkbayes2.Beta(1, 1)
dist_t = thinkbayes2.Beta(1.75, 1)
n = 10000
ss = dist_s.Sample(n)
ts = dist_t.Sample(n)
thinkplot.Cdf(thinkbayes2.Cdf(ss))
thinkplot.Cdf(thinkbayes2.Cdf(ts))

Out[10]:

{'xscale': 'linear', 'yscale': 'linear'}

In [11]:

posts = [update(prior, s, t) for s, t in zip(ss, ts)]
np.array(posts).mean()

Out[11]:

0.097733895049857883

Now here's a version that simulates worlds where $x$ is known and $s$ and $t$ are drawn from uniform distributions. For each $s$-$t$ pair, we generate one patient with a fever and compute the probability that they are sick.

In [12]:

def prob_sick(x, s, t):
    return x * s / (x * s + (1-x) * t)

In [13]:

dist_s = thinkbayes2.Beta(1, 1)
dist_t = thinkbayes2.Beta(1, 1)
n = 10000
ss = dist_s.Sample(n)
ts = dist_t.Sample(n)

In [14]:

x = 0.1
probs = [prob_sick(x, s, t) for s, t in zip(ss, ts)]

In [15]:

np.array(probs).mean()

Out[15]:

0.15306534465160232

In [16]:

cohort = np.random.random(len(probs)) < probs

In [17]:

cohort.mean()

Out[17]:

0.1439

April 6, 2016

Suppose

t is known to be 0.5
x is known to be 0.1
s is equally likely to be 0.2 or 0.8, but we don't know which

If we take the average value of s and compute p = p(sick|fever), we would consider p to be the known quantity 0.1

In [26]:

s = 0.5
t = 0.5
x = 0.1
p = x * s / (x * s + (1-x) * t)
p

Out[26]:

0.1

If we propagate the uncertainty about s through the calculation, we consider p to be either p1 or p2, but we don't know which.

In [36]:

t = 0.8
p1 = x * s / (x * s + (1-x) * t)
p1, 5/77

Out[36]:

(0.06493506493506493, 0.06493506493506493)

In [35]:

t = 0.2
p2 = x * s / (x * s + (1-x) * t)
p2, 5/23

Out[35]:

(0.21739130434782605, 0.21739130434782608)

If we were asked to make a prediction about a single patient, we would average the two possible values of p.

In [29]:

(p1 + p2) / 2

Out[29]:

0.1411631846414455

In [34]:

def logodds(p):
    return np.log(p / (1-p))

logodds(p1) - logodds(0.1), logodds(p2) - logodds(0.1)

Out[34]:

(-0.47000362924573613, 0.91629073187415466)

In [37]:

p_mix = (p1 + p2) / 2
logodds(p_mix) - logodds(0.1)

Out[37]:

0.39156220293917254

So let's simulate a series of patients by drawing a random value of y, computing p, and then tossing coins with probability p.

In [30]:

import random

def simulate_patient():
    s = 0.5
    t = random.choice([0.2, 0.8])
    x = 0.1
    p = x * s / (x * s + (1-x) * t)
    return random.random() < p

simulate_patient()

Out[30]:

False

In this simulation, the fraction of patients with fever who turn out to be sick is close to 0.141

In [31]:

patients = [simulate_patient() for _ in range(10000)]
sum(patients) / len(patients)

Out[31]:

0.1402

In [153]:

x = 0.1
s = 0.5
t1 = 0.2
t2 = 0.8

In [154]:

import pandas as pd
d1 = dict(feverbad=0.5, fevergood=0.5)
d2 = dict(sick=x, notsick=1-x)
d3 = dict(fever='t', notfever='1-t')

iterables = [d1.keys(), d2.keys(), d3.keys()]

index = pd.MultiIndex.from_product(iterables, names=['first', 'second', 'third'])
df = pd.DataFrame(np.zeros(8), index=index, columns=['prob'])

In [155]:

t_map = dict(fevergood=t2, feverbad=t1)

for label1, p1 in d1.items():
    t = t_map[label1]
    for label2, p2 in d2.items():
        for label3, p3 in d3.items():
            if label2 == 'sick':
                p = p1 * p2 * 0.5
            else:
                p = p1 * p2 * eval(p3)

            df.prob[label1, label2, label3] = p
            
df    

Out[155]:

			prob
first	second	third
fevergood	notsick	fever	0.360
	notsick	notfever	0.090
	sick	fever	0.025
	sick	notfever	0.025
feverbad	notsick	fever	0.090
	notsick	notfever	0.360
	sick	fever	0.025
	sick	notfever	0.025

If there are two kinds of people, some more fever prone than others, and we don't know which kind of patient we're dealing with, P(sick | fever)

In [157]:

df.prob[:, 'sick', 'fever'].sum() / df.prob[:, :, 'fever'].sum()

Out[157]:

0.09999999999999998

If we know fevergood, P(sick|fever) is

In [158]:

p_sick_fevergood = df.prob['fevergood', 'sick', 'fever'].sum() / df.prob['fevergood', :, 'fever'].sum()
p_sick_fevergood

Out[158]:

0.064935064935064929

If we know feverbad, P(sick|fever) is

In [149]:

p_sick_feverbad = df.prob['feverbad', 'sick', 'fever'].sum() / df.prob['feverbad', :, 'fever'].sum()
p_sick_feverbad

Out[149]:

0.21739130434782605

If we think there's a 50-50 chance of feverbad

In [159]:

(p_sick_fevergood + p_sick_feverbad) / 2

Out[159]:

0.1411631846414455

If fevergood, here's the fraction of all patients with fever:

In [143]:

p_fever_fevergood = df.prob['fevergood', :, 'fever'].sum()
p_fever_fevergood

Out[143]:

0.38500000000000006

If fever bad, here's the fraction of all patients with fever

In [144]:

p_fever_feverbad = df.prob['feverbad', :, 'fever'].sum()
p_fever_feverbad

Out[144]:

0.11500000000000002

So if we started out thinking there's a 50-50 chance of feverbad, we should now think feverbad is less likely

In [146]:

p_feverbad_fever = p_fever_feverbad / (p_fever_feverbad + p_fever_fevergood)
p_feverbad_fever

Out[146]:

0.22999999999999998

And if we compute the weighted sum of the two possible worlds

In [150]:

p_feverbad_fever * p_sick_feverbad + (1-p_feverbad_fever) * p_sick_fevergood

Out[150]:

0.099999999999999978

In [ ]: