The statistics of hypothesis can be thought of as observations of random variables from known distributions,
证伪的逻辑
Null hypothesis H0 represents some default position
Alternative hypothesis H1
We do not prove (just support) alternative hypothesis;
We just support alternative hypothesis when we reject the null hypothesis.
Imagine we have a coin and we want to test whether it’s fair.
Null hypothesis H0: The coin is fair.
Each coin flip is a Bernoulli trial,
from probability import normal_cdf, inverse_normal_cdf
import math, random
def normal_approximation_to_binomial(n, p):
"""finds mu and sigma corresponding to a Binomial(n, p)"""
mu = p * n
sigma = math.sqrt(p * (1 - p) * n)
return mu, sigma
Whenever a random variable follows a normal distribution, we can use normal_cdf to figure out the probability that its realized value lies within (or outside) a particular interval:
# the normal cdf _is_ the probability the variable is below a threshold
normal_probability_below = normal_cdf
# it's above the threshold if it's not below the threshold
def normal_probability_above(lo, mu=0, sigma=1):
return 1 - normal_cdf(lo, mu, sigma)
# it's between if it's less than hi, but not less than lo
def normal_probability_between(lo, hi, mu=0, sigma=1):
return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)
# it's outside if it's not between
def normal_probability_outside(lo, hi, mu=0, sigma=1):
return 1 - normal_probability_between(lo, hi, mu, sigma)
We can also find either the nontail region or the (symmetric) interval around the mean that accounts for a certain level of likelihood.
For example, if we want to find an interval centered at the mean and containing 60% probability
def normal_upper_bound(probability, mu=0, sigma=1):
"""returns the z for which P(Z <= z) = probability"""
return inverse_normal_cdf(probability, mu, sigma)
def normal_lower_bound(probability, mu=0, sigma=1):
"""returns the z for which P(Z >= z) = probability"""
return inverse_normal_cdf(1 - probability, mu, sigma)
def normal_two_sided_bounds(probability, mu=0, sigma=1):
"""returns the symmetric (about the mean) bounds
that contain the specified probability"""
tail_probability = (1 - probability) / 2
# upper bound should have tail_probability above it
upper_bound = normal_lower_bound(tail_probability, mu, sigma)
# lower bound should have tail_probability below it
lower_bound = normal_upper_bound(tail_probability, mu, sigma)
return lower_bound, upper_bound
In particular, when we choose to flip the coin n = 1000 times. If our hypothesis of fairness is true,
import numpy as np
np.sqrt(1000*0.5*(1-0.5))
15.811388300841896
mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)
print("mu_0", mu_0)
print("sigma_0", sigma_0)
mu_0 500.0 sigma_0 15.811388300841896
对于原假设H0,容易犯的两类错误:Type 1“弃真”或 Type 2“纳伪”?
We need to make a decision about significance: How willing we are to make a type 1 error (“false positive”), in which we reject H0 even though it’s true.
Assuming p really equals 0.5 (i.e., H0 is true), there is just a 5% chance we observe an X that lies outside this interval, which is the exact significance we wanted.
if H0 is true, approximately 19 times out of 20, this test will give the correct result.
print("normal_two_sided_bounds(0.95, mu_0, sigma_0)")
lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)
print("power of a test")
print("95% bounds based on assumption p is 0.5")
print("lo", lo)
print("hi", hi)
normal_two_sided_bounds(0.95, mu_0, sigma_0) power of a test 95% bounds based on assumption p is 0.5 lo 469.01026640487555 hi 530.9897335951244
Let’s check what happens if p is really 0.55:
print("actual mu and sigma based on p = 0.55")
mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)
print("mu_1", mu_1)
print("sigma_1", sigma_1)
print("Now, the coin is slightly biased toward heads.")
actual mu and sigma based on p = 0.55 mu_1 550.0 sigma_1 15.732132722552274 Now the coin is slightly biased toward heads.
# a type 2 error means we fail to reject the null hypothesis
# which will happen when X is still in our original interval
type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
power = 1 - type_2_probability # 0.887
print("type 2 probability", type_2_probability)
print("power", power)
type 2 probability 0.0636203880618903 power 0.9363796119381097
Assume that:
null hypothesis H0 was that the coin is not biased toward heads, or that p ≤ 0.5.
In that case we want a one-sided test that rejects the null hypothesis when X is much larger than 500 but not when X is smaller than 500.
print("one-sided test")
hi = normal_upper_bound(0.95, mu_0, sigma_0)
print("hi", hi) # is 526 (< 531, since we need more probability in the upper tail)
type_2_probability = normal_probability_below(hi, mu_1, sigma_1)
power = 1 - type_2_probability # = 0.936
print("type 2 probability", type_2_probability)
print("power", power)
one-sided test hi 526.0073585242053 type 2 probability 0.06362051966928273 power 0.9363794803307173
This is a more powerful test, since
def two_sided_p_value(x, mu=0, sigma=1):
if x >= mu:
# if x is greater than the mean, the tail is above x
return 2 * normal_probability_above(x, mu, sigma)
else:
# if x is less than the mean, the tail is below x
return 2 * normal_probability_below(x, mu, sigma)
# If we were to see 530 heads, we would compute:
print("two_sided_p_value(529.5, mu_0, sigma_0) is ", \
two_sided_p_value(529.5, mu_0, sigma_0))
two_sided_p_value(529.5, mu_0, sigma_0) is 0.06207721579598857
Why did we use 529.5 instead of 530?
normal_probability_between(529.5, 530.5, mu_0, sigma_0)
is a better estimate of the probability of seeing 530 heads than normal_probabil ity_between(530, 531, mu_0, sigma_0)
is.One way to convince yourself that this is a sensible estimate is with a simulation:
def count_extreme_values():
extreme_value_count = 0
for _ in range(100000):
num_heads = sum(1 if random.random() < 0.5 else 0 # count # of heads
for _ in range(1000)) # in 1000 flips
if num_heads >= 530 or num_heads <= 470: # and count how often
extreme_value_count += 1 # the # is 'extreme'
return extreme_value_count / 100000
count_extreme_values() # 0.062
0.06118
Since the p-value is greater than our 5% significance, we don’t reject the null.
print("two_sided_p_value(531.5, mu_0, sigma_0) is ",\
two_sided_p_value(531.5, mu_0, sigma_0))
two_sided_p_value(531.5, mu_0, sigma_0) is 0.046345287837786575
It is smaller than the 5% significance, and we would reject the null.
For our one-sided test, if we saw 525 heads we would compute:
upper_p_value = normal_probability_above
lower_p_value = normal_probability_below
upper_p_value(524.5, mu_0, sigma_0) # 0.061
# wouldn’t reject the null.
0.06062885772582083
For our one-sided test, if we saw 527 heads we would compute:
upper_p_value(526.5, mu_0, sigma_0) # 0.047
# we would reject the null.
0.04686839508859242
Make sure your data is roughly normally distributed before using normal_probability_above to compute p-values.
to construct a confidence interval around the observed value of the parameter.
Example
We can estimate the probability of the unfair coin by looking at the average value of the Bernoulli variables corresponding to each flip
If we observe 525 heads out of 1,000 flips, then we estimate p equals 0.525
.
How confident can we be about this estimate?
If we knew the exact value of p, the central limit theorem
tells us
the average of those Bernoulli variables should be approximately normal, with mean p and standard deviation $\sigma = \sqrt{p(1 - p) / 1000}$:
p_hat = 525 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat)/ 1000) # 0.0158
print(mu, sigma)
0.525 0.015791611697353755
we are “95% confident” that the following interval contains the true parameter p:
normal_two_sided_bounds(0.95, mu, sigma)
(0.4940490278129096, 0.5559509721870904)
if you were to repeat the experiment many times, 95% of the time the “true” parameter (which is the same every time) would lie within the observed confidence interval (which might be different every time).
we do not conclude that the coin is unfair, since 0.5 falls within our confidence interval.
If instead we’d seen 540 heads, then we’d have:
p_hat = 540 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
print(mu, sigma, normal_two_sided_bounds(0.95, mu, sigma) )
0.54 0.015760710643876435 (0.5091095927295919, 0.5708904072704082)
Here, “fair coin” doesn’t lie in the confidence interval.
P-hacking是科研人员不断的尝试统计计算直到p<.05,当然有时这可能是无意识的。
Simmons JP, Nelson LD, Simonshohn U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011 Nov;22(11):1359-66.
def run_experiment():
"""flip a fair coin 1000 times, True = heads, False = tails"""
return [random.random() < 0.5 for _ in range(1000)]
def reject_fairness(experiment):
"""using the 5% significance levels"""
num_heads = len([flip for flip in experiment if flip])
return num_heads < 469 or num_heads > 531
print("P-hacking")
random.seed(0)
experiments = [run_experiment() for _ in range(1000)]
num_rejections = len([experiment
for experiment in experiments
if reject_fairness(experiment)])
print(num_rejections) # 46
P-hacking 46
To help choosing between advertisement A (“tastes great!”) and advertisement B (“less bias!”).
If 990 out of 1,000 A-viewers click their ad while only 10 out of 1,000 B-viewers click their ad, you can be pretty confident that A is the better ad.
But what if the differences are not so stark?
Let’s say that $N_A$ people see ad A, and that $n_A$ of them click it.
$\frac{n_A}{N_A}$ is approximately a normal random variable with mean $p_A$ and standard deviation $\sigma_A$
$$\sigma_A = \sqrt \frac{P_A(1-P_A)}{N_A}$$Similarly,
$$\sigma_B = \sqrt \frac{P_B(1-P_B)}{N_B}$$# running an A/B test
def estimated_parameters(N, n):
p = n / N
sigma = math.sqrt(p * (1 - p) / N)
return p, sigma
If we assume those two normals are independent (which seems reasonable, since the individual Bernoulli trials ought to be), then their difference should also be normal with
Null Hypothesis: $p_A$ and $p_B$ are the same, that is $p_A - p_B = 0$
def a_b_test_statistic(N_A, n_A, N_B, n_B):
p_A, sigma_A = estimated_parameters(N_A, n_A)
p_B, sigma_B = estimated_parameters(N_B, n_B)
return (p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2)
# it should approximately be a standard normal.
For example, if “tastes great” gets 200 clicks out of 1,000 views and “less bias” gets 180 clicks out of 1,000 views, the statistic equals:
num_rejections = 1000
print(num_rejections, "rejections out of 1000")
print("A/B testing")
z = a_b_test_statistic(1000, 200, 1000, 180)
print("a_b_test_statistic(1000, 200, 1000, 180)", z)
print("p-value", two_sided_p_value(z))
print('which is large enough that you can’t conclude there’s much of a difference. ')
1000 rejections out of 1000 A/B testing a_b_test_statistic(1000, 200, 1000, 180) -1.1403464899034472 p-value 0.254141976542236 which is large enough that you can’t conclude there’s much of a difference.
On the other hand, if “less bias” only got 150 clicks, we’d have:
z = a_b_test_statistic(1000, 200, 1000, 150)
print("a_b_test_statistic(1000, 200, 1000, 150)", z)
print("p-value", two_sided_p_value(z))
a_b_test_statistic(1000, 200, 1000, 150) -2.948839123097944 p-value 0.003189699706216853
An alternative approach to inference involves treating the unknown parameters themselves as random variables.
prior distribution
for the parametersposterior distribution
for the parameters.For example, when the unknown parameter is a probability (as in our coin-flipping example), we often use a prior
from the Beta distribution
, which puts all its probability between 0 and 1:
def B(alpha, beta):
"""a normalizing constant so that the total probability is 1"""
return math.gamma(alpha) * math.gamma(beta) / math.gamma(alpha + beta)
def beta_pdf(x, alpha, beta):
if x < 0 or x > 1: # no weight outside of [0, 1]
return 0
return x ** (alpha - 1) * (1 - x) ** (beta - 1) / B(alpha, beta)
The larger alpha and beta are, the “tighter” the distribution is.
x = [i*10/1000 for i in range(101)]
y1 = [beta_pdf(i, 1, 1) for i in x]
y2 = [beta_pdf(i, 10, 10) for i in x]
y3 = [beta_pdf(i, 4, 16) for i in x]
y4 = [beta_pdf(i, 16, 4) for i in x]
import matplotlib.pyplot as plt
plt.plot(x, y1, label = 'Beta(1, 1)')
plt.plot(x, y2, '--', label = 'Beta(10, 10)')
plt.plot(x, y3, '-*', label = 'Beta(4, 16)')
plt.plot(x, y4, '_-', label = 'Beta(16, 4)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
So let’s say we assume a prior distribution on p.
Then we flip our coin a bunch of times and see h heads and t tails.
Beta is the conjugate prior (共轭先验) to the Binomial distribution.
Whenever you update a Beta prior using observations from the corresponding binomial, you will get back a Beta posterior.
Let’s say you flip the coin 10 times and see only 3 heads.
y5 = [beta_pdf(i, 4, 8) for i in x]
plt.plot(x, y5, '--', label = 'Beta(4, 8)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
y5 = [beta_pdf(i, 4, 8) for i in x]
y6 = [beta_pdf(i, 23, 27) for i in x]
plt.plot(x, y5, '--', label = 'Beta(4, 8)')
plt.plot(x, y6, '-*', label = 'Beta(23, 27)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
y5 = [beta_pdf(i, 4, 8) for i in x]
y6 = [beta_pdf(i, 23, 27) for i in x]
y7 = [beta_pdf(i, 33, 17) for i in x]
plt.plot(x, y5, '--', label = 'Beta(4, 8)')
plt.plot(x, y6, '-*', label = 'Beta(23, 27)')
plt.plot(x, y7, '_-', label = 'Beta(33, 17)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
What’s interesting is that this allows us to make probability statements about hypotheses:
“Based on the prior and the observed data, there is only a 5% likelihood the coin’s heads probability is between 49% and 51%.”
This is philosophically very different from a statement like
“if the coin were fair we would expect to observe data so extreme only 5% of the time.”
If you flipped the coin more and more times, the prior would matter less and less until eventually you’d have (nearly) the same posterior distribution no matter which prior you started with.
Using Bayesian inference to test hypotheses is considered somewhat controversial— in part because of the subjective nature of choosing a prior.