%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
from thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot
decorate = thinkplot.config
Suppose you buy a loaf of bread every day for a year, take it home, and weigh it. You suspect that the distribution of weights is more skewed than a normal distribution with the same mean and standard deviation.
To test your suspicion, write a definition for a class named
SkewTest
that extends thinkstats.HypothesisTest
and provides
two methods:
TestStatistic
should compute the skew of a given sample.
RunModel
should simulate the null hypothesis and return
simulated data.
class HypothesisTest(object):
"""Represents a hypothesis test."""
def __init__(self, data):
"""Initializes.
data: data in whatever form is relevant
"""
self.data = data
self.MakeModel()
self.actual = self.TestStatistic(data)
self.test_stats = None
def PValue(self, iters=1000):
"""Computes the distribution of the test statistic and p-value.
iters: number of iterations
returns: float p-value
"""
self.test_stats = np.array([self.TestStatistic(self.RunModel())
for _ in range(iters)])
count = sum(self.test_stats >= self.actual)
return count / iters
def MaxTestStat(self):
"""Returns the largest test statistic seen during simulations.
"""
return np.max(self.test_stats)
def PlotHist(self, label=None):
"""Draws a Cdf with vertical lines at the observed test stat.
"""
plt.hist(self.test_stats, color='C4', alpha=0.5)
plt.axvline(self.actual, linewidth=3, color='0.8')
plt.xlabel('Test statistic')
plt.ylabel('Count')
plt.title('Distribution of the test statistic under the null hypothesis')
def TestStatistic(self, data):
"""Computes the test statistic.
data: data in whatever form is relevant
"""
raise UnimplementedMethodException()
def MakeModel(self):
"""Build a model of the null hypothesis.
"""
pass
def RunModel(self):
"""Run the model of the null hypothesis.
returns: simulated data
"""
raise UnimplementedMethodException()
# Solution goes here
To test this class, I'll generate a sample from an actual Gaussian distribution, so the null hypothesis is true.
mu = 1000
sigma = 35
data = np.random.normal(mu, sigma, size=365)
Now we can make a SkewTest
and compute the observed skewness.
test = SkewTest(data)
test.actual
Here's the p-value.
test = SkewTest(data)
test.PValue()
And the distribution of the test statistic under the null hypothesis.
test.PlotHist()
Most of the time the p-value exceeds 5%, so we would conclude that the observed skewness could plausibly be due to random sample.
But let's see how often we get a false positive.
iters = 100
count = 0
for i in range(iters):
data = np.random.normal(mu, sigma, size=365)
test = SkewTest(data)
p_value = test.PValue()
if p_value < 0.05:
count +=1
print(count/iters)
In the long run, the false positive rate is the threshold we used, 5%.