%matplotlib inline
import pylab as pl
import warnings
warnings.filterwarnings(action = "ignore", category=DeprecationWarning)
pl.xkcd()
<matplotlib.rc_context at 0x7f751b15c790>
import numpy as np
from scipy import stats
## data comparing averages of heights from two twons
town1_hts = [5, 6, 7, 6, 7.1, 6, 4]
town2_hts = [5.5, 6.5, 7, 6, 7.1, 6]
## plotting - use boxplot to show the difference between two samples
fig, ax = pl.subplots(1, 1)
_ = ax.boxplot([town1_hts, town2_hts])
_ = ax.set_xticklabels(["town1", "town2"])
_ = ax.set_ylim((3.5, 7.5))
** Observations: **
** Run a statistical test**
## scipy implemented paired/unpaired
## scipy.stats.ttest_rel implements paired test
## scipy.stats.ttest_ind implements unpaired test
%pdef stats.ttest_rel
%pdef stats.ttest_ind
print "Welch's T-test p-value", stats.ttest_ind(town1_hts, town2_hts, equal_var=False, )[1]
stats.ttest_rel(a, b, axis=0) stats.ttest_ind(a, b, axis=0, equal_var=True) Welch's T-test p-value 0.347028503558
Check the assumption for t-test
%pdef stats.shapiro
print 'Town1 Shapiro-Wilks p-value', stats.shapiro(town1_hts)[1]
print 'Town2 Shapiro-Wilks p-value', stats.shapiro(town2_hts)[1]
stats.shapiro(x, a=None, reta=False) Town1 Shapiro-Wilks p-value 0.380458295345 Town2 Shapiro-Wilks p-value 0.562481462955
But what if the data sets are NOT normally distributed
%pdef stats.mannwhitneyu
print 'Mann-Whitney U p-value', stats.mannwhitneyu(town1_hts, town2_hts, )[1]
stats.mannwhitneyu(x, y, use_continuity=True) Mann-Whitney U p-value 0.253597522173