import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
import pingouin as pg
Data come from Table 10.1 of McKillup and Dyar, Geostatistics Explained, Cambridge University Press, 2010 (excerpt available on class Google Drive). Values represent the weight percent of MgO present in tourmalines from three locations in Maine.
Use two different methods to test whether there is a significant difference in the mean MgO content between the three different sites.
df = pd.read_csv('data/MgO_example/MgO_Maine.csv') # dataframe
df
Mount Mica | Sebago Batholith | Black Mountain | |
---|---|---|---|
0 | 7 | 4 | 1 |
1 | 8 | 5 | 2 |
2 | 10 | 7 | 4 |
3 | 11 | 8 | 5 |
stats.f_oneway(df['Mount Mica'],df['Sebago Batholith'],df['Black Mountain'])
F_onewayResult(statistic=10.799999999999999, pvalue=0.004058306777237465)
df2 = pd.read_csv('data/MgO_example/MgO_Maine_list.csv')
df2
MgO | Location | |
---|---|---|
0 | 7 | Mount Mica |
1 | 8 | Mount Mica |
2 | 10 | Mount Mica |
3 | 11 | Mount Mica |
4 | 4 | Sebago Batholith |
5 | 5 | Sebago Batholith |
6 | 7 | Sebago Batholith |
7 | 8 | Sebago Batholith |
8 | 1 | Black Mountain |
9 | 2 | Black Mountain |
10 | 4 | Black Mountain |
11 | 5 | Black Mountain |
pg.anova(data=df2,dv='MgO',between='Location')
Source | ddof1 | ddof2 | F | p-unc | np2 | |
---|---|---|---|---|---|---|
0 | Location | 2 | 9 | 10.8 | 0.004058 | 0.705882 |
pg.pairwise_tukey(data=df2,dv='MgO',between='Location')
A | B | mean(A) | mean(B) | diff | se | T | p-tukey | hedges | |
---|---|---|---|---|---|---|---|---|---|
0 | Black Mountain | Mount Mica | 3 | 9 | -6 | 1.290994 | -4.64758 | 0.001000 | -2.857683 |
1 | Black Mountain | Sebago Batholith | 3 | 6 | -3 | 1.290994 | -2.32379 | 0.061925 | -1.428841 |
2 | Mount Mica | Sebago Batholith | 9 | 6 | 3 | 1.290994 | 2.32379 | 0.061925 | 1.428841 |
Write a summary of your interpretation of the statistical results conducted above. Address the following questions.
This example uses data from: http://www.biostathandbook.com/wilcoxonsignedrank.html
The data are observations of aluminum content in 13 different poplar clones in a polluted area. The scientific question is whether there is a significant change in the aluminum content between August and November.
df_al = pd.read_csv('data/wilcoxon_example/Al_content.csv',
delimiter='\t')
df_al
Clone | August | November | |
---|---|---|---|
0 | Columbia River | 18.3 | 12.7 |
1 | Fritzi Pauley | 13.3 | 11.1 |
2 | Hazendans | 16.5 | 15.3 |
3 | Primo | 12.6 | 12.7 |
4 | Raspalje | 9.5 | 10.5 |
5 | Hoogvorst | 13.6 | 15.6 |
6 | Balsam Spire | 8.1 | 11.2 |
7 | Gibecq | 8.9 | 14.2 |
8 | Beaupre | 10.0 | 16.3 |
9 | Unal | 8.3 | 15.5 |
10 | Trichobel | 7.9 | 19.9 |
11 | Gaver | 8.1 | 20.4 |
12 | Wolterson | 13.4 | 36.8 |
plt.figure()
plt.hist(df_al['November']);
stats.skewtest(df_al['November'])
SkewtestResult(statistic=3.449022139607473, pvalue=0.0005626205706886182)
stats.normaltest(df_al['November'])
NormaltestResult(statistic=21.55304457655946, pvalue=2.0884103462437462e-05)
stats.wilcoxon(df_al['August'],df_al['November'])
WilcoxonResult(statistic=16.0, pvalue=0.039794921875)
Under what situations are non-parametric statistics useful? What are the potential drawbacks in using non-parametric statistics when a parametric approach is justified?