Statistics example problems¶

In [15]:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
import pingouin as pg

1. ANOVA geological example¶

a. Implementation¶

Data come from Table 10.1 of McKillup and Dyar, Geostatistics Explained, Cambridge University Press, 2010 (excerpt available on class Google Drive). Values represent the weight percent of MgO present in tourmalines from three locations in Maine.

Use two different methods to test whether there is a significant difference in the mean MgO content between the three different sites.

Method 1: Scipy¶

In [16]:

df = pd.read_csv('data/MgO_example/MgO_Maine.csv') # dataframe

In [17]:

df

Out[17]:

	Mount Mica	Sebago Batholith	Black Mountain
0	7	4	1
1	8	5	2
2	10	7	4
3	11	8	5

In [18]:

stats.f_oneway(df['Mount Mica'],df['Sebago Batholith'],df['Black Mountain'])

Out[18]:

F_onewayResult(statistic=10.799999999999999, pvalue=0.004058306777237465)

Method 2: Pingouin¶

In [19]:

df2 = pd.read_csv('data/MgO_example/MgO_Maine_list.csv')

In [20]:

df2

Out[20]:

	MgO	Location
0	7	Mount Mica
1	8	Mount Mica
2	10	Mount Mica
3	11	Mount Mica
4	4	Sebago Batholith
5	5	Sebago Batholith
6	7	Sebago Batholith
7	8	Sebago Batholith
8	1	Black Mountain
9	2	Black Mountain
10	4	Black Mountain
11	5	Black Mountain

In [21]:

pg.anova(data=df2,dv='MgO',between='Location')

Out[21]:

	Source	ddof1	ddof2	F	p-unc	np2
0	Location	2	9	10.8	0.004058	0.705882

Post-hoc test¶

In [22]:

pg.pairwise_tukey(data=df2,dv='MgO',between='Location')

Out[22]:

	A	B	mean(A)	mean(B)	diff	se	T	p-tukey	hedges
0	Black Mountain	Mount Mica	3	9	-6	1.290994	-4.64758	0.001000	-2.857683
1	Black Mountain	Sebago Batholith	3	6	-3	1.290994	-2.32379	0.061925	-1.428841
2	Mount Mica	Sebago Batholith	9	6	3	1.290994	2.32379	0.061925	1.428841

b. ANOVA interpretation¶

Write a summary of your interpretation of the statistical results conducted above. Address the following questions.

What is the null hypothesis being tested?
Should the null hypothesis be accepted or rejected?
What does the post-hoc test tell you?

2. Non-parametric tests¶

a. Wilcoxon signed-rank test: implementation¶

This example uses data from: http://www.biostathandbook.com/wilcoxonsignedrank.html

The data are observations of aluminum content in 13 different poplar clones in a polluted area. The scientific question is whether there is a significant change in the aluminum content between August and November.

In [24]:

df_al = pd.read_csv('data/wilcoxon_example/Al_content.csv',
                   delimiter='\t')

In [25]:

df_al

Out[25]:

	Clone	August	November
0	Columbia River	18.3	12.7
1	Fritzi Pauley	13.3	11.1
2	Hazendans	16.5	15.3
3	Primo	12.6	12.7
4	Raspalje	9.5	10.5
5	Hoogvorst	13.6	15.6
6	Balsam Spire	8.1	11.2
7	Gibecq	8.9	14.2
8	Beaupre	10.0	16.3
9	Unal	8.3	15.5
10	Trichobel	7.9	19.9
11	Gaver	8.1	20.4
12	Wolterson	13.4	36.8

In [28]:

plt.figure()
plt.hist(df_al['November']);

In [29]:

stats.skewtest(df_al['November'])

Out[29]:

SkewtestResult(statistic=3.449022139607473, pvalue=0.0005626205706886182)

In [30]:

stats.normaltest(df_al['November'])

Out[30]:

NormaltestResult(statistic=21.55304457655946, pvalue=2.0884103462437462e-05)

In [31]:

stats.wilcoxon(df_al['August'],df_al['November'])

Out[31]:

WilcoxonResult(statistic=16.0, pvalue=0.039794921875)

b. Interpretation¶

Under what situations are non-parametric statistics useful? What are the potential drawbacks in using non-parametric statistics when a parametric approach is justified?