The goal of this lesson is to introduce Python libraries, load and look at a dataset, begin some common statistical testing. It includes:
Most of the common statistical models (ttest, correlation, ANOVA, chisquare, etc.) are special cases of linear models, or a very close approximation. This beautiful simplicity means that there is less to learn. In particular, it all comes down to $y = a \cdot x + b$, where $a$ is the slope of the line and $b$ is the yintercept where the line crosses the yaxis.
There are certain assumptions we check to use these "parametric tests." When assumptions are not met, we have alternative "nonparametric" counterparts. We will think of "nonparametric"" tests as ranked versions of the corresponding parametric tests.
This lesson is adapted from Tests as Linear, which is also available in R.
See the Cheat Sheet
In part 2, we will:
Think of these as useful powerful toolboxes we are opening up on our workbench. Many additional libraries are available from the Python Package Index.
we import
some key scientific libraries, sometimes using a shorter alias for ease of coding, with the syntax import library_name as alias_name
# [1]:
# Python libraries for data structures, arrays, and math
import numpy as np
import pandas as pd
# Libraris for plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Importing the statistics module
import statistics
# Library for hypothesis testing
import scipy
# Libraries for regression modeling
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Library for latex
from IPython.display import Latex, display, Markdown
Loads a copy of the data into our environment. Unlike working with a spreadsheet, it does not affect the original file.
Our data This synthetic dataset contains information on new born babies and their parents. It comes from here.
Read a Comma Separated Values (CSV) data file with pd.read_csv()
.
Need to read in a different file type?
# url is the name and path of the data file
url = "https://raw.githubusercontent.com/DeisData/datasets/main/Birthweight_reduced_kg_R.csv"
# data is the name of the dataFrame we are storing our data in
# pd is pandas and read_csv is a tool in pandas for reading in a csv file
data = pd.read_csv(url)
There are lots of functions and methods we can apply to the dataframe to start inspecting it.
# Method to view head of dataset
data.head()
This dataset contains information on new born babies and their parents. It contains mostly continuous variables (although some have only a few values e.g. number of cigarettes smoked per day) and is most useful for correlation and regression.
Main dependent variable = Birthweight (lbs)
Name 
Variable 
Data type 
ID 
Baby number 

length 
Length of baby (cm) 
Scale 
Birthweight 
Weight of baby (kg) 
Scale 
headcirumference 
Head Circumference 
Scale 
Gestation 
Gestation (weeks) 
Scale 
smoker 
Mother smokes 1 = smoker 0 =
nonsmoker 
Binary 
motherage 
Maternal age 
Scale 
mnocig 
Number of cigarettes smoked per day
by mother 
Scale 
mheight 
Mothers height (cm) 
Scale 
mppwt 
Mothers prepregnancy weight (kg) 
Scale 
fage 
Father's age 
Scale 
fedyrs 
Father’s years in education 
Scale 
fnocig 
Number of cigarettes smoked per day
by father 
Scale 
fheight 
Father's height (kg) 
Scale 
lowbwt 
Low birth weight, 0 = No and 1 = yes 
Binary 
mage35 
Mother over 35, 0 = No and 1 = yes 
Binary 
Your turn to explore! Uncomment and try more Pandas methods we can apply to our dataframe:
# TRY THESE!
# Method to see the tail end of the data
# data.tail()
# Use the DataFrame.info() method to find out more about a dataframe.
# data.info()
# Get Dimensions (rows, columns)
# data.shape
# Get data types
# data.dtypes
# The DataFrame.columns variable stores information about the dataframe’s columns.
# data.columns
# Use DataFrame.describe() to get summary statistics about data.
# data.describe()
When we are doing statistical testing, we form a null hypothesis that states that there is no relationship, either between the two variables being studied (one variable does not affect the other), or between a variable and some constant number like a population mean.
The $p$value, or probability value, returned with your statistical test is a number describing how likely it is that your data would have occurred by random chance (i.e. that the null hypothesis is true). It is a number between 0 and 1. The smaller the pvalue, the stronger the evidence that you should reject the null hypothesis. A common threshold of significance is $p < 0.05$.
Image from What a pvalue tells you about statistical significance
There is a lot of conversation around use and misuse of $p$values in scientific analysis. Here are a few articles you may find interesting:
Model: the recipe for $y$ is a slope ($\beta_1$) times $x$ plus an yintercept ($\beta_0$, aka a straight line).
$y = \beta_0 + \beta_1 x \qquad \mathcal{H}_0: \beta_1 = 0$
... which is a mathy way of writing the good old $y = ax + b$ (here ordered as $y = b + ax$). The task of linear models is simply to find the numbers that best predict y
.
Either way you write it, it's an intercept ($\beta_0$) and a slope ($\beta_1$) yielding a straight line.
With our data
we can use the seaborn regression plot tool sns.regplot
to quickly visualize a linear model fit between mother's height mheight
and baby length Length
when asking the question "Can moether's height predict baby length?"
sns.regplot(x='mheight',y='Length',data=data)
Practice for you. Use the seaborn regression tool to visualize a simple linear regression to consider the question: "Can gestational age (Gestation
) predict baby weight (Birthweight
)?"
# Your code goes here!:
#sns.regplot()
This is often simply called a regression model which can be extended to multiple regression where there are several $\beta$s and on the righthand side multiplied with the predictors.
Everything below, from onesample ttest to twoway ANOVA are just special cases of this system. Nothing more, nothing less.
We use scipy.stats.pearsonr
to calculate a pearson correlation coefficient and the pvalue for testing noncorrelation between two variables. Correlations of $1$ or $+1$ imply an exact linear relationships, and 0 implies no correlation.
# pearsonr returns the pearson's correlation coefficient and
# the 2tailed pvalue
stats.pearsonr(data.mheight, data.Length)
The pvalue roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.
Practice for you. Calculate the Pearson's R for the question: "Can gestational age (Gestation
) predict baby weight (Birthweight
)?"
# Your code goes here!:
As the name implies, the Spearman rank correlation is a Pearson correlation on ranktransformed $x$ and $y$:
$\text{rank}(y) = \beta_0 + \beta_1 \cdot \text{rank}(x) \qquad \mathcal{H}_0: \beta_1 = 0$
I'll introduce ranks in a minute. For now, notice that the correlation coefficient of the linear model is identical to a "real" Pearson correlation, but pvalues are an approximation which is is appropriate for samples greater than N = 10 and almost perfect when N > 20.
Such a nice and nonmysterious equivalence!
The Spearman rankorder correlation coefficient is a nonparametric measure of the relationship between two variables. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed.
We will use scipy.stats.spearmanr
to calculate the Spearman correlation coefficient with associated pvalue.
# spearmanr returns the rank correlation coefficient and
# the 2tailed pvalue
stats.spearmanr(data.mheight, data.Length)
Practice for you. Calculate the Spearman's R for the question: "Can gestational age (Gestation
) predict baby weight (Birthweight
)?"
# Your code goes here!:
scipy.stats.rankdata
simply takes an array of numbers and "replaces" them with the integers of their rank (1st smallest, 2nd smallest, 3rd smallest, etc.). pd.DataFrame.rank
performs a similar function, but with support for pandas.DataFrames
. So the result of the ranktransformation scipy.stats.rankdata([3.6, 3.4, 5.0, 8.2])
is [3, 2, 1, 4]
. See that in the figure above?
A signed rank is the same, just where we rank according to absolute size first and then add in the sign second. So the signed rank here would be [2, 1, 3, 4]
. Or in code:
def signed_rank(df):
return np.sign(df) * df.abs().rank()
Ranks are all you need to do to convert most parametric tests into their "nonparametric" counterparts!
sns.pairplot()
from seaborn combines joint and marginal views — but rather than focusing on a single relationship, it visualizes every pairwise combination of variables simultaneously:
# Uncomment and run to see a matrix of all relationships, colorcoded by smoker / nonsmoker
# sns.pairplot(data=data, hue="smoker")
The Ttest and Wilcoxon signedrank test calculate probability for the mean of ONE group of scores. This is a test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean, popmean.
ttest model: A single number predicts $y$.
$y = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0$
In other words, it's our good old $y = \beta_0 + \beta_1*x$ where the last term is gone since there is no $x$ (essentially $x=0$).
The same is to a very close approximately true for Wilcoxon signedrank test, just with the signed ranks of $y$ instead of $y$ itself.
$\text{signed_rank}(y) = \beta_0$
We will use scipy.stats.ttest_1samp
for the parametric comparison test, and scipy.stats.wilcoxon
for the nonparametric comparison test.
Try running the Python code below and see that the linear model (smf.ols
) produces the same $t$, $p$, and $r$ as scipy.stats.ttest_1samp
.
First, we extract the subsets of data for smokers and nonsmokers.
# Average of nonsmoker mother birthweights
data_nonsmokers = data[data.smoker==0]
data_smokers = data[data.smoker==1]
Next, we will calculate the average birthweight for each group.
avg_birthwgt_nonsmoker = statistics.mean(data_nonsmokers.Birthweight)
avg_birthwgt_smoker = statistics.mean(data_smokers.Birthweight)
print(avg_birthwgt_nonsmoker, avg_birthwgt_smoker)
Let's compare the data for smoker's to the average birthweight from a nonsmoking mother.
t, p = stats.ttest_1samp(data_smokers.Birthweight, avg_birthwgt_nonsmoker)
print(t,p)
and the equivalent linear model with an interceptonly is fit with smf.ols
, the tool for ordinary least squares linear regression. If we want to make an equivalent comparison, we will shift the birthweight data by the average birthweight for nonsmokers, and then see what the linear model looks like!
# We will shift the birthweight data by the average for nonsmoker
# and store the result in a new column, Birthweight_adj
data["Birthweight_adj"] = data["Birthweight"]avg_birthwgt_nonsmoker
data_smokers.head()
# Equivalent linear model: interceptonly
res = smf.ols(formula="Birthweight_adj ~ 1", data=data[data.smoker==1]).fit()
print(res.summary())
Notice that the t
statistics and p
values for the linear model and ttest the same!
We use scipy.stats.wilcoxon
for the nonparametric comparison test and compare it to the $p$value for a ranked data linear model.
stats.wilcoxon(data[data.smoker==1].Birthweight_adj)
signed_rank_data = signed_rank(data[data.smoker==1])
res = smf.ols("Birthweight_adj ~ 1", data=signed_rank_data).fit()
print(res.summary())
A paired ttest or matched pairs test is used when we are interested in the difference between two variables for the same subject. Often the two variables might be measurements that are separated by time (e.g. cholesterol levels before and after a special diet).
ttest model: a single number (intercept) predicts the pairwise differences.
$y_2y_1 = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0$
This means that there is just one $y = y_2  y_1$ to predict and it becomes a onesample ttest on the pairwise differences. The idea is therefore also the same as for the onesample ttest.
Similarly, the Wilcoxon matched pairs only differ from Wilcoxon signedrank in that it's testing the signed ranks of the pairwise $y_2y_1$ differences.
$\text{signed_rank}(y_2y_1) = \beta_0 \qquad \mathcal{H}_0: \beta_0 = 0$
Let's load some data with matched pairs from a study tested whether cholesterol was reduced after using a certain brand of margarine as part of a low fat, low cholesterol diet.
The subjects consumed on average 2.31g of the active ingredient, stanol easter, a day. This data set contains information on 18 people using margarine to reduce cholesterol over three time points.
url = "https://raw.githubusercontent.com/DeisData/datasets/main/Cholesterol_R.csv"
data_matched = pd.read_csv(url)
data_matched.head()
# Plot the data for each ID, Before and After 8 weeks
fig, ax = plt.subplots()
sns.scatterplot(x='ID',y='Before',data=data_matched, ax=ax)
sns.scatterplot(x='ID',y='After8weeks',data=data_matched, ax=ax)
ax.set_ylabel('Cholesterol Before and After 8 wks')
ax.set_xlabel('Patient ID')
t, p = scipy.stats.ttest_rel(data_matched.After8weeks, data_matched.Before)
print(t,p)
# Let's add a column to our dataFrame subtracting the difference in cholesterol measures
data_matched["cholesterol_diff"] = data_matched.After8weeks  data_matched.Before
data_matched.head()
# The ttest_rel is similar to calculating the difference in y values for each match, and
# fitting a linear model
res = smf.ols(formula="cholesterol_diff ~ 1", data=data_matched).fit()
print(res.summary())
You will notice the t
statistics and p
values are similar for the Paired Ttest and a linear fit of the difference in the matched data.
Again, we do the signedranks trick. This is still an approximation, but a close one:
stats.wilcoxon(data_matched.After8weeks, data_matched.Before)
Practice for you. Compare the matched data after 4 weeks and after 8 weeks. What are the results of the paired sample ttest? Of the Wilcoxon matched pairs test?
# Your code goes here!:
ANOVAs are linear models with categorical predictors. The oneway ANOVA tests the null hypothesis that two or more groups have the same population mean. The test is applied to samples from two or more groups, possibly with differing sizes.
Model: One mean for each group predicts $y$.
$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 +... \qquad \mathcal{H}_0: y = \beta_0$
where $x_i$ are indicators ($x=0$ or $x=1$) where at most one $x_i=1$ while all others are $x_i=0$.
Notice how this is just "more of the same" of what we already did in other models above. When there are only two groups, this model is $y = \beta_0 + \beta_1*x$, i.e. the independent ttest. If there is only one group, it is $y = \beta_0$, i.e. the onesample ttest.
Since we now regress on more than one $x$, the oneway ANOVA is a multiple regression model.
The KruskalWallis test is simply a oneway ANOVA on the ranktransformed $y$ (value
):
$\text{rank}(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 +...$
This approximation is good enough for 12 or more data points.
We make a twolevel factor with the Margarine
levels A
and B
, so that the oneway ANOVA basically becomes a "twosample ttest". We will test the null hypothesis that the two groups have the same population mean of cholesterol level after 8 weeks.
data_matched.head()
The ANOVA test has important assumptions that must be satisfied in order for the associated pvalue to be valid. We will see how to test these in 6. Checking Assumptions
The samples are independent.
Each sample is from a normally distributed population. KS Test
The population standard deviations of the groups are all equal. This property is known as homoscedasticity. Bartlett Test
Let's see scipy's dedicated ANOVA function scipy.stats.f_oneway
.
# Extract subsets of data
data_matched_A = data_matched[data_matched.Margarine=='A']
data_matched_B = data_matched[data_matched.Margarine=='B']
scipy.stats.f_oneway(data_matched_A.After8weeks,
data_matched_B.After8weeks)
If ANOVA assumptions are not met, we can consider a nonparametric test.
The KruskalWallis Htest (scipy.stats.kruskal
) tests the null hypothesis that the population median of all of the groups are equal.
# Notice here, we have combined the steps of extracting subsets of data,
# accessing the .After8weeks column from the subset,
# and doing the Kruskal test! Python is powerful!
stats.kruskal(data_matched[data_matched.Margarine=='A'].After8weeks,
data_matched[data_matched.Margarine=='B'].After8weeks)
# Python starts working inside the parentheses, just like algebra.
Do Not Reject the hypothesis that median cholesterol levels at 8 weeks are equal for the Margarine A and Margarine B groups.
Think back to the Birthweight data
. Can you set up a test of equal means for the birthweights measured from four groups of mothers who are:
# Try it here!:
Parametric tests like the ttests and ANOVA have important assumptions that must be satisfied in order for the associated pvalue to be valid, including:
Here are just a few tools to check some of these assumptions.
Baby weight is normally distributed (i.e. bellshaped.) We can check this visually with a histogram sns.histplot
and with the Kolmogorv Smirnov Test scipy.stats.kstest
# [68]
# Are the samples normally distributed?
sns.histplot(data.Birthweight)
stats.kstest(data.Birthweight,'norm')
Your turn Is the number of cigarettes smoked per day normally distributed?
# Try it here!:
The population standard deviations of the groups are all equal. This property is known as homoscedasticity. One of several tests available to check this is the Bartlett Test scipy.stats.bartlett
.
# Are the standard deviations equal for the smokers and nonsmokers birthweight data?
stats.bartlett(data_smokers.Birthweight, data_nonsmokers.Birthweight)
Practice for you: Are the standard deviations of birthweight data equal for the groups of over 35 and under 35 mothers?
# Try it here:
Key Points
Hypothesis testing and pvalues give you the significance of an effect / difference.
Parametric and Nonparametric tests are special cases of linear models!
Regressions enable you to express rich links in your data.
Visualizing your data and fitting simple models give insight into the data.