ETHZ: 227-0966-00L¶

Quantitative Big Imaging¶

April 23, 2020¶

Statistics and Reproducibility¶

Anders Kaestner¶

In [1]:

%load_ext autoreload
%autoreload 2
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["figure.dpi"] = 150
plt.rcParams["font.size"] = 14
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']
plt.style.use('ggplot')
sns.set_style("whitegrid", {'axes.grid': False})

Literature / Useful References¶

Books¶

Jean Claude, Morphometry with R
Online through ETHZ
Chapter 3
Buy it
John C. Russ, âThe Image Processing Handbookâ,(Boca Raton, CRC Press)
Available online within domain ethz.ch (or proxy.ethz.ch / public VPN)
Hypothesis Testing Chapter
Grammar of Graphics: Leland and Wilkinson - http://www.springer.com/gp/book/9780387245447

Videos / Podcasts¶

Google/Stanford Statistics Intro
https://www.youtube.com/watch?v=YFC2KUmEebc
Last Week Tonight: Scientific Studies
https://www.youtube.com/watch?v=0Rnq1NpHdmw
Credibility Crisis
https://www.datacamp.com/community/podcast/credibility-crisis-in-data-science

Slides¶

How to solve NLP problems
https://twitter.com/sleepinyourhat/status/1105946169165955073?s=20
Data Visualization
https://socviz.co/lookatdata.html
P-Values with Puppies
https://hackernoon.com/explaining-p-values-with-puppies-af63d68005d0

Model Evaluation¶

Iris Dataset¶

The Iris dataset was used in Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems: http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf

Papers / Sites¶

[Matlab Unit Testing Documentation](http://www.mathworks.ch/ch/help/matlab/matlab-unit-test-framework.html

)

Databases Introduction
Visualizing Genomic Data (General Visualization Techniques)
NIMRod Parameter Studies
M.E. Wolak, D.J. Fairbairn, Y.R. Paulsen (2012) Guidelines for Estimating Repeatability. Methods in Ecology and Evolution 3(1):129-137.
David J.C. MacKay, Bayesian Interpolartion (1991) [http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.9072]

Previously on QBI ...¶

Image Enhancment
Highlighting the contrast of interest in images
Minimizing Noise
Understanding image histograms
Automatic Methods
Component Labeling
Single Shape Analysis
Complicated Shapes
Dynamic Experiments

Quantitative "Big" Imaging¶

The course has covered imaging enough and there have been a few quantitative metrics, but "big" has not really entered.

What does big mean?

Not just / even large
it means being ready for big data
volume, velocity, variety (3 V's)
scalable, fast, easy to customize

So what is "big" imaging

>>>> doing analyses in a disciplined manner <<<<¶

fixed steps
easy to regenerate results
no magic
documentation

having everything automated¶

100 samples is as easy as 1 sample

being able to adapt and reuse analyses¶

one really well working script and modify parameters
different types of cells
different regions

Objectives¶

Scientific Studies all try to get to a single number

Make sure this number is describing the structure well (what we have covered before)
Making sure the number is meaningful (today!)

How do we compare the number from different samples and groups?

Within a sample or same type of samples
Between samples

How do we compare different processing steps like filter choice, minimum volume, resolution, etc?
How do we evaluate our parameter selection?
How can we ensure our techniques do what they are supposed to do?
How can we visualize so much data? Are there rules?

Outline¶

Motivation (Why and How?)
Scientific Goals
Reproducibility
Predicting and Validating
Statistical metrics and results
Parameterization
Parameter sweep
Sensitivity analysis
Unit Testing
Visualization

What do we start with?¶

Going back to our original cell image

We have been able to get rid of the noise in the image and find all the cells (lecture 2-4)
We have analyzed the shape of the cells using the shape tensor (lecture 5)
We even separated cells joined together using Watershed (lecture 6)
We have created even more metrics characterizing the distribution (lecture 7)

We have at least a few samples (or different regions), large number of metrics and an almost as large number of parameters to tune

How do we do something meaningful with it?¶

Correlation and Causation¶

One of the most repeated criticisms of scientific work is that correlation and causation are confused.

Correlation

means a statistical relationship
very easy to show (single calculation)

Causation

implies there is a mechanism between A and B
very difficult to show (impossible to prove)

Observational or Controlled¶

There are two broad classes of data and scientific studies.

Observational	Controlled

Observational¶

Exploring large datasets looking for trends
Population is random
Not always hypothesis driven
Rarely leads to causation

Examples¶

We examined 100 people
- the ones with blue eyes were on average 10cm taller
In 100 cake samples
- we found a 0.9 correlation between cooking time and bubble size

Controlled¶

Most scientific studies fall into this category
Specifics of the groups are controlled
Can lead to causation

Examples¶

We examined 50 mice with gene XYZ off and 50 gene XYZ on and as the foot size increased by 10%
We increased the temperature and the number of pores in the metal increased by 10%

Simple Model: Magic / Weighted Coin¶

Since most of the experiments in science are usually specific, noisy, and often very complicated and are not usually good teaching examples

Magic / Biased Coin¶

You buy a magic coin at a shop

How many times do you need to flip it to prove it is not fair?¶

If I flip it 10 times and another person flips it 10 times, is that the same as 20 flips?
If I flip it 10 times and then multiply the results by 10 is that the same as 100 flips?
If I buy 10 coins and want to know which ones are fair what do I do?

Simple Model: Magic / Weighted Coin¶

Each coin represents a stochastic variable $\mathcal{X}$ and each flip represents an observation $\mathcal{X}_i$.
The act of performing a coin flip $\mathcal{F}$ is an observation $\mathcal{X}_i = \mathcal{F}(\mathcal{X})$

We normally assume

A fair coin has an expected value of $E(\mathcal{X})=\frac{1}{2}$:
- 50% Heads,
- 50% Tails
An unbiased flip(er) means each flip is independent of the others

$$ P(\mathcal{F}_1(\mathcal{X})*\mathcal{F}_2(\mathcal{X}))= P(\mathcal{F}_1(\mathcal{X}))*P(\mathcal{F}_2(\mathcal{X}))$$

the expected value of the flip is the same as that of the coin

$$ E(\prod_{i=0}^\infty \mathcal{F}_i(\mathcal{X})) = E(\mathcal{X}) $$

Simple Model to Reality¶

Coin Flip¶

Each flip gives us a small piece of information about
- the coin
- and the flipper
More flips provides more information

Random / Stochastic variations in coin and flipper cancel out
Systematic variations accumulate

Simple Model to Reality¶

Real Experiment¶

Each measurement tells us about
- our sample,
- our instrument,
- and our analysis
More measurements provide more information

Random / Stochastic variations in sample, instrument, and analysis cancel out
Normally the analysis has very little to no stochastic variation
Systematic variations accumulate

Iris: A more complicated model¶

Coin flips are very simple and probably difficult to match to another experiment.

A very popular dataset for learning about such values beyond 'coin-flips' is called the Iris dataset which covers

a number of measurements
from different plants
and the corresponding species.

Let's load the Iris data¶

In [18]:

%matplotlib inline
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = load_iris()
iris_df = pd.DataFrame(data['data'], columns=data['feature_names'])
iris_df['target'] = data['target_names'][data['target']]
iris_df.sample(3)

Out[18]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
16	5.4	3.9	1.3	0.4	setosa
43	5.0	3.5	1.6	0.6	setosa
8	4.4	2.9	1.4	0.2	setosa

A first inspection of the data¶

In [19]:

sns.pairplot(iris_df, hue='target');

Comparing Groups: Intraclass Correlation Coefficient¶

The intraclass correlation coefficient basically looking at

how similar objects within a group are
compared to the similarity between groups

Looking at the sepal data¶

In [20]:

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
sns.swarmplot(data=iris_df, ax = ax1,
               x='target', y='sepal width (cm)');ax1.set_title('Low Group Similarity');
ax2.imshow(plt.imread('../common/figures/FlowerAnatomy.png')); ax2.axis('off');

Looking at the petal data¶

In [21]:

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
g = sns.swarmplot(data=iris_df, ax=ax1,
               x='target', y='petal length (cm)');g.set_title('High Group Similarity');
ax2.imshow(plt.imread('../common/figures/FlowerAnatomy.png')); ax2.axis('off');

Making quantitative statements¶

Intraclass Correlation Coefficient Definition¶

$$ ICC = \frac{S_A^2}{S_A^2+S_W^2} $$

where

$S_A^2$ is the variance among groups or classes
Estimate with the standard deviations of the mean values for each group
$S_W^2$ is the variance within groups or classes.
Estimate with the average of standard deviations for each group

Interpretation¶

$$ ICC = \left\{\begin{array}{cl} 1 & \mbox{means 100% of the variance is between classes} \\ 0 & \mbox{means 0% of the variance is between classes} \end{array}\right. $$

Intraclass Correlation Coefficient: Values¶

In [22]:

def icc_calc(value_name, group_name, data_df):
    data_agg = data_df.groupby(group_name).agg({value_name: ['mean', 'var']}).reset_index()
    data_agg.columns = data_agg.columns.get_level_values(1)
    S_w = data_agg['var'].mean()
    S_a = data_agg['mean'].var()
    return S_a/(S_a+S_w)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
sns.swarmplot(data=iris_df, ax=ax1,
               x='target', y='sepal width (cm)')
ax1.set_title('Low Group Similarity\nICC:{:2.1%}'.format(icc_calc('sepal width (cm)', 'target', iris_df)));

sns.swarmplot(data=iris_df,ax=ax2, 
               x='target', y='petal length (cm)')
ax2.set_title('High Group Similarity\nICC:{:2.1%}'.format(icc_calc('petal length (cm)', 'target', iris_df)));

Comparing Groups: Tests¶

Once the reproducibility has been measured, it is possible to compare groups.

The idea is to make a test to assess the likelihood that two groups are the same given the data

List assumptions
Establish a null hypothesis

Usually that both groups are the same

Calculate the probability of the observations given the truth of the null hypothesis

Requires knowledge of probability distribution of the data
Modeling can be exceptionally complicated

Loaded Coin example¶

We have 1 coin from a magic shop.

Our assumptions are:
we flip and observe flips of coins accurately and independently
the coin is invariant and always has the same expected value
Our null hypothesis is the coin is unbiased $E(\mathcal{X})=0.5$
we can calculate the likelihood of a given observation given the number of flips (p-value)

How good is good enough?

Comparing Groups: Student's T Distribution¶

Since we do not usually know our distribution very well
or have enough samples to create a sufficient probability model

Student T Distribution ¶

We assume the distribution of our stochastic variable is normal (Gaussian) and the t-distribution provides an estimate for the mean of the underlying distribution based on few observations.

We estimate the likelihood of our observed values assuming they are coming from random observations of a normal process

Student T-Test ¶

Incorporates this distribution and provides an easy method for assessing the likelihood that the two given set of observations are coming from the same underlying process (null hypothesis)

Assume unbiased observations
Assume normal distribution

Multiple Testing Bias¶

Back to the magic coin, let's assume we are trying to publish a paper,

we heard a p-value of < 0.05 (5%) was good enough.
That means if we get 5 heads we are good!

Probability with increasing number of tosses¶

$$ P = \prod_i P(\mathcal{F}_i(\mathcal{X}))$$

In [2]:

import pandas as pd
from scipy.stats import ttest_ind
from IPython.display import display
all_heads_df = pd.DataFrame({'n_flips': [1, 4, 5]})
all_heads_df['Probability of # Heads'] = all_heads_df['n_flips'].map(
    lambda x: '{:2.1%}'.format(0.5**x))
display(all_heads_df)

	n_flips	Probability of # Heads
0	1	50.0%
1	4	6.2%
2	5	3.1%

Probability with many experiments¶

Let N friends make 5 tosses...

In [3]:

friends_heads_df = pd.DataFrame({'n_friends': [1, 10, 20, 40, 80]})
friends_heads_df['Probability of 5 Heads'] = friends_heads_df['n_friends'].map(
    lambda n_friends: '{:2.1%}'.format((1-(1-0.5**5)**n_friends)))
display(friends_heads_df)

	n_friends	Probability of 5 Heads
0	1	3.1%
1	10	27.2%
2	20	47.0%
3	40	71.9%
4	80	92.1%

Clearly this is not the case, otherwise we could keep flipping coins or ask all of our friends to flip until we got 5 heads and publish

The p-value is only meaningful when the experiment matches what we did.

We didn't say the chance of getting 5 heads ever was < 5%¶

We said is if we have¶

exactly 5 observations
and all of them are heads
the likelihood that a fair coin produced that result is <5%

There are many methods to correct.¶

Most just involve scaling $p$:

The likelihood of a sequence of 5 heads in a row if you perform 10 flips is 5x higher.

Multiple Testing Bias: Experiments¶

This is very bad news for us. We have the ability to quantify all sorts of interesting metrics

cell distance to other cells
cell oblateness
cell distribution oblateness

So lets throw them all into a magical statistics algorithm and push the publish button

With our p value of less than 0.05 and a study with 10 samples in each group, how does increasing the number of variables affect our result

In [4]:

import pandas as pd
import numpy as np
pd.set_option('precision', 2)
np.random.seed(2017)

def random_data_maker(rows, cols):
    data_df = pd.DataFrame(
        np.random.uniform(-1, 1, size=(rows, cols)),
        columns=['Var_{:02d}'.format(c_col) for c_col in range(cols)])
    data_df['Group'] = [1]*(rows-rows//2)+[2]*(rows//2)
    return data_df

rand_df = random_data_maker(10, 5)

rand_df

Out[4]:

	Var_00	Var_01	Var_02	Var_03	Var_04	Group
0	-0.96	0.53	-0.10	-0.76	0.86	1
1	0.30	-0.72	-0.54	-0.55	-0.48	1
2	-0.77	0.26	-0.23	-0.37	0.26	1
3	-0.41	0.89	-0.70	-0.85	0.41	1
4	-0.86	-0.39	-0.34	-0.38	-0.12	1
5	0.53	-0.05	-0.99	0.40	0.26	2
6	-0.94	-0.83	0.41	-0.09	0.41	2
7	0.86	-0.18	-0.92	0.24	-0.28	2
8	0.84	0.83	-0.46	-0.39	-0.97	2
9	0.08	0.34	-0.09	0.07	0.82	2

In [5]:

from scipy.stats import ttest_ind


def show_significant(in_df, cut_off=0.05):
    return in_df.sort_values('P-Value').style.apply(lambda x: ['background-color: yellow' if v<cut_off else '' for v in x])


def all_ttest(in_df):
    return pd.DataFrame(
        {'P-Value': {c_col: ttest_ind(
            a=in_df[in_df['Group'] == 1][c_col],
            b=in_df[in_df['Group'] == 2][c_col]
        ).pvalue
            for c_col in
            in_df.columns if 'Group' not in c_col}})


show_significant(all_ttest(rand_df))

Out[5]:

	P-Value
Var_03	0.01
Var_00	0.08
Var_04	0.73
Var_01	0.82
Var_02	0.92

In [6]:

np.random.seed(2019)
show_significant(all_ttest(random_data_maker(150, 20)))

Out[6]:

	P-Value
Var_15	0.01
Var_03	0.04
Var_14	0.10
Var_01	0.13
Var_07	0.26
Var_18	0.40
Var_13	0.40
Var_10	0.44
Var_04	0.50
Var_11	0.55
Var_06	0.57
Var_05	0.66
Var_09	0.71
Var_02	0.71
Var_12	0.74
Var_00	0.87
Var_19	0.87
Var_08	0.89
Var_17	0.89
Var_16	0.92

In [7]:

import seaborn as sns
from tqdm import notebook # progressbar
out_list = []
for n_vars in notebook.tqdm(range(1, 150, 10)):
    for _ in range(50):
        p_values = all_ttest(random_data_maker(100, n_vars)).values
        out_list += [{'Variables in Study': n_vars,
                      'Significant Variables Found': np.sum(p_values < 0.05),
                     'raw_values': p_values}]
var_found_df = pd.DataFrame(out_list)
sns.swarmplot(data=var_found_df, x='Variables in Study', y='Significant Variables Found');

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))

In [8]:

plt.figure(figsize=(12,7))
sns.boxplot(data=var_found_df,
            x='Variables in Study', y='Significant Variables Found');

Multiple Testing Bias: Correction¶

Using the simple correction factor (number of tests performed), we can make the significant findings constant again. $$ p_{cutoff} = \frac{0.05}{\textrm{# of Tests}} $$

In [11]:

var_found_df['Corrected Significant Count'] = var_found_df['raw_values'].map(lambda p_values: 
                                                                             np.sum(p_values<0.05/len(p_values)))

var_found_df.groupby('Variables in Study').agg('mean').reset_index().plot('Variables in Study', [
    'Significant Variables Found',
    'Corrected Significant Count'
]);
plt.title('Effect of significance correction');

So no harm done there we just add this correction factor right?

Well, what if we have exactly one variable with shift of 1.0 standard deviations from the other.

In a dataset where we check $n$ variables?

In [12]:

table_df = random_data_maker(50, 10)
really_different_var = np.concatenate([
    np.random.normal(loc=0, scale=1.0, size=(table_df.shape[0]//2)),
    np.random.normal(loc=1, scale=1.0, size=(table_df.shape[0]//2))
])
table_df['Really Different Var'] = really_different_var
fig, ax1 = plt.subplots(1, 1, figsize=(10, 5))
ax1.hist(table_df.query('Group==1')['Really Different Var'], np.linspace(-5, 5, 20), label='Group 1')
ax1.hist(table_df.query('Group==2')['Really Different Var'], np.linspace(-5, 5, 20), label='Group 2', alpha=0.5);
ax1.legend();

In [13]:

out_p_value = []
for _ in range(200):
    out_p_value += [ttest_ind(np.random.normal(loc=0, scale=1.0, size=(table_df.shape[0]//2)),
          np.random.normal(loc=1, scale=1.0, size=(table_df.shape[0]//2))).pvalue]

In [14]:

fig, m_axs = plt.subplots(2, 3, figsize=(20, 10))
for c_ax, var_count in zip(m_axs.flatten(), np.linspace(1, 140, 9).astype(int)):
    c_ax.hist(np.clip(np.array(out_p_value)*var_count, 0.01, 0.3), np.linspace(0.01, 0.3, 30))
    c_ax.set_ylim(0, 100)
    c_ax.set_title('p-value after multiple correction\n for {} variables'.format(var_count))

In [15]:

var_find_df = pd.DataFrame({'Variables': np.linspace(1, 100, 30).astype(int)})
var_find_df['Likelihood of Detecting Really Different Variable'] = var_find_df['Variables'].map(
    lambda var_count: np.mean(np.array(out_p_value)*var_count<0.05)
)
fig, ax1 = plt.subplots(1, 1, figsize=(15, 5))
var_find_df.plot('Variables', 'Likelihood of Detecting Really Different Variable', ax=ax1)
ax1.set_ylabel('% Likelihood');

Predicting and Validating - main categories¶

Borrowed from http://peekaboo-vision.blogspot.ch/2013/01/machine-learning-cheat-sheet-for-scikit.html

Overview¶

Basically all of these are ultimately functions which map inputs to outputs.

The input could be¶

an image
a point
a feature vector
or a multidimensional tensor

The output is¶

a value (regression)
a classification (classification)
a group (clustering)
a vector / matrix / tensor with fewer degrees of input / less noise as the original data (dimensionality reduction)

Overfitting¶

The most serious problem with machine learning and such approachs is overfitting your model to your data. Particularly as models get increasingly complex (random forest, neural networks, deep learning, ...), it becomes more and more difficult to apply common sense or even understand exactly what a model is doing and why a given answer is produced.

magic_classifier = {}
# training
magic_classifier['Dog'] = 'Animal'
magic_classifier['Bob'] = 'Person'
magic_classifier['Fish'] = 'Animal'

Now use this classifier, on the training data it works really well

magic_classifier['Dog'] == 'Animal' # true, 1/1 so far!
magic_classifier['Bob'] == 'Person' # true, 2/2 still perfect!
magic_classifier['Fish'] == 'Animal' # true, 3/3, wow!

On new data it doesn't work at all, it doesn't even execute.

magic_classifier['Octopus'] == 'Animal' # exception?! but it was working so well
magic_classifier['Dan'] == 'Person' # exception?!

The above example appeared to be a perfect trainer for mapping names to animals or people, but it just memorized the inputs and reproduced them at the output and so didn't actually learn anything, it just copied.

Validation¶

Relevant for each of the categories, but applied in a slightly different way depending on the group. The idea is two divide the dataset into groups called training and validation or ideally training, validation, and testing.

The analysis is then

developed on training
iteratively validated on validation
ultimately tested on testing

Concrete Example: Classifying Flowers¶

Here we return to the iris data set and try to automatically classify flowers

In [25]:

from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = load_iris()
iris_df = pd.DataFrame(data['data'], columns=data['feature_names'])
iris_df['target'] = data['target_names'][data['target']]
iris_df.sample(3)

Out[25]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
41	4.5	2.3	1.3	0.3	setosa
40	5.0	3.5	1.3	0.3	setosa
33	5.5	4.2	1.4	0.2	setosa

Qualitative vs Quantitative¶

Given the complexity of the tree, we need to do some pruning

Qualitative Assessment¶

Evaluating metrics using visual feedback
Compare with expectations from other independent techniques or approach
Are there artifacts which are included in the output?
Do the shapes look correct?
Are they distributed as expected?
Is their orientation meaningful?

Porosity

Quantitative Metrics¶

With a quantitative approach, we can calculate

the specific shape
or distribution metrics on the sample

with each parameter and establish the relationship between

parameter
and metric.

Parameters¶

In [16]:

from graphviz import Digraph

dot = Digraph()

dot.node('Raw images',color='limegreen'),        dot.node('Gaussian filter', color='lightblue')
dot.node('sigma=0.5', color='gray',shape='box'), dot.node('3x3 Neighbors', color='gray',shape='box')
dot.node('Threshold', color='lightblue'),        dot.node('100', color='gray',shape='box')
dot.node('Thickness analysis',color='hotpink'),  dot.node('Shape analysis',color='hotpink')
dot.node('Input',color='limegreen'),        dot.node('Functions', color='lightblue')
dot.node('Parameters', color='gray',shape='box'),dot.node('Output',color='hotpink')

dot.edge('Raw images', 'Gaussian filter'),    dot.edge('sigma=0.5', 'Gaussian filter')
dot.edge('3x3 Neighbors', 'Gaussian filter'), dot.edge('Gaussian filter','Threshold')
dot.edge('Threshold', 'Thickness analysis'),  dot.edge('Threshold', 'Shape analysis')
dot.edge('100','Threshold')
dot

Out[16]:

Parameter Sweep¶

The way we do this is usually a parameter sweep which means

taking one (or more) parameters
and varying them between the reasonable bounds (judged qualitatively).

Is it always the same?¶

Sensitivity¶

Control system theory¶

Sensitivity is defined as

the change in the value of an output
against the change in the input.

$$ S = \frac{|\Delta \textrm{Metric}|}{|\Delta \textrm{Parameter}|} $$

Image processing¶

Such a strict definition is not particularly useful for image processing since

a threshold has a unit of intensity and
a metric might be volume which has $m^3$

$\rightarrow$ the sensitivity becomes volume per intensity!

Practical Sensitivity¶

A more common approach is to estimate the variation in this parameter between images or within a single image (automatic threshold methods can be useful for this) and define the sensitivity based on this variation.

It is also common to normalize it with the mean value so the result is a percentage.

$$ S = \frac{max(\textrm{Metric})-min(\textrm{Metric})}{avg(\textrm{Metric})} $$

Sensitivity: Real Measurements¶

In this graph it is magnitude of the slope. The steeper the slope the more the metric changes given a small change in the parameter

Sensitivity: compare more than one variable¶

Comparing Different Variables we see that

the best (lowest) value for the count sensitivity
is the highest for the volume and anisotropy.

A contradiction - Which metric is more important?¶

Reproducibility¶

A very broad topic with plenty of sub-areas and deeper meanings. We mean two things by reproducibility

Analysis¶

The process of going from images to numbers is detailed in a clear manner that anyone, anywhere could follow and get the exact (within some tolerance) same numbers from your samples

No platform dependence
No proprietary or "in house" algorithms
No manual clicking, tweaking, or copying
One script to go from image to result

Measurement¶

Everything for analysis + taking a measurement several times (noise and exact alignment vary each time) does not change the statistics significantly

No sensitivity to mounting or rotation
No sensitivity to noise
No dependence on exact illumination

Reproducible Analysis¶

Since we will need to perform the same analysis many times to understand how reproducible it is.

Notebooks are good to develop and document analysis workflow.
The basis for reproducible analysis are scripts and macros.

With python scripts¶

#!/$PYTHONPATH/python
import sys
from myAnalysis import analysisScript # some analysis script you implemented

imageFile = sys.argv[0] # File name from command line

threshold = 130
analysisScript(fname=imageFile, threshold = threshold)

or Matlab, ImageJ, or R¶

IMAGEFILE=$1
THRESHOLD=130
matlab -r "inImage=$IMAGEFILE; threshImage=inImage>$THRESHOLD; analysisScript;"

or

java -jar ij.jar -macro TestMacro.ijm blobs.tif

or

Rscript -e "library(plyr);..."

Some software engineering - Unit testing¶

Unit Testing¶

In computer programming, unit testing is a method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine if they are fit for use.

Intuitively, one can view a unit as the smallest testable part of an application
Unit testing is possible with every language
Most (Java, C++, Matlab, R, Python) have built in support for automated testing and reporting

Comoutational science: ... Error

Unit Testing - design¶

The first requirement for unit testing to work well is to have your code divided up into small independent parts (functions)

What to test¶

Each part can then be tested independently (unit testing)
- If the tests are well done, units can be changed and tested independently
- Makes upgrading or expanding tools easy
The entire path can be tested (integration testing)
- Catches mistakes in integration or glue

How to test¶

The happy path - check what it is supposed to do
To provoke your code - provide data that will fail execution

Test data¶

Ideally with realistic but simulated test data

The utility of the testing is only as good as the tests you make

Example¶

Given the following function function vxCnt=countVoxs(inImage)

We can write the following tests:

testEmpty2d¶

assert countVoxs(zeros(3,3)) == 0

testEmpty3d¶

assert countVoxs(zeros(3,3,3)) == 0

testDiag2d¶

assert countVoxs(eye(3)) == 3

Unit Testing: Example¶

Given the following function function shapeTable=shapeAnalysis(inImage)

We should decompose the function into sub-components with single tasks:

In [23]:

from graphviz import Digraph
dot = Digraph()
dot.edge('shapeAnalysis(inImage)', 'componentLabel(inImage)'), dot.edge('shapeAnalysis(inImage)', 'analyzeObject(inObject)')
dot.edge('analyzeObject(inObject)','countVoxs(inObject)'),     dot.edge('analyzeObject(inObject)','calculateCOV(inObject)')
dot.edge('analyzeObject(inObject)','calcShapeT(covMat)'),      dot.edge('analyzeObject(inObject)','calcOrientation(shapeT)')
dot.edge('analyzeObject(inObject)','calcAnisotropy(shapeT)')
dot

Out[23]:

Unit Testing in Python¶

PyTest¶

Packages like PyTest are

well suited for larger projects
you make a set of specific tests for each module
run each time the project is updated.

Unit testing examples from Scikit Image¶

https://github.com/scikit-image/scikit-image/tree/master/skimage

class TestWatershed(unittest.TestCase):
    eight = np.ones((3, 3), bool)

    def test_watershed01(self):
        "watershed 1"
        data = np.array([[0, 0, 0, 0, 0, 0, 0],
                            [0, 0, 0, 0, 0, 0, 0],
                            [0, 0, 0, 0, 0, 0, 0],
                               [0, 1, 1, 1, 1, 1, 0],
                               [0, 1, 0, 0, 0, 1, 0],
                               [0, 1, 0, 0, 0, 1, 0],
                               [0, 1, 0, 0, 0, 1, 0],
                               [0, 1, 1, 1, 1, 1, 0],
                               [0, 0, 0, 0, 0, 0, 0],
                               [0, 0, 0, 0, 0, 0, 0]], np.uint8)
        markers = np.array([[ -1, 0, 0, 0, 0, 0, 0],
                               [0, 0, 0, 0, 0, 0, 0],
                               [0, 0, 0, 0, 0, 0, 0],
                                  [  0, 0, 0, 0, 0, 0, 0],
                                  [  0, 0, 0, 0, 0, 0, 0],
                                  [  0, 0, 0, 1, 0, 0, 0],
                                  [  0, 0, 0, 0, 0, 0, 0],
                                  [  0, 0, 0, 0, 0, 0, 0],
                                  [  0, 0, 0, 0, 0, 0, 0],
                                  [  0, 0, 0, 0, 0, 0, 0]],
                                 np.int8)
        out = watershed(data, markers, self.eight)
        expected = np.array([[-1, -1, -1, -1, -1, -1, -1],
                      [-1, -1, -1, -1, -1, -1, -1],
                      [-1, -1, -1, -1, -1, -1, -1],
                      [-1,  1,  1,  1,  1,  1, -1],
                      [-1,  1,  1,  1,  1,  1, -1],
                      [-1,  1,  1,  1,  1,  1, -1],
                      [-1,  1,  1,  1,  1,  1, -1],
                      [-1,  1,  1,  1,  1,  1, -1],
                      [-1, -1, -1, -1, -1, -1, -1],
                      [-1, -1, -1, -1, -1, -1, -1]])
        error = diff(expected, out)
        assert error < eps

DocTests¶

Keep the tests in the code itself: https://github.com/scikit-image/scikit-image/blob/16d3fd07e7d882d7f6b74e8dc4028ff946ac7e63/skimage/filters/thresholding.py#L886

def apply_hysteresis_threshold(image, low, high):
    """Apply hysteresis thresholding to `image`.
    This algorithm finds regions where `image` is greater than `high`
    OR `image` is greater than `low` *and* that region is connected to
    a region greater than `high`.
    Parameters
    ----------
    image : array, shape (M,[ N, ..., P])
        Grayscale input image.
    low : float, or array of same shape as `image`
        Lower threshold.
    high : float, or array of same shape as `image`
        Higher threshold.
    Returns
    -------
    thresholded : array of bool, same shape as `image`
        Array in which `True` indicates the locations where `image`
        was above the hysteresis threshold.
    Examples
    --------
    >>> image = np.array([1, 2, 3, 2, 1, 2, 1, 3, 2])
    >>> apply_hysteresis_threshold(image, 1.5, 2.5).astype(int)
    array([0, 1, 1, 1, 0, 0, 0, 1, 1])
    References
    ----------
    .. [1] J. Canny. A computational approach to edge detection.
           IEEE Transactions on Pattern Analysis and Machine Intelligence.
           1986; vol. 8, pp.679-698.
           DOI: 10.1109/TPAMI.1986.4767851
    """
    low = np.clip(low, a_min=None, a_max=high)  # ensure low always below high
    mask_low = image > low
    mask_high = image > high

Unit Testing Jupyter¶

Working primarily in notebooks makes regular testing more difficult but not impossible. If we employ a few simple tricks we can use doctesting seamlessly inside of Jupyter. We can make what in python is called an annotatation to setup this code.

In [24]:

import doctest
import copy
import functools

def autotest(func):
    globs = copy.copy(globals())
    globs.update({func.__name__: func})
    doctest.run_docstring_examples(
        func, globs, verbose=True, name=func.__name__)
    return func

In [26]:

@autotest
def add_5(x):
    """
    Function adds 5
    >>> add_5(5)
    10
    """
    return x+5

Finding tests in add_5
Trying:
    add_5(5)
Expecting:
    10
ok

In [29]:

from skimage.measure import label
import numpy as np
@autotest
def simple_label(x):
    """
    Label an image
    >>> test_img = np.eye(3)
    >>> test_img
    array([[1., 0., 0.],
           [0., 1., 0.],
           [0., 0., 1.]])
    >>> simple_label(test_img)
    array([[1, 0, 0],
           [0, 1, 0],
           [0, 0, 1]])
    >>> test_img[1,1] = 0
    >>> simple_label(test_img)
    array([[1, 0, 0],
           [0, 0, 0],
           [0, 0, 2]])
    """
    return label(x)

Finding tests in simple_label
Trying:
    test_img = np.eye(3)
Expecting nothing
ok
Trying:
    test_img
Expecting:
    array([[1., 0., 0.],
           [0., 1., 0.],
           [0., 0., 1.]])
ok
Trying:
    simple_label(test_img)
Expecting:
    array([[1, 0, 0],
           [0, 1, 0],
           [0, 0, 1]])
**********************************************************************
File "__main__", line 12, in simple_label
Failed example:
    simple_label(test_img)
Expected:
    array([[1, 0, 0],
           [0, 1, 0],
           [0, 0, 1]])
Got:
    array([[1, 0, 0],
           [0, 1, 0],
           [0, 0, 1]], dtype=int64)
Trying:
    test_img[1,1] = 0
Expecting nothing
ok
Trying:
    simple_label(test_img)
Expecting:
    array([[1, 0, 0],
           [0, 0, 0],
           [0, 0, 2]])
**********************************************************************
File "__main__", line 17, in simple_label
Failed example:
    simple_label(test_img)
Expected:
    array([[1, 0, 0],
           [0, 0, 0],
           [0, 0, 2]])
Got:
    array([[1, 0, 0],
           [0, 0, 0],
           [0, 0, 2]], dtype=int64)

Unit Testing Matlab¶

https://www.mathworks.com/help/matlab/matlab-unit-test-framework.html

Test Driven Programming¶

Test Driven programming is a style or approach to programming where the tests are written before the functional code. Like very concrete specifications. It is easy to estimate how much time is left since you can automatically see how many of the tests have been passed. You and your collaborators are clear on the utility of the system.

shapeAnalysis must give an anisotropy of 0 when we input a sphere
shapeAnalysis must give the center of volume within 0.5 pixels
shapeAnalysis must run on a 1000x1000 image in 30 seconds

Continuous Integration¶

Conntinuous integration is the process of running tests automatically everytime changes are made.

This is possible to setup inside of many IDEs and is offered as a commercial service from companies like CircleCI and Travis.

We use them for the QBI course to make sure all of the code in the slides are correct.

Projects like scikit-image use them to ensure changes that are made do not break existing code without requiring manual checks

Presenting the results - bringing out the message¶

Visualization¶

One of the biggest problems with big sciences is trying to visualize a lot of heterogeneous data.

Tables are difficult to interpret
3D Visualizations are very difficult to compare visually
Contradictory necessity of simple single value results and all of the data to look for trends and find problems

Color maps revisited¶

Crameri, F., Shephard, G.E. & Heron, P.J. The misuse of colour in science communication. Nat Commun 11, 5444 (2020). https://doi.org/10.1038/s41467-020-19160-7

Purpose of the visualization¶

You visualize your data for different reasons:

Understanding and exploration¶

Small and known audience (you and colleagues)
High degree of understanding of specific topic.

Presenting your results¶

Wider and sometimes unknown audience (reader of paper, person listening to presentation)
At best general understanding of the topic.

from Knaflic 2015

Bad Graphs¶

There are too many graphs which say:

my data is very complicated
I know how to use __ toolbox in Matlab/R/Mathematica
Most programs by default make poor plots
Good visualizations takes time to produce

How to improve - Key Ideas¶

What is my message?
Does the graphic communicate it clearly?
Is a graphic representation really necessary?

Does every line / color serve a purpose?
Pretend ink is very expensive

Some literature¶

Simple Rules¶

Never use 3D graphics when it can be avoided (unless you want to be deliberately misleading)

Dumb 3d

Pie charts can also be hard to interpret
Background color should almost always be white (not light gray)
Use color palettes adapted to human visual sensitivity
Use colors and transparency smart

What is my message¶

Plots to "show the results" or "get a feeling" are usually not good

In [31]:

from plotnine import *
from plotnine.data import *
import pandas as pd
# Some data
xd = np.random.rand(80)
yd = xd + np.random.rand(80)
zd = np.random.rand(80)

df = pd.DataFrame(dict(x=xd,y=yd,z=zd))
ggplot(df,aes(x='x',y='y')) + geom_point()

Out[31]:

<ggplot: (-9223371905318537580)>

Focus on a single, simple message¶

"X is a little bit correlated with Y"

In [32]:

(ggplot(df,aes(x='x',y='y')) 
 + geom_point() 
 + geom_smooth(method="lm")
# + coord_equal() 
 + labs(title="X is weakly correlated with Y")
 + theme_bw(20) )

Out[32]:

<ggplot: (-9223371905318483208)>

Does my graphic communicate it clearly?¶

Too much data makes it very difficult to derive a clear message

In [33]:

xd = np.random.rand(5000)
yd = (xd-0.5)*np.random.rand(5000)

df = pd.DataFrame(dict(x=xd,y=yd))
(ggplot(df,aes(x='x',y='y')) 
 + geom_point()
 + geom_point()
 + coord_equal()
 + theme_bw(20))

Out[33]:

<ggplot: (-9223371905318283156)>

Reduce the data¶

Filter and reduce information until it is extremely simple

In [34]:

(ggplot(df,aes(x='x',y='y'))
 + stat_bin2d(bins=40)
 + geom_smooth(method="lm",color='red')
 + coord_equal()
 + theme_bw(20)
 + guides(color='F')
)

Out[34]:

<ggplot: (-9223371905318410748)>

Reduce even further¶

In [40]:

(ggplot(df,aes(x='x',y='y'))
  + geom_density_2d(aes(x='x', y='y', color='..level..'))
  + geom_smooth(method="lm")
  + coord_equal() 
  + labs(color="Type")
  + theme_bw(15)
)

Out[40]:

<ggplot: (-9223371875612361540)>

Grammar of Graphics¶

What is a grammar?¶

Set of rules for constructing and validating a sentence
Specifies the relationship and order between the words constituting the sentence

How does grammar apply to graphics?¶

If we develop a consistent way of

expressing graphics (sentences)
in terms of elements (words)

we can compose and decompose graphics easily

The most important modern work in graphical grammars is "The Grammar of Graphics" by Wilkinson, Anand, and Grossman (2005).

This work built on earlier work by Bertin (1983) and proposed a grammar that can be used to describe and construct a wide range of statistical graphics.

Graphics in python¶

Graphical grammar can be applied in R using the ggplot2 library, which is ported to python.
Matplotlib does not support grammar but is still very useful:

Matplotlib 3.0 Cookbook ETHZ lib, code examples

Grammar Explained¶

Normally we think of plots in terms of some sort of data which is fed into a plot command that produces a picture

In Excel you select a range and plot-type and click "Make"
In Matlab you run plot(xdata,ydata,color/shape)

These produces entire graphics (sentences) or at least phrases in one go and thus abstract away from the idea of grammar.
If you spoke by finding entire sentences in a book it would be very ineffective, it is much better to build up word by word

Grammar¶

Separate the graph into its component parts

Data mapping	Points	Axes/Coordinate system	Labels/annotation
$var1 \rightarrow x$, $var2 \rightarrow y$

Construct graphics by focusing on each portion independently.

Wrapping up¶

I am not a statistician and is not a statistics course
If you have questions or concerns
Both ETHZ and Uni Zurich offer free consultation with real statisticians
They are rarely bearers of good news - you allways need more data...
Simulations (even simple ones) are very helpful
Try and understand the tests you are performing

Old slides to transfer from R to python¶

old lecture 2017 old lecture 2019

In [ ]:

# Old parameter sweep figure, needs translation from R
import sys
sys.path.append("../common/scripts/")
import shapeAnalysisProcess
import commonReportFunctions

# read and correct the coordinate system
def threshfun(x) :
  pth = rev(strsplit(x,"/")[[1]])[2]
  t=strsplit(pth,"_")[[1]][3]
  # as.numeric(substring(t,2,nchar(t)))

def readfcn(x) :
    cbind(compare.foam.corrected(x,checkProj=F),thresh=thresh.fun(x))
    # Where are the csv files located
    rootDir="../common/data/mcastudy" 
    clpor.files=Sys.glob(paste(rootDir,"/a*/lacun_0.csv",sep="/")) # list all of the files
    data = ldply(clpor.files,readfcn,.parallel=T) # Read in all of the files
    df = pd.DataFrame(data)
    
    return df;

lacun=readfcn(10)

(ggplot(lacun,aes(y=VOLUME*1e9,x=thresh))
+ geom_jitter(alpha=0.1)+geom_smooth()
+ theme_bw(24)
+ labs(y="Volume (um3)",x="Threshold Value",color="Threshold")
+ ylim(0,1000))

In [ ]:

(ggplot(subset(lacun,thresh %% 1000==0),aes(y=VOLUME*1e9,x=as.factor(thresh)))
    + geom_violin()
    + theme_bw(24)
    + labs(y="Volume (um3)",x="Threshold Value",color="Threshold")
    + ylim(0,1000))

In [ ]:

(ggplot(lacun,aes(y=PCA1_Z,x=thresh))+
    + geom_jitter(alpha=0.1)+geom_smooth()+
    + theme_bw(24)
    + labs(y="Orientation",x="Threshold Value",color="Threshold"))

Sensitivity: Real Measurements¶

In this graph it is magnitude of the slope. The steeper the slope the more the metric changes given a small change in the parameter

{r
poresum<-function(all.data) ddply(all.data,.(thresh),function(c.sample) {
  data.frame(Count=nrow(c.sample),
             Volume=mean(c.sample$VOLUME*1e9),
             Stretch=mean(c.sample$AISO),
             Oblateness=mean(c.sample$OBLATENESS),
             #Lacuna_Density_mm=1/mean(c.sample$DENSITY_CNT),
             Length=mean(c.sample$PROJ_PCA1*1000),
             Width=mean(c.sample$PROJ_PCA2*1000),
             Height=mean(c.sample$PROJ_PCA3*1000),
             Orientation=mean(abs(c.sample$PCA1_Z)))
})
comb.summary<-cbind(poresum(all.lacun),Phase="Lacuna")
splot<-ggplot(comb.summary,aes(x=thresh))
splot+geom_line(aes(y=Count))+geom_point(aes(y=Count))+scale_y_log10()+
  theme_bw(24)+labs(y="Object Count",x="Threshold",color="Phase")

Comparing Different Variables we see that the best (lowest) value for the count sensitivity is the highest for the volume and anisotropy.

{r
calc.sens<-function(in.df) {
  data.frame(sens.cnt=100*with(in.df,(max(Count)-min(Count))/mean(Count)),
             sens.vol=100*with(in.df,(max(Volume)-min(Volume))/mean(Volume)),
             sens.stretch=100*with(in.df,(max(Stretch)-min(Stretch))/mean(Stretch))
             )
}
sens.summary<-ddply.cutcols(comb.summary,.(cut_interval(thresh,5)),calc.sens)
ggplot(sens.summary,aes(x=thresh))+
  geom_line(aes(y=sens.cnt,color="Count"))+
  geom_line(aes(y=sens.vol,color="Volume"))+
  geom_line(aes(y=sens.stretch,color="Anisotropy"))+
  labs(x="Threshold",y="Sensitivity (%)",color="Metric")+
  theme_bw(20)

ETHZ: 227-0966-00L¶

Quantitative Big Imaging¶

April 23, 2020¶

Statistics and Reproducibility¶

Anders Kaestner¶

Literature / Useful References¶

Books¶

Videos / Podcasts¶

Slides¶

Model Evaluation¶

Iris Dataset¶

Papers / Sites¶

Previously on QBI ...¶

Quantitative "Big" Imaging¶

>>>> doing analyses in a disciplined manner <<<<¶

having everything automated¶

being able to adapt and reuse analyses¶

Objectives¶

Outline¶

What do we start with?¶

How do we do something meaningful with it?¶

Correlation and Causation¶

Observational or Controlled¶

Observational¶

Examples¶

Controlled¶

Examples¶

Simple Model: Magic / Weighted Coin¶

Magic / Biased Coin¶

How many times do you need to flip it to prove it is not fair?¶

Simple Model: Magic / Weighted Coin¶

Simple Model to Reality¶

Coin Flip¶

Simple Model to Reality¶

Real Experiment¶

Iris: A more complicated model¶

Let's load the Iris data¶

A first inspection of the data¶

Comparing Groups: Intraclass Correlation Coefficient¶

Looking at the sepal data¶

Looking at the petal data¶

Making quantitative statements¶

Intraclass Correlation Coefficient Definition¶

Interpretation¶

Intraclass Correlation Coefficient: Values¶

Comparing Groups: Tests¶

Loaded Coin example¶

Comparing Groups: Student's T Distribution¶

Student T Distribution¶

Student T-Test¶

Multiple Testing Bias¶

Probability with increasing number of tosses¶

Probability with many experiments¶

We didn't say the chance of getting 5 heads ever was < 5%¶

We said is if we have¶

There are many methods to correct.¶

Multiple Testing Bias: Experiments¶

Multiple Testing Bias: Correction¶

Predicting and Validating - main categories¶

Overview¶

The input could be¶

The output is¶

Overfitting¶

Validation¶

Concrete Example: Classifying Flowers¶

Qualitative vs Quantitative¶

Qualitative Assessment¶

Quantitative Metrics¶

Parameters¶

Parameter Sweep¶

Is it always the same?¶

Sensitivity¶

Control system theory¶

Image processing¶

Practical Sensitivity¶

Sensitivity: Real Measurements¶

Sensitivity: compare more than one variable¶

A contradiction - Which metric is more important?¶

Reproducibility¶

Analysis¶

Student T Distribution ¶

Student T-Test ¶