%load_ext autoreload
%autoreload 2
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["figure.dpi"] = 150
plt.rcParams["font.size"] = 14
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']
plt.style.use('ggplot')
sns.set_style("whitegrid", {'axes.grid': False})
)
Visualizing Genomic Data (General Visualization Techniques)
M.E. Wolak, D.J. Fairbairn, Y.R. Paulsen (2012) Guidelines for Estimating Repeatability. Methods in Ecology and Evolution 3(1):129-137.
David J.C. MacKay, Bayesian Interpolartion (1991) [http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.9072]
The course has covered imaging enough and there have been a few quantitative metrics, but "big" has not really entered.
What does big mean?
So what is "big" imaging
Going back to our original cell image
We have at least a few samples (or different regions), large number of metrics and an almost as large number of parameters to tune
One of the most repeated criticisms of scientific work is that correlation and causation are confused.
There are two broad classes of data and scientific studies.
Observational | Controlled |
---|---|
We examined 100 people
In 100 cake samples
Since most of the experiments in science are usually specific, noisy, and often very complicated and are not usually good teaching examples
You buy a magic coin at a shop
We normally assume
A fair coin has an expected value of $E(\mathcal{X})=\frac{1}{2}$:
An unbiased flip(er) means each flip is independent of the others
Coin flips are very simple and probably difficult to match to another experiment.
A very popular dataset for learning about such values beyond 'coin-flips' is called the Iris dataset which covers
%matplotlib inline
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = load_iris()
iris_df = pd.DataFrame(data['data'], columns=data['feature_names'])
iris_df['target'] = data['target_names'][data['target']]
iris_df.sample(3)
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
16 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
43 | 5.0 | 3.5 | 1.6 | 0.6 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
sns.pairplot(iris_df, hue='target');
The intraclass correlation coefficient basically looking at
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
sns.swarmplot(data=iris_df, ax = ax1,
x='target', y='sepal width (cm)');ax1.set_title('Low Group Similarity');
ax2.imshow(plt.imread('../common/figures/FlowerAnatomy.png')); ax2.axis('off');
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(15,5))
g = sns.swarmplot(data=iris_df, ax=ax1,
x='target', y='petal length (cm)');g.set_title('High Group Similarity');
ax2.imshow(plt.imread('../common/figures/FlowerAnatomy.png')); ax2.axis('off');
where
def icc_calc(value_name, group_name, data_df):
data_agg = data_df.groupby(group_name).agg({value_name: ['mean', 'var']}).reset_index()
data_agg.columns = data_agg.columns.get_level_values(1)
S_w = data_agg['var'].mean()
S_a = data_agg['mean'].var()
return S_a/(S_a+S_w)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
sns.swarmplot(data=iris_df, ax=ax1,
x='target', y='sepal width (cm)')
ax1.set_title('Low Group Similarity\nICC:{:2.1%}'.format(icc_calc('sepal width (cm)', 'target', iris_df)));
sns.swarmplot(data=iris_df,ax=ax2,
x='target', y='petal length (cm)')
ax2.set_title('High Group Similarity\nICC:{:2.1%}'.format(icc_calc('petal length (cm)', 'target', iris_df)));
Once the reproducibility has been measured, it is possible to compare groups.
The idea is to make a test to assess the likelihood that two groups are the same given the data
We have 1 coin from a magic shop.
Our assumptions are:
we flip and observe flips of coins accurately and independently
the coin is invariant and always has the same expected value
Our null hypothesis is the coin is unbiased $E(\mathcal{X})=0.5$
we can calculate the likelihood of a given observation given the number of flips (p-value)
How good is good enough?
We assume the distribution of our stochastic variable is normal (Gaussian) and the t-distribution provides an estimate for the mean of the underlying distribution based on few observations.
Incorporates this distribution and provides an easy method for assessing the likelihood that the two given set of observations are coming from the same underlying process (null hypothesis)
Back to the magic coin, let's assume we are trying to publish a paper,
import pandas as pd
from scipy.stats import ttest_ind
from IPython.display import display
all_heads_df = pd.DataFrame({'n_flips': [1, 4, 5]})
all_heads_df['Probability of # Heads'] = all_heads_df['n_flips'].map(
lambda x: '{:2.1%}'.format(0.5**x))
display(all_heads_df)
n_flips | Probability of # Heads | |
---|---|---|
0 | 1 | 50.0% |
1 | 4 | 6.2% |
2 | 5 | 3.1% |
Let N friends make 5 tosses...
friends_heads_df = pd.DataFrame({'n_friends': [1, 10, 20, 40, 80]})
friends_heads_df['Probability of 5 Heads'] = friends_heads_df['n_friends'].map(
lambda n_friends: '{:2.1%}'.format((1-(1-0.5**5)**n_friends)))
display(friends_heads_df)
n_friends | Probability of 5 Heads | |
---|---|---|
0 | 1 | 3.1% |
1 | 10 | 27.2% |
2 | 20 | 47.0% |
3 | 40 | 71.9% |
4 | 80 | 92.1% |
Clearly this is not the case, otherwise we could keep flipping coins or ask all of our friends to flip until we got 5 heads and publish
The p-value is only meaningful when the experiment matches what we did.
Most just involve scaling $p$:
This is very bad news for us. We have the ability to quantify all sorts of interesting metrics
So lets throw them all into a magical statistics algorithm and push the publish button
With our p value of less than 0.05 and a study with 10 samples in each group, how does increasing the number of variables affect our result
import pandas as pd
import numpy as np
pd.set_option('precision', 2)
np.random.seed(2017)
def random_data_maker(rows, cols):
data_df = pd.DataFrame(
np.random.uniform(-1, 1, size=(rows, cols)),
columns=['Var_{:02d}'.format(c_col) for c_col in range(cols)])
data_df['Group'] = [1]*(rows-rows//2)+[2]*(rows//2)
return data_df
rand_df = random_data_maker(10, 5)
rand_df
Var_00 | Var_01 | Var_02 | Var_03 | Var_04 | Group | |
---|---|---|---|---|---|---|
0 | -0.96 | 0.53 | -0.10 | -0.76 | 0.86 | 1 |
1 | 0.30 | -0.72 | -0.54 | -0.55 | -0.48 | 1 |
2 | -0.77 | 0.26 | -0.23 | -0.37 | 0.26 | 1 |
3 | -0.41 | 0.89 | -0.70 | -0.85 | 0.41 | 1 |
4 | -0.86 | -0.39 | -0.34 | -0.38 | -0.12 | 1 |
5 | 0.53 | -0.05 | -0.99 | 0.40 | 0.26 | 2 |
6 | -0.94 | -0.83 | 0.41 | -0.09 | 0.41 | 2 |
7 | 0.86 | -0.18 | -0.92 | 0.24 | -0.28 | 2 |
8 | 0.84 | 0.83 | -0.46 | -0.39 | -0.97 | 2 |
9 | 0.08 | 0.34 | -0.09 | 0.07 | 0.82 | 2 |
from scipy.stats import ttest_ind
def show_significant(in_df, cut_off=0.05):
return in_df.sort_values('P-Value').style.apply(lambda x: ['background-color: yellow' if v<cut_off else '' for v in x])
def all_ttest(in_df):
return pd.DataFrame(
{'P-Value': {c_col: ttest_ind(
a=in_df[in_df['Group'] == 1][c_col],
b=in_df[in_df['Group'] == 2][c_col]
).pvalue
for c_col in
in_df.columns if 'Group' not in c_col}})
show_significant(all_ttest(rand_df))
P-Value | |
---|---|
Var_03 | 0.01 |
Var_00 | 0.08 |
Var_04 | 0.73 |
Var_01 | 0.82 |
Var_02 | 0.92 |
np.random.seed(2019)
show_significant(all_ttest(random_data_maker(150, 20)))
P-Value | |
---|---|
Var_15 | 0.01 |
Var_03 | 0.04 |
Var_14 | 0.10 |
Var_01 | 0.13 |
Var_07 | 0.26 |
Var_18 | 0.40 |
Var_13 | 0.40 |
Var_10 | 0.44 |
Var_04 | 0.50 |
Var_11 | 0.55 |
Var_06 | 0.57 |
Var_05 | 0.66 |
Var_09 | 0.71 |
Var_02 | 0.71 |
Var_12 | 0.74 |
Var_00 | 0.87 |
Var_19 | 0.87 |
Var_08 | 0.89 |
Var_17 | 0.89 |
Var_16 | 0.92 |
import seaborn as sns
from tqdm import notebook # progressbar
out_list = []
for n_vars in notebook.tqdm(range(1, 150, 10)):
for _ in range(50):
p_values = all_ttest(random_data_maker(100, n_vars)).values
out_list += [{'Variables in Study': n_vars,
'Significant Variables Found': np.sum(p_values < 0.05),
'raw_values': p_values}]
var_found_df = pd.DataFrame(out_list)
sns.swarmplot(data=var_found_df, x='Variables in Study', y='Significant Variables Found');
HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))
plt.figure(figsize=(12,7))
sns.boxplot(data=var_found_df,
x='Variables in Study', y='Significant Variables Found');
Using the simple correction factor (number of tests performed), we can make the significant findings constant again. $$ p_{cutoff} = \frac{0.05}{\textrm{# of Tests}} $$
var_found_df['Corrected Significant Count'] = var_found_df['raw_values'].map(lambda p_values:
np.sum(p_values<0.05/len(p_values)))
var_found_df.groupby('Variables in Study').agg('mean').reset_index().plot('Variables in Study', [
'Significant Variables Found',
'Corrected Significant Count'
]);
plt.title('Effect of significance correction');
So no harm done there we just add this correction factor right?
Well, what if we have exactly one variable with shift of 1.0 standard deviations from the other.
In a dataset where we check $n$ variables?
table_df = random_data_maker(50, 10)
really_different_var = np.concatenate([
np.random.normal(loc=0, scale=1.0, size=(table_df.shape[0]//2)),
np.random.normal(loc=1, scale=1.0, size=(table_df.shape[0]//2))
])
table_df['Really Different Var'] = really_different_var
fig, ax1 = plt.subplots(1, 1, figsize=(10, 5))
ax1.hist(table_df.query('Group==1')['Really Different Var'], np.linspace(-5, 5, 20), label='Group 1')
ax1.hist(table_df.query('Group==2')['Really Different Var'], np.linspace(-5, 5, 20), label='Group 2', alpha=0.5);
ax1.legend();
out_p_value = []
for _ in range(200):
out_p_value += [ttest_ind(np.random.normal(loc=0, scale=1.0, size=(table_df.shape[0]//2)),
np.random.normal(loc=1, scale=1.0, size=(table_df.shape[0]//2))).pvalue]
fig, m_axs = plt.subplots(2, 3, figsize=(20, 10))
for c_ax, var_count in zip(m_axs.flatten(), np.linspace(1, 140, 9).astype(int)):
c_ax.hist(np.clip(np.array(out_p_value)*var_count, 0.01, 0.3), np.linspace(0.01, 0.3, 30))
c_ax.set_ylim(0, 100)
c_ax.set_title('p-value after multiple correction\n for {} variables'.format(var_count))
var_find_df = pd.DataFrame({'Variables': np.linspace(1, 100, 30).astype(int)})
var_find_df['Likelihood of Detecting Really Different Variable'] = var_find_df['Variables'].map(
lambda var_count: np.mean(np.array(out_p_value)*var_count<0.05)
)
fig, ax1 = plt.subplots(1, 1, figsize=(15, 5))
var_find_df.plot('Variables', 'Likelihood of Detecting Really Different Variable', ax=ax1)
ax1.set_ylabel('% Likelihood');
Borrowed from http://peekaboo-vision.blogspot.ch/2013/01/machine-learning-cheat-sheet-for-scikit.html
Basically all of these are ultimately functions which map inputs to outputs.
The most serious problem with machine learning and such approachs is overfitting your model to your data. Particularly as models get increasingly complex (random forest, neural networks, deep learning, ...), it becomes more and more difficult to apply common sense or even understand exactly what a model is doing and why a given answer is produced.
magic_classifier = {}
# training
magic_classifier['Dog'] = 'Animal'
magic_classifier['Bob'] = 'Person'
magic_classifier['Fish'] = 'Animal'
Now use this classifier, on the training data it works really well
magic_classifier['Dog'] == 'Animal' # true, 1/1 so far!
magic_classifier['Bob'] == 'Person' # true, 2/2 still perfect!
magic_classifier['Fish'] == 'Animal' # true, 3/3, wow!
On new data it doesn't work at all, it doesn't even execute.
magic_classifier['Octopus'] == 'Animal' # exception?! but it was working so well
magic_classifier['Dan'] == 'Person' # exception?!
The above example appeared to be a perfect trainer for mapping names to animals or people, but it just memorized the inputs and reproduced them at the output and so didn't actually learn anything, it just copied.
Relevant for each of the categories, but applied in a slightly different way depending on the group. The idea is two divide the dataset into groups called training and validation or ideally training, validation, and testing.
The analysis is then
Here we return to the iris data set and try to automatically classify flowers
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = load_iris()
iris_df = pd.DataFrame(data['data'], columns=data['feature_names'])
iris_df['target'] = data['target_names'][data['target']]
iris_df.sample(3)
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
40 | 5.0 | 3.5 | 1.3 | 0.3 | setosa |
33 | 5.5 | 4.2 | 1.4 | 0.2 | setosa |
Given the complexity of the tree, we need to do some pruning
With a quantitative approach, we can calculate
with each parameter and establish the relationship between
from graphviz import Digraph
dot = Digraph()
dot.node('Raw images',color='limegreen'), dot.node('Gaussian filter', color='lightblue')
dot.node('sigma=0.5', color='gray',shape='box'), dot.node('3x3 Neighbors', color='gray',shape='box')
dot.node('Threshold', color='lightblue'), dot.node('100', color='gray',shape='box')
dot.node('Thickness analysis',color='hotpink'), dot.node('Shape analysis',color='hotpink')
dot.node('Input',color='limegreen'), dot.node('Functions', color='lightblue')
dot.node('Parameters', color='gray',shape='box'),dot.node('Output',color='hotpink')
dot.edge('Raw images', 'Gaussian filter'), dot.edge('sigma=0.5', 'Gaussian filter')
dot.edge('3x3 Neighbors', 'Gaussian filter'), dot.edge('Gaussian filter','Threshold')
dot.edge('Threshold', 'Thickness analysis'), dot.edge('Threshold', 'Shape analysis')
dot.edge('100','Threshold')
dot
The way we do this is usually a parameter sweep which means
![]() |
![]() |
Sensitivity is defined as
Such a strict definition is not particularly useful for image processing since
$\rightarrow$ the sensitivity becomes volume per intensity!
A more common approach is to estimate the variation in this parameter between images or within a single image (automatic threshold methods can be useful for this) and define the sensitivity based on this variation.
It is also common to normalize it with the mean value so the result is a percentage.
$$ S = \frac{max(\textrm{Metric})-min(\textrm{Metric})}{avg(\textrm{Metric})} $$In this graph it is magnitude of the slope. The steeper the slope the more the metric changes given a small change in the parameter
A very broad topic with plenty of sub-areas and deeper meanings. We mean two things by reproducibility
The process of going from images to numbers is detailed in a clear manner that anyone, anywhere could follow and get the exact (within some tolerance) same numbers from your samples
Everything for analysis + taking a measurement several times (noise and exact alignment vary each time) does not change the statistics significantly
Since we will need to perform the same analysis many times to understand how reproducible it is.
#!/$PYTHONPATH/python
import sys
from myAnalysis import analysisScript # some analysis script you implemented
imageFile = sys.argv[0] # File name from command line
threshold = 130
analysisScript(fname=imageFile, threshold = threshold)
IMAGEFILE=$1
THRESHOLD=130
matlab -r "inImage=$IMAGEFILE; threshImage=inImage>$THRESHOLD; analysisScript;"
java -jar ij.jar -macro TestMacro.ijm blobs.tif
Rscript -e "library(plyr);..."
In computer programming, unit testing is a method by which individual units of source code, sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures, are tested to determine if they are fit for use.
The first requirement for unit testing to work well is to have your code divided up into small independent parts (functions)
Ideally with realistic but simulated test data
Given the following function
function shapeTable=shapeAnalysis(inImage)
We should decompose the function into sub-components with single tasks:
from graphviz import Digraph
dot = Digraph()
dot.edge('shapeAnalysis(inImage)', 'componentLabel(inImage)'), dot.edge('shapeAnalysis(inImage)', 'analyzeObject(inObject)')
dot.edge('analyzeObject(inObject)','countVoxs(inObject)'), dot.edge('analyzeObject(inObject)','calculateCOV(inObject)')
dot.edge('analyzeObject(inObject)','calcShapeT(covMat)'), dot.edge('analyzeObject(inObject)','calcOrientation(shapeT)')
dot.edge('analyzeObject(inObject)','calcAnisotropy(shapeT)')
dot
https://github.com/scikit-image/scikit-image/tree/master/skimage
Test Connected Components https://github.com/scikit-image/scikit-image/blob/16d3fd07e7d882d7f6b74e8dc4028ff946ac7e63/skimage/morphology/tests/test_ccomp.py#L13
class TestWatershed(unittest.TestCase):
eight = np.ones((3, 3), bool)
def test_watershed01(self):
"watershed 1"
data = np.array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]], np.uint8)
markers = np.array([[ -1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0]],
np.int8)
out = watershed(data, markers, self.eight)
expected = np.array([[-1, -1, -1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1, -1, -1],
[-1, 1, 1, 1, 1, 1, -1],
[-1, 1, 1, 1, 1, 1, -1],
[-1, 1, 1, 1, 1, 1, -1],
[-1, 1, 1, 1, 1, 1, -1],
[-1, 1, 1, 1, 1, 1, -1],
[-1, -1, -1, -1, -1, -1, -1],
[-1, -1, -1, -1, -1, -1, -1]])
error = diff(expected, out)
assert error < eps
Keep the tests in the code itself: https://github.com/scikit-image/scikit-image/blob/16d3fd07e7d882d7f6b74e8dc4028ff946ac7e63/skimage/filters/thresholding.py#L886
def apply_hysteresis_threshold(image, low, high):
"""Apply hysteresis thresholding to `image`.
This algorithm finds regions where `image` is greater than `high`
OR `image` is greater than `low` *and* that region is connected to
a region greater than `high`.
Parameters
----------
image : array, shape (M,[ N, ..., P])
Grayscale input image.
low : float, or array of same shape as `image`
Lower threshold.
high : float, or array of same shape as `image`
Higher threshold.
Returns
-------
thresholded : array of bool, same shape as `image`
Array in which `True` indicates the locations where `image`
was above the hysteresis threshold.
Examples
--------
>>> image = np.array([1, 2, 3, 2, 1, 2, 1, 3, 2])
>>> apply_hysteresis_threshold(image, 1.5, 2.5).astype(int)
array([0, 1, 1, 1, 0, 0, 0, 1, 1])
References
----------
.. [1] J. Canny. A computational approach to edge detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence.
1986; vol. 8, pp.679-698.
DOI: 10.1109/TPAMI.1986.4767851
"""
low = np.clip(low, a_min=None, a_max=high) # ensure low always below high
mask_low = image > low
mask_high = image > high
Working primarily in notebooks makes regular testing more difficult but not impossible. If we employ a few simple tricks we can use doctesting seamlessly inside of Jupyter. We can make what in python is called an annotatation to setup this code.
import doctest
import copy
import functools
def autotest(func):
globs = copy.copy(globals())
globs.update({func.__name__: func})
doctest.run_docstring_examples(
func, globs, verbose=True, name=func.__name__)
return func
@autotest
def add_5(x):
"""
Function adds 5
>>> add_5(5)
10
"""
return x+5
Finding tests in add_5 Trying: add_5(5) Expecting: 10 ok
from skimage.measure import label
import numpy as np
@autotest
def simple_label(x):
"""
Label an image
>>> test_img = np.eye(3)
>>> test_img
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> simple_label(test_img)
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
>>> test_img[1,1] = 0
>>> simple_label(test_img)
array([[1, 0, 0],
[0, 0, 0],
[0, 0, 2]])
"""
return label(x)
Finding tests in simple_label Trying: test_img = np.eye(3) Expecting nothing ok Trying: test_img Expecting: array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]) ok Trying: simple_label(test_img) Expecting: array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) ********************************************************************** File "__main__", line 12, in simple_label Failed example: simple_label(test_img) Expected: array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) Got: array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=int64) Trying: test_img[1,1] = 0 Expecting nothing ok Trying: simple_label(test_img) Expecting: array([[1, 0, 0], [0, 0, 0], [0, 0, 2]]) ********************************************************************** File "__main__", line 17, in simple_label Failed example: simple_label(test_img) Expected: array([[1, 0, 0], [0, 0, 0], [0, 0, 2]]) Got: array([[1, 0, 0], [0, 0, 0], [0, 0, 2]], dtype=int64)
https://www.mathworks.com/help/matlab/matlab-unit-test-framework.html
Test Driven programming is a style or approach to programming where the tests are written before the functional code. Like very concrete specifications. It is easy to estimate how much time is left since you can automatically see how many of the tests have been passed. You and your collaborators are clear on the utility of the system.
Conntinuous integration is the process of running tests automatically everytime changes are made.
This is possible to setup inside of many IDEs and is offered as a commercial service from companies like CircleCI and Travis.
We use them for the QBI course to make sure all of the code in the slides are correct.
Projects like scikit-image use them to ensure changes that are made do not break existing code without requiring manual checks
One of the biggest problems with big sciences is trying to visualize a lot of heterogeneous data.
Crameri, F., Shephard, G.E. & Heron, P.J. The misuse of colour in science communication. Nat Commun 11, 5444 (2020). https://doi.org/10.1038/s41467-020-19160-7
You visualize your data for different reasons:
from Knaflic 2015
There are too many graphs which say:
![]() | ![]() | ![]() | ![]() |
Plots to "show the results" or "get a feeling" are usually not good
from plotnine import *
from plotnine.data import *
import pandas as pd
# Some data
xd = np.random.rand(80)
yd = xd + np.random.rand(80)
zd = np.random.rand(80)
df = pd.DataFrame(dict(x=xd,y=yd,z=zd))
ggplot(df,aes(x='x',y='y')) + geom_point()
<ggplot: (-9223371905318537580)>
"X is a little bit correlated with Y"
(ggplot(df,aes(x='x',y='y'))
+ geom_point()
+ geom_smooth(method="lm")
# + coord_equal()
+ labs(title="X is weakly correlated with Y")
+ theme_bw(20) )
<ggplot: (-9223371905318483208)>
Too much data makes it very difficult to derive a clear message
xd = np.random.rand(5000)
yd = (xd-0.5)*np.random.rand(5000)
df = pd.DataFrame(dict(x=xd,y=yd))
(ggplot(df,aes(x='x',y='y'))
+ geom_point()
+ geom_point()
+ coord_equal()
+ theme_bw(20))
<ggplot: (-9223371905318283156)>
Filter and reduce information until it is extremely simple
(ggplot(df,aes(x='x',y='y'))
+ stat_bin2d(bins=40)
+ geom_smooth(method="lm",color='red')
+ coord_equal()
+ theme_bw(20)
+ guides(color='F')
)
<ggplot: (-9223371905318410748)>
(ggplot(df,aes(x='x',y='y'))
+ geom_density_2d(aes(x='x', y='y', color='..level..'))
+ geom_smooth(method="lm")
+ coord_equal()
+ labs(color="Type")
+ theme_bw(15)
)
<ggplot: (-9223371875612361540)>
If we develop a consistent way of
we can compose and decompose graphics easily
The most important modern work in graphical grammars is "The Grammar of Graphics" by Wilkinson, Anand, and Grossman (2005).
This work built on earlier work by Bertin (1983) and proposed a grammar that can be used to describe and construct a wide range of statistical graphics.
Normally we think of plots in terms of some sort of data which is fed into a plot command that produces a picture
plot(xdata,ydata,color/shape)
Separate the graph into its component parts
Data mapping | Points | Axes/Coordinate system | Labels/annotation |
---|---|---|---|
$var1 \rightarrow x$, $var2 \rightarrow y$ | ![]() |
![]() |
![]() |
Construct graphics by focusing on each portion independently.
I am not a statistician and is not a statistics course
If you have questions or concerns
Both ETHZ and Uni Zurich offer free consultation with real statisticians
They are rarely bearers of good news - you allways need more data...
Simulations (even simple ones) are very helpful
Try and understand the tests you are performing
# Old parameter sweep figure, needs translation from R
import sys
sys.path.append("../common/scripts/")
import shapeAnalysisProcess
import commonReportFunctions
# read and correct the coordinate system
def threshfun(x) :
pth = rev(strsplit(x,"/")[[1]])[2]
t=strsplit(pth,"_")[[1]][3]
# as.numeric(substring(t,2,nchar(t)))
def readfcn(x) :
cbind(compare.foam.corrected(x,checkProj=F),thresh=thresh.fun(x))
# Where are the csv files located
rootDir="../common/data/mcastudy"
clpor.files=Sys.glob(paste(rootDir,"/a*/lacun_0.csv",sep="/")) # list all of the files
data = ldply(clpor.files,readfcn,.parallel=T) # Read in all of the files
df = pd.DataFrame(data)
return df;
lacun=readfcn(10)
(ggplot(lacun,aes(y=VOLUME*1e9,x=thresh))
+ geom_jitter(alpha=0.1)+geom_smooth()
+ theme_bw(24)
+ labs(y="Volume (um3)",x="Threshold Value",color="Threshold")
+ ylim(0,1000))
(ggplot(subset(lacun,thresh %% 1000==0),aes(y=VOLUME*1e9,x=as.factor(thresh)))
+ geom_violin()
+ theme_bw(24)
+ labs(y="Volume (um3)",x="Threshold Value",color="Threshold")
+ ylim(0,1000))
(ggplot(lacun,aes(y=PCA1_Z,x=thresh))+
+ geom_jitter(alpha=0.1)+geom_smooth()+
+ theme_bw(24)
+ labs(y="Orientation",x="Threshold Value",color="Threshold"))
In this graph it is magnitude of the slope. The steeper the slope the more the metric changes given a small change in the parameter
{r
poresum<-function(all.data) ddply(all.data,.(thresh),function(c.sample) {
data.frame(Count=nrow(c.sample),
Volume=mean(c.sample$VOLUME*1e9),
Stretch=mean(c.sample$AISO),
Oblateness=mean(c.sample$OBLATENESS),
#Lacuna_Density_mm=1/mean(c.sample$DENSITY_CNT),
Length=mean(c.sample$PROJ_PCA1*1000),
Width=mean(c.sample$PROJ_PCA2*1000),
Height=mean(c.sample$PROJ_PCA3*1000),
Orientation=mean(abs(c.sample$PCA1_Z)))
})
comb.summary<-cbind(poresum(all.lacun),Phase="Lacuna")
splot<-ggplot(comb.summary,aes(x=thresh))
splot+geom_line(aes(y=Count))+geom_point(aes(y=Count))+scale_y_log10()+
theme_bw(24)+labs(y="Object Count",x="Threshold",color="Phase")
Comparing Different Variables we see that the best (lowest) value for the count sensitivity is the highest for the volume and anisotropy.
{r
calc.sens<-function(in.df) {
data.frame(sens.cnt=100*with(in.df,(max(Count)-min(Count))/mean(Count)),
sens.vol=100*with(in.df,(max(Volume)-min(Volume))/mean(Volume)),
sens.stretch=100*with(in.df,(max(Stretch)-min(Stretch))/mean(Stretch))
)
}
sens.summary<-ddply.cutcols(comb.summary,.(cut_interval(thresh,5)),calc.sens)
ggplot(sens.summary,aes(x=thresh))+
geom_line(aes(y=sens.cnt,color="Count"))+
geom_line(aes(y=sens.vol,color="Volume"))+
geom_line(aes(y=sens.stretch,color="Anisotropy"))+
labs(x="Threshold",y="Sensitivity (%)",color="Metric")+
theme_bw(20)