This notebook is intended for live tutorial sessions about TriScale.
Here is the self-study version.
To get started, we need to import a few Python modules. All the TriScale-specific functions are part of one module called triscale
.
import os
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import triscale
TriScale's experiment_sizing()
function implements the computation of the minimal number of samples required to estimate any percentile with any confidence level.
percentile = 50 # the median
confidence = 95 # the confidence level, in %
triscale.experiment_sizing(
percentile,
confidence,
verbose=True);
We can change the values in the cell above to see how the number of samples evolves with a larger confidence level or more extreme percentiles.
Note that the probability distributions are symmetric: it takes the same number of samples to compute a lower bound for the $p$-th percentile as to compute an upper bound for the $(1-p)$-th percentile.
percentile = 20
confidence = 95 # the confidence level, in %
if (triscale.experiment_sizing(percentile,confidence) ==
triscale.experiment_sizing(100-percentile,confidence)):
print("It takes the same number of samples to estimate \
the \n{}-th and \n{}-th percentiles.".format(percentile, 100-percentile))
# Sets of percentiles and confidence levels to try
percentiles = [0.1, 1, 5, 10, 25, 50, 75, 90, 95, 99, 99.9]
confidences = [75, 90, 95, 99, 99.9, 99.99]
# Computing the minimum number of runs for each (perc., conf.) pair
min_number_samples = []
for c in confidences:
tmp = []
for p in percentiles:
N = triscale.experiment_sizing(p,c)
tmp.append(N[0])
min_number_samples.append(tmp)
# Put the results in a DataFrame for a convenient display of the results
df = pd.DataFrame(columns=percentiles, data=min_number_samples)
df['Confidence level'] = confidences
df.set_index('Confidence level', inplace=True)
display(df)
Let's visualize the same data with a heatmap...
colorbar=dict(
title='Minimal N',
tickvals = [0, 1, 2, 3, 3.699, 4],
ticktext = ['1', '10', '100', '1000', '5000','10000']
)
fig = go.Figure(data=go.Heatmap(
z = np.log10(df),
y = df.index,
x = df.columns,
colorbar = colorbar,
hovertemplate='N:2^%{z}<br>percentile:%{x}<br>confidence:%{y}',
)
)
fig.update_layout(
title_text='Mininal number of samples',
xaxis=dict(title='Percentile'),
yaxis=dict(title='Confidence level')
)
fig.show()
Takeaway. The required number of samples increase exponentially when the percentile to estimate becomes more extreme. The increase induced by the confidence level is not as dramatic.
By default, experiment_sizing()
returns the minimal number of samples, such that the smallest one is the percentile estimate (or the largerest one, if the percentile is $> 50$).
If the experiment is subject to outliers, or more generally to obtain tighter bounds, one may want to collect more samples. But how many? You can use the robustness
argument to find out:
percentile = 10
confidence = 99
triscale.experiment_sizing(
percentile,
confidence,
robustness=3,
verbose=True);
Note. Hence,
robustness
refers to the number of outliers that can be excluded from the confidence interval.
Again, we can plot the minimal value of $N$ as robustness
increases. For example, for a few percentiles, with 95% confidence level:
robustness_values = np.arange(1,10,dtype=int)
confidence = 95
N_50 = [triscale.experiment_sizing(50, confidence, robustness=int(x))[0] for x in robustness_values]
N_75 = [triscale.experiment_sizing(75, confidence, robustness=int(x))[0] for x in robustness_values]
N_90 = [triscale.experiment_sizing(90, confidence, robustness=int(x))[0] for x in robustness_values]
fig = go.Figure()
fig.add_trace(go.Scatter(x=robustness_values, y=N_90, name='90th'))
fig.add_trace(go.Scatter(x=robustness_values, y=N_75, name='75th'))
fig.add_trace(go.Scatter(x=robustness_values, y=N_50, name='median'))
fig.update_layout(
title_text='Mininal number of samples for 95% confidence level',
xaxis=dict(title='robustness'),
yaxis=dict(title='Minimal N')
)
fig.show()
Takeaway. The increase in number of samples required with respect to the
robustness
parameter is essentially linear.
Based on the explanations above, use TriScale's experiment_sizing
function to answer the following questions:
Optional question (harder):
########## YOUR CODE HERE ###########
# ...
#####################################
>>> print(triscale.experiment_sizing(90,90)[0])
22
>>> print(triscale.experiment_sizing(90,95)[0])
29
>>> print(triscale.experiment_sizing(95,90)[0])
45
We observe that it "costs" many more runs to estimate a more extreme percentile (95th instead of 90th) than to increase the confidence level (90% to 95%). This observation holds true in general. The number of runs required increases exponentially when the percentiles get more extreme (close to $0$ or to $1$).
For the last question, we must play with the robustness
parameter. We can write a simple loop to increase its value until the number of runs required reaches 50.
>>> r = 0
>>> while (triscale.experiment_sizing(25,95,r)[0] <= 50):
>>> r += 1
>>> print(r-1)
7
We can exclude the 7 "worst" samples from the confidence interval. Hence, with $N=50$ samples, the best lower bound for the 25th percentile with 95% confidence is $x_8$ (assuming the first sample is $x_1$).
Next step: Data Analysis
Back to repo