This notebook presents a scalability analysis of TriScale. We focus on the execution time to perform the computations; that is, the production of other outputs than the numerical results (i.e., textual logs, plots, etc.) is excuded from this evaluation. The results presented below have been obtained on a simple laptop.
The evaluation results show that TriScale data analysis scales very well with the input sizes; the data analysis is practically negligeable compared to the data collection time. In particular, we show that:
analysis_metric()
generallyanalysis_kpi()
) and variability scores (analysis_variability()
) generallyanalysis_kpi()
and analysis_variability()
).Furthermore, computing the minimal number of runs/series required (implemented by experiment_sizing()
)is generally independent on the percentile and confidence level and completes within less than 3us 95% of the time.
import os
from pathlib import Path
import random
import timeit
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
import triscale
analysis_metric
¶The following cell generates random data and save it to csv files. This can be used to reproduced in a controlled way the timing evaluation of the analysis_metric()
function when taking a file as input.
This evaluation analysis_metric()
uses DataFrames as inputs. One can verify that using DataFrames or csv files as inputs has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of analysis_metric()
.
# Generetion of synthetic data
## Create a random DataFrame
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y
## Save chuncks of it as csv files
file_path = Path('ExampleTraces/Scalability')
if not os.path.exists(file_path):
os.makedirs(file_path)
sample_sizes = [20,50,100,150,200,300,400,500,1000,5000,10000,200000,400000,600000,800000,1000000]
for sample_size in sample_sizes:
file_name = 'synthetic_%i_samples.csv' % sample_size
df_chunk = rand_df[:sample_size]
df_chunk.to_csv(str(file_path/file_name), index=False)
print('Done.')
Done.
verbose=False
columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in sample_sizes:
setup ='''
import random
import numpy as np
import pandas as pd
import triscale
convergence = {'expected': True}
metric = {'measure':int(random.random()*50)+1}
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y
df = rand_df[:%i]
''' % sample_size
if verbose:
print('Sample size\t', sample_size)
nb_tries = 10
time_vec = (timeit.Timer('triscale.analysis_metric(df, metric, convergence=convergence)',
setup=setup).repeat(10, nb_tries))
time_vec = np.array(time_vec)
time_vec = (time_vec/nb_tries)*1000 # Convert to ms
sample_size_vec = np.ones(nb_tries)*sample_size
df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
df_new[columns[1]] = np.array(time_vec)
df = pd.concat([df, df_new], sort=False)
if verbose:
print('Done.')
print('------------------')
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()
# Zooming on for small sample sizes...
fig = px.scatter(df, x =columns[0], y=columns[1],range_x=[0,1010], range_y=[0,30])
fig.show()
The data shows two modes in the execution time of the analysis_metric()
function: a step increase, followed by a slow linear increase. This can be easily explained.
The computationally expensive part of analysis_metric()
is the convergence test, which includes a Theil-Sen regression [ref]. This regressor works by computing the slopes between all pairs of points and returns the median slope value; thus, the regressor scales with O(n²).
However, TriScale does not perform the regression on the input data directly. Instead, TriScale divides the input data in chunks. For each chunk, the metric is computed to eventually create a new data series with metric values. The purpose of the convergence test is to verify that these metric values have converged; thus TriScale executes the Theil-Sen regressor on this new data series.
The Theil-Sen regressor does not require many samples for producing a reliable result; a few tens of data points are often considered sufficient. Thus, we can cap the size of metric data series
(TriScale caps it to 100 values), which bounds the execution time of the Theil-Sen regressor. Ultimately, this allows the analysis_metric()
to scale very well with the sample size.
The linear increase for large number of raw samples is due to the computation of the metric on increasingly large chuncks. The more complex the metric is, the longer the execution time.
In this evaluation, a percentile is used as metric, which is computed efficiently by numpy
.
analysis_metric()
is fast (less than 20ms) for small numbers of samples (less than 200)
analysis_metric()
scales with the same complexity as the metric computation.In this example, the metric is a simple percentile, which appears to scale linearly when computed with numpy
.
Overall, in our experiments (running on a standard PC) analysis_metric()
execution takes
Thus, the data collection time depends on the networking experiment, but it is unlikely that any experiment would produce much more than a million of (useful) data points per second.
Thus, we conclude that the computation time of the analysis_metric()
is negligible for networking experiments.
Note
analysis_metric()
takes either a DataFrame or a csv file as input. This has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of analysis_metric()
.
analysis_kpi
¶verbose = False
columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in [10,20,30,40,50,100,200,300,400,600,800,1000]:
setup = '''
import random
import triscale
sample_length = %i
percentile = int(random.random()*50)+1
confidence = int(random.random()*25)+75
KPI = { 'percentile': percentile,
'confidence': confidence,
'bounds' : [0,0.001],
'bound' : 'lower'}
rand_data = [random.random() for i in range(sample_length)]
''' % sample_size
if verbose:
print('Sample size\t', sample_size)
nb_tries = 10
time_vec = (timeit.Timer('triscale.analysis_kpi(rand_data, KPI)', setup=setup).repeat(10, nb_tries))
time_vec = np.array(time_vec)
time_vec = (time_vec/nb_tries)*1000 # Convert to ms
sample_size_vec = np.ones(nb_tries)*sample_size
df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
df_new[columns[1]] = np.array(time_vec)
df = pd.concat([df, df_new], sort=False)
if verbose:
print('Done.')
print('------------------')
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()
The data shows a clear linear correlation between the sample size and the execution time of the analysis_kpi()
function, which is not surprising: most computations are related to the determination of the confidence interval using Thompson's method, which is an iterative process through the ordered data samples (see ThompsonCI()
docstring for details).
analysis_kpi()
scales linearly with the number of samples.
In our experiments (running on a standard PC), computation for a
Thus, we conclude that the computation time of the analysis_kpi()
is negligible for networking experiments.
analysis_variability
¶verbose = False
columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in [10,20,30,40,50,100,200,300,400,600,800,1000]:
setup = '''
import random
import triscale
sample_length = %i
percentile = int(random.random()*50)+1
confidence = int(random.random()*25)+75
score = { 'percentile': percentile,
'confidence': confidence}
rand_data = [random.random() for i in range(sample_length)]
''' % sample_size
if verbose:
print('Sample size\t', sample_size)
nb_tries = 10
time_vec = (timeit.Timer('triscale.analysis_variability(rand_data, score)', setup=setup).repeat(10, nb_tries))
time_vec = np.array(time_vec)
time_vec = (time_vec/nb_tries)*1000 # Convert to ms
sample_size_vec = np.ones(nb_tries)*sample_size
df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
df_new[columns[1]] = np.array(time_vec)
df = pd.concat([df, df_new], sort=False)
if verbose:
print('Done.')
print('------------------')
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()
The data shows a clear linear correlation between the sample size and the execution time of the analysis_variability()
function, which is not surprising: most computations are related to the determination of the confidence interval using Thompson's method, which is an iterative process through the ordered data samples (see ThompsonCI()
docstring for details).
Unsurprisingly, the data is very similar as for analysis_kpi()
. analysis_kpi()
and analysis_variability()
do the same computations. They only differ in the generation of outputs (logs and plots). Since the outputs are not considered in this scalability evaluation, we obtain very similar results for both functions.
analysis_variability()
scales linearly with the number of samples.
In our experiments (running on a standard PC), computation for a
Thus, we conclude that the computation time of the analysis_variability()
is negligible for networking experiments.
experiment_sizing
¶setup = '''
import random
import triscale
percentile = int(random.random()*50)+1
confidence = int(random.random()*25)+75
'''
nb_tries = 10
time_vec = (timeit.Timer('triscale.experiment_sizing(percentile,confidence)', setup=setup).repeat(25000, nb_tries))
time_vec = np.array(time_vec)
time_vec /= nb_tries
data = pd.DataFrame(time_vec, columns=['exec_time'])
fig = px.scatter(data, y="exec_time")
fig.show()
We can use TriScale to evaluate the expected execution time of the experiment_sizing()
function. We choose a large percentile (95th) and a standard confidence level (95%).
However, the scatter plot of the timing measurement indicates that the data is correlated (which is not suprising: the experiment is disturbed by other processes running on the PC).
KPI = {'percentile': 95,
'confidence': 95,
'bounds' : [0,0.001],
'name': 'Execution time of TriScale experiment_sizing() function',
'unit': 's'}
triscale.analysis_kpi(time_vec,
KPI,)
(False, 2.87230359390378e-06)
Indeed, TriScale indicates that data do not appear to be i.i.d., thus the analysis module cannot confidently compute a KPI.
However, since the sample size is large, we can try to randomly subsample to dilute the correlation between the data points.
for k in [2,4,8,16,64,128,256,512]:
sample_size = int(len(time_vec)/k)
ix = random.sample(range(len(time_vec)), sample_size)
ix = np.sort(ix)
time_vec_sub = time_vec[ix]
iid_test, KPI_value = triscale.analysis_kpi(time_vec_sub,
KPI)
print('Subsampling factor\t ', k)
print('Appear i.i.d.?\t\t ', iid_test)
if iid_test:
print('KPI value \t\t ', KPI_value)
print('------------------')
Subsampling factor 2 Appear i.i.d.? False ------------------ Subsampling factor 4 Appear i.i.d.? False ------------------ Subsampling factor 8 Appear i.i.d.? False ------------------ Subsampling factor 16 Appear i.i.d.? False ------------------ Subsampling factor 64 Appear i.i.d.? False ------------------ Subsampling factor 128 Appear i.i.d.? False ------------------ Subsampling factor 256 Appear i.i.d.? False ------------------ Subsampling factor 512 Appear i.i.d.? False ------------------
With a subsampling factor of 256, we sometimes obtain sufficiently uncorrelated data to compute a KPI. This is not a determistic process though, since the subsampling is random. Even with large subsampling, there is still too much correlation in the remaining data ton confidently compute a KPI.
Thus, we resume with a simpler descriptive only metric: the 95th percentile of our data smaples.
np.percentile(time_vec, 95 , interpolation='nearest')
2.820801455527544e-06
The 95th percentile of execution time for the TriScale
experiment_sizing()
function is about
2.73us