This notebook presents a scalability analysis of TriScale. We focus on the execution time to perform the computations; that is, the production of other outputs than the numerical results (i.e., textual logs, plots, etc.) is excuded from this evaluation. The results presented below have been obtained on a simple laptop.
The evaluation results show that TriScale data analysis scales very well with the input sizes; the data analysis is practically negligeable compared to the data collection time. In particular, we show that:
analysis_metric()
generallyanalysis_kpi()
) and variability scores (analysis_variability()
) generallyanalysis_kpi()
and analysis_variability()
).Furthermore, computing the minimal number of runs/series required (implemented by experiment_sizing()
)is generally independent on the percentile and confidence level and completes within less than 3us 95% of the time.
import os
from pathlib import Path
import random
import timeit
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
The entire dataset and source code of this case study is available on Zenodo:
The wget commands below download the required files to reproduce this case study.
# Set `download = True` to download (and extract) the data from this case study
# Eventually, adjust the record_id for the file version you are interested in.
# For reproducing the original TriScale paper, set `record_id = 3451418`
download = False
record_id = 3458116 # version 2 (https://doi.org/10.5281/zenodo.3458116)
files= ['triscale.py',
'helpers.py',
'triplots.py']
We now important the custom modules that we just downloaded.
triscale
is the main module of TriScale. It contains all the functions from TriScale's API; that is, functions meant to be called by the user.import triscale
analysis_metric
¶The following cell generates random data and save it to csv files. This can be used to reproduced in a control way the timing evaluation of the analysis_metric()
function when taking a file as input.
This evaluation analysis_metric()
uses DataFrames as inputs. One can verify that using DataFrames or csv files as inputs has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of analysis_metric()
.
# Generetion of synthetic data
## Create a random DataFrame
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y
## Save chuncks of it as csv files
file_path = Path('ExampleTraces/Scalability')
if not os.path.exists(file_path):
os.makedirs(file_path)
sample_sizes = [20,50,100,150,200,300,400,500,1000,5000,10000,200000,400000,600000,800000,1000000]
for sample_size in sample_sizes:
file_name = 'synthetic_%i_samples.csv' % sample_size
df_chunk = rand_df[:sample_size]
df_chunk.to_csv(str(file_path/file_name), index=False)
print('Done.')
verbose=False
columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in sample_sizes:
setup ='''
import random
import numpy as np
import pandas as pd
import triscale
convergence = {'expected': True}
metric = {'measure':int(random.random()*50)+1}
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y
df = rand_df[:%i]
''' % sample_size
if verbose:
print('Sample size\t', sample_size)
nb_tries = 10
time_vec = (timeit.Timer('triscale.analysis_metric(df, metric, convergence=convergence)',
setup=setup).repeat(10, nb_tries))
time_vec = np.array(time_vec)
time_vec = (time_vec/nb_tries)*1000 # Convert to ms
sample_size_vec = np.ones(nb_tries)*sample_size
df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
df_new[columns[1]] = np.array(time_vec)
df = pd.concat([df, df_new], sort=False)
if verbose:
print('Done.')
print('------------------')
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()