This notebook presents a scalability analysis of TriScale. We focus on the execution time to perform the computations; that is, the production of other outputs than the numerical results (i.e., textual logs, plots, etc.) is excuded from this evaluation. The results presented below have been obtained on a simple laptop.

The evaluation results show that **TriScale data analysis scales very well with the input sizes**; the data analysis is practically negligeable compared to the data collection time. In particular, we show that:

- For simple metrics such as percentiles, the excution of
`analysis_metric()`

generally- takes less than 50ms for up to 10'000 data points,
- takes about
**1s for up to one million data points**, - scales linearly.

- Computing KPIs (
`analysis_kpi()`

) and variability scores (`analysis_variability()`

) generally- takes less than 10ms for up to 100 data points,
- takes less than
**100ms for up to 1000 data points** - scales linearly.

- The computation of confidence intervals using Thompson's method is very efficient (as demonstrated by the scaling of
`analysis_kpi()`

and`analysis_variability()`

).

Furthermore, computing the minimal number of runs/series required (implemented by `experiment_sizing()`

)is generally independent on the percentile and confidence level and completes within **less than 3us** 95% of the time.

In [3]:

```
import os
from pathlib import Path
import random
import timeit
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
```

The entire dataset and source code of this case study is available on Zenodo:

The wget commands below download the required files to reproduce this case study.

In [1]:

```
# Set `download = True` to download (and extract) the data from this case study
# Eventually, adjust the record_id for the file version you are interested in.
# For reproducing the original TriScale paper, set `record_id = 3451418`
download = False
record_id = 3458116 # version 2 (https://doi.org/10.5281/zenodo.3458116)
files= ['triscale.py',
'helpers.py',
'triplots.py']
```

We now important the custom modules that we just downloaded.

`triscale`

is the main module of TriScale. It contains all the functions from TriScale's API; that is, functions meant to be called by the user.

In [ ]:

```
import triscale
```

`analysis_metric`

¶The following cell generates random data and save it to csv files. This can be used to reproduced in a control way the timing evaluation of the `analysis_metric()`

function when taking a file as input.

This evaluation `analysis_metric()`

uses DataFrames as inputs. One can verify that using DataFrames or csv files as inputs has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of `analysis_metric()`

.

In [6]:

```
# Generetion of synthetic data
## Create a random DataFrame
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y
## Save chuncks of it as csv files
file_path = Path('ExampleTraces/Scalability')
if not os.path.exists(file_path):
os.makedirs(file_path)
sample_sizes = [20,50,100,150,200,300,400,500,1000,5000,10000,200000,400000,600000,800000,1000000]
for sample_size in sample_sizes:
file_name = 'synthetic_%i_samples.csv' % sample_size
df_chunk = rand_df[:sample_size]
df_chunk.to_csv(str(file_path/file_name), index=False)
print('Done.')
```

In [3]:

```
verbose=False
columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in sample_sizes:
setup ='''
import random
import numpy as np
import pandas as pd
import triscale
convergence = {'expected': True}
metric = {'measure':int(random.random()*50)+1}
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y
df = rand_df[:%i]
''' % sample_size
if verbose:
print('Sample size\t', sample_size)
nb_tries = 10
time_vec = (timeit.Timer('triscale.analysis_metric(df, metric, convergence=convergence)',
setup=setup).repeat(10, nb_tries))
time_vec = np.array(time_vec)
time_vec = (time_vec/nb_tries)*1000 # Convert to ms
sample_size_vec = np.ones(nb_tries)*sample_size
df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
df_new[columns[1]] = np.array(time_vec)
df = pd.concat([df, df_new], sort=False)
if verbose:
print('Done.')
print('------------------')
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()
```