Notebook

TriScale¶

Scalability Study¶

This notebook presents a scalability analysis of TriScale. We focus on the execution time to perform the computations; that is, the production of other outputs than the numerical results (i.e., textual logs, plots, etc.) is excuded from this evaluation. The results presented below have been obtained on a simple laptop.

The evaluation results show that TriScale data analysis scales very well with the input sizes; the data analysis is practically negligeable compared to the data collection time. In particular, we show that:

For simple metrics such as percentiles, the excution of analysis_metric() generally
- takes less than 50ms for up to 10'000 data points,
- takes about 1s for up to one million data points,
- scales linearly.
Computing KPIs (analysis_kpi()) and variability scores (analysis_variability()) generally
- takes less than 10ms for up to 100 data points,
- takes less than 100ms for up to 1000 data points
- scales linearly.
The computation of confidence intervals using Thompson's method is very efficient (as demonstrated by the scaling of analysis_kpi() and analysis_variability()).

Furthermore, computing the minimal number of runs/series required (implemented by experiment_sizing())is generally independent on the percentile and confidence level and completes within less than 3us 95% of the time.

List of Imports

analysis_metric()
analysis_kpi()
analysis_variability()
experiment_sizing()

List of Imports¶

In [1]:

import os
from pathlib import Path
import random
import timeit

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

import triscale

`analysis_metric`¶

The following cell generates random data and save it to csv files. This can be used to reproduced in a controlled way the timing evaluation of the analysis_metric() function when taking a file as input.

This evaluation analysis_metric() uses DataFrames as inputs. One can verify that using DataFrames or csv files as inputs has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of analysis_metric().

In [4]:

# Generetion of synthetic data

## Create a random DataFrame
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y

## Save chuncks of it as csv files
file_path = Path('ExampleTraces/Scalability')
if not os.path.exists(file_path):
    os.makedirs(file_path)
    
sample_sizes = [20,50,100,150,200,300,400,500,1000,5000,10000,200000,400000,600000,800000,1000000]
for sample_size in sample_sizes:
    file_name = 'synthetic_%i_samples.csv' % sample_size
    df_chunk = rand_df[:sample_size]
    df_chunk.to_csv(str(file_path/file_name), index=False)    
print('Done.')        

Done.

In [5]:

verbose=False

columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in sample_sizes:    
    setup ='''
import random
import numpy as np
import pandas as pd
import triscale

convergence = {'expected': True}
metric      = {'measure':int(random.random()*50)+1}
rand_x = [random.random()*100 for i in range(1000000)]
rand_y = [random.random()*100 for i in range(1000000)]
rand_df = pd.DataFrame(columns=['x','y'])
rand_df['x'] = np.sort(rand_x)
rand_df['y'] = rand_y
df = rand_df[:%i]
''' % sample_size
    
    if verbose:
        print('Sample size\t', sample_size)
    nb_tries = 10
    time_vec = (timeit.Timer('triscale.analysis_metric(df, metric, convergence=convergence)', 
                             setup=setup).repeat(10, nb_tries))
    time_vec = np.array(time_vec)
    time_vec = (time_vec/nb_tries)*1000 # Convert to ms
    sample_size_vec = np.ones(nb_tries)*sample_size
    df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
    df_new[columns[1]] = np.array(time_vec)
    df = pd.concat([df, df_new], sort=False)
    if verbose:
        print('Done.')
        print('------------------')
    
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()

In [6]:

# Zooming on for small sample sizes...
fig = px.scatter(df, x =columns[0], y=columns[1],range_x=[0,1010], range_y=[0,30])
fig.show()

The data shows two modes in the execution time of the analysis_metric() function: a step increase, followed by a slow linear increase. This can be easily explained.

The computationally expensive part of analysis_metric() is the convergence test, which includes a Theil-Sen regression [ref]. This regressor works by computing the slopes between all pairs of points and returns the median slope value; thus, the regressor scales with O(n²).

However, TriScale does not perform the regression on the input data directly. Instead, TriScale divides the input data in chunks. For each chunk, the metric is computed to eventually create a new data series with metric values. The purpose of the convergence test is to verify that these metric values have converged; thus TriScale executes the Theil-Sen regressor on this new data series.

The Theil-Sen regressor does not require many samples for producing a reliable result; a few tens of data points are often considered sufficient. Thus, we can cap the size of metric data series (TriScale caps it to 100 values), which bounds the execution time of the Theil-Sen regressor. Ultimately, this allows the analysis_metric() to scale very well with the sample size.

The linear increase for large number of raw samples is due to the computation of the metric on increasingly large chuncks. The more complex the metric is, the longer the execution time. In this evaluation, a percentile is used as metric, which is computed efficiently by numpy.

analysis_metric() is fast (less than 20ms) for small numbers of samples (less than 200)

For larger numbers of samples, analysis_metric() scales with the same complexity as the metric computation.

In this example, the metric is a simple percentile, which appears to scale linearly when computed with numpy. Overall, in our experiments (running on a standard PC) analysis_metric() execution takes

less than 20ms for up to 1,000 data points,
less than 50ms for up to 10,000 data points,
about 1s for up to one million data points.

Thus, the data collection time depends on the networking experiment, but it is unlikely that any experiment would produce much more than a million of (useful) data points per second.
Thus, we conclude that the computation time of the analysis_metric() is negligible for networking experiments.

Note

analysis_metric() takes either a DataFrame or a csv file as input. This has barely any impact on the execution time: either way, the input data is loaded into a DataFrame at the start of analysis_metric().

[ Back to top ]

`analysis_kpi`¶

In [6]:

verbose = False

columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in [10,20,30,40,50,100,200,300,400,600,800,1000]:
    setup = '''
import random
import triscale

sample_length = %i
percentile = int(random.random()*50)+1
confidence = int(random.random()*25)+75
KPI = {  'percentile': percentile, 
         'confidence': confidence, 
         'bounds'    : [0,0.001],
         'bound'     : 'lower'}
rand_data = [random.random() for i in range(sample_length)]
''' % sample_size
    if verbose:
        print('Sample size\t', sample_size)
    nb_tries = 10
    time_vec = (timeit.Timer('triscale.analysis_kpi(rand_data, KPI)', setup=setup).repeat(10, nb_tries))
    time_vec = np.array(time_vec)
    time_vec = (time_vec/nb_tries)*1000 # Convert to ms
    sample_size_vec = np.ones(nb_tries)*sample_size
    df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
    df_new[columns[1]] = np.array(time_vec)
    df = pd.concat([df, df_new], sort=False)
    if verbose: 
        print('Done.')
        print('------------------')
    
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()

The data shows a clear linear correlation between the sample size and the execution time of the analysis_kpi() function, which is not surprising: most computations are related to the determination of the confidence interval using Thompson's method, which is an iterative process through the ordered data samples (see ThompsonCI() docstring for details).

analysis_kpi() scales linearly with the number of samples.

In our experiments (running on a standard PC), computation for a

sample size of 100 takes less than 10 ms,
sample size of 1000, takes less that 100ms.

Thus, we conclude that the computation time of the analysis_kpi() is negligible for networking experiments.

[ Back to top ]

`analysis_variability`¶

In [8]:

verbose = False

columns=['Sample size [#]', 'Execution time [ms]']
df = pd.DataFrame(columns=columns)
for sample_size in [10,20,30,40,50,100,200,300,400,600,800,1000]:
    setup = '''
import random
import triscale

sample_length = %i
percentile = int(random.random()*50)+1
confidence = int(random.random()*25)+75
score = {  'percentile': percentile, 
           'confidence': confidence}
rand_data = [random.random() for i in range(sample_length)]
''' % sample_size
    if verbose: 
        print('Sample size\t', sample_size)
    nb_tries = 10
    time_vec = (timeit.Timer('triscale.analysis_variability(rand_data, score)', setup=setup).repeat(10, nb_tries))
    time_vec = np.array(time_vec)
    time_vec = (time_vec/nb_tries)*1000 # Convert to ms
    sample_size_vec = np.ones(nb_tries)*sample_size
    df_new = pd.DataFrame(sample_size_vec, columns=[columns[0]])
    df_new[columns[1]] = np.array(time_vec)
    df = pd.concat([df, df_new], sort=False)
    if verbose: 
        print('Done.')
        print('------------------')
    
fig = px.scatter(df, x =columns[0], y=columns[1])
fig.show()

The data shows a clear linear correlation between the sample size and the execution time of the analysis_variability() function, which is not surprising: most computations are related to the determination of the confidence interval using Thompson's method, which is an iterative process through the ordered data samples (see ThompsonCI() docstring for details).

Unsurprisingly, the data is very similar as for analysis_kpi(). analysis_kpi() and analysis_variability() do the same computations. They only differ in the generation of outputs (logs and plots). Since the outputs are not considered in this scalability evaluation, we obtain very similar results for both functions.

analysis_variability() scales linearly with the number of samples.

In our experiments (running on a standard PC), computation for a

sample size of 100 takes less than 10 ms,
sample size of 1000 takes less than 100ms.

Thus, we conclude that the computation time of the analysis_variability() is negligible for networking experiments.

[ Back to top ]

`experiment_sizing`¶

In [9]:

setup = '''
import random 
import triscale

percentile = int(random.random()*50)+1
confidence = int(random.random()*25)+75
'''
nb_tries = 10
time_vec = (timeit.Timer('triscale.experiment_sizing(percentile,confidence)', setup=setup).repeat(25000, nb_tries))
time_vec = np.array(time_vec)
time_vec /= nb_tries
data = pd.DataFrame(time_vec, columns=['exec_time'])
fig = px.scatter(data, y="exec_time")
fig.show()

We can use TriScale to evaluate the expected execution time of the experiment_sizing() function. We choose a large percentile (95th) and a standard confidence level (95%).

However, the scatter plot of the timing measurement indicates that the data is correlated (which is not suprising: the experiment is disturbed by other processes running on the PC).

In [10]:

KPI = {'percentile': 95,
     'confidence': 95,
     'bounds'    : [0,0.001],
     'name': 'Execution time of TriScale experiment_sizing() function',
     'unit': 's'}
triscale.analysis_kpi(time_vec,
                 KPI,)

Out[10]:

(False, 2.87230359390378e-06)

Indeed, TriScale indicates that data do not appear to be i.i.d., thus the analysis module cannot confidently compute a KPI.

However, since the sample size is large, we can try to randomly subsample to dilute the correlation between the data points.

In [11]:

for k in [2,4,8,16,64,128,256,512]:
    sample_size = int(len(time_vec)/k)
    ix = random.sample(range(len(time_vec)), sample_size)
    ix = np.sort(ix)
    time_vec_sub = time_vec[ix]
    iid_test, KPI_value = triscale.analysis_kpi(time_vec_sub,
                             KPI)
    print('Subsampling factor\t ', k)
    print('Appear i.i.d.?\t\t ', iid_test)
    if iid_test:
        print('KPI value \t\t ', KPI_value)
    print('------------------')

Subsampling factor	  2
Appear i.i.d.?		  False
------------------
Subsampling factor	  4
Appear i.i.d.?		  False
------------------
Subsampling factor	  8
Appear i.i.d.?		  False
------------------
Subsampling factor	  16
Appear i.i.d.?		  False
------------------
Subsampling factor	  64
Appear i.i.d.?		  False
------------------
Subsampling factor	  128
Appear i.i.d.?		  False
------------------
Subsampling factor	  256
Appear i.i.d.?		  False
------------------
Subsampling factor	  512
Appear i.i.d.?		  False
------------------

With a subsampling factor of 256, we sometimes obtain sufficiently uncorrelated data to compute a KPI. This is not a determistic process though, since the subsampling is random. Even with large subsampling, there is still too much correlation in the remaining data ton confidently compute a KPI.

Thus, we resume with a simpler descriptive only metric: the 95th percentile of our data smaples.

In [12]:

np.percentile(time_vec, 95 , interpolation='nearest')

Out[12]:

2.820801455527544e-06

The 95th percentile of execution time for the TriScale experiment_sizing() function is about

2.73us

[ Back to top ]

In [ ]:

TriScale¶

Scalability Study¶

Menu¶

List of Imports¶

analysis_metric¶

analysis_kpi¶

analysis_variability¶

experiment_sizing¶

`analysis_metric`¶

`analysis_kpi`¶

`analysis_variability`¶

`experiment_sizing`¶