TriScale¶

Case Study - Congestion-Control Schemes¶

This notebook presents a use case of the TriScale framework, which details a comparison between 17 congestion-control schemes (listed below). Elements of this use case are described in the TriScale paper (submitted to NSDI'2020).

The evaluation is based on the MahiMahi emulator.
The evaluation data have been collected through the Pantheon framework.
The evaluation uses the calibrated to the real path from AWS California to Mexico (see details in the Data Collection section).
The evaluation considers the single flow scenario only.
The evaluation focuses on the long-term behavior of full-throttle flows - that is, flows whose own throttling/limiting factor is the congestion control.
The evaluation considers two performance dimensions:
- The one-way delay
- The (egress) throughput

List of Schemes
Evaluation Objectives
List of Imports
Download Source Files and Data
Experiment Design
Data Collection
Analysis
Conclusions

List of Schemes¶

The congestion-control schemes included in the evaluation are:

TCP BBR
Copa
TCP Cubic
FillP
FillP-Sheep
Indigo
LEDBAT
PCC-Allegro
PCC-Expr
QUIC Cubic
SCReAM
TaoVA-100x
TCP Vegas
Verus
PCC-Vivace
WebRTC media

Evaluation Objectives¶

[Back to top]

This evaluation aims to compare congestion-control schemes. For a fair comparison, all schemes are tested using the MahiMahi network emulator.

The purpose of this evaluation is not to "crown" one scheme, but rather to illustrate with a concrete example how TriScale can be used and how the framework avoids certain shortcomings in the experiment design and data analysis.

List of Imports¶

[Back to top]

In [1]:

from pathlib import Path
import json
import yaml
import numpy as np
import os
import pandas as pd
import zipfile

from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook"

Download Source Files and Data¶

[Back to top]

The entire dataset and source code of this case study is available on Zenodo:

The wget commands below download the required files to reproduce this case study. Beware! the .zip file is 2.7G. Downloading and unzipping might take a while...

In [2]:

# Set `download = True` to download (and extract) the data from this case study
# Eventually, adjust the record_id for the file version you are interested in

# For reproducing the original TriScale paper, set `record_id = 3451418`

download = False
record_id = 3458116 # version 2 (https://doi.org/10.5281/zenodo.3458116)

files= ['triscale.py',
        'helpers.py',
        'triplots.py',
        'UseCase_Pantheon.zip']
if download:
    for file in files:
        print(file)
        url = 'https://zenodo.org/record/'+str(record_id)+'/files/'+file 
        os.system('wget %s' %url)
        if file[-4:] == '.zip':    
            with zipfile.ZipFile(file,"r") as zip_file:
                zip_file.extractall()
        print('Done.')

We now important the custom modules that we just downloaded.

triscale is the main module of TriScale. It contains all the functions from TriScale's API; that is, functions meant to be called by the user.
triplots is a module from TriScale that contains all its plotting functions.
pantheon is a module specific to this case study. It contains helper functions that loop throught the different congestion-control schemes, call TriScale API functions, and create visualizations

In [3]:

import triscale
import triplots
import UseCase_Pantheon.pantheon as pantheon

Experiment Design¶

[Back to top]

One needs to answer a set of questions to fully design an experiment. They relate to three different time scales:

[Run] How long is a run?
[Series] How many runs in a series? When should they run?
[Sequel] How many series?

TriScale assists the user in answering these questions, as described below.

1. How long is a run?¶

[Back to top]

The scenario evaluates the long-term behavior of full-throttle flows - that is, flows whose own throttling/limiting factor is the congestion control. Thus, one must decide on a running time that is long enough to actually capture the long-term behavior.

TriScale implements a convergence test that helps assessing whether a run is indeed long enough to estimate the long-term behavior. Passing this test suggests that the protocol behavior has converged (with a certain level of confidence): the run is then considered long-enough to estimate the long-term behavior.

All networking protocols are differents, and there is no reason to assume that the same run time is necessary for all (certain protocols may be very stable, others require more time to converge). The only way to estimate the required length is to actually test the different protocols, and observe when the runs become long enough to pass the convergence test.

Hence, we perform a preliminary series of tests with varying run time, and test the convergence of the different congestion-control scemes. The first thing to do is to specify the parameters for the convergence test and the performance metrics.

In [4]:

# Parameters for the convergence test
# -> TriScale defaults
convergence = {'expected'  : True,
               'confidence': 95,  # in %
               'tolerance' : 1,   # in %
              }
# Throughput metric -> Median
metric_tput = {'name':'Average Throughput',
               'unit': 'Mbit/s',
               'measure':50,
               'bounds':[0,120],  # expected value range
               'tag':'throughput' # do not change the tag
              }
# Delay metric -> 95th percentile
metric_delay = {'name':'95th perc. of One-way delay (ms)',
                'unit': 'ms',
                'measure':95,
                'bounds':[0,100], # expected value range
                'tag':'delay'     # do not change the tag
               }

These are all the inputs TriScales requires to perform the convergence test and compute the metrics.

The convergence test is implemented in a dedicated convergence_test() function in the helpers.py module.
The analysis_metric() function is the user API, implemented in the triscale.py module. The analysis_metric() calls the convergence_test(), handles the plotting, and produces the textual outputs.
The compute_metric() function is a simple wrapper that loops through all congestion-control scemes and runs and calls the analysis_metric() function. It returns a Pandas dataframe with all convergence test results and metric values.

In [5]:

# Construct the path to the different test results
# (you may need to adjust `result_dir_path` if you did change the default location of file download)
result_dir_path = Path('UseCase_Pantheon/PantheonData/10_20_30_40_50_60s')
result_dir_list = [x for x in result_dir_path.iterdir() if x.is_dir()]

# Meta data file name
meta_data_file  = 'pantheon_metadata.json'

# Config file name and path
config_file     = Path('UseCase_Pantheon/PantheonData/config.yml')

# Output file
out_name        = Path('UseCase_Pantheon/PantheonData/prelim_metrics.csv')

# Metric list
metric_list = [metric_tput, metric_delay]

# Execute convergence tests and compute the metrics
metrics_design = pantheon.compute_metric( result_dir_list, 
                                          meta_data_file, 
                                          convergence,
                                          metric_list,
                                          plot=False,
                                          out_name=out_name,
                                          verbose=False)

Output retrieved from file. Skipping computation.

In [6]:

# Visualize the results: plot the number of converged runs (out of 10)
convergence_results, figure = pantheon.compute_plot_convergence(metrics_design, 
                                                                config_file, 
                                                                'runtime', 
                                                                show=False)
figure.show()

Uncomment the call below let you visualize the DataFrame containing the number of converged test runs (i.e., the data from the plot just above).

In [7]:

convergence_results

Out[7]:

	cc	runtime	run
0	bbr	10	7
1	bbr	20	9
2	bbr	30	10
3	bbr	40	10
4	bbr	50	10
5	bbr	60	10
6	copa	10	5
7	copa	20	3
8	copa	30	1
9	copa	50	1
10	copa	60	4
11	cubic	10	5
12	cubic	20	8
13	cubic	30	9
14	cubic	40	9
15	cubic	50	10
16	cubic	60	10
17	fillp	10	10
18	fillp	20	10
19	fillp	30	10
20	fillp	40	10
21	fillp	50	9
22	fillp	60	10
23	fillp_sheep	10	10
24	fillp_sheep	20	10
25	fillp_sheep	30	10
26	fillp_sheep	40	10
27	fillp_sheep	50	8
28	fillp_sheep	60	10
29	indigo	10	10
...	...	...	...
59	sprout	50	10
60	sprout	60	10
61	taova	10	2
62	taova	20	7
63	taova	30	6
64	taova	40	9
65	taova	50	7
66	taova	60	9
67	vegas	10	8
68	vegas	20	4
69	vegas	30	9
70	vegas	40	7
71	vegas	50	6
72	vegas	60	4
73	verus	10	1
74	verus	20	4
75	verus	30	4
76	verus	40	7
77	verus	50	5
78	verus	60	8
79	vivace	30	10
80	vivace	40	10
81	vivace	50	9
82	vivace	60	10
83	webrtc	10	10
84	webrtc	20	10
85	webrtc	30	10
86	webrtc	40	10
87	webrtc	50	10
88	webrtc	60	10

89 rows × 3 columns

Each scheme run 10 times. For each run, we execute TriScale's convergence test for each performance dimension (the one-way delay and the egress throughput). We count a run as "converged" if both convergence tests passed.

As expected, the different schemes need different time to converge (see plot above).

The majority of schemes converge between 8 and 10 times (out of 10) with a runtime of 30s.
Some schemes converge less than half the times even with a runtime of 60s.

The LEDBAT case¶

The LEDBAT scheme is interesting as it shows well the problem of wrongly choosing the test runtime. Let us look at the data for one run of each runtime settings...

In [8]:

sample = {'cc':'ledbat',
          'run':1}
runtimes = [10,20,30,40,50,60]
cnt = 0
for i in [0,2,5,4,3,1]:
    title = 'Runtime - %i seconds' %runtimes[cnt]
    custom_layout = {'title': title , 'xaxis' : {'title':'Time [s]'}}
    metrics_design, figure = pantheon.compute_metric( [result_dir_list[i]], 
                                          meta_data_file, 
                                          convergence,
                                          [metric_tput],
                                          plot=True,
                                          showplot=False,
                                          layout=custom_layout,
                                          verbose=False,
                                          sample=sample)
    cnt += 1
    figure.show()

One clearly sees that eventually LEDBAT throughput is converging: Reaching the stable throughput value takes about 38 seconds. However, even after 60s of runtime, TriScale convergence test fails due to the effects of the transient phase.

Now, consider that if one performs 30s-long runs and does not test for convergence, the median throughput will be estimated around 40 Mbps. This is very far from the actual long-running performance of the scheme, which is closer to 92 Mbps. However, to be able to confidently estimate the long-running performance of LEDBAT , TriScale indicates that even 60s runtime is not enough.

When the impact of the start-up phase is too important, two solutions are possible:

Increase the runtime further, or
Prun the start-up time in the raw data, which is fine when one aims to estimate the long-running performance.

Conclusion¶

We observe that 30s runtime is not sufficient for certain schemes to converge (i.e., they do not pass TriScale convergence test, with confidence 95% and tolerence 1%). This means that, if one used 30s runtime, the metric data may not be representative of the (expected) long-term performance; that is, the performance one would measure if the runs were longer.

Whether the (expected) long-term performance matters or not depends on the evaluation objective. This Case Study investigates the long-term behavior of full-throttle flows - that is, flows whose own throttling/limiting factor is the congestion control; thus, convergence does matter.

For practical reasons, Pantheon uses 30s runtime for all congestion-control schemes. To be able to compare the outcomes of the same evaluation using only Pantheon or with TriScale, we keep runtimes of 30s for all tested protocols. However, this implies that some schemes will likely never converge and yield un-exploitable data.

2. How many runs in a series? When should they run?¶

[Back to top]

How many runs?¶

The decision on the number of test runs is based on the definition of the evaluation KPIs. In other words, what are we trying to evaluate?

TriScale defines KPIs as percentiles of the metric distribution, estimated with a certain level of confidence. The percentile and confidence level are chosen by the user. Generally, we can distinguish two cases:

To evaluate the average performance, one may use as KPIs 'middle' percentiles

For example: the median, or the quartiles (25th and 75th percentiles).

To evaluate the extreme performance, one should use as KPI 'large' (resp. 'small') percentiles

For example: the 95th, or 99th percentile (resp. 5th or 1th percentile).

Intuitively, estimating 'large' or 'small' percentiles requires more data than estimating the median. Furthermore, the higher the confidence level, the more data is required. TriScale allows to quantify this relation between the percentile, the confidence level, and the required number of samples. Concretely, for a given percentile and confidence level, TriScale returns to the user the minimal number of data points necessary for the estimation. The relation follows from using the so-called Thompson's method to compute confidence interval for percentiles. In TriScale, this functionality is implemented in the experiment_sizing() function from the triscale.py module.

In this case study, we have two performance metrics: the outgoing throughput and the one-way delay (see the metric definitions above). We chose the quartiles as KPIs to investigate the average performance of the different schemes for these two metrics,

[ KPI 1 ] - 25th percentile of the throughput distribution
[ KPI 2 ] - 75th percentile of the delay distribution

Symetric percentiles requires the same number of samples to be estimated; that is, for a given confidence level, computing a lower-bound of the 25th percentile requires the same number of data points than computing an upper-bound of the 75th percentile.

The "standard" level of confidence in statistical studies is 95%. Let us start with that...

In [9]:

KPI = {'percentile': 75,
       'confidence': 95}
triscale.experiment_sizing(KPI['percentile'], 
                           KPI['confidence'],
                           CI_class='one-sided',
                           verbose=True,);

A one-sided bound of the 	75-th percentile
with a confidence level of	95 % 
requires a minimum of 		11 samples

TriScale indicates that estimating the 75th/25th percentiles with a confidence level of 95% requires 11 data points, in other words, one should run 11 tests.

Concretly, this means that with 11 data samples, the largest (resp. smallest) data point is an upper-bound (resp. lower-bound) for the 75th percentile (resp. 25th percentile) and that the probability that this bound is indeed correct is at least 95%.

However, the experiments available on the Pantheon website report a maximum of 10 runs per series. Thus, we choose to decrease the confidence level such that 10 data points are sufficient.

In [10]:

KPI = {'percentile': 75,
       'confidence': 75}
triscale.experiment_sizing(KPI['percentile'], 
                           KPI['confidence'],
                           CI_class='one-sided',
                           verbose=True,);

A one-sided bound of the 	75-th percentile
with a confidence level of	75 % 
requires a minimum of 		5 samples

With a confidence level of 75%, 5 data points are sufficient to estimate 75th/25th percentiles.

According to our preliminary tests, the majority of congestion-control schemes we look at often converge more than 5 times out of 10, therefore this setting is likely to allow TriScale to compute KPIs for most schemes.

TriScale can also compute the number of data points required such that the (k+1)th-largest value is an upper-bound for the 75th percentile (instead of using the largest value as bound). Practically, this means that the KPI would exclude the worst-performing k runs.

This can be done by passing the desired k value as the robustness parameter of experiment_design() functions. Let us try a few values...

In [11]:

KPI = {'percentile': 75,
       'confidence': 75}
to_ignore = [0,1,2,3,10]
for k in to_ignore:
    triscale.experiment_sizing(KPI['percentile'], 
                               KPI['confidence'],
                               robustness=k,
                               CI_class='one-sided',
                               verbose=True,);

A one-sided bound of the 	75-th percentile
with a confidence level of	75 % 
requires a minimum of 		5 samples

A one-sided bound of the 	75-th percentile
with a confidence level of	75 % 
requires a minimum of 		10 samples
with the worst 			1 run(s) excluded

A one-sided bound of the 	75-th percentile
with a confidence level of	75 % 
requires a minimum of 		15 samples
with the worst 			2 run(s) excluded

A one-sided bound of the 	75-th percentile
with a confidence level of	75 % 
requires a minimum of 		20 samples
with the worst 			3 run(s) excluded

A one-sided bound of the 	75-th percentile
with a confidence level of	75 % 
requires a minimum of 		51 samples
with the worst 			10 run(s) excluded

Naturally, the more outliers we would like our KPI to exclude, the more data points are required! However, it is interesting to see that with already 10 data points, we can "ignore" the most extreme value.

It is important to correctly understand what is meant by excluding the worst-performing runs. With this example and 10 runs: the 75-th percentile of the delay is smaller than the 9-th largest value with a probability larger or equal to 75%.

It is not like the worth run is "dropped"; rather, there are enough data points such that the percentile of interest is bounded by a data point which is not the largest (but second largest, in this example).

Thus, if we have 10 metrics values for a given protocol, the resulting KPIs will not be affected by the worst-performing test run.

When should the tests run?¶

This case study uses the MahiMahi network emulator as test environment. Although emulators may fail to capture some of the dynamics of "real" networks, they provide a reproducible test setup, which is particularly useful for comparisons.

Experiments running in MahiMahi do not have the time dependencies that are expected in real networks. Therefore, there is no need for profiling the emulated network: by design, it is always "the same".

Conclusion¶

We define our KPIs are 75th and 25th percentiles for the delay and throughput metric, respectively.
We choose a confidence level of 75% for the estimation of these KPIs.
This results is a minimum of 5 data points necessary to compute our KPIs.
We choose to perform 10 runs per series. This is motivated by three reasons:
- Most protocols do not always converge with 30s runtime. Performing more runs increases the chance of obtaining at least 5 data points (ie, 5 runs that have converged) per series.
- If all runs do converge, having 10 data points would exclude the worst-performing run from our KPIs.
- Currently, Pantheon website reports a maximum of 10 runs per series. Choosing the same setting facilitates the comparison between Pantheon and TriScale's evaluation proceedure.
The test runs can be performed at any time since the evaluation is performed within the MahiMahi network emulator.

Remark

One may choose different KPI parameters for the different metrics (i.e., it does not have to be 25/75th, or 5/95th percentiles).
For example, the evaluation of a real-time protocol would use a 'large' percentile and high confidence level as KPI for the delays, but could simultaneously estimate the average power comsumption using the median as a second KPI.

3. How many series?¶

[Back to top]

We chose the number runs per series to be able to compute our performance KPIs. However, how confident will we be about the obtained KPI values? How likely it is that the computed values are representative of the true performance of the protocols we are evaluating?

In other words, is the experiment reproducible?

TriScale tackles this question by performing multiple series of test runs (also called sequels ), where each series produces one value per KPI (two KPIs in this example). TriScale assesses the reproducibility of an experiment by quantifying the variability in the KPI values; TriScale computes a variability score for each KPI in an evaluation. TriScale's variability scores do not settle whether an experiment is reproducible or not; they quantifies reproducibility. The larger the score, the more variability, and therefore the less reproducible the experiment is.

TriScale computes variability scores using the same approach as for the KPIs: it estimates the upper- and lower-bounds for a (symetric) pair of percentiles for the KPI distributions. Thus, as we did for the choose of the number of runs (per series), we must perform the appropriate number of series to estimate the chosen percentiles.

In this case study, we choose the same percentilea and confidence level for the score as for the KPI. Thus, we must perform a minimum of 5 series to compute TriScale's variability scores.

Remarks

Using the same parameters for the KPIs and the score has no articular benefit. It just happens to be convenient in that example.
Like for the KPIs, it is possible to choose different score parameters for the different performance dimensions.

In [12]:

score = {'percentile': 75,
         'confidence': 75}
triscale.experiment_sizing(score['percentile'], 
                           score['confidence'],
                           CI_class='two-sided',
                           verbose=True,);

A two-sided bound of the 	75-th percentile
with a confidence level of	75 % 
requires a minimum of 		5 samples

Let us call UB and LB the upper- and lower-bound for the 75th and 25th percentiles respectively. The variability score is the UB - LB (with the same dimension and unit as the KPI values).

The interpretation of such a score is the following:

With a probability of 75%,

75% of series result in KPI values below UB and
75% of series result in KPI values above LB.

In other words, 50% of series result in KPIs that differs by a maximum of UB - LB

Summary of the Experiment Design¶

The experimental design is now completed. First, we have defined the evaluation objectives; that is,

The metrics,
The convergence,
The KPIs,
The variability scores.

In [13]:

# Metrics
metric_tput = {'name':'Average Throughput',
               'unit': 'Mbit/s',
               'measure':50,
               'bounds':[0,120],  # expected value range
               'tag':'throughput' # do not change the tag
              }
metric_delay = {'name':'95th perc. of One-way delay',
                'unit': 'ms',
                'measure':95,
                'bounds':[0,100], # expected value range
                'tag':'delay'     # do not change the tag
               }

# Convergence parameters
convergence = {'expected': True,
               'confidence': 95,  # in %
               'tolerance': 5,    # in %
              }

# KPIs
KPI_tput  = {'percentile': 25,
             'confidence': 75,
             'name': 'Average Throughput',
             'unit': 'Mbit/s',
             'bounds':[0,120],    # expected value range
             'tag':'throughput'   # do not change the tag
            }
KPI_delay = {'percentile': 75,
             'confidence': 75,
             'name': '95th perc. of One-way delay',
             'unit': 'ms',
             'bounds':[0,100],    # expected value range
             'tag':'delay'        # do not change the tag
            }

# Variability scores
score_tput  = {'percentile': 75,
             'confidence': 75,
             'name': 'Throughput',
             'unit': 'Mbit/s',
             'bounds':[0,120],    # expected value range
             'tag':'throughput'   # do not change the tag
            }
score_delay = {'percentile': 75,
             'confidence': 75,
             'name': 'One-way delay',
             'unit': 'ms',
             'bounds':[0,100],    # expected value range
             'tag':'delay'        # do not change the tag
            }

Based on these objectives, we decided on suitable parameters for the data collection; that is, we sized the data collection such that we obtain sufficient data to compute the defined KPIs and variability scores.

Parameter	Value	Short Description
#runs	10	number of runs per series
#series	5	number of series
runtime	30	length of one run
span	anytime	time interval for an entire series

This handful of parameters is sufficent to completely describe the entire evaluation using TriScale.

Data Collection¶

[Back to top]

TriScale does not perform the data collection (sorry); TriScale does help deciding what data must be collected, but it is up to user to actually collect it.

For this Case Study, we collect all the data using the Pantheon framework, which is open source and available on GitHub. We use the local mode of Pantheon, which runs the MahiMahi network emulator (detailed settings below). The only modification we made to the code is to save interemediary outputs in csv files for performing the anaysis with TriScale.

Git Commit

397dcf5960b462fc5497f8961856266bc9fbea78

Emulation command

mm-delay 45 mm-link 114.68mbps.trace 114.68mbps.trace --uplink-queue=droptail --uplink-queue-args=packets=450

Emulation description

Calibrated to the real path from AWS California to Mexico

All the data collected for this case study is publically available.

Analysis¶

[Back to top]

The data has been collected (download instructions are available in this notebook). It is time for analysis.

TriScale divides the analysis in three steps (one for each time scale):

Compute the metrics,
Compute the KPIs,
Compute the variability scores.

1. Compute the Metrics¶

[Back to top]

In [14]:

# Construct the path to the different test results
result_dir_path = Path('UseCase_Pantheon/PantheonData/10runs_30s')
result_dir_list = [x for x in result_dir_path.iterdir() if x.is_dir()]

# Meta data file name
meta_data_file  = 'pantheon_metadata.json'

# Config file name and path
config_file     = Path('UseCase_Pantheon/PantheonData/config.yml')

# Metric list
metric_list = [metric_tput, metric_delay]

# Outfile name
# -> Either save output as csv, or retrieve from file if already exists
out_name = Path('UseCase_Pantheon/PantheonData/metrics.csv')

# Execute convergence tests and compute the metrics
metrics = pantheon.compute_metric(result_dir_list, 
                                  meta_data_file, 
                                  convergence,
                                  metric_list,
                                  out_name=out_name,
                                  force_computation=False,
                                  plot=False,
                                  verbose=False)

Output retrieved from file. Skipping computation.

For each series, we can look at the number of resulting data points;
that is, how many runs have converged and resulted in valid metric values.

In [15]:

# Visualize the results: plot the number of converged runs (out of 10)
convergence_results, figure = pantheon.compute_plot_convergence(metrics, config_file, 'datetime', show=False)
figure.show()

As expected from the preliminary study from the Experiment Design, some schemes rarely converge ( Cope, LEDBAT, QUIC Cubic, PCC Allegro, and Verus ). Those schemes that do not converge (at least) 5 times in each series cannot be processed further: there is not enough data to compute the chosen KPI values.

2. Compute the KPIs¶

[Back to top]

The compute_kpi() function is a simple wrapper that loops through all protocols and test series, and calls the analysis_kpi() function from the TriScale module. compute_kpi() returns a Pandas dataframe with all KPI values.

In [16]:

# Config file name and path
config_file = Path('UseCase_Pantheon/PantheonData/config.yml')

# KPIs list
kpi_list = [ KPI_tput, KPI_delay ]

# Compute the KPIs
KPIs = pantheon.compute_kpi(metrics,
                            kpi_list,
                            series='datetime',
                            plot=False,
                            verbose=False)

In [17]:

# Uncomment to visualize the output DataFrame

# KPIs

We can output the number of series that lead to a valid KPI for each scheme:

In [18]:

df = KPIs.copy()
df = df.loc[(df['throughput_test'] == True) & (df['delay_test'] == True)]
df.dropna(inplace=True)
df = df.groupby(['cc'], as_index=False, observed=False).count()
df = df[['cc','datetime']]
df

Out[18]:

	cc	datetime
0	bbr	5
1	cubic	5
2	fillp	5
3	fillp_sheep	5
4	indigo	5
5	pcc_experimental	5
6	scream	5
7	sprout	5
8	taova	4
9	vegas	2
10	vivace	5
11	webrtc	4

The Verus scheme only had two series with at least 5 converged runs; thus we expected a maximum of 2 valid KPIs. No surprise here.

The case of WebRTC is more interesting. One of the series fails TriScale's independence test for the "Throughput" metric. Let us look at the autocorellation plot for that series, and compare it to another series that successfully passed the test.

In [19]:

layout = {'title': 'Autocorrelation - Test failed'}
sample = metrics.loc[(metrics['cc'] == 'webrtc') & (metrics['datetime'] == '2019-08-21T12:14:13:+0200')]
plot_failed = triplots.autocorr_plot(sample.throughput_value.values,
                                    layout=layout)

layout = {'title': 'Autocorrelation - Test passed'}
sample = metrics.loc[(metrics['cc'] == 'webrtc') & (metrics['datetime'] == '2019-08-22T07:59:10:+0200')]
plot_passed = triplots.autocorr_plot(sample.throughput_value.values,
                                    layout=layout)

The autocorrelation coefficient must be in the shaded grey area for the test to pass; The upper series passes the test, the lower one does not.

However, there is no clear difference in the correlation structure of the two series, i.e., the lower series does not seem significantly more correlated than the first one. All other series of WebRTC pass the independence test, which hints that the failed series is merely an artifact induced by the small number of runs in the series (which was selected to 10 in this example). In such cases, it is important that the user critically assesses TriScale’s results, in order to increase – when necessary – the number of runs or series and improve the significance of results.

In this case, we overrule TriScale's independence test for WebRTC and consider that all runs have converged.

For each series, we plot a two-dimensional representation of the KPIs for the different schemes; which allows to unambiguously compare them.

In [20]:

data_path   = Path('UseCase_Pantheon/PantheonData/10runs_30s')
meta_file   = 'pantheon_metadata.json'
config_file = Path('UseCase_Pantheon/PantheonData/config.yml')

series_label = np.sort(metrics['datetime'].unique())
cnt = 0

for series_ix in series_label:
    cnt += 1

    # Get the metrics values for one series
    metric_series = metrics.loc[metrics['datetime'] == series_ix]
    meta_file_path = str(data_path / series_ix / meta_file)

    # Plot them
    custom_layout = {
    'title': 'Serie %i' %cnt,
    "width":700,
    "height":700,
}
    pantheon.plot_triscale_kpi(metric_series, 
                  meta_file_path, 
                  kpi_list,
                  config_file,
                  layout=custom_layout, 
                  show=True)

By comparison, the plots illustrating the Pantheon paper are much harder to interpret...

In [21]:

data_path = Path('UseCase_Pantheon/PantheonData/10runs_30s/2019-08-20T15:34:33:+0200')
plot_path = Path('plots/Pantheon')
perf_file = data_path / 'pantheon_perf.json'
meta_file = data_path / 'pantheon_metadata.json'
config_file = Path('UseCase_Pantheon/PantheonData/config.yml')

custom_layout = {
    "title":None,
    "width":700,
    "height":700,
}
pantheon.plot_pantheon(perf_file, 
                      meta_file, 
                      config_file, 
                      layout=custom_layout, 
                      show=True);

3. Compute the variability scores¶

[Back to top]

As before, the compute_score() function is a simple wrapper that loops through all protocols and test series, and calls the analysis_score() function from the TriScale module. compute_score() returns a Pandas dataframe with the scores of all schemes and metrics.

In [22]:

# Scores list
score_list = [ score_tput, score_delay ]

# Compute the variability scores
scores = pantheon.compute_score(KPIs,
                            score_list,
                            plot=False,
                            verbose=False)
# Display the results
scores

Out[22]:

	cc	runtime	throughput_test	throughput_lower	throughput_upper	throughput_score	throughput_relative	delay_test	delay_lower	delay_upper	delay_score	delay_relative
0	bbr	30	True	114.672000	114.672000	0.000000	0.000000	True	86.1780	88.3675	2.1895	0.025088
1	copa	30	False	NaN	NaN	NaN	NaN	False	NaN	NaN	NaN	NaN
2	cubic	30	True	114.672000	114.672000	0.000000	0.000000	True	89.1825	89.4120	0.2295	0.002570
3	fillp	30	True	110.633824	112.520768	1.886944	0.016912	True	72.1720	73.8380	1.6660	0.022820
4	fillp_sheep	30	True	110.038912	111.158016	1.119104	0.010119	True	70.9340	71.9830	1.0490	0.014680
5	indigo	30	True	101.544000	104.592000	3.048000	0.029573	True	49.0520	49.1580	0.1060	0.002159
6	ledbat	30	False	NaN	NaN	NaN	NaN	False	NaN	NaN	NaN	NaN
7	pcc	30	False	NaN	NaN	NaN	NaN	False	NaN	NaN	NaN	NaN
8	pcc_experimental	30	True	101.023360	104.695424	3.672064	0.035700	True	69.9010	70.5040	0.6030	0.008589
9	quic	30	False	NaN	NaN	NaN	NaN	False	NaN	NaN	NaN	NaN
10	scream	30	True	0.213880	0.214264	0.000384	0.001794	True	46.6315	47.4760	0.8445	0.017948
11	sprout	30	True	9.153008	9.273728	0.120720	0.013103	True	50.2700	50.3455	0.0755	0.001501
12	taova	30	True	85.210368	88.510400	3.300032	0.037992	True	69.6470	69.7260	0.0790	0.001134
13	vegas	30	True	82.656000	88.320000	5.664000	0.066255	True	73.8140	77.2950	3.4810	0.046073
14	verus	30	True	NaN	NaN	NaN	NaN	True	NaN	NaN	NaN	NaN
15	vivace	30	True	107.606080	108.000832	0.394752	0.003662	True	48.2960	48.4245	0.1285	0.002657
16	webrtc	30	True	2.313160	2.405408	0.092248	0.039100	True	47.1280	47.1775	0.0495	0.001050

We can finally display the results in a more legible way...

In [23]:

pantheon.plot_triscale_scores_matrix( scores,
                                      score_list,
                                      config_file)

Conclusions¶

[Back to top]

This case study only considered emulation using one emulated path. As such, it does not aim to fully capture the performance of the different congestion control schemes. Rather, it illustrates how TriScale may be used for an actual performance evaluation and the importance of carefully choosing the parameters of an experiment; for example, the runtime.

Two important take-aways are that

it is important to critically consider TriScale results: the tests are intentionally conservatives to limit the risk of false positives (i.e., not detecting correlation in the data);
collecting more samples than strictly necessary improves the significance of the tests and limit the risk of false negatives.

For further discussions about TriScale design, usage, and limitations, refer to the TriScale paper.

In [ ]:

TriScale¶

Case Study - Congestion-Control Schemes¶

Menu¶

List of Schemes¶

Evaluation Objectives¶

List of Imports¶

Download Source Files and Data¶

Experiment Design¶

1. How long is a run?¶

The LEDBAT case¶

Conclusion¶

2. How many runs in a series? When should they run?¶

How many runs?¶

When should the tests run?¶

Conclusion¶

3. How many series?¶

Summary of the Experiment Design¶

Data Collection¶

Analysis¶

1. Compute the Metrics¶

2. Compute the KPIs¶

3. Compute the variability scores¶

Conclusions¶