#!/usr/bin/env python
# coding: utf-8

# ## TriScale 
# 
# 
# # Case Study - Congestion-Control Schemes
# 
# This notebook presents a use case of the TriScale framework, which details a comparison between 17 congestion-control schemes (listed below). Elements of this use case are described in the [TriScale paper](https://doi.org/10.5281/zenodo.3464273) (submitted to NSDI'2020).
# * The evaluation is based on the MahiMahi emulator. 
# * The evaluation data have been collected through the Pantheon framework. 
# * The evaluation uses the [calibrated to the real path from AWS California to Mexico](https://pantheon.stanford.edu/result/353/) (see details in the [Data Collection](#Data-Collection) section).
# * The evaluation considers the single flow scenario only.
# * The evaluation focuses on the long-term behavior of full-throttle flows - that is, flows whose own throttling/limiting factor is the congestion control.
# * The evaluation considers two performance dimensions:
#   * The one-way delay
#   * The (egress) throughput
# 
# ## Menu
# - [List of Schemes](#List-of-Schemes)
# - [Evaluation Objectives](#)
# - [List of Imports](#List-of-Imports)
# - [Download Source Files and Data](#Download-Source-Files-and-Data)
# - [Experiment Design](#Experiment-Design)
#   - [1. How long is a run?](#1.-How-long-is-a-run?)
#   - [2. How many runs in a series? When should they run?](#2.-How-many-runs-in-a-series?-When-should-they-run?)
#   - [3. How many series?](#3.-How-many-series?)
# - [Data Collection](#Data-Collection)
# - [Analysis](#Analysis)
#   - [1. Compute the Metrics](#1.-Compute-the-Metrics)
#   - [2. Compute the KPIs](#2.-Compute-the-KPIs)
#   - [3. Compute the variability scores](#3.-Compute-the-variability-scores)
# - [Conclusions](#Conclusions)
# 
# ## List of Schemes
# 
# The congestion-control schemes included in the evaluation are:
# - TCP BBR
# - Copa
# - TCP Cubic
# - FillP
# - FillP-Sheep
# - Indigo
# - LEDBAT
# - PCC-Allegro
# - PCC-Expr
# - QUIC Cubic
# - SCReAM
# - TaoVA-100x
# - TCP Vegas
# - Verus
# - PCC-Vivace
# - WebRTC media
# 
# 
# ## Evaluation Objectives
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# This evaluation aims to compare congestion-control schemes. For a fair comparison, all schemes are tested using the [MahiMahi network emulator](http://mahimahi.mit.edu/).
# 
# The **purpose of this evaluation is not to "crown"** one scheme, but rather to illustrate with a concrete example how TriScale can be used and how the framework avoids certain shortcomings in the experiment design and data analysis.

# ## List of Imports
# 
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]

# In[1]:


from pathlib import Path
import json
import yaml
import numpy as np
import os
import pandas as pd
import zipfile

from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook"


# ## Download Source Files and Data
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# The entire dataset and source code of this case study is available on Zenodo: 
# 
# [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3458116.svg)](https://doi.org/10.5281/zenodo.3458116)
# 
# 
# The wget commands below download the required files to reproduce this case study. **Beware! the .zip file is 2.7G**. Downloading and unzipping might take a while...

# In[2]:


# Set `download = True` to download (and extract) the data from this case study
# Eventually, adjust the record_id for the file version you are interested in

# For reproducing the original TriScale paper, set `record_id = 3451418`

download = False
record_id = 3458116 # version 2 (https://doi.org/10.5281/zenodo.3458116)

files= ['triscale.py',
        'helpers.py',
        'triplots.py',
        'UseCase_Pantheon.zip']
if download:
    for file in files:
        print(file)
        url = 'https://zenodo.org/record/'+str(record_id)+'/files/'+file 
        os.system('wget %s' %url)
        if file[-4:] == '.zip':    
            with zipfile.ZipFile(file,"r") as zip_file:
                zip_file.extractall()
        print('Done.')


# We now important the custom modules that we just downloaded. 
# - `triscale` is the main module of TriScale. It contains all the functions from TriScale's API; that is, functions meant to be called by the user.
# - `triplots` is a module from TriScale that contains all its plotting functions. 
# - `pantheon` is a module specific to this case study. It contains helper functions that loop throught the different congestion-control schemes, call TriScale API functions, and create visualizations

# In[3]:


import triscale
import triplots
import UseCase_Pantheon.pantheon as pantheon


# ## Experiment Design
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# One needs to answer a set of questions to fully design an experiment. They relate to three different time scales:
# 1. **\[Run\]** How long is a run?
# 2. **\[Series\]** How many runs in a series? When should they run?
# 3. **\[Sequel\]** How many series?
# 
# TriScale assists the user in answering these questions, as described below.
# 
# ### 1. How long is a run?
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# The scenario evaluates the long-term behavior of full-throttle flows - that is, flows whose own throttling/limiting factor is the congestion control. Thus, one must decide on a running time that is long enough to actually capture the long-term behavior. 
# 
# TriScale implements a _convergence test_ that helps assessing whether a run is indeed long enough to estimate the long-term behavior. Passing this test suggests that the protocol behavior has converged (with a certain level of confidence): the run is then considered long-enough to estimate the long-term behavior.
# 
# All networking protocols are differents, and there is no reason to assume that the same run time is necessary for all (certain protocols may be very stable, others require more time to converge). The only way to estimate the required length is to actually test the different protocols, and observe when the runs become long enough to pass the convergence test. 
# 
# Hence, we perform a preliminary series of tests with varying run time, and test the convergence of the different congestion-control scemes. The first thing to do is to specify the parameters for the convergence test and the performance metrics.

# In[4]:


# Parameters for the convergence test
# -> TriScale defaults
convergence = {'expected'  : True,
               'confidence': 95,  # in %
               'tolerance' : 1,   # in %
              }
# Throughput metric -> Median
metric_tput = {'name':'Average Throughput',
               'unit': 'Mbit/s',
               'measure':50,
               'bounds':[0,120],  # expected value range
               'tag':'throughput' # do not change the tag
              }
# Delay metric -> 95th percentile
metric_delay = {'name':'95th perc. of One-way delay (ms)',
                'unit': 'ms',
                'measure':95,
                'bounds':[0,100], # expected value range
                'tag':'delay'     # do not change the tag
               }


# These are all the inputs TriScales requires to perform the convergence test and compute the metrics. 
# * The convergence test is implemented in a dedicated `convergence_test()` function in the `helpers.py` module. 
# * The `analysis_metric()` function is the user API, implemented in the `triscale.py` module. The `analysis_metric()` calls the `convergence_test()`, handles the plotting, and produces the textual outputs.
# * The `compute_metric()` function is a simple wrapper that loops through all congestion-control scemes and runs and calls the `analysis_metric()` function. It returns a Pandas dataframe with all convergence test results and metric values. 

# In[5]:


# Construct the path to the different test results
# (you may need to adjust `result_dir_path` if you did change the default location of file download)
result_dir_path = Path('UseCase_Pantheon/PantheonData/10_20_30_40_50_60s')
result_dir_list = [x for x in result_dir_path.iterdir() if x.is_dir()]

# Meta data file name
meta_data_file  = 'pantheon_metadata.json'

# Config file name and path
config_file     = Path('UseCase_Pantheon/PantheonData/config.yml')

# Output file
out_name        = Path('UseCase_Pantheon/PantheonData/prelim_metrics.csv')

# Metric list
metric_list = [metric_tput, metric_delay]

# Execute convergence tests and compute the metrics
metrics_design = pantheon.compute_metric( result_dir_list, 
                                          meta_data_file, 
                                          convergence,
                                          metric_list,
                                          plot=False,
                                          out_name=out_name,
                                          verbose=False)


# In[6]:


# Visualize the results: plot the number of converged runs (out of 10)
convergence_results, figure = pantheon.compute_plot_convergence(metrics_design, 
                                                                config_file, 
                                                                'runtime', 
                                                                show=False)
figure.show()


# Uncomment the call below let you visualize the DataFrame containing the number of converged test runs (i.e., the data from the plot just above).

# In[7]:


convergence_results


# Each scheme run 10 times. For each run, we execute TriScale's convergence test for each performance dimension (the one-way delay and the egress throughput). We count a run as "converged" if both convergence tests passed.
# 
# As expected, the different schemes need different time to converge (see plot above).
# - The majority of schemes converge between 8 and 10 times (out of 10) with a runtime of 30s.
# - Some schemes converge less than half the times even with a runtime of 60s.

# #### The LEDBAT case
# The _LEDBAT_ scheme is interesting as it shows well the problem of wrongly choosing the test runtime. Let us look at the data for one run of each runtime settings...

# In[8]:


sample = {'cc':'ledbat',
          'run':1}
runtimes = [10,20,30,40,50,60]
cnt = 0
for i in [0,2,5,4,3,1]:
    title = 'Runtime - %i seconds' %runtimes[cnt]
    custom_layout = {'title': title , 'xaxis' : {'title':'Time [s]'}}
    metrics_design, figure = pantheon.compute_metric( [result_dir_list[i]], 
                                          meta_data_file, 
                                          convergence,
                                          [metric_tput],
                                          plot=True,
                                          showplot=False,
                                          layout=custom_layout,
                                          verbose=False,
                                          sample=sample)
    cnt += 1
    figure.show()


# One clearly sees that eventually _LEDBAT_ throughput is converging: Reaching the stable throughput value takes about 38 seconds. However, even after 60s of runtime, TriScale convergence test fails due to the effects of the transient phase.
# 
# Now, consider that if one performs 30s-long runs and does not test for convergence, the median throughput will be estimated around 40 Mbps. This is very far from the actual long-running performance of the scheme, which is closer to 92 Mbps. However, to be able to confidently estimate the long-running performance of _LEDBAT_ , TriScale indicates that even 60s runtime is not enough.
# 
# When the impact of the start-up phase is too important, two solutions are possible:
# 1. Increase the runtime further, or 
# 2. Prun the start-up time in the raw data, which is fine when one aims to estimate the long-running performance.
# 
# #### Conclusion
# 
# We observe that 30s runtime is not sufficient for certain schemes to converge (i.e., they do not pass TriScale convergence test, with confidence 95% and tolerence 1%).
# This means that, if one used 30s runtime, the metric data may not be representative of the (expected) long-term performance; that is, the performance one would measure if the runs were longer.
# 
# Whether the (expected) long-term performance matters or not depends on the evaluation objective. This Case Study investigates the long-term behavior of full-throttle flows - that is, flows whose own throttling/limiting factor is the congestion control; thus, **convergence does matter**.
# 
# 
# For practical reasons, Pantheon uses 30s runtime for all congestion-control schemes. 
# To be able to compare the outcomes of the same evaluation using only Pantheon or with TriScale, **we keep runtimes of 30s for all tested protocols**.
# However, this implies that some schemes will likely never converge and yield un-exploitable data.

# ### 2. How many runs in a series? When should they run?
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# #### How many runs?
# 
# The decision on the number of test runs is based on the definition of the evaluation KPIs. In other words, what are we trying to evaluate?
# 
# TriScale defines KPIs as percentiles of the metric distribution, estimated with a certain level of confidence. The percentile and confidence level are chosen by the user. Generally, we can distinguish two cases:
# - To evaluate the **average performance**, one may use as KPIs 'middle' percentiles  
# For example: the median, or the quartiles (25th and 75th percentiles).
# - To evaluate the **extreme performance**, one should use as KPI 'large' (resp. 'small') percentiles  
# For example: the 95th, or 99th percentile (resp. 5th or 1th percentile).
# 
# Intuitively, estimating 'large' or 'small' percentiles requires more data than estimating the median. Furthermore, the higher the confidence level, the more data is required. 
# TriScale allows to quantify this relation between the percentile, the confidence level, and the required number of samples. Concretely, for a given percentile and confidence level, TriScale returns to the user the minimal number 
# of data points necessary for the estimation. The relation follows from using the so-called Thompson's method to compute confidence interval for percentiles. In TriScale, this functionality is implemented in the `experiment_sizing()` function from the `triscale.py` module.
# 
# In this case study, we have two performance metrics: the outgoing throughput and the one-way delay (see the metric definitions above).
# We chose the quartiles as KPIs to investigate the average performance of the different schemes for these two metrics, 
# - \[ KPI 1 \] - 25th percentile of the throughput distribution
# - \[ KPI 2 \] - 75th percentile of the delay distribution
# 
# > Symetric percentiles requires the same number of samples to be estimated; that is, for a given confidence level, computing a lower-bound of the 25th percentile requires the same number of data points than computing an upper-bound of the 75th percentile.
# 
# The "standard" level of confidence in statistical studies is 95%. Let us start with that...

# In[9]:


KPI = {'percentile': 75,
       'confidence': 95}
triscale.experiment_sizing(KPI['percentile'], 
                           KPI['confidence'],
                           CI_class='one-sided',
                           verbose=True,);


# TriScale indicates that estimating the 75th/25th percentiles with a confidence level of 95% requires 11 data points, in other words, one should **run 11 tests**.
# 
# Concretly, this means that with 11 data samples, the largest (resp. smallest) data point is an upper-bound (resp. lower-bound) for the 75th percentile (resp. 25th percentile) and that **the probability that this bound is indeed correct is at least 95%**.
# 
# However, the experiments available on the Pantheon website report a maximum of 10 runs per series. Thus, we choose to decrease the confidence level such that 10 data points are sufficient.

# In[10]:


KPI = {'percentile': 75,
       'confidence': 75}
triscale.experiment_sizing(KPI['percentile'], 
                           KPI['confidence'],
                           CI_class='one-sided',
                           verbose=True,);


# With a confidence level of 75%, **5 data points** are sufficient to estimate 75th/25th percentiles.
# 
# According to our preliminary tests, the majority of congestion-control schemes we look at often converge more than 5 times out of 10, therefore this setting is likely to allow TriScale to compute KPIs for most schemes.
# 
# TriScale can also compute the number of data points required such that the (`k+1`)th-largest value is an upper-bound for the 75th percentile (instead of using the largest value as bound). Practically, this means that the **KPI would exclude the worst-performing `k` runs**.
# 
# This can be done by passing the desired `k` value as the `robustness` parameter of `experiment_design()` functions. Let us try a few values...

# In[11]:


KPI = {'percentile': 75,
       'confidence': 75}
to_ignore = [0,1,2,3,10]
for k in to_ignore:
    triscale.experiment_sizing(KPI['percentile'], 
                               KPI['confidence'],
                               robustness=k,
                               CI_class='one-sided',
                               verbose=True,);


# Naturally, the more outliers we would like our KPI to exclude, the more data points are required! However, it is interesting to see that with already 10 data points, we can "ignore" the most extreme value. 
# 
# > It is important to correctly understand **what is meant by _excluding the worst-performing runs_**. With this example and 10 runs: the 75-th percentile of the delay is smaller than the 9-th largest value with a probability larger or equal to 75%.  
# It is not like the worth run is "dropped"; rather, there are enough data points such that the percentile of interest is bounded by a data point which is not the largest (but second largest, in this example).
# 
# Thus, if we have 10 metrics values for a given protocol, the resulting KPIs will not be affected by the worst-performing test run.
# 
# 
# #### When should the tests run?
# 
# This case study uses the [MahiMahi network emulator](http://mahimahi.mit.edu/) as test environment. Although emulators may fail to capture some of the dynamics of "real" networks, they provide a reproducible test setup, which is particularly useful for comparisons.
# 
# Experiments running in MahiMahi do not have the time dependencies that are expected in real networks. Therefore, there is no need for profiling the emulated network: by design, it is always "the same".
# 
# #### Conclusion
# 
# - We define our **KPIs are 75th and 25th percentiles** for the delay and throughput metric, respectively.
# - We choose a **confidence level of 75%** for the estimation of these KPIs. 
# - This results is a minimum of 5 data points necessary to compute our KPIs.
# - We choose to perform **10 runs per series**. This is motivated by three reasons:
#   * Most protocols do not always converge with 30s runtime. Performing more runs increases the chance of obtaining at least 5 data points (ie, 5 runs that have converged) per series.
#   * If all runs do converge, having 10 data points would exclude the worst-performing run from our KPIs.
#   * Currently, [Pantheon website](https://pantheon.stanford.edu/) reports a maximum of 10 runs per series. Choosing the same setting facilitates the comparison between Pantheon and TriScale's evaluation proceedure.
# - The **test runs can be performed at any time** since the evaluation is performed within the MahiMahi network emulator.
# 
# 
# >**Remark**  
# One may choose different KPI parameters for the different metrics (i.e., it does not have to be 25/75th, or 5/95th percentiles).  
# For example, the evaluation of a real-time protocol would use a 'large' percentile and high confidence level as KPI for the delays, but could simultaneously estimate the average power comsumption using the median as a second KPI.

# ### 3. How many series?
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# We chose the number runs per series to be able to compute our performance KPIs. 
# However, how confident will we be about the obtained KPI values? How likely it is that the computed values are representative of the true performance of the protocols we are evaluating?
# 
# In other words, **is the experiment reproducible?**
# 
# TriScale tackles this question by performing _multiple series_ of test runs (also called _sequels_ ), where each series produces one value per KPI (two KPIs in this example). TriScale assesses the reproducibility of an experiment by quantifying the variability in the KPI values; TriScale computes a **variability score** for each KPI in an evaluation.
# TriScale's variability scores do not settle whether an experiment is reproducible or not; they **quantifies reproducibility**. The larger the score, the more variability, and therefore the less reproducible the experiment is.
# 
# TriScale computes variability scores using the same approach as for the KPIs: it estimates the upper- and lower-bounds for a (symetric) pair of percentiles for the _KPI distributions_.
# Thus, as we did for the choose of the number of runs (per series), we must perform the appropriate number of series to estimate the chosen percentiles.
# 
# In this case study, we choose the same percentilea and confidence level for the score as for the KPI. Thus, we must perform a minimum of 5 series to compute TriScale's variability scores.
# 
# >**Remarks** 
# - Using the same parameters for the KPIs and the score has no articular benefit. It just happens to be convenient in that example.
# - Like for the KPIs, it is possible to choose different score parameters for the different performance dimensions.

# In[12]:


score = {'percentile': 75,
         'confidence': 75}
triscale.experiment_sizing(score['percentile'], 
                           score['confidence'],
                           CI_class='two-sided',
                           verbose=True,);


# Let us call `UB` and `LB` the upper- and lower-bound for the 75th and 25th percentiles respectively. The variability score is the `UB - LB` (with the same dimension and unit as the KPI values).
# 
# The interpretation of such a score is the following: 
# > **With a probability of 75%,**  
# **75% of series result in KPI values below `UB` and**  
# **75% of series result in KPI values above `LB`.**  
#   
# > **In other words, 50% of series result in KPIs that differs by a maximum of `UB - LB`**

# #### Summary of the Experiment Design
# 
# The experimental design is now completed. First, we have defined the evaluation objectives; that is,
# 1. The metrics,
# 1. The convergence,
# 2. The KPIs,
# 3. The variability scores.

# In[13]:


# Metrics
metric_tput = {'name':'Average Throughput',
               'unit': 'Mbit/s',
               'measure':50,
               'bounds':[0,120],  # expected value range
               'tag':'throughput' # do not change the tag
              }
metric_delay = {'name':'95th perc. of One-way delay',
                'unit': 'ms',
                'measure':95,
                'bounds':[0,100], # expected value range
                'tag':'delay'     # do not change the tag
               }

# Convergence parameters
convergence = {'expected': True,
               'confidence': 95,  # in %
               'tolerance': 5,    # in %
              }

# KPIs
KPI_tput  = {'percentile': 25,
             'confidence': 75,
             'name': 'Average Throughput',
             'unit': 'Mbit/s',
             'bounds':[0,120],    # expected value range
             'tag':'throughput'   # do not change the tag
            }
KPI_delay = {'percentile': 75,
             'confidence': 75,
             'name': '95th perc. of One-way delay',
             'unit': 'ms',
             'bounds':[0,100],    # expected value range
             'tag':'delay'        # do not change the tag
            }

# Variability scores
score_tput  = {'percentile': 75,
             'confidence': 75,
             'name': 'Throughput',
             'unit': 'Mbit/s',
             'bounds':[0,120],    # expected value range
             'tag':'throughput'   # do not change the tag
            }
score_delay = {'percentile': 75,
             'confidence': 75,
             'name': 'One-way delay',
             'unit': 'ms',
             'bounds':[0,100],    # expected value range
             'tag':'delay'        # do not change the tag
            }


# Based on these objectives, we decided on suitable parameters for the data collection; that is, we sized the data collection such that we obtain sufficient data to compute the defined KPIs and variability scores.
# 
# |Parameter|Value|Short Description|
# |---|:---:|:--|
# |#runs|10|number of runs per series |
# |#series|5|number of series |
# |runtime|30|length of one run |
# |span|anytime|time interval for an entire series |
# 
# This handful of parameters is sufficent to **completely describe** the entire evaluation using TriScale.

# ## Data Collection
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# TriScale does not perform the data collection (sorry); TriScale does help deciding what data must be collected, but it is up to user to actually collect it.
# 
# For this Case Study, we collect all the data using the [Pantheon framework](https://pantheon.stanford.edu/), which is open source and [available on GitHub](https://github.com/StanfordSNR/pantheon).
# We use the local mode of Pantheon, which runs the MahiMahi network emulator (detailed settings below).
# The only modification we made to the code is to save interemediary outputs in `csv` files for performing the anaysis with TriScale.
# 
# - **Git Commit**  
# [397dcf5960b462fc5497f8961856266bc9fbea78](https://github.com/StanfordSNR/pantheon/commit/397dcf5960b462fc5497f8961856266bc9fbea78)
# - **Emulation command**  
# `mm-delay 45 mm-link 114.68mbps.trace 114.68mbps.trace --uplink-queue=droptail --uplink-queue-args=packets=450`
# - **Emulation description**  
# [Calibrated to the real path from AWS California to Mexico](https://pantheon.stanford.edu/result/353/)
# 
# All the data collected for this case study is [publically available](#Download-Source-Files-and-Data).
# 
# ## Analysis
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# The data has been collected ([download instructions](#Download-Source-Files-and-Data) are available in this notebook). It is time for analysis.
# 
# TriScale divides the analysis in three steps (one for each time scale):
# 1. Compute the metrics,
# 2. Compute the KPIs,
# 3. Compute the variability scores.
# 
# ### 1. Compute the Metrics
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]

# In[14]:


# Construct the path to the different test results
result_dir_path = Path('UseCase_Pantheon/PantheonData/10runs_30s')
result_dir_list = [x for x in result_dir_path.iterdir() if x.is_dir()]

# Meta data file name
meta_data_file  = 'pantheon_metadata.json'

# Config file name and path
config_file     = Path('UseCase_Pantheon/PantheonData/config.yml')

# Metric list
metric_list = [metric_tput, metric_delay]

# Outfile name
# -> Either save output as csv, or retrieve from file if already exists
out_name = Path('UseCase_Pantheon/PantheonData/metrics.csv')

# Execute convergence tests and compute the metrics
metrics = pantheon.compute_metric(result_dir_list, 
                                  meta_data_file, 
                                  convergence,
                                  metric_list,
                                  out_name=out_name,
                                  force_computation=False,
                                  plot=False,
                                  verbose=False)


# For each series, we can look at the number of resulting data points;   
# that is, how many runs have converged and resulted in valid metric values.

# In[15]:


# Visualize the results: plot the number of converged runs (out of 10)
convergence_results, figure = pantheon.compute_plot_convergence(metrics, config_file, 'datetime', show=False)
figure.show()


# As expected from the preliminary study from the [Experiment Design](), some schemes rarely converge ( _Cope, LEDBAT, QUIC Cubic, PCC Allegro,_ and _Verus_ ). 
# Those schemes that do not converge (at least) 5 times in each series cannot be processed further: there is not enough data to compute the chosen KPI values.

# ### 2. Compute the KPIs
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# The `compute_kpi()` function is a simple wrapper that loops through all protocols and test series, and calls the `analysis_kpi()` function from the TriScale module. 
# `compute_kpi()` returns a Pandas dataframe with all KPI values. 

# In[16]:


# Config file name and path
config_file = Path('UseCase_Pantheon/PantheonData/config.yml')

# KPIs list
kpi_list = [ KPI_tput, KPI_delay ]

# Compute the KPIs
KPIs = pantheon.compute_kpi(metrics,
                            kpi_list,
                            series='datetime',
                            plot=False,
                            verbose=False)


# In[17]:


# Uncomment to visualize the output DataFrame

# KPIs


# We can output the number of series that lead to a valid KPI for each scheme:

# In[18]:


df = KPIs.copy()
df = df.loc[(df['throughput_test'] == True) & (df['delay_test'] == True)]
df.dropna(inplace=True)
df = df.groupby(['cc'], as_index=False, observed=False).count()
df = df[['cc','datetime']]
df


# The _Verus_ scheme only had two series with at least 5 converged runs; thus we expected a maximum of 2 valid KPIs. No surprise here.
# 
# The case of WebRTC is more interesting. 
# One of the series fails TriScale's independence test for the "Throughput" metric. Let us look at the autocorellation plot for that series, and compare it to another series that successfully passed the test.

# In[19]:


layout = {'title': 'Autocorrelation - Test failed'}
sample = metrics.loc[(metrics['cc'] == 'webrtc') & (metrics['datetime'] == '2019-08-21T12:14:13:+0200')]
plot_failed = triplots.autocorr_plot(sample.throughput_value.values,
                                    layout=layout)

layout = {'title': 'Autocorrelation - Test passed'}
sample = metrics.loc[(metrics['cc'] == 'webrtc') & (metrics['datetime'] == '2019-08-22T07:59:10:+0200')]
plot_passed = triplots.autocorr_plot(sample.throughput_value.values,
                                    layout=layout)


# The autocorrelation coefficient must be in the shaded grey
# area for the test to pass; The upper series passes the test, the lower one does not. 
# 
# However, there
# is no clear difference in the correlation structure of the two
# series, i.e., the lower series does not seem significantly more
# correlated than the first one. All other series of WebRTC pass
# the independence test, which hints that the failed series is
# merely an artifact induced by the small number of runs in the
# series (which was selected to 10 in this example). In such
# cases, it is important that the user critically assesses TriScale’s
# results, in order to increase – when necessary – the number of
# runs or series and improve the significance of results.
# 
# In this case, we overrule TriScale's independence test for _WebRTC_ and consider that all runs have converged.
# 
# For each series, we plot a two-dimensional representation of the KPIs for the different schemes; which allows to unambiguously compare them.

# In[20]:


data_path   = Path('UseCase_Pantheon/PantheonData/10runs_30s')
meta_file   = 'pantheon_metadata.json'
config_file = Path('UseCase_Pantheon/PantheonData/config.yml')

series_label = np.sort(metrics['datetime'].unique())
cnt = 0

for series_ix in series_label:
    cnt += 1

    # Get the metrics values for one series
    metric_series = metrics.loc[metrics['datetime'] == series_ix]
    meta_file_path = str(data_path / series_ix / meta_file)

    # Plot them
    custom_layout = {
    'title': 'Serie %i' %cnt,
    "width":700,
    "height":700,
}
    pantheon.plot_triscale_kpi(metric_series, 
                  meta_file_path, 
                  kpi_list,
                  config_file,
                  layout=custom_layout, 
                  show=True)


# By comparison, the plots illustrating the [Pantheon paper](https://pantheon.stanford.edu/static/pantheon/documents/pantheon-paper.pdf) are much harder to interpret...

# In[21]:


data_path = Path('UseCase_Pantheon/PantheonData/10runs_30s/2019-08-20T15:34:33:+0200')
plot_path = Path('plots/Pantheon')
perf_file = data_path / 'pantheon_perf.json'
meta_file = data_path / 'pantheon_metadata.json'
config_file = Path('UseCase_Pantheon/PantheonData/config.yml')

custom_layout = {
    "title":None,
    "width":700,
    "height":700,
}
pantheon.plot_pantheon(perf_file, 
                      meta_file, 
                      config_file, 
                      layout=custom_layout, 
                      show=True);


# ### 3. Compute the variability scores
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# As before, the `compute_score()` function is a simple wrapper that loops through all protocols and test series, and calls the `analysis_score()` function from the TriScale module. 
# `compute_score()` returns a Pandas dataframe with the scores of all schemes and metrics. 

# In[22]:


# Scores list
score_list = [ score_tput, score_delay ]

# Compute the variability scores
scores = pantheon.compute_score(KPIs,
                            score_list,
                            plot=False,
                            verbose=False)
# Display the results
scores


# We can finally display the results in a more legible way...

# In[23]:


pantheon.plot_triscale_scores_matrix( scores,
                                      score_list,
                                      config_file)


# ## Conclusions
# [[Back to top](#Case-Study---Congestion-Control-Schemes)]
# 
# This case study only considered emulation using one emulated path. As such, it does not aim to fully
# capture the performance of the different congestion control schemes. Rather, it illustrates how TriScale may be used for an actual performance evaluation and the importance of carefully choosing the parameters of an experiment; for example, the runtime.
# 
# Two important take-aways are that 
# 1. it is important to critically consider TriScale results: the tests are intentionally conservatives to limit the risk of false positives (i.e., not detecting correlation in the data); 
# 2. collecting more samples than strictly necessary improves the significance of the tests and limit the risk of false negatives.
# 
# For further discussions about TriScale design, usage, and limitations, refer to the [TriScale paper](https://doi.org/10.5281/zenodo.3464273).
# 

# ---

# In[ ]: