#!/usr/bin/env python
# coding: utf-8

# ## TriScale 
# 
# # Case Study - Glossy on FlockLab
# 
# This notebook presents a short use case of the TriScale framework, which compares the performance of [Glossy](https://ieeexplore.ieee.org/document/5779066) for different parameter values. This use case relies [the FlockLab testbed](https://gitlab.ethz.ch/tec/public/flocklab/wikis/home) as experiment environment. Elements of this use case are described in the [TriScale paper](https://doi.org/10.5281/zenodo.3464273) (submitted to NSDI'2020).
# 
# In particular, this use case illustrates the importance of network profiling: this example shows how one may reach wrong conclusions (even with high confidence!) when the environmental conditions are not properly assessed.
# 
# ## Menu
# 
# - [Evaluation Objectives](#Evaluation-Objectives)
# - [List of Imports](#List-of-Imports)
# - [Download Source Files and Data](#Download-Source-Files-and-Data)
# - [Data Collection](#Data-Collection)
# - [Preprocessing](#Preprocessing)
# - [Computing the Metrics](#Computing-the-Metrics)
# - [Computing the KPIs](#Computing-the-KPIs)
# - [Network Profiling - FlockLab](#Network-Profiling---FlockLab)
# - [Conclusions](#Conclusions)
# 
# 
# 
# ## Evaluation Objectives
# [[Back to top](#TriScale)]
# 
# This evaluation aims to compare two parameters settings for [Glossy](https://ieeexplore.ieee.org/document/5779066), a low-power wireless protocol based on
# synchronous transmissions. Glossy includes as parameter the number of retransmissions of a packet, called $N$. We investigate the impact of two values for $N$ on the reliability of Glossy, measured as the packet reception ratio (PRR). We define our KPI as the median with 95% confidence level.
# 
# It is expected that the larger the value for $N$, the more reliable the protocol is (and the more energy it consumes). We are intereted in assessing whether
# - this is generally true
# - there are difference in reliability depending on the day and time of the runs
# 
# We test two values for the parameter $N$: 1 and 2 retransmissions.
# 
# ## List of Imports
# 
# [[Back to top](#TriScale)]

# In[1]:


import os
from pathlib import Path
import pandas as pd
import numpy as np


# ## Download Source Files and Data
# [[Back to top](#TriScale)]
# 
# The entire dataset and source code of this case study is available on Zenodo: 
# 
# [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3458116.svg)](https://doi.org/10.5281/zenodo.3458116)
# 
# The wget commands below download the required files to reproduce this case study.

# In[2]:


# Set `download = True` to download (and extract) the data from this case study
# Eventually, adjust the record_id for the file version you are interested in.

# For reproducing the original TriScale paper, set `record_id = 3451418`

download = False
record_id = 3458116 # version 2 (https://doi.org/10.5281/zenodo.3458116)

files= ['triscale.py',
        'helpers.py',
        'triplots.py',
        'UseCase_Glossy.zip']
if download:
    for file in files:
        print(file)
        url = 'https://zenodo.org/record/'+str(record_id)+'/files/'+file 
        os.system('wget %s' %url)
        if file[-4:] == '.zip':    
            with zipfile.ZipFile(file,"r") as zip_file:
                zip_file.extractall()
        print('Done.')


# We now important the custom modules that we just downloaded. 
# - `triscale` is the main module of TriScale. It contains all the functions from TriScale's API; that is, functions meant to be called by the user.
# - `flocklab` is a module specific to this case study. It contains helper functions that parse the FlockLab test results files for this use case.

# In[3]:


import triscale
import UseCase_Glossy.flocklab as flocklab


# ## Data Collection
# [[Back to top](#TriScale)]
# 
# The test scenario is very simple. During one communication round, each node in the network initiate in turn a Glossy flood (using $N=1$ retransmission). All the other nodes log whether they successfully received the packet. The same round is then repeated with $N=2$ retransmissions.
# 
# - The evaluation runs on the [TelosB motes](https://www.advanticsys.com/shop/mtmcm5000msp-p-14.html)
# - The motes use radio frequency channel 22 (2.46GHz, which largely overlaps with WiFi traffic)
# - The payload size is set to 64 bytes.
# - The scenario is run 24 times per day, scheduled randomly throughout the day. 
# - Data has been collected over three weeks, from 2019-08-22 to 2019-09-11.
# 
# The collected data is available in the [TriScale artifacts repository](#Download-Source-Files-and-Data).

# ## Preprocessing
# [[Back to top](#TriScale)]
# 
# First, the serial log files from FlockLab are parsed to create the csv files compatible with TriScale API.

# In[4]:


# Expected list of node ids
node_list = [1, 2, 3, 4, 6, 7, 8, 10, 11, 13, 14, 15, 16, 17, 18, 19, 
             20, 22, 23, 24, 25, 26, 27, 28, 32, 33]

# Path to results to parse
data_folder = Path('UseCase_Glossy/Data_Glossy')

# Loop through the tests and parse the serial log file
date_list = [x for x in data_folder.iterdir() if x.is_dir()]
for test_date in sorted(date_list)[:-2]: # the last two days are not yet available
    test_list = [x for x in test_date.iterdir() if x.is_dir()]
    for test in test_list:
        test_file_name = str(test / 'serial.csv')
        flocklab.parse_serial_log_glossy_use_case(test_file_name, node_list, verbose=False)
print('Done.')


# ## Computing the Metrics
# [[Back to top](#TriScale)]
# 
# The test scenario is terminating. Thus, there is no need to check whether the runs have converged: the test runtime must simply be large enough to complete the scenario; that is, the two rounds of Glossy floods.
# 
# The analysis starts with the computation of the metric. In this evaluation, we define the metric as the median packet reception ratio (our Y values) between all the nodes (node IDs are the X values). In other words, our metric is the median number of floods what are successfully received by one node in the network.
# 
# The metric values are stored in a DataFrame, which also contains the test number and the date and time of the test.

# In[5]:


# Create the storing DataFrame
columns = [ 'date_time', 'test_number', 'PRR_N1', 'PRR_N2' ]
df = pd.DataFrame(columns=columns, 

                  dtype=np.int32)
# TriScale inputs
metric = {'name': 'Average Packet Reception Ratio',
          'unit': '%',
          'measure': 50,
          'bounds': [0,1]}
convergence = {'expected': False}

# Loop through the tests files and parse the csv files
date_list = [x for x in data_folder.iterdir() if x.is_dir()]
for test_date in sorted(date_list)[:-2]: # the last two days are not yet available
    test_list = [x for x in test_date.iterdir() if x.is_dir()]
    for test in test_list:
        
        # Get the test number
        test_number = int(str(test)[-5:])
        
        # Get the test date_time
        xml_file = str(test / 'testconfiguration.xml')
        with open(xml_file, 'r') as xml_file: 
            for line in xml_file:
                if '<start>' in line:
                    tmp = line[0:-1].split('<start>')
                    test_datetime = tmp[1][:-8]
                    break
                                
        # Compute PRR metric for N=1
        data_file_name = str(test / 'glossy_reliability_N1.csv')
        converge1, PRR_N1, figure1 = triscale.analysis_metric(data_file_name, 
                                               metric, 
                                               plot=False, 
                                               convergence=convergence, 
                                               verbose=False)
        # Compute PRR metric for N=2
        data_file_name = str(test / 'glossy_reliability_N2.csv')
        converge2, PRR_N2, figure2 = triscale.analysis_metric(data_file_name, 
                                               metric, 
                                               plot=False, 
                                               convergence=convergence, 
                                               verbose=False)
        df_new = pd.DataFrame([[test_datetime, test_number, PRR_N1, PRR_N2]], 
                              columns=columns,
                              dtype=np.int32)
        df = pd.concat([df, df_new])

# Parse dates
df['date_time'] = pd.to_datetime(df['date_time'], utc=True) 
df.set_index('date_time', inplace=True)
df.sort_values("test_number", inplace=True)
df.head()


# We can already see from these data that the results strongly differ between the different days. For example, if we isolate the results from a weekday and a weekend and compare them:

# In[6]:


weekend = df.loc['2019-08-24']
weekend.median()
# weekend


# In[7]:


weekday = df.loc['2019-08-26']
weekday.median()


# We can see that the median PRR for one node is much lower on a weekday than on a weekend
# 
# |Day|Type|$N$=1|$N$=2|
# |---|---|---|---|
# |2019-08-24|weekend|88|94|
# |2019-08-26|weekday|78|88|
# 
# Now, if one does not pay attention to the days when the runs are performed, it may lead to wrong conclusions with respect to the reliability of Glossy with $N=1$ or $2$ retransmissions. This is illustrated below.

# ## Computing the KPIs
# [[Back to top](#TriScale)]
# 
# We desire a high confidence in the comparison between Glossy with $N=1$ and $N=2$ retransmissions; We choose as TriScale KPI to estimate the median PRR with a confidence level of 95%.
# 24 runs (one day of test) is more than necessary to compute this KPI.

# In[8]:


KPI = {'name': 'PRR',
       'unit': '\%',
       'percentile': 50,
       'confidence': 95,
       'class': 'one-sided',
       'bounds': [0,100],
       'bound': 'lower'}
for k in [0,5,6]:
    triscale.experiment_sizing(KPI['percentile'], 
                               KPI['confidence'],
                               CI_class=KPI['class'],
                               robustness=k,
                               verbose=True,);


# Thus, with 24 runs, we can compute our KPI: the 6th smallest PRR value is a lower-bound on the median PRR with a probability larger than 95%.
# 
# We want to illustrate the issues that may occur when neglecting a seasonal component in the environmental conditions (in this case, the weekly seasonal component from the FlockLab testbed).
# 
# Thus, we first compute the KPIs by (intentionally) selecting two days where conditions are "friendly" (ie weekend) and "harsh" (ie weekday), and we evaluate Glossy with $N=1$ and $N=2$ respectively on the "friendly" and "harsh" day.

# In[9]:


# Select the metric series for N=1
data = weekend.PRR_N1.dropna().values

to_plot = ['horizontal']
verbose=False
triscale.analysis_kpi(data, KPI, to_plot, verbose=verbose)


# In[10]:


# Select the metric series for N=2 (drop missing values!)
data = weekday.PRR_N2.dropna().values
triscale.analysis_kpi(data, KPI, to_plot, verbose=verbose)


# From this first analysis, one concludes that, with 95% probability, 
# * $N=1$ results in a KPI of 88%
# * $N=2$ results in a KPI of 84%
# 
# Naturally, this conclusion is wrong. As explained before, the issue lies in the fact that we neglected the differences in the environment conditions between the two tested series: one is tested in a "friendly" network, the other one in a "harsh" network.
# 
# To mitigate with issue, TriScale implements a `network profiling()` function. This function analysis link quality results from the test environment to identify systematic patterns in the network conditions: the so-called seasonal components.
# 
# ## Network Profiling - FlockLab
# [[Back to top](#TriScale)]
# 
# We use TriScale network profiling function on the [wireless link quality data for FlockLab](https://doi.org/10.5281/zenodo.3354717), which is collected by the FlockLab maintainers and made publicly available. They run the link quality test every two hours, resulting in 12 measurement points per day.
# 
# In this case study, we use only the data from August 2019, which has a large overlap with our data collection period.

# In[11]:


data_file = Path('UseCase_Glossy/Data_FlockLab/2019-08_FlockLab_sky.csv')
link_quality = flocklab.parse_data_file(str(data_file), active_link_threshold=50)
link_quality_bounds = [0,100]
link_quality_name = 'PRR [%]'
fig_theil, fig_autocorr = triscale.network_profiling(link_quality, link_quality_bounds, link_quality_name)
fig_autocorr.show()


# One can clearly see from the autocorrelation plot that the average link quality on FlockLab has strong seasonal components. The first pic at lag 12 (i.e., 24h) reveals the daily seasonal component. The data also
# show a second main peak at lag 84; which corresponds to one week. Indeed, there is less interference in the weekends than on weekdays, which creates a weekly seasonal component.
# 
# Due to this weekly component, it becomes problematic (aka, potentially wrong) to compare results from different time periods which span less than a week. In other word, the time span for series of runs must be at least one week long to be comparable.
# 
# Indeed, when we compare $N=1$ and $N=2$ using the entire span of our runs (i.e., three weeks), we obtain very different results:

# In[12]:


dataN1 = df.PRR_N1.dropna().values
dataN2 = df.PRR_N2.dropna().values
triscale.analysis_kpi(dataN1, KPI, to_plot, verbose=verbose)
triscale.analysis_kpi(dataN2, KPI, to_plot, verbose=verbose)


# As expected, Glossy with $N=2$ retransmissions performs much better than $N=1$:
# 
# |Number of retransmissions | KPI |
# |---|---|
# |$N = 1$|80%|
# |$N = 2$|88%|

# ## Conclusions
# [[Back to top](#TriScale)]
# 
# This simple case study illustrates that, even with high confidence, one may reach wrong conclusions due to
# the discrepancy in the experimental conditions. 
# 
# On a real network, short term variations are unpredictable and (often) unavoidable. This is why it is important to perform multiple runs in a series: it increases the chances to do the experiment in both favorable and unfavorable conditions. 
# 
# However, we illustrated that systematic patterns are also often present. In other words, there are times where there is consistently more of less interference. Knowing about these dependencies is important to 
# - ensure fairness in the comparison between protocols, and 
# - enable reproducibility of the evaluations: the series span must be long enough such that it is irrelevant when the series actually starts.

# In[ ]: