ANALYSING LATENCY IN AFRICA

USING RIPE NCC RECURRING TRACEROUTE MEASUREMENTS FROM PROBES IN AFRICA TO eu.thethings.network


In [1]:
from datetime import datetime
import os.path
import time
import requests
import numpy as np
import pandas as pd

import altair as alt
from vega_datasets import data as alt_data

# for the notebook only (not for JupyterLab) run this command once per session
alt.renderers.enable('notebook')
alt.data_transformers.disable_max_rows()

import utilities as utils

import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2
In [2]:
# Globar vars
WIDTH = 860
AFRICA_TOPOJSON_URL = 'https://raw.githubusercontent.com/deldersveld/topojson/master/continents/africa.json'
SAMPLE_SIZE = 10000

I. Data Loading, preprocessing and cleaning

In [5]:
json_data = utils.get_measurements(measurement_id=17468431, start='01/12/2018', stop='31/12/2018')
Fetching data from:
https://atlas.ripe.net/api/v2/measurements/17468431/results/?start=1543618800&stop=1546210800&probe_ids=None
In [4]:
# Number of records
print(str(len(json_data)) + ' results')
607324 results
In [5]:
# An exemple of result
json_data[0]
Out[5]:
{'af': 4,
 'dst_addr': '52.169.76.255',
 'dst_name': '52.169.76.255',
 'endtime': 1543618976,
 'from': '169.0.103.214',
 'fw': 4940,
 'group_id': 17468431,
 'lts': 8,
 'msm_id': 17468431,
 'msm_name': 'Traceroute',
 'paris_id': 3,
 'prb_id': 21682,
 'proto': 'TCP',
 'result': [{'hop': 64,
   'result': [{'flags': 'SA',
     'from': '52.169.76.255',
     'hdropts': [{'mss': 1440}],
     'rtt': 176.065,
     'size': 4,
     'ttl': 50},
    {'flags': 'SA',
     'from': '52.169.76.255',
     'hdropts': [{'mss': 1440}],
     'rtt': 176.794,
     'size': 4,
     'ttl': 50},
    {'flags': 'SA',
     'from': '52.169.76.255',
     'hdropts': [{'mss': 1440}],
     'rtt': 178.843,
     'size': 4,
     'ttl': 50}]}],
 'size': 0,
 'src_addr': '192.168.3.14',
 'stored_timestamp': 1543619072,
 'timestamp': 1543618975,
 'type': 'traceroute'}
  • Measurement results data structure

Note that according to probe's firmware version, returned fields might differ

From: https://atlas.ripe.net/docs/data_struct/

  • af: adress familly with possible values: {4: IPv4, 6: IPv6}
  • dst_addr: IP address of the destination (string)
  • dst_name: name of the destination (string)
  • from: IP address of the probe as known by controller (string)
  • fw: firmware version of the probe
  • group_id: if the measurement belongs to a group of measurements, the identifier of the group (int)
  • lts: last time synchronised. How long ago (in seconds) the clock of the probe was found to be in sync with that of a controller. The value -1 is used to indicate that the probe does not know whether it is in sync (int)
  • msm_id: measurement identifier (int)
  • msm_name: measurement type "Ping" (string)
  • paris_id: variation for the Paris mode of traceroute (int)
  • prb_id: source probe ID (int)
  • proto: "UDP", "ICMP", or "TCP" (string)
  • result: list of hop elements (array of objects). Objects have the following fields:
    • hop: hop number (int)
    • error: [optional] when an error occurs trying to send a packet. In that case there will not be a result structure. (string)
    • result: variable content, depending on type of response (array of objects) objects have the following fields:
      • flags: "flags" -- (optional) TCP flags in the reply packet, for TCP traceroute, concatenated, in the order 'F' (FIN), 'S' (SYN), 'R' (RST), 'P' (PSH), 'A' (ACK), 'U' (URG) (fw >= 4600) (string)
      • from: IPv4 or IPv6 source address in reply (string)
      • rtt: round-trip-time of reply, not present when the response is late in ms (float)
      • size: size of reply (int)
      • ttl: time-to-live in reply (int)
    • size: packet size (int)
    • src_addr: source address used by probe (string)
    • timestamp: Unix timestamp for start of measurement (int)
    • type: "traceroute" (string)
In [6]:
# To check a specific record by attribute
# list(filter(lambda m: m['endtime'] == 1543619009, json_data))
  • Checking destination IP address and fetching lon, lat for later use
In [0]:
# Let's check quickly all probes trace to the same IP address
ip_dest = set(map(lambda x: x['dst_addr'], json_data))
print('Destination IP address: ', ip_dest)

# Building the url as a long string in one go
url_ipinfo = 'http://ipinfo.io/{}?token=your-token'.format(list(ip_dest)[0])
url_ipinfo

r = requests.get(url_ipinfo)
json_ipinfo = r.json()

print(json_ipinfo)
lat_dest, lon_dest = map(lambda x: float(x), json_ipinfo['loc'].split(','))
Destination IP address:  {'52.169.76.255'}
{'ip': '52.169.76.255', 'city': 'Dublin', 'region': 'Leinster', 'country': 'IE', 'loc': '53.3331,-6.2489', 'org': 'AS8075 Microsoft Corporation', 'postal': 'D02'}
In [0]:
json_ipinfo
Out[0]:
{'city': 'Dublin',
 'country': 'IE',
 'ip': '52.169.76.255',
 'loc': '53.3331,-6.2489',
 'org': 'AS8075 Microsoft Corporation',
 'postal': 'D02',
 'region': 'Leinster'}
In [106]:
# Utilities functions and lambdas
# Error codes: https://atlas.ripe.net/docs/data_struct

# Filtering/mapping predicates
has_result = lambda row: not('error' in row['result'][0])
is_late = lambda pack: 'late' in pack # number of packets a reply is late, in this case rtt is not present (int)
is_timed_out = lambda pack: 'x' in pack # {'x': '*'} - Timed out
is_error = lambda pack: 'error' in pack # Network, destination,... unreachable
is_err = lambda pack: 'err' in pack # Network, destination,... unreachable
is_empty = lambda pack: not(bool(pack))

get_result = lambda row: row['result'][0]['result']
get_nb_packets = lambda row: len(get_result(row))
get_max_nb_hops = lambda row: row['result'][0]['hop']
get_rtts = lambda row: list(filter(lambda pack: not(is_late(pack) or
                                                    is_timed_out(pack) or
                                                    is_error(pack) or 
                                                    is_err(pack) or
                                                    is_empty(pack)), get_result(row)))

get_nb_late_pack = lambda row: len(list(filter(lambda pack: is_late(pack), get_result(row))))
get_nb_timed_out_pack = lambda row: len(list(filter(lambda pack: is_timed_out(pack), get_result(row))))
get_nb_error_pack = lambda row: len(list(filter(lambda pack: (is_error(pack) or
                                                              is_err(pack)), get_result(row))))
get_nb_empty_pack = lambda row: len(list(filter(lambda pack: is_empty(pack), get_result(row))))

get_nb_rtts = lambda row: len(get_rtts(row))
get_rtts_mean = lambda row: np.mean(list(map(lambda pack: pack['rtt'], get_rtts(row))))
get_ttl = lambda row: get_rtts(row)[0]['ttl']

to_datetime = lambda x: datetime.utcfromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S')
In [131]:
# Postprocess json data
measurements = []
for row in json_data:
    #is_valid = has_result(row) and get_nb_rtts(row)
    is_valid = has_result(row)
    measurements.append(
        {'prb_id': row['prb_id'],
         'ip_from': row['from'],
         'paris_id': row['paris_id'],
         'datetime': to_datetime(row['timestamp']),
         'start_time': row['timestamp'],
         'end_time': row['endtime'],
         'last_time_sync': row['lts'],
         'firmware_version': row['fw'],
         'nb_packets': get_nb_packets(row) if is_valid else np.NaN,
         'nb_rtts': get_nb_rtts(row) if is_valid else np.NaN, 
         'rtt': get_rtts_mean(row) if is_valid else np.NaN,
         'measurement_failed': not(is_valid),
         'nb_late_pack': get_nb_late_pack(row) if is_valid else np.NaN,
         'nb_timed_out_pack': get_nb_timed_out_pack(row) if is_valid else np.NaN,
         'nb_error_pack': get_nb_error_pack(row) if is_valid else np.NaN,
         'nb_empty_pack': get_nb_empty_pack(row) if is_valid else np.NaN,
         #'nb_hops': get_max_nb_hops(row) - get_ttl(row) if is_valid else np.NaN
        })    
In [132]:
# Convert to Pandas dataframe 
columns = ['datetime', 'prb_id', 'ip_from', 'paris_id', 'start_time', 
           'end_time', 'last_time_sync', 'firmware_version', 'nb_packets', 'nb_rtts',
           'rtt', 'measurement_failed', 'nb_late_pack', 'nb_timed_out_pack', 
           'nb_error_pack','nb_empty_pack']

df = pd.DataFrame(measurements, columns=columns)
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
df.sort_index(inplace=True)
In [692]:
df.head()
Out[692]:
prb_id ip_from paris_id start_time end_time last_time_sync firmware_version nb_packets nb_rtts rtt measurement_failed nb_late_pack nb_timed_out_pack nb_error_pack nb_empty_pack
datetime
2018-11-30 23:02:49 35067 41.248.208.162 8 1543618969 1543618970 204 4780 3.0 3.0 94.480667 False 0.0 0.0 0.0 0.0
2018-11-30 23:02:55 21682 169.0.103.214 3 1543618975 1543618976 8 4940 3.0 3.0 177.234000 False 0.0 0.0 0.0 0.0
2018-11-30 23:03:00 33073 196.15.205.17 3 1543618980 1543618980 48 4940 3.0 3.0 172.731000 False 0.0 0.0 0.0 0.0
2018-11-30 23:03:02 50355 155.93.192.173 4 1543618982 1543618982 49 4940 3.0 3.0 178.559000 False 0.0 0.0 0.0 0.0
2018-11-30 23:03:03 11383 196.1.95.103 3 1543618983 1543618984 91 4940 3.0 3.0 96.144667 False 0.0 0.0 0.0 0.0

I.2 Probes description

In [693]:
# Get list of probes involved in measurements
probes_id = list(df['prb_id'].unique())
In [694]:
json_data_probe = utils.get_probes(probes_id)
In [695]:
# An example of probe's description.
json_data_probe['results'][0]
Out[695]:
{'address_v4': '196.1.95.16',
 'address_v6': '2001:4278:1000:1::16',
 'asn_v4': 8346,
 'asn_v6': 8346,
 'country_code': 'SN',
 'description': 'UCAD Probe',
 'first_connected': 1291147153,
 'geometry': {'type': 'Point', 'coordinates': [-17.4415, 14.6715]},
 'id': 239,
 'is_anchor': False,
 'is_public': True,
 'last_connected': 1568990096,
 'prefix_v4': '196.1.92.0/22',
 'prefix_v6': '2001:4278::/32',
 'status': {'id': 1, 'name': 'Connected', 'since': '2019-09-19T22:17:28Z'},
 'status_since': 1568931448,
 'tags': [{'name': 'Office', 'slug': 'office'},
  {'name': 'No NAT', 'slug': 'no-nat'},
  {'name': 'system: V1', 'slug': 'system-v1'},
  {'name': 'system: Resolves A Correctly',
   'slug': 'system-resolves-a-correctly'},
  {'name': 'system: Resolves AAAA Correctly',
   'slug': 'system-resolves-aaaa-correctly'},
  {'name': 'system: IPv4 Works', 'slug': 'system-ipv4-works'},
  {'name': "system: IPv6 Doesn't Work", 'slug': 'system-ipv6-doesnt-work'},
  {'name': 'system: IPv4 Capable', 'slug': 'system-ipv4-capable'},
  {'name': 'system: IPv6 Capable', 'slug': 'system-ipv6-capable'}],
 'total_uptime': 236903509,
 'type': 'Probe'}
In [19]:
# Postprocess json data
probes = []
for i, row in enumerate(json_data_probe['results']):
        probes.append(
            {'prb_id': row['id'],
             'ip_v4': row['address_v4'],
             'asn': str(row['asn_v4']),
             'country_code': row['country_code'],
             'description': row['description'],
             'lon': row['geometry']['coordinates'][0],
             'lat': row['geometry']['coordinates'][1]})
In [20]:
# Convert to Pandas dataframe 
columns = ['prb_id','ip_v4', 'asn', 'country_code', 'description', 'lon', 'lat']
df_probes = pd.DataFrame(probes, columns=columns)
In [21]:
print('Number of countries: ', len(df_probes['country_code'].unique()))
print(df_probes['country_code'].unique())
Number of countries:  36
['SN' 'ZA' 'MU' 'TN' 'BW' 'CM' 'GH' 'TZ' 'KE' 'BF' 'UG' 'RW' 'NG' 'CG'
 'BJ' 'MA' 'BI' 'MZ' 'ET' 'ZM' 'SZ' 'AO' 'MW' 'LS' 'NA' 'MG' 'SD' 'SS'
 'SC' 'CD' 'DZ' 'ZW' 'GM' 'TG' 'EG' 'DJ']
In [22]:
# Loading country codes
country_codes = pd.read_csv('https://raw.githubusercontent.com/franckalbinet/' + 
                            'latency-internet-africa/master/data/country_codes.csv')

country_codes.head()
Out[22]:
name alpha-2 alpha-3 country-code iso_3166-2 region sub-region intermediate-region region-code sub-region-code intermediate-region-code
0 Afghanistan AF AFG 4 ISO 3166-2:AF Asia Southern Asia NaN 142.0 34.0 NaN
1 Åland Islands AX ALA 248 ISO 3166-2:AX Europe Northern Europe NaN 150.0 154.0 NaN
2 Albania AL ALB 8 ISO 3166-2:AL Europe Southern Europe NaN 150.0 39.0 NaN
3 Algeria DZ DZA 12 ISO 3166-2:DZ Africa Northern Africa NaN 2.0 15.0 NaN
4 American Samoa AS ASM 16 ISO 3166-2:AS Oceania Polynesia NaN 9.0 61.0 NaN
In [23]:
# Joining/merging with country code to get full country name
df_probes = pd.merge(df_probes, country_codes[['name', 'alpha-2']], left_on='country_code', 
                     right_on='alpha-2', how='left')
df_probes = df_probes.drop(['country_code'], axis=1)
df_probes.rename(columns={'alpha-2':'country_code', 'name': 'country_name'}, inplace=True)
df_probes.head()
Out[23]:
prb_id ip_v4 asn description lon lat country_name country_code
0 239 196.1.95.16 8346 UCAD Probe -17.4415 14.6715 Senegal SN
1 242 None 328453 1 Belvedere 18.4895 -33.9815 South Africa ZA
2 446 196.192.112.229 37708 AFRINIC Mauritius 57.4995 -20.2395 Mauritius MU
3 473 196.4.161.3 3741 Internet Solutions [http://www.is.co.za] - Ros... 28.0405 -26.1515 South Africa ZA
4 504 193.95.97.132 2609 ATI, KASBAH, 1G/s 10.1675 36.7995 Tunisia TN
  • It looks like we don't get the IP address from all probes. Let's use the ip address provided in measurement results instead.
In [24]:
# Note that each probe can have several IP addresses. We keep only one.
ip_from_measurement = df[['prb_id','ip_from']].reset_index().drop(columns=['datetime']).drop_duplicates(subset='prb_id')
In [25]:
df_probes = pd.merge(df_probes, ip_from_measurement, left_on='prb_id', right_on='prb_id', how='left') 
In [26]:
df_probes.head()
Out[26]:
prb_id ip_v4 asn description lon lat country_name country_code ip_from
0 239 196.1.95.16 8346 UCAD Probe -17.4415 14.6715 Senegal SN 196.1.95.16
1 242 None 328453 1 Belvedere 18.4895 -33.9815 South Africa ZA 196.210.11.182
2 446 196.192.112.229 37708 AFRINIC Mauritius 57.4995 -20.2395 Mauritius MU 196.192.112.229
3 473 196.4.161.3 3741 Internet Solutions [http://www.is.co.za] - Ros... 28.0405 -26.1515 South Africa ZA 196.4.161.3
4 504 193.95.97.132 2609 ATI, KASBAH, 1G/s 10.1675 36.7995 Tunisia TN 193.95.97.132
In [96]:
ipinfo_lookup = []
for ip in df_probes['ip_from'].values:
  r = requests.get('http://ipinfo.io/{}?token=your-token'.format(ip))
  ipinfo_lookup.append(r.json())
In [97]:
df_ip_info = pd.DataFrame(ipinfo_lookup)
df_ip_info.head()
Out[97]:
city country hostname ip loc org postal region
0 SN dkr-sn.probe.atlas.ucad.sn 196.1.95.16 14.0000,-14.0000 AS8346 SONATEL-AS Autonomous System NaN
1 Cape Town ZA 196-210-11-182.dynamic.isadsl.co.za 196.210.11.182 -33.9258,18.4232 AS3741 Internet Solutions 7945 Western Cape
2 MU p446.probes.atlas.ripe.net 196.192.112.229 -20.2833,57.5500 AS37708 African Network Information Center - (... NaN
3 ZA ripe-ncc.is.co.za 196.4.161.3 -29.0000,24.0000 AS3741 Internet Solutions NaN
4 Tunis TN NaN 193.95.66.40 36.8190,10.1658 AS2609 Tunisia BackBone AS NaN Tūnis
In [98]:
df_probes = pd.merge(df_probes, df_ip_info, left_on='ip_from', right_on='ip', how='left') 
df_probes.head()
Out[98]:
prb_id ip_v4 asn description lon lat country_name country_code ip_from city country hostname ip loc org postal region
0 239 196.1.95.16 8346 UCAD Probe -17.4415 14.6715 Senegal SN 196.1.95.16 SN dkr-sn.probe.atlas.ucad.sn 196.1.95.16 14.0000,-14.0000 AS8346 SONATEL-AS Autonomous System NaN
1 242 None 328453 1 Belvedere 18.4895 -33.9815 South Africa ZA 196.210.11.182 Cape Town ZA 196-210-11-182.dynamic.isadsl.co.za 196.210.11.182 -33.9258,18.4232 AS3741 Internet Solutions 7945 Western Cape
2 446 196.192.112.229 37708 AFRINIC Mauritius 57.4995 -20.2395 Mauritius MU 196.192.112.229 MU p446.probes.atlas.ripe.net 196.192.112.229 -20.2833,57.5500 AS37708 African Network Information Center - (... NaN
3 473 196.4.161.3 3741 Internet Solutions [http://www.is.co.za] - Ros... 28.0405 -26.1515 South Africa ZA 196.4.161.3 ZA ripe-ncc.is.co.za 196.4.161.3 -29.0000,24.0000 AS3741 Internet Solutions NaN
4 567 193.95.66.40 2609 ATI, BELVEDERE, 1G/s 10.1805 36.8205 Tunisia TN 193.95.66.40 Tunis TN NaN 193.95.66.40 36.8190,10.1658 AS2609 Tunisia BackBone AS NaN Tūnis
In [99]:
df_probes = df_probes[['prb_id', 'ip', 'asn', 'hostname', 'org', 'description', 'country_name', 'country_code', 'region', 'city', 'lon', 'lat']]
In [100]:
df_probes.head()
Out[100]:
prb_id ip asn hostname org description country_name country_code region city lon lat
0 239 196.1.95.16 8346 dkr-sn.probe.atlas.ucad.sn AS8346 SONATEL-AS Autonomous System UCAD Probe Senegal SN -17.4415 14.6715
1 242 196.210.11.182 328453 196-210-11-182.dynamic.isadsl.co.za AS3741 Internet Solutions 1 Belvedere South Africa ZA Western Cape Cape Town 18.4895 -33.9815
2 446 196.192.112.229 37708 p446.probes.atlas.ripe.net AS37708 African Network Information Center - (... AFRINIC Mauritius Mauritius MU 57.4995 -20.2395
3 473 196.4.161.3 3741 ripe-ncc.is.co.za AS3741 Internet Solutions Internet Solutions [http://www.is.co.za] - Ros... South Africa ZA 28.0405 -26.1515
4 567 193.95.66.40 2609 NaN AS2609 Tunisia BackBone AS ATI, BELVEDERE, 1G/s Tunisia TN Tūnis Tunis 10.1805 36.8205
In [101]:
# Saving probes description as csv
utils.df_to_csv(df_probes, prefix_name='probes')
In [19]:
df_probes = pd.read_csv('data/probes-2019-08-23.csv')
In [697]:
# Joining/merging with country code to get full country name
data = pd.merge(df.reset_index(), df_probes, left_on='prb_id', right_on='prb_id', how='left')
In [700]:
SPEED_OF_LIGHT = 299792.458 # km/s
DIST = 3620 # km (more or less shortest African point from Dublin)
print("Light takes: ", int(1000*DIST / SPEED_OF_LIGHT), "ms to travel") # in ms
Light takes:  12 ms to travel
In [7]:
# To be defined
THRESHOLD_RTT = 30 # in ms
  • Which probes yield rtts below 30 ms??
In [702]:
data[data['rtt'] < 30]['prb_id'].unique()
Out[702]:
array([29158, 35742, 32674, 18514, 33985, 29991, 33290, 14867, 23538,
       21610, 14958, 33090,  4274,   596, 15883, 35743, 31505, 13299,
       35074])
  • In which countries??
In [703]:
data[data['rtt'] < 30]['country_name'].unique()
Out[703]:
array(['Kenya', 'South Africa', 'Togo', 'Gambia', 'Cameroon',
       'Congo (Democratic Republic of the)', 'Sudan',
       'Tanzania, United Republic of', nan, 'Seychelles', 'Ghana',
       'Botswana', 'Tunisia', 'Egypt'], dtype=object)

In [704]:
# Remove missing country names - though can be retrieved easily but those countries are
# not relevant to ous NRENs vs. others analysis as no identified NRENs in those countries.
data.dropna(subset=['country_name'], inplace=True)
In [705]:
# Dumping data for further use
utils.df_to_csv(data, prefix_name='measurements')
In [3]:
data = pd.read_csv('./data/measurements-2019-09-20.csv')
In [4]:
# How many measurements and columns
data.shape
Out[4]:
(581762, 28)
In [5]:
# Quick descriptive statistics
data['rtt'].describe()
Out[5]:
count    576684.000000
mean        167.272890
std         123.419625
min           0.585333
25%         139.629500
50%         170.072167
75%         183.718333
max        6292.648333
Name: rtt, dtype: float64
In [97]:
# Time range fetched
print("Measurements from: {} to {}".format(min(data['datetime']),max(data['datetime'])))
Measurements from: 2018-11-30 23:02:49 to 2018-12-30 22:54:10
  • All Rtts Range
In [8]:
alt.Chart(data[data['rtt'] > THRESHOLD_RTT].sample(SAMPLE_SIZE))\
    .mark_bar()\
    .encode(
        alt.X("rtt:Q", bin=alt.Bin(step=100), title='Round-Trip-Time (ms)'),
        y='count()')\
    .properties(
        width = WIDTH,
        height = 200)\
    .save('./img/all-rtts.png')
In [9]:
utils.show_chart('./img/all-rtts.png')
Out[9]:

  • Focusing on [0, 600] ms range
In [712]:
alt.Chart(data.sample(SAMPLE_SIZE))\
    .mark_bar()\
    .encode(
        alt.X("rtt:Q", bin=alt.Bin(extent=[0, 600], step=20), title='Round-Trip-Time (ms)'),
        y='count()')\
    .properties(
        width = WIDTH,
        height = 200)\
    .save('./img/rtts-focus.png')
In [10]:
utils.show_chart('./img/rtts-focus.png')
Out[10]:

  • With peaks and outliers
In [714]:
alt.Chart(data[data['rtt'] > THRESHOLD_RTT].sample(SAMPLE_SIZE)).mark_boxplot().encode(
    alt.X('rtt:Q',title='Round-Trip-Time (ms)'),
    alt.Y(field = 'country_name', type = 'nominal', 
          sort=alt.EncodingSortField(field='rtt', op='max', order='ascending'),
          title='Country'))\
.properties(
        width = 0.8*WIDTH,
        height = 550).save('./img/rtts-boxplot-by-country.png')
In [11]:
utils.show_chart('./img/rtts-boxplot-by-country.png')
Out[11]:

  • Focusing on Interquartile range and median

    Let's forget outliers and peaks for now and focus on Q1, Q2, Q3.

In [716]:
points = alt.Chart(data[data['rtt'] > THRESHOLD_RTT].sample(SAMPLE_SIZE)).mark_point(
    filled=True,
    color='steelblue',
    size=80
).encode(
    x=alt.X('median(rtt)', title='Round-Trip-Time (ms)'),
    y=alt.Y(
        'country_name',
         sort=alt.EncodingSortField(
             field='rtt',
             op='median',
             order='ascending'
         ),
        title='Country'
    )
).properties(
    width=0.8*WIDTH,
    height=600
)

error_bars = points.mark_rule(size=1.5, color='steelblue').encode(
    x='q1(rtt)',
    x2='q3(rtt)',
)

(points + error_bars).save('./img/rtts-iqr-by-country.png')
In [12]:
utils.show_chart('./img/rtts-iqr-by-country.png')
Out[12]:

In [718]:
african_countries = alt.topo_feature(AFRICA_TOPOJSON_URL, feature='continent_Africa_subunits')

# US states background
background = alt.Chart(african_countries).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    width=WIDTH,
    height=500
).project('equirectangular')

# probes positions on background
points = alt.Chart(df_probes).mark_circle(
    size=50,
    color='steelblue'
).encode(
    longitude='lon:Q',
    latitude='lat:Q'
)

(background + points).configure_view(stroke="transparent").save('./img/probes-locations.png')
In [13]:
utils.show_chart('./img/probes-locations.png')
Out[13]:

II.4 Packet loss

Below is considered a lost packet when:

  • no result;
  • packet late;
  • timed out.
In [91]:
data_valid = data[data['measurement_failed'] == False]
data_valid['packets_lost'] = data_valid['nb_packets'] - data_valid['nb_rtts']
In [92]:
def perc_loss(df):
    return 100 * sum(df['packets_lost']) / sum(df['nb_packets'])
In [93]:
print("Percentage of packets lost: {:.2f}".format(perc_loss(data_valid)))
Percentage of packets lost: 1.99
In [723]:
packet_loss_country = data_valid.groupby(['country_name'])\
                        .apply(perc_loss)\
                        .to_frame(name='% of packet loss')\
                        .sort_values(by='% of packet loss', ascending=False)
print(packet_loss_country)
                                    % of packet loss
country_name                                        
Congo (Democratic Republic of the)         27.323449
Egypt                                      12.523072
Kenya                                       4.948249
Ethiopia                                    3.988362
Algeria                                     2.344592
Morocco                                     1.936620
South Africa                                1.932597
Sudan                                       1.902569
Cameroon                                    1.687548
Madagascar                                  1.372378
Senegal                                     1.276268
Zimbabwe                                    1.048065
Tunisia                                     0.942401
Zambia                                      0.883190
Togo                                        0.830149
Burundi                                     0.807505
Benin                                       0.537149
Uganda                                      0.508824
Tanzania, United Republic of                0.216664
Rwanda                                      0.147547
Gambia                                      0.146935
Congo                                       0.138889
Ghana                                       0.077187
Malawi                                      0.077160
Angola                                      0.058377
South Sudan                                 0.058092
Mauritius                                   0.035770
Nigeria                                     0.027006
Djibouti                                    0.025621
Mozambique                                  0.025077
Eswatini                                    0.023148
Botswana                                    0.011575
Seychelles                                  0.000000
Lesotho                                     0.000000
In [724]:
alt.Chart(packet_loss_country.reset_index()).mark_bar().encode(
    x=alt.X('% of packet loss', title='Packet loss (%)'),
    y=alt.Y(
        'country_name:N',
         sort=alt.EncodingSortField(
             field='% of packet loss',
             order='descending'
         ),
        title='Country'
    )
).properties(
    width=0.8*WIDTH,
    height=600
).save('./img/packet-loss-per-country.png')
In [14]:
utils.show_chart('./img/packet-loss-per-country.png')
Out[14]:

III. Analysing National Education Research Networks (NREN) probes latency

For further information on NRENS: https://en.wikipedia.org/wiki/National_research_and_education_network

III.1 NRENs identification process

In [20]:
nren_probes = [15328, 30177, 13114, 13218, 12344, 14712, 19592, 6369, 33838, 33284, 3461, 13711, 21673]
print(str(len(nren_probes)) + ' NRENs probes identified.')
13 NRENs probes identified.
  • Adding a new column testing probe's NREN membership
In [21]:
data['is_nren'] = data['prb_id'].isin(nren_probes)
data.head(2)
Out[21]:
datetime prb_id ip_from paris_id start_time end_time last_time_sync firmware_version nb_packets nb_rtts ... hostname org description country_name country_code region city lon lat is_nren
0 2018-11-30 23:02:49 35067 41.248.208.162 8 1543618969 1543618970 204 4780 3.0 3.0 ... NaN AS36903 Office National des Postes et Telecomm... Rue Bruxelles Morocco MA Rabat-Salé-Kénitra Rabat -6.2395 33.8205 False
1 2018-11-30 23:02:55 21682 169.0.103.214 3 1543618975 1543618976 8 4940 3.0 3.0 ... 169-0-103-214.ip.afrihost.co.za AS37611 Afrihost (Pty) Ltd Strand South Africa ZA Western Cape Cape Town 18.8315 -34.1015 False

2 rows × 28 columns

III.2 NREN probes location

In [23]:
df_probes['is_nren'] = df_probes['prb_id'].isin(nren_probes)
In [731]:
african_countries = alt.topo_feature(AFRICA_TOPOJSON_URL, feature='continent_Africa_subunits')

# US states background
background = alt.Chart(african_countries).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    width=WIDTH,
    height=500
).project('equirectangular')

# probes positions on background
points_not_nren = alt.Chart(df_probes[df_probes['is_nren'] == False]).mark_circle(
    size=30, fill='steelblue', stroke='white', strokeWidth=0.5, opacity=0.8
).encode(
    longitude='lon:Q',
    latitude='lat:Q'
)

points_nren = alt.Chart(df_probes[df_probes['is_nren'] == True]).mark_point(
    size=300, fill='#c44e51', opacity=1, shape='triangle-up'
).encode(
    longitude='lon:Q',
    latitude='lat:Q'
)

# NREN probes in red and non NREN probes in blue
(background + points_nren + points_not_nren)\
    .configure_view(stroke="transparent")\
    .save('./img/probes-locations-nrens.png')
In [25]:
utils.show_chart('./img/probes-locations-nrens.png')
Out[25]:

  • Select only countries where NREN probes have been identified
In [27]:
countries_nren = list(data[data['is_nren'] == True]['country_code'].unique())
In [28]:
data_nren = data[data['country_code'].isin(countries_nren)]
  • Nb. of records per country per type of probes (belonging to nren or not)
In [29]:
data_nren[data_nren['rtt'] > THRESHOLD_RTT]\
    .groupby(['country_name', 'is_nren'])\
    .size()
Out[29]:
country_name                  is_nren
Algeria                       False        5180
                              True         5748
Kenya                         False       25262
                              True         5756
Madagascar                    False        5748
                              True         5571
Morocco                       False        8540
                              True         2808
South Africa                  False      229449
                              True         4316
Sudan                         False        4121
                              True         2713
Tanzania, United Republic of  False       11636
                              True         2837
Zambia                        False       10790
                              True         2826
dtype: int64

III.3 Comparing RTTs of NREN probes vs. others

  • What percentage of NRENs measurements?
In [94]:
print("Percentage of NRENs measurements: {:.2f}".format(100 * sum(data_nren['is_nren'] == True) / len(data_nren)))
Percentage of NRENs measurements: 9.38
In [736]:
points = alt.Chart(data_nren[data_nren['rtt'] > THRESHOLD_RTT].sample(SAMPLE_SIZE)).mark_point(
    filled=True,
    color='steelblue',
    size=80
).encode(
    x=alt.X('median(rtt)', title='Round-Trip-Time (ms)'),
    y=alt.Y(
        'is_nren',
         sort=alt.EncodingSortField(
             field='rtt',
             op='median',
             order='ascending'
         ),
    title='NREN membership'
    )
).properties(
    width=0.8*WIDTH,
    height=100
)

error_bars = points.mark_rule(size=1.5, color='steelblue').encode(
    x='q1(rtt)',
    x2='q3(rtt)',
)

(points + error_bars).save('./img/rtt-nrens-vs-others.png')
In [31]:
utils.show_chart('./img/rtt-nrens-vs-others.png')
Out[31]:

III.4 Comparing RTTs of NREN probes vs. others per country

In [181]:
points = alt.Chart(data_nren[data_nren['rtt'] > THRESHOLD_RTT].sample(SAMPLE_SIZE)).mark_point(
    filled=True,
    color='steelblue',
    size=80
).encode(
    x=alt.X('median(rtt)', title='Round-Trip-Time (ms)'),
    y=alt.Y(
        'is_nren',
         sort=alt.EncodingSortField(
             field='rtt',
             op='median',
             order='ascending'
         ),
        title='',
        axis=None
    ),
    color=alt.Color('is_nren', legend=None, scale=alt.Scale(
            domain=['true', 'false'],
            range=['#c44e51','steelblue']))
    
).properties(
    width=0.8*WIDTH,
    height=80
)

error_bars = points.mark_rule(size=1.5, color='steelblue').encode(
    x='q1(rtt)',
    x2='q3(rtt)',
)

(points + error_bars)\
    .facet(
        row='country_name:N')\
    .configure_axis(
        labelFontSize=15,\
        titleFontSize=15,\
        gridDash=[4,4]\
    )\
    .configure_header(
        titleFontSize=0,\
        labelFontSize=15,\
        labelLimit=80,\
        title=None
    )\
    .save('./img/rtt-nrens-vs-others-by-country.png')
In [182]:
utils.show_chart('./img/rtt-nrens-vs-others-by-country.png')
Out[182]:

III.5 Comparing Packet Loss of NREN probes vs. others

  • Comparing mean Packet Loss
In [34]:
data_nren_valid = data_nren[data_nren['measurement_failed'] == False]
data_nren_valid['packets_lost'] = data_nren_valid['nb_packets'] - data_nren_valid['nb_rtts']
In [37]:
packet_loss_nren_country = data_nren_valid.groupby(['is_nren'])\
                            .apply(perc_loss)\
                            .to_frame(name='% of packet loss')\
                            .sort_values(by='% of packet loss', ascending=False)
print(packet_loss_nren_country)
         % of packet loss
is_nren                  
False            2.231209
True             0.818880

III.6 Comparing Packet Loss of NREN probes vs. others per country

  • Comparing mean Packet Loss per country
In [302]:
packet_loss_per_nren_per_country = data_nren_valid.groupby(['country_name', 'is_nren'])\
                            .apply(perc_loss)\
                            .to_frame(name='% of packet loss')\
                                             
packet_loss_per_nren_per_country
Out[302]:
% of packet loss
country_name is_nren
Algeria False 4.107719
True 0.700231
Kenya False 5.917119
True 0.109954
Madagascar False 0.416667
True 2.341549
Morocco False 2.188217
True 1.163611
South Africa False 1.955328
True 0.690290
Sudan False 2.535381
True 0.613422
Tanzania, United Republic of False 0.259367
True 0.000000
Zambia False 1.030548
True 0.317759
In [305]:
alt.Chart(packet_loss_per_nren_per_country.reset_index())\
    .mark_bar(height=20)\
    .encode(
        x=alt.X('% of packet loss', title='Packet loss (%)'),
        y=alt.Y('is_nren:N', title='', axis=None),
        color=alt.Color('is_nren', legend=None, scale=alt.Scale(domain=['true', 'false'], range=['#c44e51','steelblue'])),
        row=alt.Row('country_name:N', sort=['Kenya', 'Algeria', 'Sudan', 'Morocco', 'South Africa',\
                                            'Zambia', 'Madagascar', 'Tanzania, United Republic of']))\
    .properties(
        width=0.8*WIDTH,
        height=70)\
    .configure_axis(
        labelFontSize=15,\
        titleFontSize=15,\
        gridDash=[4,4]\
    )\
    .configure_header(
        titleFontSize=0,\
        labelFontSize=15,\
        labelLimit=80,\
        title=None
    ).save('./img/packet-loss-nrens-vs-others-by-country.png')
In [306]:
utils.show_chart('./img/packet-loss-nrens-vs-others-by-country.png')
Out[306]:

III.7 Is there enough evidence at this stage that a NREN make a difference?

  • Hypothesis testing

H0: there is no difference of mean of % of packet loss between NRENs vs. non-NRENS measurements.

In [39]:
# Observed difference of % of packet loss between NRENs and not-NRENs
observed_diff = packet_loss_nren_country.diff().values[1][0]
print(observed_diff)
-1.4123289321193973

Let's first simulate that HO Null hypothesis:

In [62]:
nren_idx = data_nren_valid[data_nren_valid['is_nren'] == True].index.values

# Sample non NREN measurements to get the same number of values as NRENs one as 10x more
not_nren_idx = data_nren_valid[data_nren_valid['is_nren'] == False].sample(len(nren_idx)).index.values

both_idx = np.concatenate((nren_idx, not_nren_idx))
In [68]:
nb_replicates = 10000
replicates = np.zeros(nb_replicates)
for i in range(nb_replicates):
    both_perm_idx = np.random.permutation(both_idx)
    perm_sample_nren_idx = both_perm_idx[:len(nren_idx)]
    perm_sample_not_nren_idx = both_perm_idx[len(not_nren_idx):]
    
    replicates[i] =  perc_loss(data_nren_valid.loc[perm_sample_nren_idx,])-\
                     perc_loss(data_nren_valid.loc[perm_sample_not_nren_idx,])
In [69]:
alt.Chart(pd.DataFrame({'x': list(replicates)}))\
    .mark_bar()\
    .encode(
        alt.X("x:Q", title='difference of mean of packet loss between NRENS and not-NRENS under simulated H0'),
        y='count()')\
    .properties(
        width = WIDTH,
        height = 200)\
    .save('./img/estimate-diff-percentage-packet-loss.png')
In [70]:
utils.show_chart('./img/estimate-diff-percentage-packet-loss.png')
Out[70]:

What's the probability of obtaining a value of our test statistic (difference of mean of packet loss) that is at least as extreme as what was observed, under the assumption the null hypothesis is true?

In [71]:
print(100 * sum(replicates <= observed_diff) / len(replicates), '%')
0.0 %