#!/usr/bin/env python
# coding: utf-8

# # Activity 7 - Violin Plots and Parallel Coordinates
# 
# This notebook will demonstrate the use of Violin Plots and Parallel Coordinates. This visualisation technique, similar to a box plot, is well suited for comparing multiple distributions, and show a curved distribution plot for each feature under consideration. Below shows a simple violin plot example using Seaborn.

# In[3]:


#!pip install seaborn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips)


# ## Case Study: Comparing Benign and DDoS behaviours in network traffic analysis using CICIDS2017
# 
# Let's consider an example where we want to look at benign and malicious network traffic. We will use to CICIDS2017 dataset for this. This dataset captures in the region of 80 numerical features that characterise network activity. Each data instance has been labelled as either benign or as an attack type (we focus on DDoS here, however other attacks are present in the full dataset). What data attributes set these two classes apart? We can use violin plots to judge this visually over the entire dataset.
# 
# ### Load in dataset and clean it up
# 
# First we will load in the data set, and we will remove all Not-a-Number and Infinity values that may be present. We will also remove columns that contain only zeros (i.e., no separating features).

# In[4]:


# Load in the dataset
df = pd.read_csv('./data/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv')
# Remove NaN and Inf
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
# Remove columns with all zero values
df = df.loc[:, (df != 0).any(axis=0)]
# Output table
df


# In[5]:


# What columns are left?
print (df.columns)
# How many columns?
print ("Length: ", len(df.columns))


# ### Normalise each column
# 
# Each column has its own range of values - some are quite narrow, some are quite large. We often normalise data to make it easier to work with and draw comparisons - this essentially means scaling it to be within a fixed range. Here, we want to normalise each feature indepedently - essentially meaning that each column will have a minimum value of zero and a maximum value of one, and all values for that particular feature will be scaled within this range.
# 
# We will use the sci-kit learn library to achieve this.

# In[6]:


# Import scikit learn
from sklearn import preprocessing
# Extract only the numerical feature columns
subset = df.iloc[:,7:74].astype(float)
# Define the scaler
min_max_scaler = preprocessing.MinMaxScaler()
# Apply the scaler to each column of our dataframe
df2 = pd.DataFrame(min_max_scaler.fit_transform(subset), columns=subset.columns, index=subset.index)
df2


# ### Separate data based on class
# 
# We have scaled the entire dataset so that all data for each feature is scaled in a consistent manner. We now want to split our dataset based on the classes of data that exist. Here, we know we have benign and DDoS classes.

# In[7]:


# Output the classes
outcome = df[' Label'].unique()
print(outcome)


# In[16]:


# Split data based on identified classes
df2[' Label'] = df[' Label']
benign = df2[df2[' Label'] == outcome[0]]
ddos = df2[df2[' Label'] == outcome[1]]


# In[17]:


benign


# ### Visualise the output
# 
# We now have our data split into the classes, so we can use the violin plot for each of our classes independently, and compare the two figures.

# In[18]:


plt.figure(figsize=(30,5))
ax = sns.violinplot(data=benign)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
ax.set_title("Violin Plot to show benign feature distributions");


# In[19]:


plt.figure(figsize=(30,5))
ax = sns.violinplot(data=ddos)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
ax.set_title("Violin Plot to show Bot feature distributions");


# ### Findings
# 
# Comparing the two charts, we can see that the following features are different across the two classes.
# 
# * Flow Duration
# * Flow IAT (Mean, Std, Max, Min)
# * Fwd IAT (Total, Mean, Std, Max, Min)
# * Packet Length Variance
# * Idle (Mean, Std, Max, Min)
# 
# We now have a clearer view (as far as this dataset is concerned) with what makes for a benign packet, and what makes for a malicious DDoS packet.

# ### Extra: PCA decomposition to separate classes
# 
# Given the high dimensionality of the data, what does the data look like if we perform dimensionality reduction? Can we better separate between the two classes? 

# In[224]:


from sklearn import decomposition
pca = decomposition.PCA(n_components=2)
X = pd.DataFrame(pca.fit_transform(df2.values), columns=['x', 'y'])
X['Label'] = df[' Label']

benignX = X[X['Label'] == outcome[0]]
ddosX = X[X['Label'] == outcome[1]]

plt.scatter(benignX['x'], benignX['y'])
plt.scatter(ddosX['x'], ddosX['y'])


# ***Unfortunately not*** - this is not a great surprise, our violin plot shows overlap between the features of the two classes and there is no clear decision boundary that separates the two. PCA is quite poor when there is little variance in many features (as we have here) hence why the plot has artefacts where straight lines appear. Other methods like t-SNE and UMAP may perform better but at greater computational cost.

# In[47]:


# Here's an example of selecting all columns that contain the.phrase 'IAT'

cols = 5

df3 = df2[ df2.columns[ df2.columns.str.contains("IAT") ] ]
#df3 = df3.iloc[:,0:cols]

df3[' Label'] = df2[' Label']


samples = 1000
benign3 = df3[df3[' Label'] == outcome[0]].iloc[0:samples,:]
ddos3 = df3[df3[' Label'] == outcome[1]].iloc[0:samples,:]


df3 = pd.concat([benign3, ddos3])
df3

plt.figure(figsize=(20,5))
ax = pd.plotting.parallel_coordinates(df3, ' Label', color=('#556270', '#4ECDC4'))
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);


# In[ ]: