This notebook will demonstrate the use of Violin Plots and Parallel Coordinates. This visualisation technique, similar to a box plot, is well suited for comparing multiple distributions, and show a curved distribution plot for each feature under consideration. Below shows a simple violin plot example using Seaborn.
#!pip install seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips)
Let's consider an example where we want to look at benign and malicious network traffic. We will use to CICIDS2017 dataset for this. This dataset captures in the region of 80 numerical features that characterise network activity. Each data instance has been labelled as either benign or as an attack type (we focus on DDoS here, however other attacks are present in the full dataset). What data attributes set these two classes apart? We can use violin plots to judge this visually over the entire dataset.
First we will load in the data set, and we will remove all Not-a-Number and Infinity values that may be present. We will also remove columns that contain only zeros (i.e., no separating features).
# Load in the dataset
df = pd.read_csv('./data/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv')
# Remove NaN and Inf
df = df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
# Remove columns with all zero values
df = df.loc[:, (df != 0).any(axis=0)]
# Output table
df
Flow ID | Source IP | Source Port | Destination IP | Destination Port | Protocol | Timestamp | Flow Duration | Total Fwd Packets | Total Backward Packets | ... | min_seg_size_forward | Active Mean | Active Std | Active Max | Active Min | Idle Mean | Idle Std | Idle Max | Idle Min | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 192.168.10.5-104.16.207.165-54865-443-6 | 104.16.207.165 | 443 | 192.168.10.5 | 54865 | 6 | 7/7/2017 3:30 | 3 | 2 | 0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
1 | 192.168.10.5-104.16.28.216-55054-80-6 | 104.16.28.216 | 80 | 192.168.10.5 | 55054 | 6 | 7/7/2017 3:30 | 109 | 1 | 1 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
2 | 192.168.10.5-104.16.28.216-55055-80-6 | 104.16.28.216 | 80 | 192.168.10.5 | 55055 | 6 | 7/7/2017 3:30 | 52 | 1 | 1 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
3 | 192.168.10.16-104.17.241.25-46236-443-6 | 104.17.241.25 | 443 | 192.168.10.16 | 46236 | 6 | 7/7/2017 3:30 | 34 | 1 | 1 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
4 | 192.168.10.5-104.19.196.102-54863-443-6 | 104.19.196.102 | 443 | 192.168.10.5 | 54863 | 6 | 7/7/2017 3:30 | 3 | 2 | 0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
225740 | 192.168.10.15-72.21.91.29-61374-80-6 | 72.21.91.29 | 80 | 192.168.10.15 | 61374 | 6 | 7/7/2017 5:02 | 61 | 1 | 1 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
225741 | 192.168.10.15-72.21.91.29-61378-80-6 | 72.21.91.29 | 80 | 192.168.10.15 | 61378 | 6 | 7/7/2017 5:02 | 72 | 1 | 1 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
225742 | 192.168.10.15-72.21.91.29-61375-80-6 | 72.21.91.29 | 80 | 192.168.10.15 | 61375 | 6 | 7/7/2017 5:02 | 75 | 1 | 1 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
225743 | 192.168.10.15-8.41.222.187-61323-80-6 | 8.41.222.187 | 80 | 192.168.10.15 | 61323 | 6 | 7/7/2017 5:02 | 48 | 2 | 0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
225744 | 192.168.10.15-8.43.72.21-61326-80-6 | 8.43.72.21 | 80 | 192.168.10.15 | 61326 | 6 | 7/7/2017 5:02 | 68 | 1 | 1 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
225711 rows × 75 columns
# What columns are left?
print (df.columns)
# How many columns?
print ("Length: ", len(df.columns))
Index(['Flow ID', ' Source IP', ' Source Port', ' Destination IP', ' Destination Port', ' Protocol', ' Timestamp', ' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets', 'Total Length of Fwd Packets', ' Total Length of Bwd Packets', ' Fwd Packet Length Max', ' Fwd Packet Length Min', ' Fwd Packet Length Mean', ' Fwd Packet Length Std', 'Bwd Packet Length Max', ' Bwd Packet Length Min', ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max', ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std', ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean', ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags', ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s', ' Bwd Packets/s', ' Min Packet Length', ' Max Packet Length', ' Packet Length Mean', ' Packet Length Std', ' Packet Length Variance', 'FIN Flag Count', ' SYN Flag Count', ' RST Flag Count', ' PSH Flag Count', ' ACK Flag Count', ' URG Flag Count', ' ECE Flag Count', ' Down/Up Ratio', ' Average Packet Size', ' Avg Fwd Segment Size', ' Avg Bwd Segment Size', ' Fwd Header Length.1', 'Subflow Fwd Packets', ' Subflow Fwd Bytes', ' Subflow Bwd Packets', ' Subflow Bwd Bytes', 'Init_Win_bytes_forward', ' Init_Win_bytes_backward', ' act_data_pkt_fwd', ' min_seg_size_forward', 'Active Mean', ' Active Std', ' Active Max', ' Active Min', 'Idle Mean', ' Idle Std', ' Idle Max', ' Idle Min', ' Label'], dtype='object') Length: 75
Each column has its own range of values - some are quite narrow, some are quite large. We often normalise data to make it easier to work with and draw comparisons - this essentially means scaling it to be within a fixed range. Here, we want to normalise each feature indepedently - essentially meaning that each column will have a minimum value of zero and a maximum value of one, and all values for that particular feature will be scaled within this range.
We will use the sci-kit learn library to achieve this.
# Import scikit learn
from sklearn import preprocessing
# Extract only the numerical feature columns
subset = df.iloc[:,7:74].astype(float)
# Define the scaler
min_max_scaler = preprocessing.MinMaxScaler()
# Apply the scaler to each column of our dataframe
df2 = pd.DataFrame(min_max_scaler.fit_transform(subset), columns=subset.columns, index=subset.index)
df2
Flow Duration | Total Fwd Packets | Total Backward Packets | Total Length of Fwd Packets | Total Length of Bwd Packets | Fwd Packet Length Max | Fwd Packet Length Min | Fwd Packet Length Mean | Fwd Packet Length Std | Bwd Packet Length Max | ... | act_data_pkt_fwd | min_seg_size_forward | Active Mean | Active Std | Active Max | Active Min | Idle Mean | Idle Std | Idle Max | Idle Min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.333335e-08 | 0.000518 | 0.00000 | 0.000066 | 0.000000 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000000 | ... | 0.000518 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 9.166671e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.000000 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 4.416669e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.000000 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 2.916668e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.000000 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 3.333335e-08 | 0.000518 | 0.00000 | 0.000066 | 0.000000 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000000 | ... | 0.000518 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
225740 | 5.166669e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.000000 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
225741 | 6.083336e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.000000 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
225742 | 6.333337e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.000000 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
225743 | 4.083335e-07 | 0.000518 | 0.00000 | 0.000066 | 0.000000 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000000 | ... | 0.000518 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
225744 | 5.750003e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.000000 | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
225711 rows × 67 columns
We have scaled the entire dataset so that all data for each feature is scaled in a consistent manner. We now want to split our dataset based on the classes of data that exist. Here, we know we have benign and DDoS classes.
# Output the classes
outcome = df[' Label'].unique()
print(outcome)
['BENIGN' 'DDoS']
# Split data based on identified classes
df2[' Label'] = df[' Label']
benign = df2[df2[' Label'] == outcome[0]]
ddos = df2[df2[' Label'] == outcome[1]]
benign
Flow Duration | Total Fwd Packets | Total Backward Packets | Total Length of Fwd Packets | Total Length of Bwd Packets | Fwd Packet Length Max | Fwd Packet Length Min | Fwd Packet Length Mean | Fwd Packet Length Std | Bwd Packet Length Max | ... | min_seg_size_forward | Active Mean | Active Std | Active Max | Active Min | Idle Mean | Idle Std | Idle Max | Idle Min | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.333335e-08 | 0.000518 | 0.00000 | 0.000066 | 0.000000 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000000 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
1 | 9.166671e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
2 | 4.416669e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
3 | 2.916668e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
4 | 3.333335e-08 | 0.000518 | 0.00000 | 0.000066 | 0.000000 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000000 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
225740 | 5.166669e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
225741 | 6.083336e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
225742 | 6.333337e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
225743 | 4.083335e-07 | 0.000518 | 0.00000 | 0.000066 | 0.000000 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000000 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
225744 | 5.750003e-07 | 0.000000 | 0.00034 | 0.000033 | 0.000001 | 0.000514 | 0.004076 | 0.001552 | 0.0 | 0.000514 | ... | 0.384615 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | BENIGN |
97686 rows × 68 columns
We now have our data split into the classes, so we can use the violin plot for each of our classes independently, and compare the two figures.
plt.figure(figsize=(30,5))
ax = sns.violinplot(data=benign)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
ax.set_title("Violin Plot to show benign feature distributions");
plt.figure(figsize=(30,5))
ax = sns.violinplot(data=ddos)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
ax.set_title("Violin Plot to show Bot feature distributions");
Comparing the two charts, we can see that the following features are different across the two classes.
We now have a clearer view (as far as this dataset is concerned) with what makes for a benign packet, and what makes for a malicious DDoS packet.
Given the high dimensionality of the data, what does the data look like if we perform dimensionality reduction? Can we better separate between the two classes?
from sklearn import decomposition
pca = decomposition.PCA(n_components=2)
X = pd.DataFrame(pca.fit_transform(df2.values), columns=['x', 'y'])
X['Label'] = df[' Label']
benignX = X[X['Label'] == outcome[0]]
ddosX = X[X['Label'] == outcome[1]]
plt.scatter(benignX['x'], benignX['y'])
plt.scatter(ddosX['x'], ddosX['y'])
<matplotlib.collections.PathCollection at 0x7fc82fbd0990>
*Unfortunately not* - this is not a great surprise, our violin plot shows overlap between the features of the two classes and there is no clear decision boundary that separates the two. PCA is quite poor when there is little variance in many features (as we have here) hence why the plot has artefacts where straight lines appear. Other methods like t-SNE and UMAP may perform better but at greater computational cost.
# Here's an example of selecting all columns that contain the.phrase 'IAT'
cols = 5
df3 = df2[ df2.columns[ df2.columns.str.contains("IAT") ] ]
#df3 = df3.iloc[:,0:cols]
df3[' Label'] = df2[' Label']
samples = 1000
benign3 = df3[df3[' Label'] == outcome[0]].iloc[0:samples,:]
ddos3 = df3[df3[' Label'] == outcome[1]].iloc[0:samples,:]
df3 = pd.concat([benign3, ddos3])
df3
plt.figure(figsize=(20,5))
ax = pd.plotting.parallel_coordinates(df3, ' Label', color=('#556270', '#4ECDC4'))
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
/Users/pa-legg/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy