Network Attack Detection using Machine Learning¶

Luca Di Bello (luca.dibello@student.supsi.ch) - SUPSI - 2023

Project goal¶

The goal of this project is to build a machine learning model that is able to detect and classify network attacks. The model will be trained using a dataset of network flows, and will be able to predict if a new flow is an attack or not, and if it is, what kind of attack it is.

Dataset description¶

The dataset used for this project is written in the NetFlow V9 format (format by Cisco, documentation available here). The dataset is composed by two files:

train_set: ~2 million flows, used for training the model
test_set: ~4 million flows, used for testing the model

Columns¶

FLOW_ID: A unique identifier for the flow
PROTOCOL_MAP: A string representing the protocol used in the flow, possible values include "ICMP", "TCP", "UDP", "IGMP", "GRE", "ESP", "AH", "EIGRP", "OSPF", "PIM", "IPV6-ICMP", "IPV6-IP", "IPV6-ROUTE", "IPV6-FRAG", "IPV6-NONXT", "IPV6-OPTS", and others.
L4_SRC_PORT: The source port number in the flow, possible values range from 0 to 65535.
IPV4_SRC_ADDR: The source IPv4 address in the flow, represented as a string in dotted decimal notation (e.g., "192.168.0.1").
L4_DST_PORT: The destination port number in the flow, possible values range from 0 to 65535.
IPV4_DST_ADDR: The destination IPv4 address in the flow, represented as a string in dotted decimal notation (e.g., "192.168.0.2").
FIRST_SWITCHED: The time at which the flow started, measured in seconds since the epoch (January 1, 1970).
FLOW_DURATION_MILLISECONDS: The duration of the flow in milliseconds.
LAST_SWITCHED: The time at which the flow ended, measured in seconds since the epoch (January 1, 1970).
PROTOCOL: The protocol used in the flow, possible values include 1 (ICMP), 6 (TCP), 17 (UDP), and others.
TCP_FLAGS: The TCP flags set in the flow, represented as a binary string (e.g., "100101").
TCP_WIN_MAX_IN: The maximum advertised window size (in bytes) for incoming traffic.
TCP_WIN_MAX_OUT: The maximum advertised window size (in bytes) for outgoing traffic.
TCP_WIN_MIN_IN: The minimum advertised window size (in bytes) for incoming traffic.
TCP_WIN_MIN_OUT: The minimum advertised window size (in bytes) for outgoing traffic.
TCP_WIN_MSS_IN: The maximum segment size (in bytes) for incoming traffic.
TCP_WIN_SCALE_IN: The window scale factor for incoming traffic.
TCP_WIN_SCALE_OUT: The window scale factor for outgoing traffic.
SRC_TOS: The Type of Service (ToS) value for the source IP address.
DST_TOS: The Type of Service (ToS) value for the destination IP address.
TOTAL_FLOWS_EXP: The total number of expected flows.
MIN_IP_PKT_LEN: The minimum length (in bytes) of IP packets in the flow.
MAX_IP_PKT_LEN: The maximum length (in bytes) of IP packets in the flow.
TOTAL_PKTS_EXP: The total number of expected packets in the flow.
TOTAL_BYTES_EXP: The total number of expected bytes in the flow.
IN_BYTES: The number of bytes received in the flow.
IN_PKTS: The number of packets received in the flow.
OUT_BYTES: The number of bytes sent in the flow.
OUT_PKTS: The number of packets sent in the flow.
ANALYSIS_TIMESTAMP: The time at which the flow was analyzed, measured in seconds since the epoch (January 1, 1970).
ANOMALY: A binary flag indicating whether the flow contains an anomaly (1 = true, 0 = false).
ALERT: (only available in training set) The kind of attack that has been detected on the current flow. This are the possible values:
- None: No attack has been detected
- Port scanning: The flow is a port scanning attack
- Denial of Service: The flow is a DoS attack
- Malware: The flow is a malware attack
ID: A unique identifier for the flow.

Tested models¶

K-Nearest Neighbors (KNN)
Support Vector Machine Classifier (SVC) with RBF (Radial Basis Function) kernel
Pipeline with Principal Component Analysis (PCA) and Support Vector Machine Classifier (SVC)
Bagging Classifier (based on SVC with RBF kernel)
Random Forest Classifier
Extra Trees Classifier
Neural Network (MLPClassifier)

1. Datasets loading¶

In this section we load the datasets and, since the dataset is too big, we take a sample of it if in development mode.

1.1. Importing the basic libraries¶

In [263]:

# Load data processing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
%matplotlib inline

1.2. Importing machine learning libraries¶

In [264]:

# Load machine learning libraries
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

1.3. Development mode flag¶

If the following flag is set to True, the model will be trained on a smaller dataset, in order to speed up the development process. If the flag is set to False, the model will be trained on the whole dataset.

In [265]:

# If true, only the 3% of the data will be used for training and testing of the various models
_DEVMODE = True

1.4. Loading the datasets¶

In [266]:

# Loading the data from the train and test files
train_df = pd.read_csv('data/train_net.csv')
test_df = pd.read_csv('data/test_net.csv')

1.5. Loaded datasets information¶

In [267]:

# Print total size
print("Test set size: ", test_df.shape)
print("Train set size: ", train_df.shape)

# Value counts
train_df['ALERT'].value_counts()

Test set size:  (2077339, 32)
Train set size:  (4217625, 33)

Out[267]:

None                 3659000
Port Scanning         507845
Denial of Service      50392
Malware                  388
Name: ALERT, dtype: int64

1.6. Dataset development mode reduction¶

In [268]:

if _DEVMODE:
    train_df = train_df.sample(frac=0.03, random_state=1)
    test_df = test_df.sample(frac=0.03, random_state=1)

    # Print total size
    print("Test set size: ", test_df.shape)
    print("Train set size: ", train_df.shape)

Test set size:  (62320, 32)
Train set size:  (126529, 33)

2. Data preprocessing¶

In this section we preprocess the datasets in order to make them usable by the machine learning algorithms.

2.1. Print datasets information¶

In [269]:

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 126529 entries, 1283232 to 959711
Data columns (total 33 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   FLOW_ID                     126529 non-null  int64  
 1   PROTOCOL_MAP                126529 non-null  object 
 2   L4_SRC_PORT                 126529 non-null  int64  
 3   IPV4_SRC_ADDR               126529 non-null  object 
 4   L4_DST_PORT                 126529 non-null  int64  
 5   IPV4_DST_ADDR               126529 non-null  object 
 6   FIRST_SWITCHED              126529 non-null  int64  
 7   FLOW_DURATION_MILLISECONDS  126529 non-null  int64  
 8   LAST_SWITCHED               126529 non-null  int64  
 9   PROTOCOL                    126529 non-null  int64  
 10  TCP_FLAGS                   126529 non-null  int64  
 11  TCP_WIN_MAX_IN              126529 non-null  int64  
 12  TCP_WIN_MAX_OUT             126529 non-null  int64  
 13  TCP_WIN_MIN_IN              126529 non-null  int64  
 14  TCP_WIN_MIN_OUT             126529 non-null  int64  
 15  TCP_WIN_MSS_IN              126529 non-null  int64  
 16  TCP_WIN_SCALE_IN            126529 non-null  int64  
 17  TCP_WIN_SCALE_OUT           126529 non-null  int64  
 18  SRC_TOS                     126529 non-null  int64  
 19  DST_TOS                     126529 non-null  int64  
 20  TOTAL_FLOWS_EXP             126529 non-null  int64  
 21  MIN_IP_PKT_LEN              126529 non-null  int64  
 22  MAX_IP_PKT_LEN              126529 non-null  int64  
 23  TOTAL_PKTS_EXP              126529 non-null  int64  
 24  TOTAL_BYTES_EXP             126529 non-null  int64  
 25  IN_BYTES                    126529 non-null  int64  
 26  IN_PKTS                     126529 non-null  int64  
 27  OUT_BYTES                   126529 non-null  int64  
 28  OUT_PKTS                    126529 non-null  int64  
 29  ANALYSIS_TIMESTAMP          126529 non-null  int64  
 30  ANOMALY                     69493 non-null   float64
 31  ALERT                       126529 non-null  object 
 32  ID                          126529 non-null  int64  
dtypes: float64(1), int64(28), object(4)
memory usage: 32.8+ MB

2.2. Print datasets shape¶

In [270]:

# Show information about the data
def printInfo(df):
    print('Dataframe shape: ', df.shape)
    print('Dataframe columns: ', df.columns)

print('==== Train data ====')
printInfo(train_df)
print()
print('==== Test data ====')
printInfo(test_df)

==== Train data ====
Dataframe shape:  (126529, 33)
Dataframe columns:  Index(['FLOW_ID', 'PROTOCOL_MAP', 'L4_SRC_PORT', 'IPV4_SRC_ADDR',
       'L4_DST_PORT', 'IPV4_DST_ADDR', 'FIRST_SWITCHED',
       'FLOW_DURATION_MILLISECONDS', 'LAST_SWITCHED', 'PROTOCOL', 'TCP_FLAGS',
       'TCP_WIN_MAX_IN', 'TCP_WIN_MAX_OUT', 'TCP_WIN_MIN_IN',
       'TCP_WIN_MIN_OUT', 'TCP_WIN_MSS_IN', 'TCP_WIN_SCALE_IN',
       'TCP_WIN_SCALE_OUT', 'SRC_TOS', 'DST_TOS', 'TOTAL_FLOWS_EXP',
       'MIN_IP_PKT_LEN', 'MAX_IP_PKT_LEN', 'TOTAL_PKTS_EXP', 'TOTAL_BYTES_EXP',
       'IN_BYTES', 'IN_PKTS', 'OUT_BYTES', 'OUT_PKTS', 'ANALYSIS_TIMESTAMP',
       'ANOMALY', 'ALERT', 'ID'],
      dtype='object')

==== Test data ====
Dataframe shape:  (62320, 32)
Dataframe columns:  Index(['FLOW_ID', 'PROTOCOL_MAP', 'L4_SRC_PORT', 'IPV4_SRC_ADDR',
       'L4_DST_PORT', 'IPV4_DST_ADDR', 'FIRST_SWITCHED',
       'FLOW_DURATION_MILLISECONDS', 'LAST_SWITCHED', 'PROTOCOL', 'TCP_FLAGS',
       'TCP_WIN_MAX_IN', 'TCP_WIN_MAX_OUT', 'TCP_WIN_MIN_IN',
       'TCP_WIN_MIN_OUT', 'TCP_WIN_MSS_IN', 'TCP_WIN_SCALE_IN',
       'TCP_WIN_SCALE_OUT', 'SRC_TOS', 'DST_TOS', 'TOTAL_FLOWS_EXP',
       'MIN_IP_PKT_LEN', 'MAX_IP_PKT_LEN', 'TOTAL_PKTS_EXP', 'TOTAL_BYTES_EXP',
       'IN_BYTES', 'IN_PKTS', 'OUT_BYTES', 'OUT_PKTS', 'ANALYSIS_TIMESTAMP',
       'ANOMALY', 'ID'],
      dtype='object')

2.3. Show training dataset structure¶

In [271]:

train_df.head()

Out[271]:

	FLOW_ID	PROTOCOL_MAP	L4_SRC_PORT	IPV4_SRC_ADDR	L4_DST_PORT	IPV4_DST_ADDR	FIRST_SWITCHED	FLOW_DURATION_MILLISECONDS	LAST_SWITCHED	PROTOCOL	...	IN_BYTES	IN_PKTS	OUT_BYTES	OUT_PKTS	ANALYSIS_TIMESTAMP	ANOMALY	ALERT	ID
1283232	372246818	tcp	62980	10.114.241.166	47736	10.114.224.117	1647766047	0	1647766047	6	...	44	1	40	1	1647766568	0.0	Port Scanning	1283232
3327778	372024895	udp	44246	10.114.227.143	53	10.114.226.5	1647761040	0	1647761040	17	...	73	1	121	1	1647761116	0.0	None	3327778
3341312	334833539	tcp	36150	10.114.225.212	5228	74.125.133.188	1647346083	144	1647346083	6	...	132	2	130	2	1647346141	NaN	None	3341312
1704357	369619889	tcp	33858	10.114.225.206	6443	10.114.232.94	1647693586	0	1647693586	6	...	60	1	0	0	1647693660	1.0	None	1704357
4163521	371759146	udp	48586	10.114.227.52	53	10.114.226.5	1647752706	0	1647752706	17	...	91	1	146	1	1647752852	0.0	None	4163521

5 rows × 33 columns

2.4 Check for missing values¶

In [272]:

# Check for missing values
print('==== Train data ====')
print(train_df.isnull().sum())
print()
print('==== Test data ====')
print(test_df.isnull().sum())
print()

==== Train data ====
FLOW_ID                           0
PROTOCOL_MAP                      0
L4_SRC_PORT                       0
IPV4_SRC_ADDR                     0
L4_DST_PORT                       0
IPV4_DST_ADDR                     0
FIRST_SWITCHED                    0
FLOW_DURATION_MILLISECONDS        0
LAST_SWITCHED                     0
PROTOCOL                          0
TCP_FLAGS                         0
TCP_WIN_MAX_IN                    0
TCP_WIN_MAX_OUT                   0
TCP_WIN_MIN_IN                    0
TCP_WIN_MIN_OUT                   0
TCP_WIN_MSS_IN                    0
TCP_WIN_SCALE_IN                  0
TCP_WIN_SCALE_OUT                 0
SRC_TOS                           0
DST_TOS                           0
TOTAL_FLOWS_EXP                   0
MIN_IP_PKT_LEN                    0
MAX_IP_PKT_LEN                    0
TOTAL_PKTS_EXP                    0
TOTAL_BYTES_EXP                   0
IN_BYTES                          0
IN_PKTS                           0
OUT_BYTES                         0
OUT_PKTS                          0
ANALYSIS_TIMESTAMP                0
ANOMALY                       57036
ALERT                             0
ID                                0
dtype: int64

==== Test data ====
FLOW_ID                           0
PROTOCOL_MAP                      0
L4_SRC_PORT                       0
IPV4_SRC_ADDR                     0
L4_DST_PORT                       0
IPV4_DST_ADDR                     0
FIRST_SWITCHED                    0
FLOW_DURATION_MILLISECONDS        0
LAST_SWITCHED                     0
PROTOCOL                          0
TCP_FLAGS                         0
TCP_WIN_MAX_IN                    0
TCP_WIN_MAX_OUT                   0
TCP_WIN_MIN_IN                    0
TCP_WIN_MIN_OUT                   0
TCP_WIN_MSS_IN                    0
TCP_WIN_SCALE_IN                  0
TCP_WIN_SCALE_OUT                 0
SRC_TOS                           0
DST_TOS                           0
TOTAL_FLOWS_EXP                   0
MIN_IP_PKT_LEN                    0
MAX_IP_PKT_LEN                    0
TOTAL_PKTS_EXP                    0
TOTAL_BYTES_EXP                   0
IN_BYTES                          0
IN_PKTS                           0
OUT_BYTES                         0
OUT_PKTS                          0
ANALYSIS_TIMESTAMP                0
ANOMALY                       28016
ID                                0
dtype: int64

2.5 Fill missing ANOMALY values¶

In [273]:

# Fill the missing ANOMALY values with 0 (no anomaly)
train_df['ANOMALY'].fillna(0, inplace=True)
test_df['ANOMALY'].fillna(0, inplace=True)

3. Data analysis¶

In this section we analyze the datasets in order to have a better understanding of the data.

3.1. Data types¶

In [274]:

train_df.dtypes

Out[274]:

FLOW_ID                         int64
PROTOCOL_MAP                   object
L4_SRC_PORT                     int64
IPV4_SRC_ADDR                  object
L4_DST_PORT                     int64
IPV4_DST_ADDR                  object
FIRST_SWITCHED                  int64
FLOW_DURATION_MILLISECONDS      int64
LAST_SWITCHED                   int64
PROTOCOL                        int64
TCP_FLAGS                       int64
TCP_WIN_MAX_IN                  int64
TCP_WIN_MAX_OUT                 int64
TCP_WIN_MIN_IN                  int64
TCP_WIN_MIN_OUT                 int64
TCP_WIN_MSS_IN                  int64
TCP_WIN_SCALE_IN                int64
TCP_WIN_SCALE_OUT               int64
SRC_TOS                         int64
DST_TOS                         int64
TOTAL_FLOWS_EXP                 int64
MIN_IP_PKT_LEN                  int64
MAX_IP_PKT_LEN                  int64
TOTAL_PKTS_EXP                  int64
TOTAL_BYTES_EXP                 int64
IN_BYTES                        int64
IN_PKTS                         int64
OUT_BYTES                       int64
OUT_PKTS                        int64
ANALYSIS_TIMESTAMP              int64
ANOMALY                       float64
ALERT                          object
ID                              int64
dtype: object

3.2. Observing the distribution of the target variable¶

We can observe that the dataset is highly imbalanced, with the majority of the flows being normal (no attack detected). We can also observe that also the number of malware attacks is very low, compared to the other attacks.

These two facts will have a big impact on the model training, as we will see later.

In [275]:

# Show the distribution of the target variable
sns.countplot(x='ALERT', data=train_df)

Out[275]:

<Axes: xlabel='ALERT', ylabel='count'>

In [276]:

# Count the number of unique protocol_maps
train_df['PROTOCOL_MAP'].value_counts()

Out[276]:

tcp          62514
udp          54250
icmp          9728
gre             36
ipv6-icmp        1
Name: PROTOCOL_MAP, dtype: int64

In [277]:

fig, axs = plt.subplots(1, 3, figsize=(20, 5))

# seaborn countplots
sns.countplot(x='ANOMALY', data=train_df, ax=axs[0]).set(title='ANOMALY')
              

# Seaborn countplot for the 'PROTOCOL_MAP' column, with enough space for the labels
sns.countplot(x='PROTOCOL_MAP', data=train_df, ax=axs[1]).set(title='PROTOCOL_MAP')

# Boxplot for L4_SRC_PORT to undestand the distribution of the data
sns.boxplot(
    x='L4_SRC_PORT', data=train_df, ax=axs[2],
    notch=True, showcaps=True,
    flierprops={"marker": "x"}, # Change the outlier marker
    showmeans=True, # Show the mean
    boxprops={"facecolor": (.4, .6, .8, .5)},
  ).set(title='L4_SRC_PORT')

Out[277]:

[Text(0.5, 1.0, 'L4_SRC_PORT')]

3.3. Protocol distribution in relation to the kind of attack¶

In [278]:

# Show protocol_map distribution for kind of ALERT
sns.countplot(x='PROTOCOL_MAP', hue='ALERT', data=train_df)

Out[278]:

<Axes: xlabel='PROTOCOL_MAP', ylabel='count'>

3.4. Unique hosts in dataset¶

Knowing the amount of unique hosts in the dataset is important to understand the size of the dataset since I expect that a bigger dataset will be more difficult to train properly.

In [279]:

# Find unique hosts (IP addresses) in the train and test data
train_src_hosts = train_df['IPV4_SRC_ADDR'].unique()
train_dst_hosts = train_df['IPV4_DST_ADDR'].unique()
train_hosts = np.union1d(train_src_hosts, train_dst_hosts)

# For each host, count the number of flows
print('Number of unique hosts in the train data: ', len(train_hosts))

# Find unique hosts (IP addresses) in the train and test data
test_src_hosts = test_df['IPV4_SRC_ADDR'].unique()
test_dst_hosts = test_df['IPV4_DST_ADDR'].unique()
test_hosts = np.union1d(test_src_hosts, test_dst_hosts)

# Floor ratio of hosts in test data that are not in train data
ratio = math.floor((1.0-len(test_hosts)/len(train_hosts)) * 100)

# For each host, count the number of flows
print("Number of unique hosts in the test data: {} (~{}% smaller)".format(len(test_hosts), ratio))

Number of unique hosts in the train data:  16875
Number of unique hosts in the test data: 11085 (~34% smaller)

3.5. Distribution analysis using pairplot¶

In [280]:

# select the columns to be used for training
train_df_columns = train_df[['L4_SRC_PORT', 'L4_DST_PORT', 'PROTOCOL', 'ANOMALY', 'ALERT']]

# Distribution analysis using pairplot
sns.pairplot(train_df_columns, hue='ALERT')

Out[280]:

<seaborn.axisgrid.PairGrid at 0x2d62c6290>

3.6. Remove useless columns and create dummies¶

In [281]:

# Revoked columns
revoked_columns = [
  'FLOW_ID', # Completely random
  'ID', # Completely random
  'ANALYSIS_TIMESTAMP', # Completely random
  'IPV4_SRC_ADDR', # Not useful for the model
  'IPV4_DST_ADDR', # Not useful for the model
  'PROTOCOL_MAP', # There is a numerical column for the protocol
  'MIN_IP_PKT_LEN', # Always 0 since it is a minimum value
  'MAX_IP_PKT_LEN', # Always 0 (maybe it means that the packet have infinite length?)
  'TOTAL_PKTS_EXP', # Always 0
  'TOTAL_BYTES_EXP', # Always 0
]

# Create dummy columns for the ALERT column
alert_dummies = pd.get_dummies(train_df['ALERT'], prefix='ALERT', drop_first=True)

# Copy + drop the revoked columns
train_df = train_df.copy().drop(revoked_columns, axis=1)

3.7. Correlation heatmap¶

We can observe that there are some features that are highly correlated with each other, such as IN_BYTES - OUT_BYTES and IN_PKTS - OUT_PKTS. This is not surprising, since these features are related to the amount of data exchanged between the two hosts.

We can also observe that a port scanning alert is highly correlated with the L4_DST_PORT and ANOMALY features. This is not surprising, since a port scanning attack is a type of attack that tries to find open ports on a host. It is highly correlated with ANOMALY probably because the forged packets are built in a way that they are not recognized as an attack by the network.

Unfortunately, since malware attacks alerts are various and have different characteristics/features, it is not possible to find a correlation between them and the other features. This could mean that the features used in this dataset are not enough to detect malware attacks.

In the other hand, none alerts are strongly negatively correlated with ANOMALY and L4_DST_PORT. This is not surprising, since a normally a flow contains valid packets and the destination is usually a well-known port.

In [282]:

# Correlation heatmap using pandas
corr = pd.concat([train_df.drop('ALERT', axis=1), alert_dummies], axis=1).corr(
  numeric_only=False, # Only consider numeric columns
)

# Correlation heatmap using seaborn + make annotations fit the heatmap
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt=".1f", cmap="YlGnBu")

Out[282]:

<Axes: >

4. Dataset preparation¶

In this section we prepare the dataset for the machine learning algorithms. We will split the dataset into training and testing sets, and we will also scale the data to make it more suitable for the algorithms.

4.1. Splitting the training set¶

Since we already have a test set, we split our training set in training and validation sets. We will use Sklearn's StratifiedShuffleSplit to split the training set in 80% training and 20% validation maintaining the same distribution of the target variable. This is needed since the dataset is highly imbalanced.

In [283]:

def split_maintain_distribution(X, y):
  sss=StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=9)
  indexes = sss.split(X, y)
  train_indices, test_indices = next(indexes)
  return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

In [284]:

X_train, X_val, y_train, y_val = split_maintain_distribution(train_df.drop('ALERT', axis=1), train_df['ALERT'])

Now, check if actually the distribution of the target variable is the same in the training and validation sets.

4.1.1 Check if the datasets are balanced¶

In [285]:

# Print distribution of the target variable in the train and validation sets
print('Train set distribution:')
print(y_train.value_counts(normalize=True))
print()
print('Validation set distribution:')
print(y_val.value_counts(normalize=True))

Train set distribution:
None                 0.868508
Port Scanning        0.119380
Denial of Service    0.011983
Malware              0.000128
Name: ALERT, dtype: float64

Validation set distribution:
None                 0.868490
Port Scanning        0.119379
Denial of Service    0.011973
Malware              0.000158
Name: ALERT, dtype: float64

We can confirm that the distribution of the target variable is the same in the training and validation sets.

4.2. Data scaling¶

Scaling the data is important to avoid that some features will have a bigger impact on the model training than others. This is especially important when we are dealing with features that have different units of measure.

In [286]:

# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)

# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)

# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)

5. Feature selection¶

In this section we will use a Random Forest classifier to find the most important features in the dataset. This will help us to reduce the number of features used in the model training, and therefore speed up the training process.

5.1. Create model and fit it¶

In [287]:

# Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100) # 100 trees = default value

# Fit the model
rfc.fit(x_train_scaled, y_train)

Out[287]:

RandomForestClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

5.2. Get feature importances¶

In [288]:

# Print features importance
feature_importances = pd.DataFrame(
    rfc.feature_importances_,
    index=X_train.columns,
    columns=['importance']
).sort_values('importance', ascending=False)
print(feature_importances)

                            importance
IN_BYTES                      0.182619
ANOMALY                       0.170634
TCP_WIN_MSS_IN                0.106685
L4_DST_PORT                   0.079761
TCP_WIN_MAX_IN                0.071403
TCP_WIN_MIN_IN                0.059129
OUT_BYTES                     0.046427
FIRST_SWITCHED                0.043752
TOTAL_FLOWS_EXP               0.036191
FLOW_DURATION_MILLISECONDS    0.035744
LAST_SWITCHED                 0.029963
L4_SRC_PORT                   0.026551
TCP_FLAGS                     0.024626
TCP_WIN_SCALE_IN              0.019076
TCP_WIN_MAX_OUT               0.018523
IN_PKTS                       0.012175
PROTOCOL                      0.009500
SRC_TOS                       0.009323
OUT_PKTS                      0.007262
TCP_WIN_MIN_OUT               0.007026
TCP_WIN_SCALE_OUT             0.002606
DST_TOS                       0.001025

5.3. Plot feature importances¶

In [289]:

# Plot feature importance
plt.figure(figsize=(20, 10))
plt.xticks(rotation=-90)
sns.barplot(x=feature_importances.index, y=feature_importances['importance'])

Out[289]:

<Axes: ylabel='importance'>

5.4. Select most important features¶

Select the most important features using the Random Forest classifier results

In [290]:

MIN_IMPORTANCE_THRESHOLD = 0.02

In [291]:

# Select all columns with importance > 0.02
COLUMNS = feature_importances[feature_importances['importance'] > MIN_IMPORTANCE_THRESHOLD].index
COLUMNS

Out[291]:

Index(['IN_BYTES', 'ANOMALY', 'TCP_WIN_MSS_IN', 'L4_DST_PORT',
       'TCP_WIN_MAX_IN', 'TCP_WIN_MIN_IN', 'OUT_BYTES', 'FIRST_SWITCHED',
       'TOTAL_FLOWS_EXP', 'FLOW_DURATION_MILLISECONDS', 'LAST_SWITCHED',
       'L4_SRC_PORT', 'TCP_FLAGS'],
      dtype='object')

5.5. Reprepare the dataset with the selected features¶

5.5.1. Split again the training set into training and validation sets (with new features)¶

In [292]:

X_train, X_val, y_train, y_val = split_maintain_distribution(
  train_df[COLUMNS],
  train_df['ALERT']
)

5.5.2. Scale again the train and validation sets (with new features)¶

In [293]:

# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)

# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)

# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)

5.5.3. Scale also the test set (with new features)¶

In [294]:

# No target variable, so no need to split the fit and transform
x_test_scaled = StandardScaler().fit_transform(test_df[COLUMNS])
# Convert to pandas dataframe
df_feat_test = pd.DataFrame(x_test_scaled, columns=test_df[COLUMNS].columns)

6. UMAP visualization¶

In this section we will use UMAP to visualize the dataset in 2D. This will help us to understand if is possible to separate the different classes of attacks in the dataset.

6.1 Create UMAP model and fit it using training set¶

In [295]:

import umap
reducer = umap.UMAP(
  random_state=42,
  n_neighbors=50,
  min_dist=0.3,
)
mapper = reducer.fit(x_train_scaled)

6.2. UMAP visualization¶

6.2.1. Visualization using matplotlib¶

Reduce data dimensionality to 2 dimensions and plot the data using matplotlib.

In [296]:

# Transform the train set
embedding = mapper.transform(x_train_scaled)

# Plot the train set
plt.figure(figsize=(10, 10))
plt.scatter(
  embedding[:, 0],
  embedding[:, 1],
  c=[sns.color_palette()[x] for x in y_train.map({'None': 0, 'Port Scanning': 1, 'Denial of Service': 2, 'Malware': 3})],
)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the train set', fontsize=24)

Out[296]:

Text(0.5, 1.0, 'UMAP projection of the train set')

6.2.2. Visualization using `umap.plot`¶

Plot the data using umap.plot (which uses matplotlib under the hood).

In [297]:

import umap.plot

# Create data labels
labels = y_train.map({'None': 0, 'Port Scanning': 1, 'Denial of Service': 2, 'Malware': 3})
# Visualize the embedding using umap.plot
p = umap.plot.points(
    mapper,
    labels=y_train,
    width=1000,
    height=900,
)
umap.plot.show(p)

7. Model Training¶

In this section we will train different models and compare their results. We will use the following models:

K-Nearest Neighbors (KNN)
Support Vector Machine (SVM) with RBF kernel (Radial Basis Function)
- SVC
- SVC with PCA (Principal Component Analysis) pipeline
Bagging Classifier (SVC with RBF kernel)
Random Forest Classifier
Extra Trees Classifier
Neural Network (MLPClassifier)

7.1. KNN Classifier training¶

We can notice that the model is excessively precise, with a precision of 1.0 with any kind of attack. This is probably due to the fact that the dataset is highly imbalanced, with the majority of the flows being normal (no attack detected). We can also observe that also the number of malware attacks is very low, compared to the other attacks.

7.1.1 Finding the best K hyperparameter for KNN¶

To find the best K hyperparameter for KNN, we will use the validation set to find the best K value. We will then use this K value to train the model on the training set and evaluate it on the test set.

In [298]:

# Find best K using GridSearchCV
MAX_DEGREE = 30

k_range = list(range(1, MAX_DEGREE+1))
param_grid = dict(n_neighbors=k_range)
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best k: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")

Best k: {'n_neighbors': 1}
Best score: 0.997569722296316

In [299]:

# Plot results
plt.figure(num=0, dpi=96, figsize=(10, 6))
plt.plot(k_range, grid.cv_results_['mean_test_score'])
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.xticks(k_range)
plt.show()

By looking at the graphical outcome, the best parameter for KNN is K = 1. Since this value would lead to overfitting, we will use the first odd number after 1, which is K = 3.

This outcome is not surprising since the training and validation sets are coming probably from the same network and the same hosts, so the flows are very similar to each other. This means that the best way to test our model is to use the test set.

7.1.2. Fit model with best K hyperparameter + make predictions¶

In [300]:

# Create a KNN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3) # 3 = view note above
# Fit the classifier to the data
knn.fit(x_train_scaled, y_train)
# Make predictions on validation set
predictions = knn.predict(x_validation_scaled)

7.1.3. Model evaluation based on validation set predictions¶

In [301]:

# Print the classification report
print(classification_report(y_val, predictions))

                   precision    recall  f1-score   support

Denial of Service       0.94      1.00      0.97       303
          Malware       1.00      1.00      1.00         4
             None       1.00      1.00      1.00     21978
    Port Scanning       0.98      1.00      0.99      3021

         accuracy                           1.00     25306
        macro avg       0.98      1.00      0.99     25306
     weighted avg       1.00      1.00      1.00     25306

In [302]:

# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

Out[302]:

<Axes: >

7.1.4. KNN predictions on test set¶

Unfortunately, the test set doesn't include the target variable, so we can't evaluate the model on it. We can only evaluate the model on the validation set.

In [303]:

# Prediction on the test set
predictions = knn.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

Out[303]:

None                 53884
Port Scanning         7621
Denial of Service      810
Malware                  5
dtype: int64

7.2. Support Vector Machine Classifier (SVC) training¶

Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well.

7.2.1 Only SVC model training¶

7.2.1.1. Grid search to find best hyperparameters for SVC¶

In [304]:

# Create grid search parameters
param_grid = {
  'C': [0.1, 1, 10, 100, 1000],
  'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
}

# Create grid search
svc_grid = GridSearchCV(
  SVC(kernel="rbf"),
  param_grid,
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1, # Use all cores
)

# Fit grid search
svc_grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best params: {svc_grid.best_params_}")
print(f"Best score: {svc_grid.best_score_}")

/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(

Best params: {'C': 1000, 'gamma': 1}
Best score: 0.998360055142362

7.2.1.2. Create model with best parameters + fit model¶

In [305]:

# Create SVM with best parameters
svc = SVC(
  kernel='rbf',
  C=svc_grid.best_params_['C'],
  gamma=svc_grid.best_params_['gamma'],
)
svc.fit(x_train_scaled, y_train)

Out[305]:

SVC(C=1000, gamma=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.2.1.3. Make predictions¶

In [306]:

# Make predictions on validation set
predictions = svc.predict(x_validation_scaled)

7.2.1.4. Model evaluation¶

In [307]:

# Print the classification report
print(classification_report(y_val, predictions))

                   precision    recall  f1-score   support

Denial of Service       0.98      1.00      0.99       303
          Malware       1.00      1.00      1.00         4
             None       1.00      1.00      1.00     21978
    Port Scanning       0.99      1.00      1.00      3021

         accuracy                           1.00     25306
        macro avg       0.99      1.00      1.00     25306
     weighted avg       1.00      1.00      1.00     25306

In [308]:

# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

Out[308]:

<Axes: >

7.2.1.5. SVC model predictions on test set¶

In [309]:

# Prediction on the test set
predictions = svc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

Out[309]:

None                 54007
Port Scanning         7541
Denial of Service      767
Malware                  5
dtype: int64

7.2.2. PCA + SVC model training¶

7.2.2.1. Create pipeline¶

In [310]:

# Create the two parameters
pca = PCA(whiten=True, random_state=42) # PCA (Principal Component Analysis)
svc = SVC(kernel='rbf', class_weight='balanced') # SVC (Support Vector Classification)

# Create pipeline
model = make_pipeline(pca, svc)

7.2.2.2. Grid search to find the best parameters for PCA and SVC¶

In [311]:

# Generate a valid n_components range (from 5 to maximum number of features)
n_features = x_train_scaled.shape[1]
n_components = np.arange(5, n_features, 3)

param_grid = {
  'pca__n_components': n_components,
  'svc__C': [50, 100, 500, 1000, 5000, 10000],
  'svc__gamma': [0.001, 0.01, 0.1, 1, 10]
}

# Grid search 
pipeline_grid = GridSearchCV(
    model,
    param_grid,
    cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
    n_jobs=-1 # Use all cores
)
pipeline_grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best params: {pipeline_grid.best_params_}")
print(f"Best score: {pipeline_grid.best_score_}")

/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(

Best params: {'pca__n_components': 11, 'svc__C': 5000, 'svc__gamma': 1}
Best score: 0.9983402973726345

7.2.2.3. Create pipeline with best parameters + fit model¶

In [312]:

# Now, create the desired pipeline
pca = PCA(
  n_components=pipeline_grid.best_params_['pca__n_components'],
  whiten=True,
  random_state=42
)
svc = SVC(kernel='rbf',
  class_weight='balanced',
  # Use the best parameters found by the grid search
  C=pipeline_grid.best_params_['svc__C'],
  gamma=pipeline_grid.best_params_['svc__gamma']
)
model = make_pipeline(pca, svc)
model.fit(x_train_scaled, y_train)

Out[312]:

Pipeline(steps=[('pca', PCA(n_components=11, random_state=42, whiten=True)),
                ('svc', SVC(C=5000, class_weight='balanced', gamma=1))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.2.2.4. Make predictions¶

In [313]:

# Make predictions on validation set
predictions = model.predict(x_validation_scaled)

7.2.2.5. Model evaluation¶

In [314]:

# Print the classification report
print(classification_report(y_val, predictions))

                   precision    recall  f1-score   support

Denial of Service       0.98      1.00      0.99       303
          Malware       1.00      1.00      1.00         4
             None       1.00      1.00      1.00     21978
    Port Scanning       0.99      1.00      1.00      3021

         accuracy                           1.00     25306
        macro avg       0.99      1.00      1.00     25306
     weighted avg       1.00      1.00      1.00     25306

In [315]:

# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

Out[315]:

<Axes: >

7.2.2.6. SVC+PCA pipeline model predictions on test set¶

In [316]:

# Prediction on the test set
predictions = model.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

Out[316]:

None                 54005
Port Scanning         7421
Denial of Service      891
Malware                  3
dtype: int64

7.3. Bagging Classifier (SVC based) training¶

Bagging Classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

7.3.1. Create model using best SVC parameters + fit model¶

In [317]:

svc = SVC(kernel='rbf',
  class_weight='balanced',
  C=svc_grid.best_params_['C'],
  gamma=svc_grid.best_params_['gamma']
)

clf = BaggingClassifier(
  svc,
  n_estimators=30,
  n_jobs=-1, # Use all cores
  random_state=42
)
clf.fit(x_train_scaled, y_train)

Out[317]:

BaggingClassifier(base_estimator=SVC(C=1000, class_weight='balanced', gamma=1),
                  n_estimators=30, n_jobs=-1, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.3.2. Make predictions¶

In [318]:

predictions = clf.predict(x_validation_scaled)

7.3.3. Model evaluation¶

In [319]:

# Print the classification report
print(classification_report(y_val, predictions))

                   precision    recall  f1-score   support

Denial of Service       0.96      1.00      0.98       303
          Malware       1.00      1.00      1.00         4
             None       1.00      1.00      1.00     21978
    Port Scanning       0.99      1.00      1.00      3021

         accuracy                           1.00     25306
        macro avg       0.99      1.00      0.99     25306
     weighted avg       1.00      1.00      1.00     25306

In [320]:

# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

Out[320]:

<Axes: >

7.5.5. Bagging Classifier predictions on test set¶

In [321]:

# Prediction on the test set
predictions = clf.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

Out[321]:

None                 53947
Port Scanning         7558
Denial of Service      810
Malware                  5
dtype: int64

7.4. Random Forest Classifier training¶

Random Forest is an ensemble method that combines multiple decision trees to create a more accurate model. It is a supervised learning algorithm that can be used for both classification and regression tasks.

7.4.1. Grid search to find best hyperparameters for Random Forest¶

In [322]:

# Create random forest classifier
rfc = RandomForestClassifier()

# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500, 1000]}

# Used to find the best n_estimators value to use to train the model
rfc_grid = GridSearchCV(
  rfc,
  parameters,
  scoring='accuracy',
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1 # Use all cores
)

# Fit model to data
rfc_grid.fit(x_train_scaled, y_train)

# Extract best params
print(f"Best params: {rfc_grid.best_params_}")
print(f"Best score: {rfc_grid.best_score_}")

Best params: {'n_estimators': 40}
Best score: 0.9996739870396807

7.4.2. Create model with best parameters + fit model¶

In [323]:

rfc = RandomForestClassifier(n_estimators=rfc_grid.best_params_['n_estimators'])
rfc.fit(x_train_scaled, y_train)

Out[323]:

RandomForestClassifier(n_estimators=40)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.4.3. Make predictions¶

In [324]:

# Make predictions on validation set
predictions = rfc.predict(x_validation_scaled)

7.4.4. Model evaluation¶

In [325]:

# Print the classification report
print(classification_report(y_val, predictions))

                   precision    recall  f1-score   support

Denial of Service       1.00      1.00      1.00       303
          Malware       1.00      0.75      0.86         4
             None       1.00      1.00      1.00     21978
    Port Scanning       1.00      1.00      1.00      3021

         accuracy                           1.00     25306
        macro avg       1.00      0.94      0.96     25306
     weighted avg       1.00      1.00      1.00     25306

In [326]:

# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

Out[326]:

<Axes: >

7.5.5. Random Forest model predictions on test set¶

In [327]:

# Prediction on the test set
predictions = rfc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

Out[327]:

None             62161
Port Scanning      155
Malware              4
dtype: int64

7.5. Extra Trees Classifier training¶

This kind of classifier is an ensemble of decision trees. It is similar to a Random Forest classifier, but the trees are trained using the whole dataset instead of a bootstrap sample.

7.5.1. Grid search to find best hyperparameters for Extra Trees¶

In [328]:

# Create random forest classifier
etc = ExtraTreesClassifier()

# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500]}

# Used to find the best n_estimators value to use to train the model
etc_grid = GridSearchCV(
  etc,
  parameters,
  scoring='accuracy',
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1 # Use all cores
)

# Fit model to data
etc_grid.fit(x_train_scaled, y_train)

# Extract best params
print(f"Best params: {etc_grid.best_params_}")
print(f"Best score: {etc_grid.best_score_}")

Best params: {'n_estimators': 30}
Best score: 0.9995949538136113

7.5.2. Create model with best parameters + fit model¶

In [329]:

etc = ExtraTreesClassifier(n_estimators=etc_grid.best_params_['n_estimators'])
etc.fit(x_train_scaled, y_train)

Out[329]:

ExtraTreesClassifier(n_estimators=30)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.5.3. Make predictions¶

In [330]:

# Make predictions on validation set
predictions = etc.predict(x_validation_scaled)

7.5.4. Model evaluation¶

In [331]:

# Print the classification report
print(classification_report(y_val, predictions))

                   precision    recall  f1-score   support

Denial of Service       1.00      1.00      1.00       303
          Malware       1.00      1.00      1.00         4
             None       1.00      1.00      1.00     21978
    Port Scanning       1.00      1.00      1.00      3021

         accuracy                           1.00     25306
        macro avg       1.00      1.00      1.00     25306
     weighted avg       1.00      1.00      1.00     25306

In [332]:

# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

Out[332]:

<Axes: >

7.5.5. Extra Trees model predictions on test set¶

In [333]:

# Prediction on the test set
predictions = etc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

Out[333]:

None                 54738
Port Scanning         7496
Denial of Service       81
Malware                  5
dtype: int64

7.6. Neural Network classifier training¶

7.6.1. Grid search to find best hyperparameters for Neural Network¶

In [334]:

# Create MLPClasifier
mlp = MLPClassifier(
  max_iter=1000,
  random_state=42
)

# Grid search for MLPClassifier
parameters = {
  'hidden_layer_sizes': [(50,), (100,), (50, 50)],
  'activation': ['relu', 'tanh'],
  'alpha': [0.0001, 0.001],
  'solver': ['adam', 'lbfgs'],
  'learning_rate': ['constant', 'invscaling'],
}

mlp_grid = GridSearchCV(
  mlp,
  parameters,
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1, # Use all cores
)

mlp_grid.fit(x_train_scaled, y_train)

/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)

Out[334]:

GridSearchCV(cv=2, estimator=MLPClassifier(max_iter=1000, random_state=42),
             n_jobs=-1,
             param_grid={'activation': ['relu', 'tanh'],
                         'alpha': [0.0001, 0.001],
                         'hidden_layer_sizes': [(50,), (100,), (50, 50)],
                         'learning_rate': ['constant', 'invscaling'],
                         'solver': ['adam', 'lbfgs']})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [335]:

# Extract best params
print(f"Best params: {mlp_grid.best_params_}")
print(f"Best score: {mlp_grid.best_score_}")

Best params: {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50), 'learning_rate': 'constant', 'solver': 'lbfgs'}
Best score: 0.9990022021781368

7.6.2. Create model with best parameters + fit model¶

In [336]:

# Create MLPClassifier with best parameters
mlp = MLPClassifier(
  hidden_layer_sizes=mlp_grid.best_params_['hidden_layer_sizes'],
  activation=mlp_grid.best_params_['activation'],
  alpha=mlp_grid.best_params_['alpha'],
  solver=mlp_grid.best_params_['solver'],
  learning_rate=mlp_grid.best_params_['learning_rate'],
  max_iter=1000,
  random_state=42
)
mlp.fit(x_train_scaled, y_train)

/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)

Out[336]:

MLPClassifier(activation='tanh', hidden_layer_sizes=(50, 50), max_iter=1000,
              random_state=42, solver='lbfgs')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7.6.3. Make predictions¶

In [337]:

# Make predictions on validation set
predictions = mlp.predict(x_validation_scaled)

7.6.4. Model evaluation¶

In [338]:

# Print the classification report
print(classification_report(y_val, predictions))

                   precision    recall  f1-score   support

Denial of Service       0.98      1.00      0.99       303
          Malware       1.00      1.00      1.00         4
             None       1.00      1.00      1.00     21978
    Port Scanning       1.00      1.00      1.00      3021

         accuracy                           1.00     25306
        macro avg       0.99      1.00      1.00     25306
     weighted avg       1.00      1.00      1.00     25306

In [339]:

# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

Out[339]:

<Axes: >

7.6.5. MPL classifier model predictions on test set¶

In [340]:

# Prediction on the test set
predictions = mlp.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

Out[340]:

None                 54397
Port Scanning         7317
Denial of Service      601
Malware                  5
dtype: int64

8. Comparative Analysis of Classification Models¶

I have observed outstanding performance across all tested classification models. It is worth noting that nearly every model exhibited exceptional accuracy, as evidenced by an f1-score nearing perfection (1.0).

Upon thorough consideration, I have come to realize that this phenomenon stems from the substantial resemblance between the utilized validation set (created by splitting the training set, given the absence of the target variable in the provided test set) and the training set employed for model training.

Presumably, the training set was constructed using data from a specific network, resulting in a significant overlap of features between both sets. Consequently, the models achieved highly accurate classifications for almost all flows within the validation set. However, their performance may not be equally robust when applied to a distinct test set derived from a different network.

Nevertheless, I have opted to present the classification outcomes of each model on the provided test set, despite the unavailability of performance metrics for evaluation.

Acquiring a test set originating from a diverse network would have been advantageous, enabling a more precise assessment of the models' performance. Regrettably, obtaining such a test set proved unfeasible for this particular project.

Network Attack Detection using Machine Learning¶

Project goal¶

Dataset description¶

Columns¶

Tested models¶

1. Datasets loading¶

1.1. Importing the basic libraries¶

1.2. Importing machine learning libraries¶

1.3. Development mode flag¶

1.4. Loading the datasets¶

1.5. Loaded datasets information¶

1.6. Dataset development mode reduction¶

2. Data preprocessing¶

2.1. Print datasets information¶

2.2. Print datasets shape¶

2.3. Show training dataset structure¶

2.4 Check for missing values¶

2.5 Fill missing ANOMALY values¶

3. Data analysis¶

3.1. Data types¶

3.2. Observing the distribution of the target variable¶

3.3. Protocol distribution in relation to the kind of attack¶

3.4. Unique hosts in dataset¶

3.5. Distribution analysis using pairplot¶

3.6. Remove useless columns and create dummies¶

3.7. Correlation heatmap¶

4. Dataset preparation¶

4.1. Splitting the training set¶

4.1.1 Check if the datasets are balanced¶

4.2. Data scaling¶

5. Feature selection¶

5.1. Create model and fit it¶

5.2. Get feature importances¶

5.3. Plot feature importances¶

5.4. Select most important features¶

5.5. Reprepare the dataset with the selected features¶

5.5.1. Split again the training set into training and validation sets (with new features)¶

5.5.2. Scale again the train and validation sets (with new features)¶

5.5.3. Scale also the test set (with new features)¶

6. UMAP visualization¶

6.1 Create UMAP model and fit it using training set¶

6.2. UMAP visualization¶

6.2.1. Visualization using matplotlib¶

6.2.2. Visualization using umap.plot¶

7. Model Training¶

7.1. KNN Classifier training¶

7.1.1 Finding the best K hyperparameter for KNN¶

7.1.2. Fit model with best K hyperparameter + make predictions¶

7.1.3. Model evaluation based on validation set predictions¶

7.1.4. KNN predictions on test set¶

7.2. Support Vector Machine Classifier (SVC) training¶

7.2.1 Only SVC model training¶

7.2.1.1. Grid search to find best hyperparameters for SVC¶

7.2.1.2. Create model with best parameters + fit model¶

7.2.1.3. Make predictions¶

7.2.1.4. Model evaluation¶

7.2.1.5. SVC model predictions on test set¶

7.2.2. PCA + SVC model training¶

7.2.2.1. Create pipeline¶

7.2.2.2. Grid search to find the best parameters for PCA and SVC¶

7.2.2.3. Create pipeline with best parameters + fit model¶

7.2.2.4. Make predictions¶

7.2.2.5. Model evaluation¶

7.2.2.6. SVC+PCA pipeline model predictions on test set¶

7.3. Bagging Classifier (SVC based) training¶

7.3.1. Create model using best SVC parameters + fit model¶

7.3.2. Make predictions¶

7.3.3. Model evaluation¶

7.5.5. Bagging Classifier predictions on test set¶

7.4. Random Forest Classifier training¶

7.4.1. Grid search to find best hyperparameters for Random Forest¶

7.4.2. Create model with best parameters + fit model¶

7.4.3. Make predictions¶

7.4.4. Model evaluation¶

7.5.5. Random Forest model predictions on test set¶

7.5. Extra Trees Classifier training¶

7.5.1. Grid search to find best hyperparameters for Extra Trees¶

7.5.2. Create model with best parameters + fit model¶

7.5.3. Make predictions¶

7.5.4. Model evaluation¶

6.2.2. Visualization using `umap.plot`¶