Luca Di Bello (luca.dibello@student.supsi.ch) - SUPSI - 2023
The goal of this project is to build a machine learning model that is able to detect and classify network attacks. The model will be trained using a dataset of network flows, and will be able to predict if a new flow is an attack or not, and if it is, what kind of attack it is.
The dataset used for this project is written in the NetFlow V9 format (format by Cisco, documentation available here). The dataset is composed by two files:
In this section we load the datasets and, since the dataset is too big, we take a sample of it if in development mode.
# Load data processing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
%matplotlib inline
# Load machine learning libraries
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
If the following flag is set to True
, the model will be trained on a smaller dataset, in order to speed up the development process. If the flag is set to False
, the model will be trained on the whole dataset.
# If true, only the 3% of the data will be used for training and testing of the various models
_DEVMODE = True
# Loading the data from the train and test files
train_df = pd.read_csv('data/train_net.csv')
test_df = pd.read_csv('data/test_net.csv')
# Print total size
print("Test set size: ", test_df.shape)
print("Train set size: ", train_df.shape)
# Value counts
train_df['ALERT'].value_counts()
Test set size: (2077339, 32) Train set size: (4217625, 33)
None 3659000 Port Scanning 507845 Denial of Service 50392 Malware 388 Name: ALERT, dtype: int64
if _DEVMODE:
train_df = train_df.sample(frac=0.03, random_state=1)
test_df = test_df.sample(frac=0.03, random_state=1)
# Print total size
print("Test set size: ", test_df.shape)
print("Train set size: ", train_df.shape)
Test set size: (62320, 32) Train set size: (126529, 33)
In this section we preprocess the datasets in order to make them usable by the machine learning algorithms.
train_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 126529 entries, 1283232 to 959711 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FLOW_ID 126529 non-null int64 1 PROTOCOL_MAP 126529 non-null object 2 L4_SRC_PORT 126529 non-null int64 3 IPV4_SRC_ADDR 126529 non-null object 4 L4_DST_PORT 126529 non-null int64 5 IPV4_DST_ADDR 126529 non-null object 6 FIRST_SWITCHED 126529 non-null int64 7 FLOW_DURATION_MILLISECONDS 126529 non-null int64 8 LAST_SWITCHED 126529 non-null int64 9 PROTOCOL 126529 non-null int64 10 TCP_FLAGS 126529 non-null int64 11 TCP_WIN_MAX_IN 126529 non-null int64 12 TCP_WIN_MAX_OUT 126529 non-null int64 13 TCP_WIN_MIN_IN 126529 non-null int64 14 TCP_WIN_MIN_OUT 126529 non-null int64 15 TCP_WIN_MSS_IN 126529 non-null int64 16 TCP_WIN_SCALE_IN 126529 non-null int64 17 TCP_WIN_SCALE_OUT 126529 non-null int64 18 SRC_TOS 126529 non-null int64 19 DST_TOS 126529 non-null int64 20 TOTAL_FLOWS_EXP 126529 non-null int64 21 MIN_IP_PKT_LEN 126529 non-null int64 22 MAX_IP_PKT_LEN 126529 non-null int64 23 TOTAL_PKTS_EXP 126529 non-null int64 24 TOTAL_BYTES_EXP 126529 non-null int64 25 IN_BYTES 126529 non-null int64 26 IN_PKTS 126529 non-null int64 27 OUT_BYTES 126529 non-null int64 28 OUT_PKTS 126529 non-null int64 29 ANALYSIS_TIMESTAMP 126529 non-null int64 30 ANOMALY 69493 non-null float64 31 ALERT 126529 non-null object 32 ID 126529 non-null int64 dtypes: float64(1), int64(28), object(4) memory usage: 32.8+ MB
# Show information about the data
def printInfo(df):
print('Dataframe shape: ', df.shape)
print('Dataframe columns: ', df.columns)
print('==== Train data ====')
printInfo(train_df)
print()
print('==== Test data ====')
printInfo(test_df)
==== Train data ==== Dataframe shape: (126529, 33) Dataframe columns: Index(['FLOW_ID', 'PROTOCOL_MAP', 'L4_SRC_PORT', 'IPV4_SRC_ADDR', 'L4_DST_PORT', 'IPV4_DST_ADDR', 'FIRST_SWITCHED', 'FLOW_DURATION_MILLISECONDS', 'LAST_SWITCHED', 'PROTOCOL', 'TCP_FLAGS', 'TCP_WIN_MAX_IN', 'TCP_WIN_MAX_OUT', 'TCP_WIN_MIN_IN', 'TCP_WIN_MIN_OUT', 'TCP_WIN_MSS_IN', 'TCP_WIN_SCALE_IN', 'TCP_WIN_SCALE_OUT', 'SRC_TOS', 'DST_TOS', 'TOTAL_FLOWS_EXP', 'MIN_IP_PKT_LEN', 'MAX_IP_PKT_LEN', 'TOTAL_PKTS_EXP', 'TOTAL_BYTES_EXP', 'IN_BYTES', 'IN_PKTS', 'OUT_BYTES', 'OUT_PKTS', 'ANALYSIS_TIMESTAMP', 'ANOMALY', 'ALERT', 'ID'], dtype='object') ==== Test data ==== Dataframe shape: (62320, 32) Dataframe columns: Index(['FLOW_ID', 'PROTOCOL_MAP', 'L4_SRC_PORT', 'IPV4_SRC_ADDR', 'L4_DST_PORT', 'IPV4_DST_ADDR', 'FIRST_SWITCHED', 'FLOW_DURATION_MILLISECONDS', 'LAST_SWITCHED', 'PROTOCOL', 'TCP_FLAGS', 'TCP_WIN_MAX_IN', 'TCP_WIN_MAX_OUT', 'TCP_WIN_MIN_IN', 'TCP_WIN_MIN_OUT', 'TCP_WIN_MSS_IN', 'TCP_WIN_SCALE_IN', 'TCP_WIN_SCALE_OUT', 'SRC_TOS', 'DST_TOS', 'TOTAL_FLOWS_EXP', 'MIN_IP_PKT_LEN', 'MAX_IP_PKT_LEN', 'TOTAL_PKTS_EXP', 'TOTAL_BYTES_EXP', 'IN_BYTES', 'IN_PKTS', 'OUT_BYTES', 'OUT_PKTS', 'ANALYSIS_TIMESTAMP', 'ANOMALY', 'ID'], dtype='object')
train_df.head()
FLOW_ID | PROTOCOL_MAP | L4_SRC_PORT | IPV4_SRC_ADDR | L4_DST_PORT | IPV4_DST_ADDR | FIRST_SWITCHED | FLOW_DURATION_MILLISECONDS | LAST_SWITCHED | PROTOCOL | ... | TOTAL_PKTS_EXP | TOTAL_BYTES_EXP | IN_BYTES | IN_PKTS | OUT_BYTES | OUT_PKTS | ANALYSIS_TIMESTAMP | ANOMALY | ALERT | ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1283232 | 372246818 | tcp | 62980 | 10.114.241.166 | 47736 | 10.114.224.117 | 1647766047 | 0 | 1647766047 | 6 | ... | 0 | 0 | 44 | 1 | 40 | 1 | 1647766568 | 0.0 | Port Scanning | 1283232 |
3327778 | 372024895 | udp | 44246 | 10.114.227.143 | 53 | 10.114.226.5 | 1647761040 | 0 | 1647761040 | 17 | ... | 0 | 0 | 73 | 1 | 121 | 1 | 1647761116 | 0.0 | None | 3327778 |
3341312 | 334833539 | tcp | 36150 | 10.114.225.212 | 5228 | 74.125.133.188 | 1647346083 | 144 | 1647346083 | 6 | ... | 0 | 0 | 132 | 2 | 130 | 2 | 1647346141 | NaN | None | 3341312 |
1704357 | 369619889 | tcp | 33858 | 10.114.225.206 | 6443 | 10.114.232.94 | 1647693586 | 0 | 1647693586 | 6 | ... | 0 | 0 | 60 | 1 | 0 | 0 | 1647693660 | 1.0 | None | 1704357 |
4163521 | 371759146 | udp | 48586 | 10.114.227.52 | 53 | 10.114.226.5 | 1647752706 | 0 | 1647752706 | 17 | ... | 0 | 0 | 91 | 1 | 146 | 1 | 1647752852 | 0.0 | None | 4163521 |
5 rows × 33 columns
# Check for missing values
print('==== Train data ====')
print(train_df.isnull().sum())
print()
print('==== Test data ====')
print(test_df.isnull().sum())
print()
==== Train data ==== FLOW_ID 0 PROTOCOL_MAP 0 L4_SRC_PORT 0 IPV4_SRC_ADDR 0 L4_DST_PORT 0 IPV4_DST_ADDR 0 FIRST_SWITCHED 0 FLOW_DURATION_MILLISECONDS 0 LAST_SWITCHED 0 PROTOCOL 0 TCP_FLAGS 0 TCP_WIN_MAX_IN 0 TCP_WIN_MAX_OUT 0 TCP_WIN_MIN_IN 0 TCP_WIN_MIN_OUT 0 TCP_WIN_MSS_IN 0 TCP_WIN_SCALE_IN 0 TCP_WIN_SCALE_OUT 0 SRC_TOS 0 DST_TOS 0 TOTAL_FLOWS_EXP 0 MIN_IP_PKT_LEN 0 MAX_IP_PKT_LEN 0 TOTAL_PKTS_EXP 0 TOTAL_BYTES_EXP 0 IN_BYTES 0 IN_PKTS 0 OUT_BYTES 0 OUT_PKTS 0 ANALYSIS_TIMESTAMP 0 ANOMALY 57036 ALERT 0 ID 0 dtype: int64 ==== Test data ==== FLOW_ID 0 PROTOCOL_MAP 0 L4_SRC_PORT 0 IPV4_SRC_ADDR 0 L4_DST_PORT 0 IPV4_DST_ADDR 0 FIRST_SWITCHED 0 FLOW_DURATION_MILLISECONDS 0 LAST_SWITCHED 0 PROTOCOL 0 TCP_FLAGS 0 TCP_WIN_MAX_IN 0 TCP_WIN_MAX_OUT 0 TCP_WIN_MIN_IN 0 TCP_WIN_MIN_OUT 0 TCP_WIN_MSS_IN 0 TCP_WIN_SCALE_IN 0 TCP_WIN_SCALE_OUT 0 SRC_TOS 0 DST_TOS 0 TOTAL_FLOWS_EXP 0 MIN_IP_PKT_LEN 0 MAX_IP_PKT_LEN 0 TOTAL_PKTS_EXP 0 TOTAL_BYTES_EXP 0 IN_BYTES 0 IN_PKTS 0 OUT_BYTES 0 OUT_PKTS 0 ANALYSIS_TIMESTAMP 0 ANOMALY 28016 ID 0 dtype: int64
# Fill the missing ANOMALY values with 0 (no anomaly)
train_df['ANOMALY'].fillna(0, inplace=True)
test_df['ANOMALY'].fillna(0, inplace=True)
In this section we analyze the datasets in order to have a better understanding of the data.
train_df.dtypes
FLOW_ID int64 PROTOCOL_MAP object L4_SRC_PORT int64 IPV4_SRC_ADDR object L4_DST_PORT int64 IPV4_DST_ADDR object FIRST_SWITCHED int64 FLOW_DURATION_MILLISECONDS int64 LAST_SWITCHED int64 PROTOCOL int64 TCP_FLAGS int64 TCP_WIN_MAX_IN int64 TCP_WIN_MAX_OUT int64 TCP_WIN_MIN_IN int64 TCP_WIN_MIN_OUT int64 TCP_WIN_MSS_IN int64 TCP_WIN_SCALE_IN int64 TCP_WIN_SCALE_OUT int64 SRC_TOS int64 DST_TOS int64 TOTAL_FLOWS_EXP int64 MIN_IP_PKT_LEN int64 MAX_IP_PKT_LEN int64 TOTAL_PKTS_EXP int64 TOTAL_BYTES_EXP int64 IN_BYTES int64 IN_PKTS int64 OUT_BYTES int64 OUT_PKTS int64 ANALYSIS_TIMESTAMP int64 ANOMALY float64 ALERT object ID int64 dtype: object
We can observe that the dataset is highly imbalanced, with the majority of the flows being normal (no attack detected). We can also observe that also the number of malware attacks is very low, compared to the other attacks.
These two facts will have a big impact on the model training, as we will see later.
# Show the distribution of the target variable
sns.countplot(x='ALERT', data=train_df)
<Axes: xlabel='ALERT', ylabel='count'>
# Count the number of unique protocol_maps
train_df['PROTOCOL_MAP'].value_counts()
tcp 62514 udp 54250 icmp 9728 gre 36 ipv6-icmp 1 Name: PROTOCOL_MAP, dtype: int64
fig, axs = plt.subplots(1, 3, figsize=(20, 5))
# seaborn countplots
sns.countplot(x='ANOMALY', data=train_df, ax=axs[0]).set(title='ANOMALY')
# Seaborn countplot for the 'PROTOCOL_MAP' column, with enough space for the labels
sns.countplot(x='PROTOCOL_MAP', data=train_df, ax=axs[1]).set(title='PROTOCOL_MAP')
# Boxplot for L4_SRC_PORT to undestand the distribution of the data
sns.boxplot(
x='L4_SRC_PORT', data=train_df, ax=axs[2],
notch=True, showcaps=True,
flierprops={"marker": "x"}, # Change the outlier marker
showmeans=True, # Show the mean
boxprops={"facecolor": (.4, .6, .8, .5)},
).set(title='L4_SRC_PORT')
[Text(0.5, 1.0, 'L4_SRC_PORT')]
# Show protocol_map distribution for kind of ALERT
sns.countplot(x='PROTOCOL_MAP', hue='ALERT', data=train_df)
<Axes: xlabel='PROTOCOL_MAP', ylabel='count'>
Knowing the amount of unique hosts in the dataset is important to understand the size of the dataset since I expect that a bigger dataset will be more difficult to train properly.
# Find unique hosts (IP addresses) in the train and test data
train_src_hosts = train_df['IPV4_SRC_ADDR'].unique()
train_dst_hosts = train_df['IPV4_DST_ADDR'].unique()
train_hosts = np.union1d(train_src_hosts, train_dst_hosts)
# For each host, count the number of flows
print('Number of unique hosts in the train data: ', len(train_hosts))
# Find unique hosts (IP addresses) in the train and test data
test_src_hosts = test_df['IPV4_SRC_ADDR'].unique()
test_dst_hosts = test_df['IPV4_DST_ADDR'].unique()
test_hosts = np.union1d(test_src_hosts, test_dst_hosts)
# Floor ratio of hosts in test data that are not in train data
ratio = math.floor((1.0-len(test_hosts)/len(train_hosts)) * 100)
# For each host, count the number of flows
print("Number of unique hosts in the test data: {} (~{}% smaller)".format(len(test_hosts), ratio))
Number of unique hosts in the train data: 16875 Number of unique hosts in the test data: 11085 (~34% smaller)
# select the columns to be used for training
train_df_columns = train_df[['L4_SRC_PORT', 'L4_DST_PORT', 'PROTOCOL', 'ANOMALY', 'ALERT']]
# Distribution analysis using pairplot
sns.pairplot(train_df_columns, hue='ALERT')
<seaborn.axisgrid.PairGrid at 0x2d62c6290>
# Revoked columns
revoked_columns = [
'FLOW_ID', # Completely random
'ID', # Completely random
'ANALYSIS_TIMESTAMP', # Completely random
'IPV4_SRC_ADDR', # Not useful for the model
'IPV4_DST_ADDR', # Not useful for the model
'PROTOCOL_MAP', # There is a numerical column for the protocol
'MIN_IP_PKT_LEN', # Always 0 since it is a minimum value
'MAX_IP_PKT_LEN', # Always 0 (maybe it means that the packet have infinite length?)
'TOTAL_PKTS_EXP', # Always 0
'TOTAL_BYTES_EXP', # Always 0
]
# Create dummy columns for the ALERT column
alert_dummies = pd.get_dummies(train_df['ALERT'], prefix='ALERT', drop_first=True)
# Copy + drop the revoked columns
train_df = train_df.copy().drop(revoked_columns, axis=1)
We can observe that there are some features that are highly correlated with each other, such as IN_BYTES - OUT_BYTES and IN_PKTS - OUT_PKTS. This is not surprising, since these features are related to the amount of data exchanged between the two hosts.
We can also observe that a port scanning alert is highly correlated with the L4_DST_PORT and ANOMALY features. This is not surprising, since a port scanning attack is a type of attack that tries to find open ports on a host. It is highly correlated with ANOMALY probably because the forged packets are built in a way that they are not recognized as an attack by the network.
Unfortunately, since malware attacks alerts are various and have different characteristics/features, it is not possible to find a correlation between them and the other features. This could mean that the features used in this dataset are not enough to detect malware attacks.
In the other hand, none alerts are strongly negatively correlated with ANOMALY and L4_DST_PORT. This is not surprising, since a normally a flow contains valid packets and the destination is usually a well-known port.
# Correlation heatmap using pandas
corr = pd.concat([train_df.drop('ALERT', axis=1), alert_dummies], axis=1).corr(
numeric_only=False, # Only consider numeric columns
)
# Correlation heatmap using seaborn + make annotations fit the heatmap
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt=".1f", cmap="YlGnBu")
<Axes: >
In this section we prepare the dataset for the machine learning algorithms. We will split the dataset into training and testing sets, and we will also scale the data to make it more suitable for the algorithms.
Since we already have a test set, we split our training set in training and validation sets. We will use Sklearn's StratifiedShuffleSplit
to split the training set in 80% training and 20% validation maintaining the same distribution of the target variable. This is needed since the dataset is highly imbalanced.
def split_maintain_distribution(X, y):
sss=StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=9)
indexes = sss.split(X, y)
train_indices, test_indices = next(indexes)
return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]
X_train, X_val, y_train, y_val = split_maintain_distribution(train_df.drop('ALERT', axis=1), train_df['ALERT'])
Now, check if actually the distribution of the target variable is the same in the training and validation sets.
# Print distribution of the target variable in the train and validation sets
print('Train set distribution:')
print(y_train.value_counts(normalize=True))
print()
print('Validation set distribution:')
print(y_val.value_counts(normalize=True))
Train set distribution: None 0.868508 Port Scanning 0.119380 Denial of Service 0.011983 Malware 0.000128 Name: ALERT, dtype: float64 Validation set distribution: None 0.868490 Port Scanning 0.119379 Denial of Service 0.011973 Malware 0.000158 Name: ALERT, dtype: float64
We can confirm that the distribution of the target variable is the same in the training and validation sets.
Scaling the data is important to avoid that some features will have a bigger impact on the model training than others. This is especially important when we are dealing with features that have different units of measure.
# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)
# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)
# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)
In this section we will use a Random Forest classifier to find the most important features in the dataset. This will help us to reduce the number of features used in the model training, and therefore speed up the training process.
# Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100) # 100 trees = default value
# Fit the model
rfc.fit(x_train_scaled, y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
# Print features importance
feature_importances = pd.DataFrame(
rfc.feature_importances_,
index=X_train.columns,
columns=['importance']
).sort_values('importance', ascending=False)
print(feature_importances)
importance IN_BYTES 0.182619 ANOMALY 0.170634 TCP_WIN_MSS_IN 0.106685 L4_DST_PORT 0.079761 TCP_WIN_MAX_IN 0.071403 TCP_WIN_MIN_IN 0.059129 OUT_BYTES 0.046427 FIRST_SWITCHED 0.043752 TOTAL_FLOWS_EXP 0.036191 FLOW_DURATION_MILLISECONDS 0.035744 LAST_SWITCHED 0.029963 L4_SRC_PORT 0.026551 TCP_FLAGS 0.024626 TCP_WIN_SCALE_IN 0.019076 TCP_WIN_MAX_OUT 0.018523 IN_PKTS 0.012175 PROTOCOL 0.009500 SRC_TOS 0.009323 OUT_PKTS 0.007262 TCP_WIN_MIN_OUT 0.007026 TCP_WIN_SCALE_OUT 0.002606 DST_TOS 0.001025
# Plot feature importance
plt.figure(figsize=(20, 10))
plt.xticks(rotation=-90)
sns.barplot(x=feature_importances.index, y=feature_importances['importance'])
<Axes: ylabel='importance'>
Select the most important features using the Random Forest classifier results
MIN_IMPORTANCE_THRESHOLD = 0.02
# Select all columns with importance > 0.02
COLUMNS = feature_importances[feature_importances['importance'] > MIN_IMPORTANCE_THRESHOLD].index
COLUMNS
Index(['IN_BYTES', 'ANOMALY', 'TCP_WIN_MSS_IN', 'L4_DST_PORT', 'TCP_WIN_MAX_IN', 'TCP_WIN_MIN_IN', 'OUT_BYTES', 'FIRST_SWITCHED', 'TOTAL_FLOWS_EXP', 'FLOW_DURATION_MILLISECONDS', 'LAST_SWITCHED', 'L4_SRC_PORT', 'TCP_FLAGS'], dtype='object')
X_train, X_val, y_train, y_val = split_maintain_distribution(
train_df[COLUMNS],
train_df['ALERT']
)
# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)
# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)
# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)
# No target variable, so no need to split the fit and transform
x_test_scaled = StandardScaler().fit_transform(test_df[COLUMNS])
# Convert to pandas dataframe
df_feat_test = pd.DataFrame(x_test_scaled, columns=test_df[COLUMNS].columns)
In this section we will use UMAP to visualize the dataset in 2D. This will help us to understand if is possible to separate the different classes of attacks in the dataset.
import umap
reducer = umap.UMAP(
random_state=42,
n_neighbors=50,
min_dist=0.3,
)
mapper = reducer.fit(x_train_scaled)
Reduce data dimensionality to 2 dimensions and plot the data using matplotlib.
# Transform the train set
embedding = mapper.transform(x_train_scaled)
# Plot the train set
plt.figure(figsize=(10, 10))
plt.scatter(
embedding[:, 0],
embedding[:, 1],
c=[sns.color_palette()[x] for x in y_train.map({'None': 0, 'Port Scanning': 1, 'Denial of Service': 2, 'Malware': 3})],
)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the train set', fontsize=24)
Text(0.5, 1.0, 'UMAP projection of the train set')
umap.plot
¶Plot the data using umap.plot
(which uses matplotlib under the hood).
import umap.plot
# Create data labels
labels = y_train.map({'None': 0, 'Port Scanning': 1, 'Denial of Service': 2, 'Malware': 3})
# Visualize the embedding using umap.plot
p = umap.plot.points(
mapper,
labels=y_train,
width=1000,
height=900,
)
umap.plot.show(p)
In this section we will train different models and compare their results. We will use the following models:
We can notice that the model is excessively precise, with a precision of 1.0 with any kind of attack. This is probably due to the fact that the dataset is highly imbalanced, with the majority of the flows being normal (no attack detected). We can also observe that also the number of malware attacks is very low, compared to the other attacks.
To find the best K hyperparameter for KNN, we will use the validation set to find the best K value. We will then use this K value to train the model on the training set and evaluate it on the test set.
# Find best K using GridSearchCV
MAX_DEGREE = 30
k_range = list(range(1, MAX_DEGREE+1))
param_grid = dict(n_neighbors=k_range)
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid.fit(x_train_scaled, y_train)
# Print information about the model
print(f"Best k: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")
Best k: {'n_neighbors': 1} Best score: 0.997569722296316
# Plot results
plt.figure(num=0, dpi=96, figsize=(10, 6))
plt.plot(k_range, grid.cv_results_['mean_test_score'])
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.xticks(k_range)
plt.show()
By looking at the graphical outcome, the best parameter for KNN is K = 1. Since this value would lead to overfitting, we will use the first odd number after 1, which is K = 3.
This outcome is not surprising since the training and validation sets are coming probably from the same network and the same hosts, so the flows are very similar to each other. This means that the best way to test our model is to use the test set.
# Create a KNN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3) # 3 = view note above
# Fit the classifier to the data
knn.fit(x_train_scaled, y_train)
# Make predictions on validation set
predictions = knn.predict(x_validation_scaled)
# Print the classification report
print(classification_report(y_val, predictions))
precision recall f1-score support Denial of Service 0.94 1.00 0.97 303 Malware 1.00 1.00 1.00 4 None 1.00 1.00 1.00 21978 Port Scanning 0.98 1.00 0.99 3021 accuracy 1.00 25306 macro avg 0.98 1.00 0.99 25306 weighted avg 1.00 1.00 1.00 25306
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])
# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')
<Axes: >
Unfortunately, the test set doesn't include the target variable, so we can't evaluate the model on it. We can only evaluate the model on the validation set.
# Prediction on the test set
predictions = knn.predict(x_test_scaled)
# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class
None 53884 Port Scanning 7621 Denial of Service 810 Malware 5 dtype: int64
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well.
# Create grid search parameters
param_grid = {
'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
}
# Create grid search
svc_grid = GridSearchCV(
SVC(kernel="rbf"),
param_grid,
cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
n_jobs=-1, # Use all cores
)
# Fit grid search
svc_grid.fit(x_train_scaled, y_train)
# Print information about the model
print(f"Best params: {svc_grid.best_params_}")
print(f"Best score: {svc_grid.best_score_}")
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
Best params: {'C': 1000, 'gamma': 1} Best score: 0.998360055142362
# Create SVM with best parameters
svc = SVC(
kernel='rbf',
C=svc_grid.best_params_['C'],
gamma=svc_grid.best_params_['gamma'],
)
svc.fit(x_train_scaled, y_train)
SVC(C=1000, gamma=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1000, gamma=1)
# Make predictions on validation set
predictions = svc.predict(x_validation_scaled)
# Print the classification report
print(classification_report(y_val, predictions))
precision recall f1-score support Denial of Service 0.98 1.00 0.99 303 Malware 1.00 1.00 1.00 4 None 1.00 1.00 1.00 21978 Port Scanning 0.99 1.00 1.00 3021 accuracy 1.00 25306 macro avg 0.99 1.00 1.00 25306 weighted avg 1.00 1.00 1.00 25306
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])
# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')
<Axes: >
# Prediction on the test set
predictions = svc.predict(x_test_scaled)
# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class
None 54007 Port Scanning 7541 Denial of Service 767 Malware 5 dtype: int64
# Create the two parameters
pca = PCA(whiten=True, random_state=42) # PCA (Principal Component Analysis)
svc = SVC(kernel='rbf', class_weight='balanced') # SVC (Support Vector Classification)
# Create pipeline
model = make_pipeline(pca, svc)
# Generate a valid n_components range (from 5 to maximum number of features)
n_features = x_train_scaled.shape[1]
n_components = np.arange(5, n_features, 3)
param_grid = {
'pca__n_components': n_components,
'svc__C': [50, 100, 500, 1000, 5000, 10000],
'svc__gamma': [0.001, 0.01, 0.1, 1, 10]
}
# Grid search
pipeline_grid = GridSearchCV(
model,
param_grid,
cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
n_jobs=-1 # Use all cores
)
pipeline_grid.fit(x_train_scaled, y_train)
# Print information about the model
print(f"Best params: {pipeline_grid.best_params_}")
print(f"Best score: {pipeline_grid.best_score_}")
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
Best params: {'pca__n_components': 11, 'svc__C': 5000, 'svc__gamma': 1} Best score: 0.9983402973726345
# Now, create the desired pipeline
pca = PCA(
n_components=pipeline_grid.best_params_['pca__n_components'],
whiten=True,
random_state=42
)
svc = SVC(kernel='rbf',
class_weight='balanced',
# Use the best parameters found by the grid search
C=pipeline_grid.best_params_['svc__C'],
gamma=pipeline_grid.best_params_['svc__gamma']
)
model = make_pipeline(pca, svc)
model.fit(x_train_scaled, y_train)
Pipeline(steps=[('pca', PCA(n_components=11, random_state=42, whiten=True)), ('svc', SVC(C=5000, class_weight='balanced', gamma=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('pca', PCA(n_components=11, random_state=42, whiten=True)), ('svc', SVC(C=5000, class_weight='balanced', gamma=1))])
PCA(n_components=11, random_state=42, whiten=True)
SVC(C=5000, class_weight='balanced', gamma=1)
# Make predictions on validation set
predictions = model.predict(x_validation_scaled)
# Print the classification report
print(classification_report(y_val, predictions))
precision recall f1-score support Denial of Service 0.98 1.00 0.99 303 Malware 1.00 1.00 1.00 4 None 1.00 1.00 1.00 21978 Port Scanning 0.99 1.00 1.00 3021 accuracy 1.00 25306 macro avg 0.99 1.00 1.00 25306 weighted avg 1.00 1.00 1.00 25306
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])
# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')
<Axes: >
# Prediction on the test set
predictions = model.predict(x_test_scaled)
# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class
None 54005 Port Scanning 7421 Denial of Service 891 Malware 3 dtype: int64
Bagging Classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
svc = SVC(kernel='rbf',
class_weight='balanced',
C=svc_grid.best_params_['C'],
gamma=svc_grid.best_params_['gamma']
)
clf = BaggingClassifier(
svc,
n_estimators=30,
n_jobs=-1, # Use all cores
random_state=42
)
clf.fit(x_train_scaled, y_train)
BaggingClassifier(base_estimator=SVC(C=1000, class_weight='balanced', gamma=1), n_estimators=30, n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(base_estimator=SVC(C=1000, class_weight='balanced', gamma=1), n_estimators=30, n_jobs=-1, random_state=42)
SVC(C=1000, class_weight='balanced', gamma=1)
SVC(C=1000, class_weight='balanced', gamma=1)
predictions = clf.predict(x_validation_scaled)
# Print the classification report
print(classification_report(y_val, predictions))
precision recall f1-score support Denial of Service 0.96 1.00 0.98 303 Malware 1.00 1.00 1.00 4 None 1.00 1.00 1.00 21978 Port Scanning 0.99 1.00 1.00 3021 accuracy 1.00 25306 macro avg 0.99 1.00 0.99 25306 weighted avg 1.00 1.00 1.00 25306
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])
# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')
<Axes: >
# Prediction on the test set
predictions = clf.predict(x_test_scaled)
# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class
None 53947 Port Scanning 7558 Denial of Service 810 Malware 5 dtype: int64
Random Forest is an ensemble method that combines multiple decision trees to create a more accurate model. It is a supervised learning algorithm that can be used for both classification and regression tasks.
# Create random forest classifier
rfc = RandomForestClassifier()
# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500, 1000]}
# Used to find the best n_estimators value to use to train the model
rfc_grid = GridSearchCV(
rfc,
parameters,
scoring='accuracy',
cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
n_jobs=-1 # Use all cores
)
# Fit model to data
rfc_grid.fit(x_train_scaled, y_train)
# Extract best params
print(f"Best params: {rfc_grid.best_params_}")
print(f"Best score: {rfc_grid.best_score_}")
Best params: {'n_estimators': 40} Best score: 0.9996739870396807
rfc = RandomForestClassifier(n_estimators=rfc_grid.best_params_['n_estimators'])
rfc.fit(x_train_scaled, y_train)
RandomForestClassifier(n_estimators=40)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(n_estimators=40)
# Make predictions on validation set
predictions = rfc.predict(x_validation_scaled)
# Print the classification report
print(classification_report(y_val, predictions))
precision recall f1-score support Denial of Service 1.00 1.00 1.00 303 Malware 1.00 0.75 0.86 4 None 1.00 1.00 1.00 21978 Port Scanning 1.00 1.00 1.00 3021 accuracy 1.00 25306 macro avg 1.00 0.94 0.96 25306 weighted avg 1.00 1.00 1.00 25306
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])
# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')
<Axes: >
# Prediction on the test set
predictions = rfc.predict(x_test_scaled)
# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class
None 62161 Port Scanning 155 Malware 4 dtype: int64
This kind of classifier is an ensemble of decision trees. It is similar to a Random Forest classifier, but the trees are trained using the whole dataset instead of a bootstrap sample.
# Create random forest classifier
etc = ExtraTreesClassifier()
# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500]}
# Used to find the best n_estimators value to use to train the model
etc_grid = GridSearchCV(
etc,
parameters,
scoring='accuracy',
cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
n_jobs=-1 # Use all cores
)
# Fit model to data
etc_grid.fit(x_train_scaled, y_train)
# Extract best params
print(f"Best params: {etc_grid.best_params_}")
print(f"Best score: {etc_grid.best_score_}")
Best params: {'n_estimators': 30} Best score: 0.9995949538136113
etc = ExtraTreesClassifier(n_estimators=etc_grid.best_params_['n_estimators'])
etc.fit(x_train_scaled, y_train)
ExtraTreesClassifier(n_estimators=30)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ExtraTreesClassifier(n_estimators=30)
# Make predictions on validation set
predictions = etc.predict(x_validation_scaled)
# Print the classification report
print(classification_report(y_val, predictions))
precision recall f1-score support Denial of Service 1.00 1.00 1.00 303 Malware 1.00 1.00 1.00 4 None 1.00 1.00 1.00 21978 Port Scanning 1.00 1.00 1.00 3021 accuracy 1.00 25306 macro avg 1.00 1.00 1.00 25306 weighted avg 1.00 1.00 1.00 25306
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])
# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')
<Axes: >
# Prediction on the test set
predictions = etc.predict(x_test_scaled)
# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class
None 54738 Port Scanning 7496 Denial of Service 81 Malware 5 dtype: int64
# Create MLPClasifier
mlp = MLPClassifier(
max_iter=1000,
random_state=42
)
# Grid search for MLPClassifier
parameters = {
'hidden_layer_sizes': [(50,), (100,), (50, 50)],
'activation': ['relu', 'tanh'],
'alpha': [0.0001, 0.001],
'solver': ['adam', 'lbfgs'],
'learning_rate': ['constant', 'invscaling'],
}
mlp_grid = GridSearchCV(
mlp,
parameters,
cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
n_jobs=-1, # Use all cores
)
mlp_grid.fit(x_train_scaled, y_train)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter) /opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
GridSearchCV(cv=2, estimator=MLPClassifier(max_iter=1000, random_state=42), n_jobs=-1, param_grid={'activation': ['relu', 'tanh'], 'alpha': [0.0001, 0.001], 'hidden_layer_sizes': [(50,), (100,), (50, 50)], 'learning_rate': ['constant', 'invscaling'], 'solver': ['adam', 'lbfgs']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(cv=2, estimator=MLPClassifier(max_iter=1000, random_state=42), n_jobs=-1, param_grid={'activation': ['relu', 'tanh'], 'alpha': [0.0001, 0.001], 'hidden_layer_sizes': [(50,), (100,), (50, 50)], 'learning_rate': ['constant', 'invscaling'], 'solver': ['adam', 'lbfgs']})
MLPClassifier(max_iter=1000, random_state=42)
MLPClassifier(max_iter=1000, random_state=42)
# Extract best params
print(f"Best params: {mlp_grid.best_params_}")
print(f"Best score: {mlp_grid.best_score_}")
Best params: {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50), 'learning_rate': 'constant', 'solver': 'lbfgs'} Best score: 0.9990022021781368
# Create MLPClassifier with best parameters
mlp = MLPClassifier(
hidden_layer_sizes=mlp_grid.best_params_['hidden_layer_sizes'],
activation=mlp_grid.best_params_['activation'],
alpha=mlp_grid.best_params_['alpha'],
solver=mlp_grid.best_params_['solver'],
learning_rate=mlp_grid.best_params_['learning_rate'],
max_iter=1000,
random_state=42
)
mlp.fit(x_train_scaled, y_train)
/opt/homebrew/Caskroom/miniconda/base/envs/datascience/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:559: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
MLPClassifier(activation='tanh', hidden_layer_sizes=(50, 50), max_iter=1000, random_state=42, solver='lbfgs')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
MLPClassifier(activation='tanh', hidden_layer_sizes=(50, 50), max_iter=1000, random_state=42, solver='lbfgs')
# Make predictions on validation set
predictions = mlp.predict(x_validation_scaled)
# Print the classification report
print(classification_report(y_val, predictions))
precision recall f1-score support Denial of Service 0.98 1.00 0.99 303 Malware 1.00 1.00 1.00 4 None 1.00 1.00 1.00 21978 Port Scanning 1.00 1.00 1.00 3021 accuracy 1.00 25306 macro avg 0.99 1.00 1.00 25306 weighted avg 1.00 1.00 1.00 25306
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])
# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')
<Axes: >
# Prediction on the test set
predictions = mlp.predict(x_test_scaled)
# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class
None 54397 Port Scanning 7317 Denial of Service 601 Malware 5 dtype: int64
I have observed outstanding performance across all tested classification models. It is worth noting that nearly every model exhibited exceptional accuracy, as evidenced by an f1-score nearing perfection (1.0).
Upon thorough consideration, I have come to realize that this phenomenon stems from the substantial resemblance between the utilized validation set (created by splitting the training set, given the absence of the target variable in the provided test set) and the training set employed for model training.
Presumably, the training set was constructed using data from a specific network, resulting in a significant overlap of features between both sets. Consequently, the models achieved highly accurate classifications for almost all flows within the validation set. However, their performance may not be equally robust when applied to a distinct test set derived from a different network.
Nevertheless, I have opted to present the classification outcomes of each model on the provided test set, despite the unavailability of performance metrics for evaluation.
Acquiring a test set originating from a diverse network would have been advantageous, enabling a more precise assessment of the models' performance. Regrettably, obtaining such a test set proved unfeasible for this particular project.