Anomaly Detection - Network Activity¶

After cybersecurity attackers get initial access to a victim's network, their next techniques may involve executing scripts from one network machine to another. These follow-up activities could generate network activity that is anomalous. It might be anomalous in two different ways: maybe compared to what normally happens on the network, or maybe relative to what everything else on the network is currently doing. Those two kinds of anomalies have an important distinction -- the first one assumes some ground-truth knowledge of what "normal" looks like -- this is "novelty" detection. The second one doesn't need a ground truth -- it's just "outlier" detection.

My favorite example of network anomaly detection gone wrong is the university network which detected an enormous spike in traffic on a high port number between the hours of midnight and about 3am every night. Were they under attack? Follow-up investigation determined that the involved IP addresses were all for on-campus freshmen housing, and that the port was a common Minecraft server config. Oops, false alarm! The solution was to create a new "normalcy" model specific to on-campus housing in the wee nighttime hours, which wouldn't be alarmed by late-night minecraft gaming. I'm probably not remembering that story correctly, but the point is that "normalcy" model can quickly spiral to requiring a huge number of hyper-specific models or parameters, to reduce false positives.

This notebook demonstrates a simple way to do outlier detection on netflow records, using one of the CTU-13 datasets. It clusters on standard-scaled TotBytes, using DBSCAN with an eps of 2.5, means any points labeled as cluster -1 don't group with any other points, assuming gaussian (normal) distributions..

I'm using deargle/my-datascience-notebook

In [1]:

import pandas as pd
import sklearn
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

print(f'pandas version: {pd.__version__}')
print(f'sklearn version: {sklearn.__version__}')

pandas version: 1.4.0
sklearn version: 1.0.2

In [2]:

# first, download https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow into the current directory
from pathlib import Path

path_to_file = 'capture20110810.binetflow'
path = Path(path_to_file)

if not path.is_file():
    !wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow

df = pd.read_csv(path_to_file)

In [3]:

df.head()

Out[3]:

	StartTime	Dur	Proto	SrcAddr	Sport	Dir	DstAddr	Dport	State	TotPkts	TotBytes	SrcBytes	Label
0	2011/08/10 09:46:53.047277	3550.182373	udp	212.50.71.179	39678	<->	147.32.84.229	13363	CON	12	875	413	flow=Background-UDP-Established
1	2011/08/10 09:46:53.048843	0.000883	udp	84.13.246.132	28431	<->	147.32.84.229	13363	CON	2	135	75	flow=Background-UDP-Established
2	2011/08/10 09:46:53.049895	0.000326	tcp	217.163.21.35	80	<?>	147.32.86.194	2063	FA_A	2	120	60	flow=Background
3	2011/08/10 09:46:53.053771	0.056966	tcp	83.3.77.74	32882	<?>	147.32.85.5	21857	FA_FA	3	180	120	flow=Background
4	2011/08/10 09:46:53.053937	3427.768066	udp	74.89.223.204	21278	<->	147.32.84.229	13363	CON	42	2856	1596	flow=Background-UDP-Established

In [4]:

# Set the StartTime as the row index
df = df.set_index(pd.to_datetime(df['StartTime']))

In [5]:

# Streaming network analytics would bin incoming netflows by some time window. Below, I set the window
# to 1 minute, but this is arbitrary to make iterative testing faster than, say, a 10-minute window
grouped = df.groupby(pd.Grouper(freq='1min'))

In [6]:

# For demonstration purposes, I extract just the first window
grouped = list(grouped)
first_window = grouped[0]
print(f'Window: {first_window[0]}')
data = first_window[1]

Window: 2011-08-10 09:46:00

In [7]:

# I'll cluster just based on netflow TotBytes, but this is also arbitrary. Other numerical features could be included.
X = data[['TotBytes']]

In [8]:

# Apply a standard deviation transformation to my data.
X = StandardScaler().fit_transform(X)

In [9]:

# Since we standard-scaled, we can set `DBSCAN`'s `eps` parameter
# to be 2.5, which roughly corresponds to a common threshold for gaussian (normal) distribution for what is
# considered an "outlier"

clf = DBSCAN(eps=2.5)
y_preds = clf.fit_predict(X)
print(f'n netflows: {len(X)}')
print(f'n clusters: {len(list(set([y for y in clf.labels_ if y != -1])))}')
print(f'n outliers: {len([x for x in y_preds if x == -1])}') 

n netflows: 1654
n clusters: 1
n outliers: 6

In [11]:

# Show the anomalous netflows:
data[y_preds == -1]

Out[11]:

	StartTime	Dur	Proto	SrcAddr	Sport	Dir	DstAddr	Dport	State	sTos	dTos	TotPkts	TotBytes	SrcBytes	Label
StartTime
2011-08-10 09:46:53.078297	2011/08/10 09:46:53.078297	3599.972412	tcp	147.32.80.13	80	<?>	147.32.84.162	51769	PA_A	0.0	0.0	72157	61638544	60214264	flow=From-Background-CVUT-Proxy
2011-08-10 09:46:53.106431	2011/08/10 09:46:53.106431	507.347626	tcp	147.32.80.13	80	<?>	147.32.85.112	10885	FPA_FA	0.0	0.0	162760	137136528	132816366	flow=From-Background-CVUT-Proxy
2011-08-10 09:46:53.346951	2011/08/10 09:46:53.346951	3598.887695	tcp	195.250.146.99	554	<?>	147.32.86.99	16786	PA_PA	0.0	0.0	51576	60964440	60363789	flow=Background
2011-08-10 09:46:53.709666	2011/08/10 09:46:53.709666	3599.047607	tcp	195.250.146.6	554	<?>	147.32.84.59	49375	PA_PA	0.0	0.0	55068	61164753	60334389	flow=Background-Established-cmpgw-CVUT
2011-08-10 09:46:58.680289	2011/08/10 09:46:58.680289	3591.222656	tcp	147.32.85.103	49317	<?>	88.159.8.10	22	PA_PA	0.0	0.0	105555	70963308	68636560	flow=Background
2011-08-10 09:46:59.367587	2011/08/10 09:46:59.367587	3350.645508	tcp	147.32.87.5	524	?>	147.32.85.2	49350	PA_	0.0	NaN	214827	248405120	248405120	flow=Background