Anomaly Detection - Network Activity

After cybersecurity attackers get initial access to a victim's network, their next techniques may involve executing scripts from one network machine to another. These follow-up activities could generate network activity that is anomalous. It might be anomalous in two different ways: maybe compared to what normally happens on the network, or maybe relative to what everything else on the network is currently doing. Those two kinds of anomalies have an important distinction -- the first one assumes some ground-truth knowledge of what "normal" looks like -- this is "novelty" detection. The second one doesn't need a ground truth -- it's just "outlier" detection.

My favorite example of network anomaly detection gone wrong is the university network which detected an enormous spike in traffic on a high port number between the hours of midnight and about 3am every night. Were they under attack? Follow-up investigation determined that the involved IP addresses were all for on-campus freshmen housing, and that the port was a common Minecraft server config. Oops, false alarm! The solution was to create a new "normalcy" model specific to on-campus housing in the wee nighttime hours, which wouldn't be alarmed by late-night minecraft gaming. I'm probably not remembering that story correctly, but the point is that "normalcy" model can quickly spiral to requiring a huge number of hyper-specific models or parameters, to reduce false positives.

This notebook demonstrates a simple way to do outlier detection on netflow records, using one of the CTU-13 datasets. It clusters on standard-scaled TotBytes, using DBSCAN with an eps of 2.5, means any points labeled as cluster -1 don't group with any other points, assuming gaussian (normal) distributions..

I'm using deargle/my-datascience-notebook

In [1]:
import pandas as pd
import sklearn
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

print(f'pandas version: {pd.__version__}')
print(f'sklearn version: {sklearn.__version__}')
pandas version: 1.4.0
sklearn version: 1.0.2
In [2]:
# first, download https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow into the current directory
from pathlib import Path

path_to_file = 'capture20110810.binetflow'
path = Path(path_to_file)

if not path.is_file():
    !wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow

df = pd.read_csv(path_to_file)
In [3]:
df.head()
Out[3]:
StartTime Dur Proto SrcAddr Sport Dir DstAddr Dport State sTos dTos TotPkts TotBytes SrcBytes Label
0 2011/08/10 09:46:53.047277 3550.182373 udp 212.50.71.179 39678 <-> 147.32.84.229 13363 CON 0.0 0.0 12 875 413 flow=Background-UDP-Established
1 2011/08/10 09:46:53.048843 0.000883 udp 84.13.246.132 28431 <-> 147.32.84.229 13363 CON 0.0 0.0 2 135 75 flow=Background-UDP-Established
2 2011/08/10 09:46:53.049895 0.000326 tcp 217.163.21.35 80 <?> 147.32.86.194 2063 FA_A 0.0 0.0 2 120 60 flow=Background
3 2011/08/10 09:46:53.053771 0.056966 tcp 83.3.77.74 32882 <?> 147.32.85.5 21857 FA_FA 0.0 0.0 3 180 120 flow=Background
4 2011/08/10 09:46:53.053937 3427.768066 udp 74.89.223.204 21278 <-> 147.32.84.229 13363 CON 0.0 0.0 42 2856 1596 flow=Background-UDP-Established
In [4]:
# Set the StartTime as the row index
df = df.set_index(pd.to_datetime(df['StartTime']))
In [5]:
# Streaming network analytics would bin incoming netflows by some time window. Below, I set the window
# to 1 minute, but this is arbitrary to make iterative testing faster than, say, a 10-minute window
grouped = df.groupby(pd.Grouper(freq='1min'))
In [6]:
# For demonstration purposes, I extract just the first window
grouped = list(grouped)
first_window = grouped[0]
print(f'Window: {first_window[0]}')
data = first_window[1]
Window: 2011-08-10 09:46:00
In [7]:
# I'll cluster just based on netflow TotBytes, but this is also arbitrary. Other numerical features could be included.
X = data[['TotBytes']]
In [8]:
# Apply a standard deviation transformation to my data.
X = StandardScaler().fit_transform(X)
In [9]:
# Since we standard-scaled, we can set `DBSCAN`'s `eps` parameter
# to be 2.5, which roughly corresponds to a common threshold for gaussian (normal) distribution for what is
# considered an "outlier"

clf = DBSCAN(eps=2.5)
y_preds = clf.fit_predict(X)
print(f'n netflows: {len(X)}')
print(f'n clusters: {len(list(set([y for y in clf.labels_ if y != -1])))}')
print(f'n outliers: {len([x for x in y_preds if x == -1])}') 
n netflows: 1654
n clusters: 1
n outliers: 6
In [11]:
# Show the anomalous netflows:
data[y_preds == -1]
Out[11]:
StartTime Dur Proto SrcAddr Sport Dir DstAddr Dport State sTos dTos TotPkts TotBytes SrcBytes Label
StartTime
2011-08-10 09:46:53.078297 2011/08/10 09:46:53.078297 3599.972412 tcp 147.32.80.13 80 <?> 147.32.84.162 51769 PA_A 0.0 0.0 72157 61638544 60214264 flow=From-Background-CVUT-Proxy
2011-08-10 09:46:53.106431 2011/08/10 09:46:53.106431 507.347626 tcp 147.32.80.13 80 <?> 147.32.85.112 10885 FPA_FA 0.0 0.0 162760 137136528 132816366 flow=From-Background-CVUT-Proxy
2011-08-10 09:46:53.346951 2011/08/10 09:46:53.346951 3598.887695 tcp 195.250.146.99 554 <?> 147.32.86.99 16786 PA_PA 0.0 0.0 51576 60964440 60363789 flow=Background
2011-08-10 09:46:53.709666 2011/08/10 09:46:53.709666 3599.047607 tcp 195.250.146.6 554 <?> 147.32.84.59 49375 PA_PA 0.0 0.0 55068 61164753 60334389 flow=Background-Established-cmpgw-CVUT
2011-08-10 09:46:58.680289 2011/08/10 09:46:58.680289 3591.222656 tcp 147.32.85.103 49317 <?> 88.159.8.10 22 PA_PA 0.0 0.0 105555 70963308 68636560 flow=Background
2011-08-10 09:46:59.367587 2011/08/10 09:46:59.367587 3350.645508 tcp 147.32.87.5 524 ?> 147.32.85.2 49350 PA_ 0.0 NaN 214827 248405120 248405120 flow=Background