After cybersecurity attackers get initial access to a victim's network, their next techniques may involve executing scripts from one network machine to another. These follow-up activities could generate network activity that is anomalous. It might be anomalous in two different ways: maybe compared to what normally happens on the network, or maybe relative to what everything else on the network is currently doing. Those two kinds of anomalies have an important distinction -- the first one assumes some ground-truth knowledge of what "normal" looks like -- this is "novelty" detection. The second one doesn't need a ground truth -- it's just "outlier" detection.
My favorite example of network anomaly detection gone wrong is the university network which detected an enormous spike in traffic on a high port number between the hours of midnight and about 3am every night. Were they under attack? Follow-up investigation determined that the involved IP addresses were all for on-campus freshmen housing, and that the port was a common Minecraft server config. Oops, false alarm! The solution was to create a new "normalcy" model specific to on-campus housing in the wee nighttime hours, which wouldn't be alarmed by late-night minecraft gaming. I'm probably not remembering that story correctly, but the point is that "normalcy" model can quickly spiral to requiring a huge number of hyper-specific models or parameters, to reduce false positives.
This notebook demonstrates a simple way to do outlier detection on netflow records, using one of the CTU-13 datasets. It clusters on standard-scaled TotBytes,
using DBSCAN with an eps of 2.5, means any points labeled as cluster -1
don't group with any other points, assuming gaussian (normal) distributions..
I'm using deargle/my-datascience-notebook
import pandas as pd
import sklearn
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
print(f'pandas version: {pd.__version__}')
print(f'sklearn version: {sklearn.__version__}')
pandas version: 1.4.0 sklearn version: 1.0.2
# first, download https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow into the current directory
from pathlib import Path
path_to_file = 'capture20110810.binetflow'
path = Path(path_to_file)
if not path.is_file():
!wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow
df = pd.read_csv(path_to_file)
df.head()
StartTime | Dur | Proto | SrcAddr | Sport | Dir | DstAddr | Dport | State | sTos | dTos | TotPkts | TotBytes | SrcBytes | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011/08/10 09:46:53.047277 | 3550.182373 | udp | 212.50.71.179 | 39678 | <-> | 147.32.84.229 | 13363 | CON | 0.0 | 0.0 | 12 | 875 | 413 | flow=Background-UDP-Established |
1 | 2011/08/10 09:46:53.048843 | 0.000883 | udp | 84.13.246.132 | 28431 | <-> | 147.32.84.229 | 13363 | CON | 0.0 | 0.0 | 2 | 135 | 75 | flow=Background-UDP-Established |
2 | 2011/08/10 09:46:53.049895 | 0.000326 | tcp | 217.163.21.35 | 80 | <?> | 147.32.86.194 | 2063 | FA_A | 0.0 | 0.0 | 2 | 120 | 60 | flow=Background |
3 | 2011/08/10 09:46:53.053771 | 0.056966 | tcp | 83.3.77.74 | 32882 | <?> | 147.32.85.5 | 21857 | FA_FA | 0.0 | 0.0 | 3 | 180 | 120 | flow=Background |
4 | 2011/08/10 09:46:53.053937 | 3427.768066 | udp | 74.89.223.204 | 21278 | <-> | 147.32.84.229 | 13363 | CON | 0.0 | 0.0 | 42 | 2856 | 1596 | flow=Background-UDP-Established |
# Set the StartTime as the row index
df = df.set_index(pd.to_datetime(df['StartTime']))
# Streaming network analytics would bin incoming netflows by some time window. Below, I set the window
# to 1 minute, but this is arbitrary to make iterative testing faster than, say, a 10-minute window
grouped = df.groupby(pd.Grouper(freq='1min'))
# For demonstration purposes, I extract just the first window
grouped = list(grouped)
first_window = grouped[0]
print(f'Window: {first_window[0]}')
data = first_window[1]
Window: 2011-08-10 09:46:00
# I'll cluster just based on netflow TotBytes, but this is also arbitrary. Other numerical features could be included.
X = data[['TotBytes']]
# Apply a standard deviation transformation to my data.
X = StandardScaler().fit_transform(X)
# Since we standard-scaled, we can set `DBSCAN`'s `eps` parameter
# to be 2.5, which roughly corresponds to a common threshold for gaussian (normal) distribution for what is
# considered an "outlier"
clf = DBSCAN(eps=2.5)
y_preds = clf.fit_predict(X)
print(f'n netflows: {len(X)}')
print(f'n clusters: {len(list(set([y for y in clf.labels_ if y != -1])))}')
print(f'n outliers: {len([x for x in y_preds if x == -1])}')
n netflows: 1654 n clusters: 1 n outliers: 6
# Show the anomalous netflows:
data[y_preds == -1]
StartTime | Dur | Proto | SrcAddr | Sport | Dir | DstAddr | Dport | State | sTos | dTos | TotPkts | TotBytes | SrcBytes | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
StartTime | |||||||||||||||
2011-08-10 09:46:53.078297 | 2011/08/10 09:46:53.078297 | 3599.972412 | tcp | 147.32.80.13 | 80 | <?> | 147.32.84.162 | 51769 | PA_A | 0.0 | 0.0 | 72157 | 61638544 | 60214264 | flow=From-Background-CVUT-Proxy |
2011-08-10 09:46:53.106431 | 2011/08/10 09:46:53.106431 | 507.347626 | tcp | 147.32.80.13 | 80 | <?> | 147.32.85.112 | 10885 | FPA_FA | 0.0 | 0.0 | 162760 | 137136528 | 132816366 | flow=From-Background-CVUT-Proxy |
2011-08-10 09:46:53.346951 | 2011/08/10 09:46:53.346951 | 3598.887695 | tcp | 195.250.146.99 | 554 | <?> | 147.32.86.99 | 16786 | PA_PA | 0.0 | 0.0 | 51576 | 60964440 | 60363789 | flow=Background |
2011-08-10 09:46:53.709666 | 2011/08/10 09:46:53.709666 | 3599.047607 | tcp | 195.250.146.6 | 554 | <?> | 147.32.84.59 | 49375 | PA_PA | 0.0 | 0.0 | 55068 | 61164753 | 60334389 | flow=Background-Established-cmpgw-CVUT |
2011-08-10 09:46:58.680289 | 2011/08/10 09:46:58.680289 | 3591.222656 | tcp | 147.32.85.103 | 49317 | <?> | 88.159.8.10 | 22 | PA_PA | 0.0 | 0.0 | 105555 | 70963308 | 68636560 | flow=Background |
2011-08-10 09:46:59.367587 | 2011/08/10 09:46:59.367587 | 3350.645508 | tcp | 147.32.87.5 | 524 | ?> | 147.32.85.2 | 49350 | PA_ | 0.0 | NaN | 214827 | 248405120 | 248405120 | flow=Background |