#!/usr/bin/env python # coding: utf-8 # # Anomaly Detection - Network Activity # After cybersecurity attackers get initial access to a victim's network, their next techniques may involve executing scripts from one network machine to another. These follow-up activities could generate network activity that is *anomalous*. It might be anomalous in two different ways: maybe compared to what *normally happens* on the network, or maybe relative to what everything else on the network is *currently doing*. Those two kinds of anomalies have an important distinction -- the first one assumes some ground-truth knowledge of what "normal" *looks* like -- this is "novelty" detection. The second one doesn't need a ground truth -- it's just "outlier" detection. # # My favorite example of network anomaly detection gone wrong is the university network which detected an enormous spike in traffic on a high port number between the hours of midnight and about 3am every night. Were they under attack? Follow-up investigation determined that the involved IP addresses were all for on-campus freshmen housing, and that the port was a common Minecraft server config. Oops, false alarm! The solution was to create a *new* "normalcy" model specific to on-campus housing in the wee nighttime hours, which wouldn't be alarmed by late-night minecraft gaming. I'm probably not remembering that story correctly, but the point is that "normalcy" model can quickly spiral to requiring a huge number of hyper-specific models or parameters, to reduce false positives. # # This notebook demonstrates a simple way to do outlier detection on netflow records, using one of the CTU-13 datasets. It clusters on standard-scaled TotBytes, # using DBSCAN with an eps of 2.5, means any points labeled as cluster `-1` don't group with any other points, assuming gaussian (normal) distributions.. # # I'm using [deargle/my-datascience-notebook](https://hub.docker.com/r/deargle/my-datascience-notebook) # In[1]: import pandas as pd import sklearn from sklearn.cluster import DBSCAN from sklearn.preprocessing import StandardScaler print(f'pandas version: {pd.__version__}') print(f'sklearn version: {sklearn.__version__}') # In[2]: # first, download https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow into the current directory from pathlib import Path path_to_file = 'capture20110810.binetflow' path = Path(path_to_file) if not path.is_file(): get_ipython().system('wget https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-42/detailed-bidirectional-flow-labels/capture20110810.binetflow') df = pd.read_csv(path_to_file) # In[3]: df.head() # In[4]: # Set the StartTime as the row index df = df.set_index(pd.to_datetime(df['StartTime'])) # In[5]: # Streaming network analytics would bin incoming netflows by some time window. Below, I set the window # to 1 minute, but this is arbitrary to make iterative testing faster than, say, a 10-minute window grouped = df.groupby(pd.Grouper(freq='1min')) # In[6]: # For demonstration purposes, I extract just the first window grouped = list(grouped) first_window = grouped[0] print(f'Window: {first_window[0]}') data = first_window[1] # In[7]: # I'll cluster just based on netflow TotBytes, but this is also arbitrary. Other numerical features could be included. X = data[['TotBytes']] # In[8]: # Apply a standard deviation transformation to my data. X = StandardScaler().fit_transform(X) # In[9]: # Since we standard-scaled, we can set `DBSCAN`'s `eps` parameter # to be 2.5, which roughly corresponds to a common threshold for gaussian (normal) distribution for what is # considered an "outlier" clf = DBSCAN(eps=2.5) y_preds = clf.fit_predict(X) print(f'n netflows: {len(X)}') print(f'n clusters: {len(list(set([y for y in clf.labels_ if y != -1])))}') print(f'n outliers: {len([x for x in y_preds if x == -1])}') # In[11]: # Show the anomalous netflows: data[y_preds == -1]