Zeek Network Data to Scikit-Learn

In this notebook we're going to be using the zat Python module and explore the functionality that enables us to easily go from Zeek data to Pandas to Scikit-Learn. Once we get our data in a form that is usable by Scikit-Learn we have a wide array of data analysis and machine learning algorithms at our disposal.




Code Availability

All this code in this notebook is from the examples/bro_to_scikit.py file in the zat repository (https://github.com/SuperCowPowers/zat). If you have any questions/problems please don't hesitate to open up an Issue in GitHub or even better submit a PR. :)

In [4]:
# Third Party Imports
import pandas as pd
import sklearn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Local imports
import zat
from zat.log_to_dataframe import LogToDataFrame
from zat.dataframe_to_matrix import DataFrameToMatrix

# Good to print out versions of stuff
print('zat: {:s}'.format(zat.__version__))
print('Pandas: {:s}'.format(pd.__version__))
print('Numpy: {:s}'.format(np.__version__))
print('Scikit Learn Version:', sklearn.__version__)
zat: 0.3.6
Pandas: 0.25.1
Numpy: 1.16.4
Scikit Learn Version: 0.21.2

Quickly go from Zeek log to Pandas DataFrame

In [5]:
# Create a Pandas dataframe from a Zeek log
log_to_df = LogToDataFrame()
bro_df = log_to_df.create_dataframe('../data/dns.log')

# Print out the head of the dataframe
uid id.orig_h id.orig_p id.resp_h id.resp_p proto trans_id query qclass qclass_name ... rcode rcode_name AA TC RD RA Z answers TTLs rejected
2013-09-15 23:44:27.631940126 CZGShC2znK1sV7jdI7 1030 53 udp 44949 guyspy.com 1 C_INTERNET ... 0 NOERROR F F T T 0 36.000000 F
2013-09-15 23:44:27.696868896 CZGShC2znK1sV7jdI7 1030 53 udp 50071 www.guyspy.com 1 C_INTERNET ... 0 NOERROR F F T T 0 guyspy.com, 1000.000000,36.000000 F
2013-09-15 23:44:28.060639143 CZGShC2znK1sV7jdI7 1030 53 udp 39062 devrubn8mli40.cloudfront.net 1 C_INTERNET ... 0 NOERROR F F T T 0,,,54.230... 60.000000,60.000000,60.000000,60.000000,60.000... F
2013-09-15 23:44:28.141794920 CZGShC2znK1sV7jdI7 1030 53 udp 7312 d31qbv1cthcecs.cloudfront.net 1 C_INTERNET ... 0 NOERROR F F T T 0,,,54.230.... 60.000000,60.000000,60.000000,60.000000,60.000... F
2013-09-15 23:44:28.422703981 CZGShC2znK1sV7jdI7 1030 53 udp 41872 crl.entrust.net 1 C_INTERNET ... 0 NOERROR F F T T 0 cdn.entrust.net.c.footprint.net, 4993.000000,129.000000,129.000000,129.000000 F

5 rows × 22 columns

So... what just happened?

Yep it was quick... the two little lines of code above turned a Zeek log (any log) into a Pandas DataFrame. The zat package also supports streaming data from dynamic/active logs, handles log rotations and in general tries to make your life a bit easier when doing data analysis and machine learning on Zeek data.

Now that we have the data in a dataframe there are a million wonderful things we could do for data munging, processing and analysis but that will have to wait for another time/notebook.

In [6]:
# Using Pandas we can easily and efficiently compute additional data metrics
# Here we use the vectorized operations of Pandas/Numpy to compute query length
bro_df['query_length'] = bro_df['query'].str.len()

DNS records are a mix of numeric and categorical data

When we look at the dns records some of the data is numerical and some of it is categorical so we'll need a way of handling both data types in a generalized way. zat has a DataFrameToMatrix class that handles a lot of the details and mechanics of combining numerical and categorical data, we'll use below.

In [9]:
# These are the features we want (note some of these are categorical :)
features = ['AA', 'RA', 'RD', 'TC', 'Z', 'rejected', 'proto', 'qclass_name', 
            'qtype_name', 'rcode_name', 'query_length']
feature_df = bro_df[features]
AA RA RD TC Z rejected proto qclass_name qtype_name rcode_name query_length
2013-09-15 23:44:27.631940126 F T T F 0 F udp C_INTERNET A NOERROR 10.0
2013-09-15 23:44:27.696868896 F T T F 0 F udp C_INTERNET A NOERROR 14.0
2013-09-15 23:44:28.060639143 F T T F 0 F udp C_INTERNET A NOERROR 28.0
2013-09-15 23:44:28.141794920 F T T F 0 F udp C_INTERNET A NOERROR 29.0
2013-09-15 23:44:28.422703981 F T T F 0 F udp C_INTERNET A NOERROR 15.0


We'll now use a zat scikit-learn tranformer class to convert the Pandas DataFrame to a numpy ndarray (matrix). Yes it's awesome... I'm not sure it's Optimus Prime awesome.. but it's still pretty nice.

In [10]:
# Use the zat DataframeToMatrix class (handles categorical data)
# You can see below it uses a heuristic to detect category data. When doing
# this for real we should explicitly convert before sending to the transformer.
to_matrix = dataframe_to_matrix.DataFrameToMatrix()
bro_matrix = to_matrix.fit_transform(feature_df)
Normalizing column Z...
Normalizing column query_length...


Now that we have a numpy ndarray(matrix) we're ready to rock with scikit-learn...

In [63]:
# Now we're ready for scikit-learn!
# Just some simple stuff for this example, KMeans and TSNE projection
kmeans = KMeans(n_clusters=5).fit_predict(bro_matrix)
projection = TSNE().fit_transform(bro_matrix)

# Now we can put our ML results back onto our dataframe!
bro_df['x'] = projection[:, 0] # Projection X Column
bro_df['y'] = projection[:, 1] # Projection Y Column
bro_df['cluster'] = kmeans
bro_df[['query', 'proto', 'x', 'y', 'cluster']].head()  # Showing the scikit-learn results in our dataframe
query proto x y cluster
2013-09-15 23:44:27.631940126 guyspy.com udp 39.936161 -18.896549 0
2013-09-15 23:44:27.696868896 www.guyspy.com udp 24.945557 3.408248 0
2013-09-15 23:44:28.060639143 devrubn8mli40.cloudfront.net udp -28.303942 2.064817 0
2013-09-15 23:44:28.141794920 d31qbv1cthcecs.cloudfront.net udp -22.633984 6.704806 0
2013-09-15 23:44:28.422703981 crl.entrust.net udp 18.326878 -2.018499 0
In [64]:
# Plotting defaults
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 18.0
plt.rcParams['figure.figsize'] = 15.0, 7.0
In [65]:
# Now use dataframe group by cluster
cluster_groups = bro_df.groupby('cluster')

# Plot the Machine Learning results
fig, ax = plt.subplots()
colors = {0:'red', 1:'blue', 2:'green', 3:'orange', 4:'purple', 5:'black'}
for key, group in cluster_groups:
    group.plot(ax=ax, kind='scatter', x='x', y='y', alpha=0.5, s=250,
               label='Cluster: {:d}'.format(key), color=colors[key])

Lets Investigate the 5 clusters of DNS data

We cramed a bunch of features into the clustering algorithm. The features were both numerical and categorical. So did the clustering 'do the right thing'? Well first some caveats and disclaimers:

  • We're obviously working with a small amount of Zeek DNS data
  • This is an example to show how the tranformations work (from Zeek to Pandas to Scikit)
  • The DNS data is real data but for this example and others we obviously pulled in 'weird' stuff on purpose
  • We knew that the K in KMeans should be 5 :)

Okay will all those caveats lets look at how the clustering did on both numeric and categorical data combined

Cluster details

  • Cluster 0: (42 observations) Looks like 'normal' DNS requests
  • Cluster 1: (11 observations) All the queries are '-' (Zeek for NA/not found/etc)
  • Cluster 2: ( 6 observations) The protocol is TCP instead of the normal UDP
  • Cluster 3: ( 4 observations) All the DNS queries are exceptionally long
  • Cluster 4: ( 4 observations) The reserved Z bit is set to 1 (required to be 0)

Numerical + Categorical = AOK

With our example data we've successfully gone from Zeek logs to Pandas to scikit-learn. The clusters appear to make sense and certainly from an investigative and threat hunting perspective being able to cluster the data and use PCA for dimensionality reduction might come in handy depending on your use case

In [66]:
# Now print out the details for each cluster
pd.set_option('display.width', 1000)
show_fields = ['query', 'Z', 'proto', 'qtype_name', 'cluster']
for key, group in cluster_groups:
    print('\nCluster {:d}: {:d} observations'.format(key, len(group)))
Cluster 0: 42 observations
                                                       query  Z proto qtype_name  cluster
2013-09-15 23:44:27.631940126                     guyspy.com  0   udp          A        0
2013-09-15 23:44:27.696868896                 www.guyspy.com  0   udp          A        0
2013-09-15 23:44:28.060639143   devrubn8mli40.cloudfront.net  0   udp          A        0
2013-09-15 23:44:28.141794920  d31qbv1cthcecs.cloudfront.net  0   udp          A        0
2013-09-15 23:44:28.422703981                crl.entrust.net  0   udp          A        0

Cluster 1: 3 observations
                              query  Z proto qtype_name  cluster
2013-09-15 23:44:50.827882051   NaN  0   udp        NaN        1
2013-09-15 23:44:50.829301834   NaN  0   udp        NaN        1
2013-09-15 23:44:51.026231050   NaN  0   udp        NaN        1

Cluster 2: 3 observations
                                       query  Z proto qtype_name  cluster
2013-09-15 23:48:05.897021055  j.maxmind.com  0   tcp          A        2
2013-09-15 23:48:06.897021055  j.maxmind.com  0   tcp          A        2
2013-09-15 23:48:07.897021055  j.maxmind.com  0   tcp          A        2

Cluster 3: 3 observations
                                                                           query  Z proto qtype_name  cluster
2013-09-15 23:48:02.497021198  superlongcrazydnsqueryforoutlierdetectionj.max...  0   udp          A        3
2013-09-15 23:48:02.597021103  xyzqsuperlongcrazydnsqueryforoutlierdetectionj...  0   udp          A        3
2013-09-15 23:48:02.697021008  abcsuperlongcrazydnsqueryforoutlierdetectionj....  0   udp          A        3

Cluster 4: 3 observations
                                       query  Z proto qtype_name  cluster
2013-09-15 23:48:02.897021055  j.maxmind.com  1   udp          A        4
2013-09-15 23:48:04.897021055  j.maxmind.com  1   udp          A        4
2013-09-15 23:48:04.997021198  j.maxmind.com  1   udp          A        4

Wrap Up

Well that's it for this notebook, we'll have an upcoming notebook that addresses some of the issues that we overlooked in this simple example. We use Isolation Forest for anomaly detection which works well for high dimensional data. The notebook will cover both training and streaming evalution against the model.

If you liked this notebook please visit the zat project for more notebooks and examples.