Dirk Loss, http://dirk-loss.de, @dloss. v1.1, 2013-06-02
This IPython notebook shows how to analyse network traffic using the following tools:
Pandas allows for very flexible analysis, treating your PCAP files as a timeseries of packet data.
So if the statistics provided by Wireshark are not enough, you might want to try this. And it's more fun, of course. :)
First we need a PCAP file. I chose a sample file from the Digital Corpora site that has been used for courses in network forensics:
from IPython.display import HTML
HTML('<iframe src=http://digitalcorpora.org/corpora/scenarios/nitroba-university-harassment-scenario width=600 height=300></iframe>')
!mkdir -p pcap
cd pcap
/home/dirk/projects/pcap
We can download it using curl or pure Python. Just uncomment one of the following cells:
url="http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap"
# If you have curl installed, we can get nice progress bars:
#!curl -o nitroba.pcap $url
# Or use pure Python:
# import urllib
# urllib.urlretrieve(url, "nitroba.pcap")
ls -l nitroba.pcap
-rw-rw-r-- 1 dirk dirk 56795590 Jun 2 12:10 nitroba.pcap
!md5sum nitroba.pcap
d6b5df10fc572b54ceb9c543d11f10a4 nitroba.pcap
We can use the tshark
command from the Wireshark tool suite to read the PCAP file and convert it into a tab-separated file. This might not be very fast, but it is very flexible, because all of Wireshark's diplay filters can be used to select the packets that we are interested in.
!tshark -v
TShark 1.6.7 Copyright 1998-2012 Gerald Combs <gerald@wireshark.org> and contributors. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled (32-bit) with GLib 2.32.0, with libpcap (version unknown), with libz 1.2.3.4, with POSIX capabilities (Linux), without libpcre, with SMI 0.4.8, with c-ares 1.7.5, with Lua 5.1, without Python, with GnuTLS 2.12.14, with Gcrypt 1.5.0, with MIT Kerberos, with GeoIP. Running on Linux 3.2.0-45-generic, with libpcap version 1.1.1, with libz 1.2.3.4. Built using gcc 4.6.3.
For now, I just select the frame number and the frame length and redirect the output to a file:
!tshark -n -r nitroba.pcap -T fields -Eheader=y -e frame.number -e frame.len > frame.len
Let's have a look at the file:
!head -10 frame.len
frame.number frame.len 1 70 2 70 3 1421 4 70 5 1284 6 70 7 70 8 70 9 78
Two columns, tab-separaed. (Not exactly CSV, but who cares. ;-)
Pandas can read those tables into a DataFrame object:
import pandas as pd
df=pd.read_table("frame.len")
The object has a nice default representation that shows the number of values in each row:
df
<class 'pandas.core.frame.DataFrame'> Int64Index: 95175 entries, 0 to 95174 Data columns (total 2 columns): frame.number 95175 non-null values frame.len 95175 non-null values dtypes: int64(2)
Some statistics about the frame length:
df["frame.len"].describe()
count 95175.000000 mean 580.748789 std 625.757017 min 42.000000 25% 70.000000 50% 87.000000 75% 1466.000000 max 1466.000000 dtype: float64
The minimum and maximum frame lengths are plausible for an Ethernet connection.
For a better overview, we plot the frame length over time.
We initialise IPython to show inline graphics:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
Set a figure size in inches:
figsize(10,6)
Pandas automatically uses Matplotlib for plotting. We plot with small dots and an alpha channel of 0.2:
df["frame.len"].plot(style=".", alpha=0.2)
title("Frame length")
ylabel("bytes")
xlabel("frame number")
<matplotlib.text.Text at 0x9aaf58c>
So there are always lots of small packets (< 100 bytes) and lots of large packets (> 1400 bytes). Some bursts of packets with other sizes (around 400 bytes, 1000 bytes, etc.) can be clearly seen.
Passing all those arguments to tshark is quite cumbersome. Here is a convenience function that reads the given fields into a Pandas DataFrame:
import subprocess
import datetime
import pandas as pd
def read_pcap(filename, fields=[], display_filter="",
timeseries=False, strict=False):
""" Read PCAP file into Pandas DataFrame object.
Uses tshark command-line tool from Wireshark.
filename: Name or full path of the PCAP file to read
fields: List of fields to include as columns
display_filter: Additional filter to restrict frames
strict: Only include frames that contain all given fields
(Default: false)
timeseries: Create DatetimeIndex from frame.time_epoch
(Default: false)
Syntax for fields and display_filter is specified in
Wireshark's Display Filter Reference:
http://www.wireshark.org/docs/dfref/
"""
if timeseries:
fields = ["frame.time_epoch"] + fields
fieldspec = " ".join("-e %s" % f for f in fields)
display_filters = fields if strict else []
if display_filter:
display_filters.append(display_filter)
filterspec = "-R '%s'" % " and ".join(f for f in display_filters)
options = "-r %s -n -T fields -Eheader=y" % filename
cmd = "tshark %s %s %s" % (options, filterspec, fieldspec)
proc = subprocess.Popen(cmd, shell = True,
stdout=subprocess.PIPE)
if timeseries:
df = pd.read_table(proc.stdout,
index_col = "frame.time_epoch",
parse_dates=True,
date_parser=datetime.datetime.fromtimestamp)
else:
df = pd.read_table(proc.stdout)
return df
We will use this function in my further analysis.
By summing up the frame lengths we can calculate the complete (Ethernet) bandwidth used. First use our convenience function to read the PCAP into a DataFrame:
framelen=read_pcap("nitroba.pcap", ["frame.len"], timeseries=True)
framelen
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029 Data columns (total 1 columns): frame.len 95175 non-null values dtypes: int64(1)
Then we re-sample the timeseries into buckets of 1 second, summing over the lengths of all frames that were captured in that second:
bytes_per_second=framelen.resample("S", how="sum")
Here are the first 5 rows. We get NaN for those timestamps where no frames were captured:
bytes_per_second.head()
frame.len | |
---|---|
frame.time_epoch | |
2008-07-22 03:51:07 | 20729 |
2008-07-22 03:51:08 | 8426 |
2008-07-22 03:51:09 | 13565 |
2008-07-22 03:51:10 | NaN |
2008-07-22 03:51:11 | NaN |
bytes_per_second.plot()
<matplotlib.axes.AxesSubplot at 0x9f40d2c>
Let's try to replicate the TCP Time-Sequence Graph that is known from Wireshark (Statistics > TCP Stream Analysis > Time-Sequence Graph (Stevens).
fields=["tcp.stream", "ip.src", "ip.dst", "tcp.seq", "tcp.ack", "tcp.window_size", "tcp.len"]
ts=read_pcap("nitroba.pcap", fields, timeseries=True, strict=True)
ts
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 81451 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029 Data columns (total 7 columns): tcp.stream 81451 non-null values ip.src 81451 non-null values ip.dst 81451 non-null values tcp.seq 81451 non-null values tcp.ack 81451 non-null values tcp.window_size 81451 non-null values tcp.len 81451 non-null values dtypes: int64(5), object(2)
Now we have to select a TCP stream to analyse. As an example, we just pick stream number 10:
stream=ts[ts["tcp.stream"] == 10]
stream
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 26 entries, 2008-07-22 03:51:08.431406 to 2008-07-22 03:53:29.160668 Data columns (total 7 columns): tcp.stream 26 non-null values ip.src 26 non-null values ip.dst 26 non-null values tcp.seq 26 non-null values tcp.ack 26 non-null values tcp.window_size 26 non-null values tcp.len 26 non-null values dtypes: int64(5), object(2)
Pandas only print the overview because the table is to wide. So we force a display:
print stream.to_string()
tcp.stream ip.src ip.dst tcp.seq tcp.ack tcp.window_size tcp.len frame.time_epoch 2008-07-22 03:51:08.431406 10 209.85.171.97 192.168.1.64 0 1 5672 0 2008-07-22 03:51:08.437600 10 192.168.1.64 209.85.171.97 1 1 524280 0 2008-07-22 03:51:08.438156 10 192.168.1.64 209.85.171.97 1 1 524280 153 2008-07-22 03:51:08.467383 10 209.85.171.97 192.168.1.64 1 154 6784 0 2008-07-22 03:51:08.469846 10 209.85.171.97 192.168.1.64 1 154 6784 1177 2008-07-22 03:51:08.474440 10 192.168.1.64 209.85.171.97 154 1178 523712 0 2008-07-22 03:51:08.547444 10 192.168.1.64 209.85.171.97 154 1178 524280 267 2008-07-22 03:51:08.547498 10 192.168.1.64 209.85.171.97 421 1178 524280 6 2008-07-22 03:51:08.547768 10 192.168.1.64 209.85.171.97 427 1178 524280 41 2008-07-22 03:51:08.589823 10 209.85.171.97 192.168.1.64 1178 468 7872 47 2008-07-22 03:51:08.592029 10 192.168.1.64 209.85.171.97 468 1225 524280 0 2008-07-22 03:51:08.594719 10 192.168.1.64 209.85.171.97 468 1225 524280 604 2008-07-22 03:51:08.633074 10 209.85.171.97 192.168.1.64 1225 1072 9024 1344 2008-07-22 03:51:08.635798 10 192.168.1.64 209.85.171.97 1072 2569 523552 0 2008-07-22 03:51:09.295395 10 192.168.1.64 209.85.171.97 1072 2569 524280 1024 2008-07-22 03:51:09.337628 10 209.85.171.97 192.168.1.64 2569 2096 11072 354 2008-07-22 03:51:09.340889 10 192.168.1.64 209.85.171.97 2096 2923 524280 0 2008-07-22 03:53:09.324698 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:09.561366 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:10.020463 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:10.734440 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:11.956795 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:13.662067 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:15.876856 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:20.305760 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 2008-07-22 03:53:29.160668 10 209.85.171.97 192.168.1.64 2923 2096 11072 0
Add a column that shows who sent the packet (client or server).
The fancy lambda expression is a function that distinguishes between the client and the server side of the stream by comparing the source IP address with the source IP address of the first packet in the stream (for TCP steams that should have been sent by the client).
stream["type"] = stream.apply(lambda x: "client" if x["ip.src"] == stream.irow(0)["ip.src"] else "server", axis=1)
print stream.to_string()
tcp.stream ip.src ip.dst tcp.seq tcp.ack tcp.window_size tcp.len type frame.time_epoch 2008-07-22 03:51:08.431406 10 209.85.171.97 192.168.1.64 0 1 5672 0 client 2008-07-22 03:51:08.437600 10 192.168.1.64 209.85.171.97 1 1 524280 0 server 2008-07-22 03:51:08.438156 10 192.168.1.64 209.85.171.97 1 1 524280 153 server 2008-07-22 03:51:08.467383 10 209.85.171.97 192.168.1.64 1 154 6784 0 client 2008-07-22 03:51:08.469846 10 209.85.171.97 192.168.1.64 1 154 6784 1177 client 2008-07-22 03:51:08.474440 10 192.168.1.64 209.85.171.97 154 1178 523712 0 server 2008-07-22 03:51:08.547444 10 192.168.1.64 209.85.171.97 154 1178 524280 267 server 2008-07-22 03:51:08.547498 10 192.168.1.64 209.85.171.97 421 1178 524280 6 server 2008-07-22 03:51:08.547768 10 192.168.1.64 209.85.171.97 427 1178 524280 41 server 2008-07-22 03:51:08.589823 10 209.85.171.97 192.168.1.64 1178 468 7872 47 client 2008-07-22 03:51:08.592029 10 192.168.1.64 209.85.171.97 468 1225 524280 0 server 2008-07-22 03:51:08.594719 10 192.168.1.64 209.85.171.97 468 1225 524280 604 server 2008-07-22 03:51:08.633074 10 209.85.171.97 192.168.1.64 1225 1072 9024 1344 client 2008-07-22 03:51:08.635798 10 192.168.1.64 209.85.171.97 1072 2569 523552 0 server 2008-07-22 03:51:09.295395 10 192.168.1.64 209.85.171.97 1072 2569 524280 1024 server 2008-07-22 03:51:09.337628 10 209.85.171.97 192.168.1.64 2569 2096 11072 354 client 2008-07-22 03:51:09.340889 10 192.168.1.64 209.85.171.97 2096 2923 524280 0 server 2008-07-22 03:53:09.324698 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:09.561366 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:10.020463 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:10.734440 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:11.956795 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:13.662067 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:15.876856 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:20.305760 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client 2008-07-22 03:53:29.160668 10 209.85.171.97 192.168.1.64 2923 2096 11072 0 client
client_stream=stream[stream.type == "client"]
client_stream["tcp.seq"].plot(style="r-o")
<matplotlib.axes.AxesSubplot at 0xa1e454c>
Notice that the x-axis shows the real timestamps.
For comparison, change the x-axis to be the packet number in the stream:
client_stream.index = arange(len(client_stream))
client_stream["tcp.seq"].plot(style="r-o")
<matplotlib.axes.AxesSubplot at 0xa1d91ac>
Looks different of course.
per_stream=ts.groupby("tcp.stream")
per_stream.head()
<class 'pandas.core.frame.DataFrame'> MultiIndex: 9913 entries, (0, 2008-07-22 03:51:07.095278) to (2765, 2008-07-22 08:11:35.496780) Data columns (total 7 columns): tcp.stream 9913 non-null values ip.src 9913 non-null values ip.dst 9913 non-null values tcp.seq 9913 non-null values tcp.ack 9913 non-null values tcp.window_size 9913 non-null values tcp.len 9913 non-null values dtypes: int64(5), object(2)
bytes_per_stream = per_stream["tcp.len"].sum()
bytes_per_stream.head()
tcp.stream 0 0 1 2565 5 5158 6 8266 10 5017 Name: tcp.len, dtype: int64
bytes_per_stream.plot()
<matplotlib.axes.AxesSubplot at 0xac810ac>
bytes_per_stream.max()
5150771
biggest_stream=bytes_per_stream.idxmax()
biggest_stream
88
bytes_per_stream.ix[biggest_stream]
5150771
Let's have a look at the padding of the Ethernet frames. Some cards have been leaking data in the past. For more details, see http://www.securiteam.com/securitynews/5BP01208UO.html
trailer_df = read_pcap("nitroba.pcap", ["eth.src", "eth.trailer"], timeseries=True)
trailer_df
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029 Data columns (total 2 columns): eth.src 95175 non-null values eth.trailer 12851 non-null values dtypes: object(2)
trailer=trailer_df["eth.trailer"]
trailer
frame.time_epoch 2008-07-22 03:51:07.095278 NaN 2008-07-22 03:51:07.103728 NaN 2008-07-22 03:51:07.114897 NaN 2008-07-22 03:51:07.139448 NaN 2008-07-22 03:51:07.319680 NaN 2008-07-22 03:51:07.321990 NaN 2008-07-22 03:51:07.326517 NaN 2008-07-22 03:51:07.335554 NaN 2008-07-22 03:51:07.376171 NaN 2008-07-22 03:51:07.378392 NaN 2008-07-22 03:51:07.389299 NaN 2008-07-22 03:51:07.390478 NaN 2008-07-22 03:51:07.404056 NaN 2008-07-22 03:51:07.416518 NaN 2008-07-22 03:51:07.423663 NaN ... 2008-07-22 08:13:44.266370 NaN 2008-07-22 08:13:44.266638 NaN 2008-07-22 08:13:44.293692 00:00:00:00:00:00 2008-07-22 08:13:44.585477 NaN 2008-07-22 08:13:44.863535 NaN 2008-07-22 08:13:44.873602 NaN 2008-07-22 08:13:44.883737 NaN 2008-07-22 08:13:44.893510 NaN 2008-07-22 08:13:44.903460 NaN 2008-07-22 08:13:44.913495 NaN 2008-07-22 08:13:44.923654 NaN 2008-07-22 08:13:44.933648 NaN 2008-07-22 08:13:44.943515 NaN 2008-07-22 08:13:44.953453 NaN 2008-07-22 08:13:47.046029 NaN Name: eth.trailer, Length: 95175, dtype: object
Ok. Most frames do not seem to have padding, but some have. Let's count per value to get an overview:
trailer.value_counts()
00:00:00:00:00:00 7989 3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 913 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 606 3b:02:a7:19:00:1d:6b:99:98:6a:88:64:11:00:8f:da:00:42 303 00:00 299 00:00:c0:a8:01:40:00:00:00:00:00:00:00:00:00:1d:d9:2e 259 32:01:67:06:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 254 2d:66:6f:6f:65:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01 253 04:67:6b:64:63:03:75:61:73:03:61:6f:6c:03:63:6f:6d:00 160 70:03:6d:73:67:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01 151 73:6b:03:6d:61:63:03:63:6f:6d:00:00:01:00:01:00:01:00 146 2d:66:6f:6f:62:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01 101 73:6b:03:6d:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 66 72:65:76:73:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 54 00:00:00:00:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 52 ... 2d:66:6f:6f:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02 1 00:00:c9:6e:87:fc 1 00:00:44:b7:84:43 1 00:00:3b:fc:30:86 1 00:00:7b:1f:5b:03 1 00:00:78:27:f5:37 1 00:00:f0:2c:e6:35 1 00:00:6e:f5:46:41 1 00:00:00:00:00:00:00:00:00:00:00:00:00:16:39:da:a9 1 00:00:7a:e4:d0:27 1 00:00:61:c8:85:63 1 00:00:e7:99:00:70 1 00:00:68:25:eb:a0 1 00:00:34:ba:2b:52 1 00:00:53:8a:e9:05 1 Length: 635, dtype: int64
Mostly zeros, but some data. Let's decode the hex strings:
import binascii
def unhex(s, sep=":"):
return binascii.unhexlify("".join(s.split(sep)))
s=unhex("3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02")
s
';\x02\xa7\x19\xaa\xaa\x03\x00\x80\xc2\x00\x07\x00\x00\x00\x02;\x02'
padding = trailer_df.dropna()
padding["unhex"]=padding["eth.trailer"].map(unhex)
def printable(s):
chars = []
for c in s:
if c.isalnum():
chars.append(c)
else:
chars.append(".")
return "".join(chars)
printable("\x95asd\x33")
'.asd3'
padding["printable"]=padding["unhex"].map(printable)
padding["printable"].value_counts()
...... 8145 .................. 1927 ......k..j.d.....B 303 .. 299 2.g............... 254 .fooe.yahoo.com... 253 .gkdc.uas.aol.com. 160 p.msg.yahoo.com... 151 sk.mac.com........ 148 .foob.yahoo.com... 101 sk.m.............. 66 revs.............. 54 ge.w.............. 45 1.1............... 44 .goo.............. 42 ... ..........Wz...... 1 ..M... 1 ...i.Z 1 ..x... 1 ..N... 1 ..n.oN 1 ....fK 1 ....fk 1 ..Y8.. 1 ..n.FA 1 ...O.r 1 ....Qn 1 ..PK.e 1 ...w.. 1 ..1... 1 Length: 375, dtype: int64
def ratio_printable(s):
printable = sum(1.0 for c in s if c.isalnum())
return printable / len(s)
ratio_printable("a\x93sdfs")
0.8333333333333334
padding["ratio_printable"] = padding["unhex"].map(ratio_printable)
padding[padding["ratio_printable"] > 0.5]
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 727 entries, 2008-07-22 03:51:20.018817 to 2008-07-22 05:40:13.338449 Data columns (total 5 columns): eth.src 727 non-null values eth.trailer 727 non-null values unhex 727 non-null values printable 727 non-null values ratio_printable 727 non-null values dtypes: float64(1), object(4)
_.printable.value_counts()
.fooe.yahoo.com... 253 .gkdc.uas.aol.com. 160 p.msg.yahoo.com... 151 .foob.yahoo.com... 101 .weather.com...... 31 ge.weather.com.... 26 1.1..HOST.239.255. 1 ..CDWW 1 .foof.yahoo.com... 1 ..3rbo 1 ..BIKM 1 dtype: int64
Now find out which Ethernet cards sent those packets with more than 50% ASCII data in their padding:
padding[padding["ratio_printable"] > 0.5]['eth.src'].drop_duplicates()
frame.time_epoch 2008-07-22 03:51:20.018817 00:1d:d9:2e:4f:61 2008-07-22 04:10:14.155085 00:1d:6b:99:98:68 Name: eth.src, dtype: object
HTML('<iframe src=http://www.coffer.com/mac_find/?string=00%3A1d%3Ad9%3A2e%3A4f%3A61 width=600 height=300></iframe>')
Thats 'Hon Hai Precision' (and "Netopia Inc" for the other MAC address).