Driveby PCAP analysis

Check yo'self before you wreck yo'self

We sumbled upon a really great collection of PCAPs that were interactions with suspicious websites. Which seemed to lead to the suspicion being correct, and an end system got exploited, or the suspicion was wrong and nothing bad happened. While we do no machine learning in this notebook, there are several techniques that are usful for collecting statistics about features across all the samples. However, when dealing with network traffic it's often very useful to have the additional context that a full session can provide. Here we try to break out of the sample box and begin showing groups of data as viewed by Bro in each network session.


What we did:

  • Data gathered with Bro (default Bro content)
  • bro -C -r <pcap file> local
  • Data cleanup
  • Explored the Data!
    • Found some patterns
    • Exploited the patterns to us find new things!


  • To the people that we borrowed the data from. Once they get back to me I'll be sure to give proper recognition. :)
In [78]:
# All the imports and some basic level setting with various versions
import IPython
import os
import pylab
import string
import pandas
import pickle
import matplotlib
import collections
import numpy as np
import pandas as pd
import matplotlib as plt
from __future__ import division

print "IPython version: %s" %IPython.__version__
print "pandas version: %s" %pd.__version__
print "numpy version: %s" %np.__version__
print "matplotlib version: %s" %plt.__version__

%matplotlib inline
pylab.rcParams['figure.figsize'] = (16.0, 5.0)
IPython version: 2.0.0
pandas version: 0.13.0rc1-32-g81053f9
numpy version: 1.6.1
matplotlib version: 1.3.1
In [24]:
# Mapping of fields of the files we want to read in and initial setup of pandas dataframes
# Borrowed from aonther notebook, this time we're just going to focus on notice and files for starters
# But the rest are here when we need 'em
logs_to_process = {
                    'conn.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','proto','service','duration','orig_bytes','resp_bytes','conn_state','local_orig','missed_bytes','history','orig_pkts','orig_ip_bytes','resp_pkts','resp_ip_bytes','tunnel_parents','sample'],
                    'dns.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','proto','trans_id','query','qclass','qclass_name','qtype','qtype_name','rcode','rcode_name','AA','TC','RD','RA','Z','answers','TTLs','rejected','sample'],
                    'files.log' : ['ts','fuid','tx_hosts','rx_hosts','conn_uids','source','depth','analyzers','mime_type','filename','duration','local_orig','is_orig','seen_bytes','total_bytes','missing_bytes','overflow_bytes','timedout','parent_fuid','md5','sha1','sha256','extracted','sample'],
                    'ftp.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','user','password','command','arg','mime_type','file_size','reply_code','reply_msg','data_channel.passive','data_channel.orig_h','data_channel.resp_h','data_channel.resp_p','fuid','sample'],
                    'http.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','trans_depth','method','host','uri','referrer','user_agent','request_body_len','response_body_len','status_code','status_msg','info_code','info_msg','filename','tags','username','password','proxied','orig_fuids','orig_mime_types','resp_fuids','resp_mime_types','sample'],
                    'irc.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','nick','user','command','value','addl','dcc_file_name','dcc_file_size','dcc_mime_type','fuid','sample'],
                    'notice.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','fuid','file_mime_type','file_desc','proto','note','msg','sub','src','dst','p','n','peer_descr','actions','suppress_for','dropped','remote_location.country_code','remote_location.region','','remote_location.latitude','remote_location.longitude','sample'],
                    'signatures.log' : ['ts','src_addr','src_port','dst_addr','dst_port','note','sig_id','event_msg','sub_msg','sig_count','host_count','sample'],
                    'smtp.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','trans_depth','helo','mailfrom','rcptto','date','from','to','reply_to','msg_id','in_reply_to','subject','x_originating_ip','first_received','second_received','last_reply','path','user_agent','fuids','is_webmail','sample'],
                    'ssl.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','version','cipher','server_name','session_id','subject','issuer_subject','not_valid_before','not_valid_after','last_alert','client_subject','client_issuer_subject','cert_hash','validation_status','sample'],
                    'tunnel.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','tunnel_type','action','sample'],
                    'weird.log' : ['ts','uid','id.orig_h','id.orig_p','id.resp_h','id.resp_p','name','addl','notice','peer','sample']

conndf   = pd.DataFrame(columns=logs_to_process['conn.log'])
dnsdf    = pd.DataFrame(columns=logs_to_process['dns.log'])
filesdf  = pd.DataFrame(columns=logs_to_process['files.log'])
ftpdf    = pd.DataFrame(columns=logs_to_process['ftp.log'])
httpdf   = pd.DataFrame(columns=logs_to_process['http.log'])
ircdf    = pd.DataFrame(columns=logs_to_process['irc.log'])
noticedf = pd.DataFrame(columns=logs_to_process['notice.log'])
smtpdf   = pd.DataFrame(columns=logs_to_process['smtp.log'])
ssldf    = pd.DataFrame(columns=logs_to_process['ssl.log'])
weirddf  = pd.DataFrame(columns=logs_to_process['weird.log'])
In [25]:
process_files = ['notice.log','files.log']
for dirName, subdirList, fileList in os.walk('..'):
    for fname in fileList:
        tags = dirName.split('/')
        if len(tags) == 2 and fname in logs_to_process:
            logname = fname.split('.')
                if fname in process_files:
                    #print "Processing %s - %s" %(tags[1], fname)
                    tempdf = pd.read_csv(dirName+'/'+fname, sep='\t',skiprows=8, header=None, 
                                     names=logs_to_process[fname][:-1], skipfooter=1)
                    tempdf['sample'] = tags[1]
                    if fname == 'conn.log':
                        conndf = conndf.append(tempdf)
                    if fname == 'dns.log':
                        dnsdf = dnsdf.append(tempdf)
                    if fname == 'files.log':
                        filesdf = filesdf.append(tempdf)
                    if fname == 'ftp.log':
                        ftpdf = ftpdf.append(tempdf)
                    if fname == 'http.log':
                        httpdf = httpdf.append(tempdf)
                    if fname == 'notice.log':
                        noticedf = noticedf.append(tempdf)
                    if fname == 'signatures.log':
                        sigdf = sigdf.append(tempdf)
                    if fname == 'smtp.log':
                        smtpdf = smtpdf.append(tempdf)
                    if fname == 'ssl.log':
                        ssldf = ssldf.append(tempdf)
                    if fname == 'tunnel.log':
                        tunneldf = tunneldf.append(tempdf)
                    if fname == 'weird.log':
                        weirddf = weirddf.append(tempdf)
            except Exception as e:
                print "[*] error: %s, on %s/%s" % (str(e), dirName, fname)
In [26]:
#You can use these to save a copy of the raw dataframe, because reading in the files over-and-over again is awful
#pickle.dump(filesdf, open('files.dataframe', 'wb'))
filesdf = pickle.load(open('files.dataframe', 'rb'))
#pickle.dump(noticedf, open('notice.dataframe', 'wb'))
noticedf = pickle.load(open('notice.dataframe', 'rb'))

Well, it took a while to get the data read into the dataframes let's take a quick peek at what it looks like. If everything looks pretty, or at least how we'd expect we can move on with some analysis.

In [27]:
ts uid id.orig_h id.orig_p id.resp_h id.resp_p fuid file_mime_type file_desc proto note msg sub src dst p n peer_descr actions suppress_for
0 1.338423e+09 C2cBNO2DuqzRxvfac6 1068 80 FEdFLYNt00bHPa9sb application/x-dosexec tcp TeamCymruMalwareHashRegistry::Match Malware Hash Registry Detection rate: 33% Las... 80 - bro Notice::ACTION_LOG 3600 ...
0 1.336524e+09 Cbs89uRcl9HnSASEc 1104 80 FaeRAd49IO4E2V2ndb application/x-dosexec tcp TeamCymruMalwareHashRegistry::Match Malware Hash Registry Detection rate: 24% Las... 80 - bro Notice::ACTION_LOG 3600 ...
1 1.336524e+09 Cpcwv13LGlszc39oQc 1105 80 F2ZPuu3GQcfZNFezk application/x-shockwave-flash tcp TeamCymruMalwareHashRegistry::Match Malware Hash Registry Detection rate: 60% Las... 80 - bro Notice::ACTION_LOG 3600 ...

3 rows × 27 columns

In [28]:
ts fuid tx_hosts rx_hosts conn_uids source depth analyzers mime_type filename duration local_orig is_orig seen_bytes total_bytes missing_bytes overflow_bytes timedout parent_fuid md5
0 1.320786e+09 FDbQJR35HG4EfZ0nb4 C9KyD7n902u8uko9j HTTP 0 SHA1,MD5 text/html - 0.000024 - F 4305 - 0 0 F - 8b07497c411ce08842cf2145380d238e ...
1 1.320786e+09 Fhv2Jx2f5EiIs52jf4 C9KyD7n902u8uko9j HTTP 0 SHA1,MD5 text/plain - 0.000323 - F 3233 3233 0 0 F - db8f4e6949c0fc0fc9cadf85d02e099a ...
2 1.320786e+09 FUJQ4k1k9m9EUOiDL5 CaTFVw42sBoMSa9Zcl HTTP 0 SHA1,MD5 text/html - 0.000000 - F 2717 2717 0 0 F - b030b8e337724d4a2786041a38c0951f ...
3 1.320786e+09 F7p0uQ1a9czI7BIwW4 CgQksU3fAxqa2VzJmi HTTP 0 SHA1,MD5 text/html - 0.000000 - F 1567 1567 0 0 F - 644a4a82d7580c2cf9f97e13b0ee1ced ...
4 1.320786e+09 FitWMF3UfEasz2Vyql C9KyD7n902u8uko9j HTTP 0 SHA1,MD5 application/x-shockwave-flash - 1.356016 - F 1177469 1177469 0 0 F - c2045c6a5e95a99f6ebbc073e62894cd ...

5 rows × 24 columns

In [29]:
TeamCymruMalwareHashRegistry::Match    7212
Scan::Address_Scan                      105
SSL::Invalid_Server_Cert                105
dtype: int64

Everything checks out nicely, and we've got a super-high-level view of what kinds of alerts were generated by Bro. I have a feeling the Team Cymru MHR results will come in handy if we want to get a feel for what's being detected vs. what's sneaking past the goalie.

In [30]:
hashes = set()
def grab_hash(s):
    if 'virustotal' in s:
    return ''

throwaway = noticedf['sub'].map(grab_hash)
In [31]:
def box_plot_df_setup(series_a, series_b): 
    # Count up all the times that a category from series_a
    # matches up with a category from series_b. This is
    # basically a gigantic contingency table
    cont_table = collections.defaultdict(lambda : collections.Counter())
    for val_a, val_b in zip(series_a.values, series_b.values):
        cont_table[val_a][val_b] += 1
    # Create a dataframe
    # A dataframe with keys from series_a as the index, series_b_keys
    # as the columns and the counts as the values.
    dataframe = pd.DataFrame(cont_table.values(), index=cont_table.keys())
    dataframe.fillna(0, inplace=True)
    return dataframe

Everybody loves a good stacked bar graph. Here we can get a general feel for the data and there seem to be a lot of files! Apparently in some of the driveby activities some malware was (probably) run and grabbed something from an FTP site

In [32]:
ax = box_plot_df_setup(filesdf['source'], filesdf['mime_type']).T.plot(kind='bar', stacked=True)
pylab.ylabel('Number of Files')
patches, labels = ax.get_legend_handles_labels()
ax.legend(patches, labels, title="Service Type")
<matplotlib.legend.Legend at 0x130e284d0>
In [33]:
text/html     252401
text/plain    244182
image/jpeg    200075
image/gif     141428
image/png     116836
dtype: int64
In [34]:
print "Lots of files!"
print "Total # of files (across all samples): %s" %filesdf.shape[0]
print "Total # of unique files: %s" %len(filesdf['sha1'].unique())
print "Total # of network sessions involving files: %s" %len(filesdf['conn_uids'].unique())
print "Total # of unique mime_types: %s" %len(filesdf['mime_type'].unique())
print "Total # of unique filenames: %s" %len(filesdf['filename'].unique())
Lots of files!
Total # of files (across all samples): 1100106
Total # of unique files: 312453
Total # of network sessions involving files: 238211
Total # of unique mime_types: 46
Total # of unique filenames: 18324
In [35]:
# We can use some of the output from above and get rid of them and look are more exciting files
# Just an example, I don't think we'll do much with this data frame today
boring = set(['text/html','text/plain','image/jpeg','image/gif','image/png','application/xml','image/x-icon'])
exciting_filesdf = filesdf[filesdf['mime_type'].apply(lambda x: x not in boring)]
ts fuid tx_hosts rx_hosts conn_uids source depth analyzers mime_type filename duration local_orig is_orig seen_bytes total_bytes missing_bytes overflow_bytes timedout parent_fuid md5
4 1.320786e+09 FitWMF3UfEasz2Vyql C9KyD7n902u8uko9j HTTP 0 SHA1,MD5 application/x-shockwave-flash - 1.356016 - F 1177469 1177469 0 0 F - c2045c6a5e95a99f6ebbc073e62894cd ...
7 1.320786e+09 FAnnpf1X8a1w6JLtV9 CENibi3Z3bv3ZVlyqf HTTP 0 SHA1,MD5 application/x-dosexec - 0.895983 - F 342016 - 0 0 F - ae12a0a1d449b9f2816d20546c477c0b ...

2 rows × 24 columns

After getting a high-level look at the data and doing our Python and pandas stretches we're set to move on to some more intresting ways to view the data and how different properties relate to one another.

Instead of looking at the top 10 of this or that, let's see how we can examine the top N broken down by various categories.

We'll start off easy and look at the most popular protocols and then the most popular mime-types by count within those protocls. Followed by a view of the same data but with the restriction that Bro know something about the filename, and last but not least we can look at the most popular filenames within a mime-type within a protocol! </font>

In [36]:
filesdf['count'] = 1
filesdf[['source','mime_type','count']].groupby(['source','mime_type']).sum().sort('count', ascending=0).head(10)
source mime_type
HTTP text/html 252401
text/plain 244182
image/jpeg 200075
image/gif 141428
image/png 116836
binary 39964
application/x-dosexec 30456
application/x-shockwave-flash 17878
application/jar 14372
application/zip 12247

10 rows × 1 columns

In [37]:
# We can get a slightly different view if we look at percentages of files
# Wonder how accurate the percentages are vs. monitored network traffic?
filesdf.groupby('source')['mime_type'].apply(lambda x: pd.value_counts(x)/x.count().astype(float)).head(20)
FTP_DATA  application/x-java-applet            1.000000
HTTP      text/html                            0.229434
          text/plain                           0.221963
          image/jpeg                           0.181869
          image/gif                            0.128559
          image/png                            0.106205
          binary                               0.036328
          application/x-dosexec                0.027685
          application/x-shockwave-flash        0.016251
          application/jar                      0.013064
          application/zip                      0.011133
          application/xml                      0.006645
          application/pdf                      0.004922
          application/        0.002872
          application/x-java-applet            0.002678
          text/x-c                             0.002036
          application/octet-stream             0.001929
          text/x-asm                           0.001733
          application/x-elc                    0.000960
          application/    0.000865
dtype: float64
In [38]:
filesdf['count'] = 1
filesdf[filesdf['filename'] != '-'][['source','mime_type','count']].groupby(['source','mime_type']).sum().sort('count', ascending=0).head(10)
source mime_type
HTTP application/x-dosexec 17793
binary 16810
application/jar 8874
image/jpeg 4756
application/pdf 4672
application/zip 4247
image/png 847
image/gif 161
text/plain 120
text/html 80

10 rows × 1 columns

In [39]:
filesdf[filesdf['filename'] != '-'][['source','mime_type','filename','count']].groupby(['source','mime_type','filename']).sum().sort('count', ascending=0).head(20)
source mime_type filename
HTTP application/x-dosexec setup.exe 4471
binary setup.exe 1461
application/x-dosexec about.exe 1204
contacts.exe 1170
info.exe 1146
calc.exe 1101
readme.exe 1081
scandsk.exe 645
PluginInstall.exe 598
files/load1.exe 432
application/jar 938a99f1be85e19406438f9a572fdf71.jar 374
loading.jar 351
a03cb4b8e5bb148134a57a2c87ddacd9.jar 300
application/x-dosexec files/load2.exe 231
uplayermediaplayer-setup.exe 231
foto43.exe 188
binary ./files/cit_video.module 133
image/png ad516503a11cd5ca435acc9bb6523536.png 127
application/jar app.jar 112
application/x-dosexec 2.exe 108

20 rows × 1 columns

The takeaway seems to be that if you're going to cause a file to be downloaded by a user over HTTP (malicious or not, but likely malicious) you should really call it 'setup.exe' followed not-so closely by: 'about.exe', 'contacts.exe', 'info.exe', 'calc.exe', 'readme.exe'. And looking for a mime-type of 'binary' or 'application/x-dosexec' should do pretty well.

In [40]:
# Filenames with a '/' in them??
# Just some random exploring, wonder why these have a path associated w/them and not the rest? Questions for another day.
print filesdf[filesdf['filename'].str.contains('/')]['filename'].value_counts().head(10)
print filesdf[filesdf['filename'].str.contains('\.\.')]['filename'].value_counts()
files/load1.exe                     432
files/load2.exe                     231
./files/cit_video.module            133
./files/cit_ffcookie.module          60
users/leftunch/file/ractrupt.exe     44
files/load5.exe                      42
files/load3.exe                      37
./files/barman.png                   36
./files/up.bin                       26
users/root/file/file.exe             26
dtype: int64

./../load/12.exe                                                4
../admin/files/cit_video.module                                 2
Ray Rice had three TD\xe2\x80\x99s..jpeg                        2
../admin/files/cit_ffcookie.module                              2
../admin/files/gconfig8.dll                                     2
\xe1\xba\xa3nh h\xc3\xa0i h\xc3\xb3a ra ng\xc3\xa0nh c\xc6\xa1 kh\xc3\xad b\xc3\xa1ch khoa....jpg    2
../admin/files/webinjects/merged-1.txt                          2
Sexy Dr..jpg                                                    2
dtype: int64
In [41]:
# Lots of duplicate files Wonder what these look like?
28d6814f309ea289f847c69cf91194c6    8328
b4491705564909da7f9eaf749dbbfbb1    7527
cd2e0e43980a00fb6a2742d3afd803b8    5831
d9feff91276e487e595cc23f62d259bc    4740
325472601571f31e1bf00674c368d335    4320
dtype: int64
In [42]:
filesdf[filesdf['filename'] != '-'][['filename','mime_type','count']].groupby(['filename','mime_type']).sum().sort('count', ascending=0).head(10)
filename mime_type
setup.exe application/x-dosexec 4471
binary 1461
about.exe application/x-dosexec 1204
contacts.exe application/x-dosexec 1170
info.exe application/x-dosexec 1146
calc.exe application/x-dosexec 1101
readme.exe application/x-dosexec 1081
scandsk.exe application/x-dosexec 645
PluginInstall.exe application/x-dosexec 598
files/load1.exe application/x-dosexec 432

10 rows × 1 columns

In [43]:
filesdf[filesdf['filename'] != '-'][['filename','md5','count']].groupby(['filename','md5']).sum().sort('count', ascending=0).head(10)
filename md5
loading.jar 2d272e75e6be0d397dcc9b493936d873 238
setup.exe 7776d42f2e2167591de7321b18704e9a 134
b87668c676063c0f36f2c7faef6b7d3d 92
618ba1bbd7e537482f7e058419fa8a28 92
405f6ef1172501148ac36495780a69d0 78
938a99f1be85e19406438f9a572fdf71.jar e872a71a6060413c1abfa5e41839aa8d 72
ad516503a11cd5ca435acc9bb6523536.png 102345b9de00a7f5d7ee00f688ba68ab 72
938a99f1be85e19406438f9a572fdf71.jar 88f7ddd10321d1a035b05a7a0ca263f1 66
foto43.exe 0cd1f2bc16529f88a919830c87f180f7 66
Applet.jar 0e658e5217ea9f0d28e60ab491e716f2 63

10 rows × 1 columns

It seems (from this data) there is a fair amount of resuse of both filenames and actual samples. Perhaps these samples were of the more boring exploit kits that don't try to obfuscate each individual download. Oh well, at least we've got some easy potential IOCs to look for.

Now we've got some hunches, how can we back these up using the results of the MHR events we found earlier? </font>

In [44]:
# Lookup to see what the Bro - Team Cymru Malware Hash Registry picks up
def tc_mhr_present_single(sha1):
    if sha1 in hashes:
        return True
    return False

tempdf = filesdf
tempdf['count'] = 1
tempdf['mhr'] = tempdf['sha1'].map(tc_mhr_present_single)
# The following 2 Commands Print out the tables below
#tempdf[tempdf['filename'] != '-'][['mhr','filename','count']].groupby(['mhr','filename']).sum().sort('count', ascending=0).head(12)
#tempdf[tempdf['filename'] != '-'][['filename','mhr','count']].groupby(['filename','mhr']).sum().sort('count', ascending=0).head(12)

The tables below were converted to .png files for pretty display, but they're just screen-caps of the above commands

In [45]:
tempdf.groupby('mhr')['filename'].apply(lambda x: pd.value_counts(x)/len(tempdf.index))
False  -                                       0.940176
       setup.exe                               0.003344
       scandsk.exe                             0.000546
       PluginInstall.exe                       0.000544
       contacts.exe                            0.000528
       about.exe                               0.000525
       readme.exe                              0.000515
       calc.exe                                0.000509
       info.exe                                0.000499
       files/load1.exe                         0.000230
       a03cb4b8e5bb148134a57a2c87ddacd9.jar    0.000224
       uplayermediaplayer-setup.exe            0.000210
       938a99f1be85e19406438f9a572fdf71.jar    0.000204
       foto43.exe                              0.000157
       files/load2.exe                         0.000135
True  windows-update-sp4-kb66639-setup.exe    0.000001
      f13f6a19.jar                            0.000001
      34dcaa0c.jar                            0.000001
      7af76.pdf                               0.000001
      windows-update-sp3-kb74463-setup.exe    0.000001
      windows-update-sp2-kb66551-setup.exe    0.000001
      a37002b0.jar                            0.000001
      windows-update-sp2-kb88572-setup.exe    0.000001
      d1af818e.jar                            0.000001
      load/49.exe                             0.000001
      FIX_KB111703.exe                        0.000001
      windows-update-sp2-kb81951-setup.exe    0.000001
      c5d53367.jar                            0.000001
      2bba99f3.jar                            0.000001
      3.exe                                   0.000001
Length: 18400, dtype: float64
<SURPRISE> AV isn't perfect </SURPRISE>

At least we got that out of our systems we can keep moving foward and see what else we can discover!
In [46]:
# Number of files per sample
filesdf[['sample','count']].groupby(['sample']).sum().sort('count', ascending=0).head(10)
fffa22264e6ebebdfcc3c9526d8bc19c_20130313 14598
c352624409de5b479f26a904c29d413f_20130719 4936
ef3be687d525c85b00a91efdfda711e9_20121204 4928
51e711110049f0788bcac16b0acd558a_20130814 4674
e259f0c751b02bd2419538ddc6b99772_20140314 3496
d3819a56d678838f0e01fa2c637f3b1a_20111004 3480
7138d1cd36d383673844013d4eae97af_20130116 3312
ae3d23caa091be1d3df1107898ea5231_20110902 3292
a1da196baea54a71d34e6421b7a8f040_20130908 3210
0f0fd9ada450841ab1fd767d7387e489_20130803 3198

10 rows × 1 columns

In [47]:
filesdf[['conn_uids','count']].groupby(['conn_uids']).sum().sort('count', ascending=0).head(10)
C9KyD7n902u8uko9j 2916
CENibi3Z3bv3ZVlyqf 972
COhNOY1kIa174wPKK6 972
CCBO2p2JDrryA2k1gb 972
CaTFVw42sBoMSa9Zcl 972
CBiXTD427Vvn9DXymd 972
CN0i562AIvEXAXVdS6 972
CVANI54q1eA8PJEPNl 972
C9O3nw2UUAuRVkxqHe 972
CgQrNE10zxxoKNF1O7 972

10 rows × 1 columns

It's always fun when data causes more questions than it answers! What's with all the repeated 972s above? Maybe we're not looking at it correctly

In [48]:
filesdf[['sample','conn_uids','count']].groupby(['sample','conn_uids']).sum().sort('count', ascending=0).head(20)
sample conn_uids
fffa22264e6ebebdfcc3c9526d8bc19c_20130313 C9KyD7n902u8uko9j 2910
CVANI54q1eA8PJEPNl 970
C3lCz83lC3PGZKMUR9 970
C9O3nw2UUAuRVkxqHe 970
CBiXTD427Vvn9DXymd 970
CCBO2p2JDrryA2k1gb 970
CN0i562AIvEXAXVdS6 970
COhNOY1kIa174wPKK6 970
CENibi3Z3bv3ZVlyqf 970
Cf6EYC22yHQsulV7mc 970
CgQksU3fAxqa2VzJmi 970
CgQrNE10zxxoKNF1O7 970
CaTFVw42sBoMSa9Zcl 970
c352624409de5b479f26a904c29d413f_20130719 CFPUpI2zTBwvTHZWe5 468
C7HuQHe3tlEc740dk 450
42c918193f5f3fe008e1ad8007ee0a64_20120612 CNCa6Z2Ybjei8Cu0B 446
CJtgFl3ASjdriC7976 438
b1321a9f0dd6c48c296a07205557b969_20120201 CNzvnh3VdsIivzr2z8 346
CyP6oYOs7782sfNO1 310
9003a3423d9e3d05d96b4a09d8d64c61_20120131 CgDZVO1O4jpz7Rz3uj 292

20 rows × 1 columns

In [49]:
filesdf[filesdf['conn_uids'] == 'CVANI54q1eA8PJEPNl'].shape[0]
In [50]:
print filesdf[filesdf['conn_uids'] == 'CVANI54q1eA8PJEPNl']['sample'].unique()
print filesdf[filesdf['conn_uids'] == 'C9KyD7n902u8uko9j']['sample'].unique() 

Guess Bro conn uids weren't as unique as I thought they should be.

Let's move on, quickly, shall we!

Now that we have a good handle on the data and the successes/failurs of AV, what if we could figure out what samples and sessions had "interesting" combinations of file types. Could we build a better driveby detector?

In [51]:
# These are so we can pick up combinations of executable/persistent mime-types along with a mime-type that is frequently
# associated with exploits/drivebys
executable_types = set(['application/x-dosexec', 'application/octet-stream', 'binary', 'application/'])
common_exploit_types = set(['application/x-java-applet','application/pdf','application/zip','application/jar','application/x-shockwave-flash'])

# If there is at least one executable type and one exploit type in a list of mime-types
def intresting_combo_data(mimetypes):
    mt = set(mimetypes.tolist())
    et = set()
    cet = set()
    et = mt.intersection(executable_types)
    cet = mt.intersection(common_exploit_types)
    if len(et) > 0 and len(cet) > 0:
        return ":".join(cet) + ":" + ":".join(et)
    if len(et) > 0 and len(cet) == 0:
        return ":".join(et)
    if len(cet) > 0 and len(et) == 0:
        return ":".join(cet)
    return "NONE"

def intresting_combo_label(mimetypes):
    mt = set(mimetypes.tolist())
    et = set()
    cet = set()
    et = mt.intersection(executable_types)
    cet = mt.intersection(common_exploit_types)
    if len(et) > 0 and len(cet) > 0:
        return "C-c-c-combo"
    if len(et) > 0 and len(cet) == 0:
        return "Executable only"
    if len(cet) > 0 and len(et) == 0:
        return "Exploit only"
    return "NONE"

# Lookup to see what the Bro - Team Cymru Malware Hash Registry picks up
def tc_mhr_present(sha1):
    for h in set(sha1.tolist()):
        if h in hashes:
            return True
    return False
In [52]:
# Get the data in a list (Series) of <sample name> -> <nparray of mime-types>
sample_groups = filesdf.groupby('sample')
s = sample_groups['mime_type'].apply(lambda x: x.unique())

# Rebuild the series into a dataframe and then "collapse" the dataframe with a reset index
sample_combos = pd.DataFrame(s, columns=['mime_types'])
sample_combos['sample'] = s.index
sample_combos['combos_data'] =
sample_combos['combos_label'] =

# Add some more columns and reset the index
sample_combos['sha1'] = sample_groups['sha1'].apply(lambda x: x.unique())
sample_combos['num_files'] = sample_groups['sha1'].apply(lambda x: len(x))
sample_combos = sample_combos.reset_index(drop=True)
sample_combos['mhr'] = sample_combos['sha1'].map(tc_mhr_present)
# Now we have a nice "flat" dataframe that for each sample has a list of sha1s associated with it, along with mime-types
# associated with it. Including the list of interesting mime-type combinations and if any of the sha1 hashes were picked up in
# the MHR
mime_types sample combos_data combos_label sha1 num_files mhr
0 [text/html, text/plain, image/jpeg, applicatio... 00065098d5b9f76f15b7eafadc0fd262_20120531 application/zip:application/x-dosexec:binary C-c-c-combo [faa7c496d19abe7fdb0e46deeac899353a6d29f8, a37... 84 True
1 [text/html, image/png, text/plain, image/jpeg,... 0009e3e91e87268dfa577f4626072bab_20120615 application/zip:application/x-dosexec:binary C-c-c-combo [7f140fddab21ae35f577f3527aed9933d05b58d3, 32f... 66 False
2 [text/plain, text/html, image/jpeg, image/gif,... 001d54c3e3f4cb05bc7676c98d70d0ec_20120509 application/x-shockwave-flash:application/pdf:... C-c-c-combo [3e611f6de96f98df23791dc837aa62f0d627a366, 17a... 350 True
3 [text/html, text/plain, application/jar, appli... 0020e9fc0bf7538cdf75b0308c5a236c_20121230 application/x-shockwave-flash:application/jar:... C-c-c-combo [fad057c5e1022a584646517aa2a4ef2998937490, cda... 126 True
4 [text/html, text/plain, image/gif, application... 002487431d095f45222491166c16f2ae_20120121 application/x-shockwave-flash:application/jar:... C-c-c-combo [fea5705f0bfa0f45a6dac2dbc8ffa9106a7bc3f4, be0... 132 True

5 rows × 7 columns

With the nice table above we have a sample view of the unique mime-types, unique files, total # of files and if any of those files were detected by the MHR all rolled into once nice package.

Assumption Alert!
In the case of using sample (eg. PCAP) we can treat this as activity from a single host w/o worry about trying to unique out all the information about the sandbox env and it's IPs. </font>

In [53]:
print "Reminder"
print "Total files found: %s" %len(filesdf.index)
print "Total Samples: %s" %len(sample_combos.index)
print "\nData summary"
print sample_combos.combos_label.value_counts()
sample_combos['count'] = 1
sample_combos[['combos_label','combos_data','count']].groupby(['combos_label','combos_data']).sum().sort('count', ascending=0).head(15)
Total files found: 1100106
Total Samples: 6662

Data summary
C-c-c-combo        5410
Executable only    1218
Exploit only         25
NONE                  9
dtype: int64
combos_label combos_data
Executable only application/x-dosexec 801
C-c-c-combo application/jar:application/x-dosexec 403
application/zip:application/x-dosexec:binary 361
application/jar:application/x-dosexec:binary 352
Executable only application/x-dosexec:binary 305
C-c-c-combo application/jar:application/pdf:application/x-dosexec:binary 303
application/zip:application/pdf:application/x-dosexec:binary 274
application/zip:application/pdf:application/x-shockwave-flash:application/x-dosexec:binary 248
application/zip:application/pdf:application/x-shockwave-flash:application/x-dosexec 238
application/x-shockwave-flash:application/x-dosexec 228
application/zip:application/x-dosexec 184
application/zip:application/x-shockwave-flash:application/x-dosexec:binary 181
application/x-shockwave-flash:application/pdf:application/x-dosexec 172
application/x-shockwave-flash:application/jar:application/pdf:application/x-dosexec:binary 171
application/x-shockwave-flash:application/jar:application/x-dosexec:binary 168

15 rows × 1 columns

In [54]:
(100. * sample_combos.combos_data.value_counts() / len(sample_combos.index)).head(10)
application/x-dosexec                                           12.023416
application/jar:application/x-dosexec                            6.049234
application/zip:application/x-dosexec:binary                     5.418793
application/jar:application/x-dosexec:binary                     5.283699
application/x-dosexec:binary                                     4.578205
application/jar:application/pdf:application/x-dosexec:binary     4.548184
application/zip:application/pdf:application/x-dosexec:binary     4.112879
application/zip:application/pdf:application/x-shockwave-flash:application/x-dosexec:binary     3.722606
application/zip:application/pdf:application/x-shockwave-flash:application/x-dosexec     3.572501
application/x-shockwave-flash:application/x-dosexec              3.422396
dtype: float64
In [55]:
# Sorry for the lousy formatting, I really wanted (me) and you to see all the crazy combinations of files in some of these samples.
# Maybe they're multiple malicious sites, or maybe it's just some crazy spray-and-pray happening!
sample_mhr = sample_combos[sample_combos['mhr'] == True]
print "Total Samples: %s" %len(sample_mhr.index) 
print sample_mhr.combos_label.value_counts()
print (100. * sample_mhr.combos_data.value_counts() / len(sample_mhr.index)).head(10)
Total Samples: 2849

C-c-c-combo        2761
Executable only      85
Exploit only          3
dtype: int64

application/jar:application/x-dosexec                           9.582310
application/jar:application/pdf:application/x-dosexec:binary    7.230607
application/zip:application/pdf:application/x-shockwave-flash:application/x-dosexec:binary    7.195507
application/zip:application/pdf:application/x-shockwave-flash:application/x-dosexec    6.739207
application/zip:application/x-dosexec:binary                    6.598807
application/zip:application/pdf:application/x-dosexec:binary    5.440505
application/jar:application/x-dosexec:binary                    5.229905
application/x-shockwave-flash:application/pdf:application/x-dosexec    4.563005
application/x-shockwave-flash:application/pdf:application/x-dosexec:binary    3.580204
application/x-shockwave-flash:application/jar:application/pdf:application/x-dosexec:binary    3.545104
dtype: float64

We've got a great picture at how things could look coming from a host in a "short" amount of itme, wonder what it looks like when it's broken up at the network layer. Let's mimic the layout and some of the analysis from above and see what pops up.


In [56]:
# Get the data in a list (Series) of <sample name> -> <nparray of mime-types>
uid_groups = filesdf.groupby('conn_uids')
s = uid_groups['mime_type'].apply(lambda x: x.unique())

# Rebuild the series into a dataframe and then "collapse" the dataframe with a reset index
uid_combos = pd.DataFrame(s, columns=['mime_types'])
uid_combos['conn_uid'] = s.index
uid_combos['combos_data'] =
uid_combos['combos_label'] =

# Same trick, different day
uid_combos['sha1'] = uid_groups['sha1'].apply(lambda x: x.unique())
uid_combos['num_files'] = uid_groups['sha1'].apply(lambda x: len(x))
uid_combos = uid_combos.reset_index(drop=True)
uid_combos = uid_combos[uid_combos['conn_uid'] != '(empty)']
uid_combos['mhr'] = uid_combos['sha1'].map(tc_mhr_present)
# Now we've got the same type of dataframe as above in sample_combos.
mime_types conn_uid combos_data combos_label sha1 num_files mhr
1 [image/png, text/plain] C000io4MJcSJMUYhmb NONE NONE [d34f84ee29438bba5c353a9d20f3536badb38178, 577... 32 False
2 [image/jpeg] C001r8hbIKrnTGyHf NONE NONE [74d87f7b1ca35f84cd4f1a08314c2d2b8c2aa763, 59d... 4 False
3 [application/zip] C002bZ2fhBI7mYwQxf application/zip Exploit only [c17ac3fb131adc485b32bebb3bcad742a0a676d1] 2 False
4 [text/html] C007502KmyrlCzKHqb NONE NONE [36124e9da41313a4bc3a903abe5fccf0ea9c7736] 2 False
5 [image/gif] C009392KwetlBZpxC NONE NONE [75e91ae3e549dab12ed1c9787ade9131aef1c981] 2 False

5 rows × 7 columns

In [57]:
# Same deal as above, these 2 make the same tables as the images below

The tables below were converted to .png files for pretty display, but they're just screen-caps of the above commands

Stats bassed on session (conn_uid)
Stats based on sample (sample)
In [82]:
df_uid = pd.DataFrame()
df_uid['num_files'] = uid_combos['num_files']
df_uid['label'] = "session"
df_sample = pd.DataFrame()
df_sample['num_files'] = sample_combos['num_files']
df_sample['label'] = "sample"
df = pd.concat([df_sample, df_uid], ignore_index=True)
plt.pyplot.xlabel('Number of Files, Session v. Sample')
plt.pyplot.ylabel('# of Files')
plt.pyplot.title('Comparision of # Files')
<matplotlib.text.Text at 0x149bf4790>
In [83]:
print "Total connections: %s" %len(uid_combos.index)
#100. * uid_combos.combos.value_counts() / len(uid_combos.index)
uid_combos['count'] = 1
uid_combos[['combos_label','combos_data','count']].groupby(['combos_label','combos_data']).sum().sort('count', ascending=0).head(15)
Total connections: 238210
combos_label combos_data
NONE NONE 198045
Executable only application/x-dosexec 10395
binary 10136
Exploit only application/x-shockwave-flash 6035
application/jar 3276
application/zip 2570
application/x-java-applet 1483
application/pdf 1398
C-c-c-combo application/pdf:application/x-dosexec 1030
application/jar:application/x-dosexec 902
application/zip:application/x-dosexec 877
Executable only application/octet-stream 862
application/ 242
application/ 198
C-c-c-combo application/x-shockwave-flash:application/pdf:application/x-dosexec 186

15 rows × 1 columns

In [84]:
uid_mhr = uid_combos[uid_combos['mhr'] == True]
print "Total connections: %s" %len(uid_mhr.index)
print uid_mhr.combos_label.value_counts()
100. * uid_mhr.combos_data.value_counts() / len(uid_mhr.index)
Total connections: 6066

Executable only    2998
C-c-c-combo        1748
Exploit only       1320
dtype: int64

application/x-dosexec                                           49.010880
application/pdf:application/x-dosexec                            9.347181
application/jar:application/x-dosexec                            9.066930
application/x-shockwave-flash                                    8.127267
application/jar                                                  6.594131
application/zip:application/x-dosexec                            5.637982
application/x-java-applet                                        4.648863
application/x-shockwave-flash:application/pdf:application/x-dosexec     2.884932
application/pdf                                                  2.176063
application/x-shockwave-flash:application/x-dosexec              1.483680
application/x-dosexec:binary                                     0.395648
application/pdf:application/x-dosexec:binary                     0.263765
application/x-shockwave-flash:application/pdf                    0.214309
application/x-shockwave-flash:application/pdf:application/x-dosexec:binary     0.049456
application/jar:binary                                           0.049456
application/pdf:binary                                           0.016485
application/x-shockwave-flash:application/pdf:binary             0.016485
application/x-dosexec:application/octet-stream                   0.016485
dtype: float64

After a long slog through the data (some interesting, and some ... eh) we've got a few more questions we can answer. We'll stick to just the session view from now on, but these can just as easily be done above via the sample dataframe.

Great, AV was able to detect some things, but what is it missing? Can we find any relationships between what AV missed and what it found? </font>

In [85]:
uid_mhr = uid_combos[uid_combos['mhr'] != True]
print uid_mhr.combos_label.value_counts()
uid_mhr[uid_mhr['combos_label'] == 'C-c-c-combo']['combos_data'].value_counts()
NONE               198045
Executable only     18981
Exploit only        13496
C-c-c-combo          1622
dtype: int64

application/zip:application/x-dosexec                           535
application/pdf:application/x-dosexec                           463
application/jar:application/x-dosexec                           352
application/jar:binary                                          150
application/x-shockwave-flash:application/x-dosexec              35
application/x-shockwave-flash:binary                             33
application/pdf:binary                                           26
application/x-shockwave-flash:application/pdf:application/x-dosexec     11
application/zip:binary                                            4
application/zip:application/jar:application/x-dosexec             4
application/x-shockwave-flash:application/octet-stream            3
application/pdf:application/x-dosexec:binary                      3
application/x-shockwave-flash:application/x-dosexec:binary        1
application/zip:application/jar:binary                            1
application/x-java-applet:application/x-dosexec                   1
dtype: int64
In [86]:
combos = uid_combos[uid_combos['combos_label'] == 'C-c-c-combo'].shape[0]
print "% of MHR hits in sessions with an \"interesting\" file combination"
print uid_combos[uid_combos['combos_label'] == 'C-c-c-combo']['mhr'].value_counts().apply(lambda x: x/combos)
print "\n% of MHR hits from samples (end systems) with an \"interesting\" file combination"
combos = sample_combos[sample_combos['combos_label'] == 'C-c-c-combo'].shape[0]
print sample_combos[sample_combos['combos_label'] == 'C-c-c-combo']['mhr'].value_counts().apply(lambda x: x/combos)
% of MHR hits in sessions with an "interesting" file combination
True     0.518694
False    0.481306
dtype: float64

% of MHR hits from samples (end systems) with an "interesting" file combination
True     0.510351
False    0.489649
dtype: float64

Wait, what!?

Did we just figure out that (according to our samples) that if you're relying on AV to detect bad files instead of looking for a super easy pattern in network traffic you're missing out on 1/2 of the possible malware driveby downloads. Granted this requires an "interesting" file combination be present in both the same session or from the same source host, but wow. </font>

Assumption Alert!
Again, we believe this data to be composed of mostly driveby downloads, so we're only looking at labels within a known-malicous set. Looking for those interesting file combinations on the network may yield additional false-positives.


We were able to successfully dissect properties of files that traverse the network. Based on the properties the effectiveness of current solutions, proposed patterns to look for new and "un-known" attacks were discovered! Never underestimate the value of having better data. With this set, some insights were gained, but if the data came pre-labeled or we knew more about how it was collected perhaps we'd have more or different assumptions and gotten different results. </font>

In [ ]:

This website does not host notebooks, it only renders notebooks available on other websites.

Delivered by Fastly, Rendered by OVHcloud

nbviewer GitHub repository.

nbviewer version: d25d3c3

nbconvert version: 5.6.1

Rendered (Wed, 07 Dec 2022 04:33:16 UTC)