Systems logs are particularly challenging because they often lack any kind of structure as all. For this exercise we're going to be looking at the /var/log/system.log of a typical Mac OSX Laptop. The first steps will be standard data inspection and plots, after that we'll pull out the big guns!
# This exercise is mostly for us to understand what kind of data we have and then
# run some simple stats on the fields/values in the data. Pandas will be great for that
import pandas as pd
pd.__version__
'0.13.0rc1-32-g81053f9'
# Set default figure sizes
pylab.rcParams['figure.figsize'] = (16.0, 5.0)
# Lets take a peek at our system.log, with IPython you
# can execute system commands with '!' (also see %alias)
!head /var/log/system.log
Jan 11 00:16:09 voltron kernel[0]: Wake reason: RTC (Alarm) Jan 11 00:16:09 voltron kernel[0]: RTC: SleepService 2014/1/11 07:16:09, sleep 2014/1/11 06:16:16 Jan 11 00:16:09 voltron kernel[0]: AirPort_Brcm43xx::powerChange: System Wake - Full Wake/ Dark Wake / Maintenance wake Jan 11 00:16:09 voltron kernel[0]: Previous Sleep Cause: 5 Jan 11 00:16:09 voltron kernel[0]: IOPPF: Sent gpu-internal-plimit-notification last value 0 (rounded time weighted average 0) Jan 11 00:16:09 voltron kernel[0]: IOThunderboltSwitch<0xffffff8042864000>(0x0)::listenerCallback - Thunderbolt HPD packet for route = 0x0 port = 11 unplug = 0 Jan 11 00:16:09 voltron kernel[0]: IOThunderboltSwitch<0xffffff8042864000>(0x0)::listenerCallback - Thunderbolt HPD packet for route = 0x0 port = 12 unplug = 0 Jan 11 00:16:09 voltron kernel[0]: wlEvent: en0 en0 Link DOWN virtIf = 0 Jan 11 00:16:09 voltron kernel[0]: AirPort: Link Down on en0. Reason 8 (Disassociated because station leaving). Jan 11 00:16:09 voltron kernel[0]: TBT W (2): 0x0100 [x]
# Okay so weird timestamp, machine name and then random stuff...
import dateutil.parser
# Helper function to make compact, composite labels for the syslog rows
def make_label(features):
return unicode(':'.join([f[:6] for f in features]), 'utf-8').encode('ascii','ignore')
# Now process the syslog file, part of the challenge is parsing the darn thing.
# Typically each row in a syslog will have a timestamp first and then some random
# set of other fields split on whitespace. We're going to carefully pull out the
# timestamp but then just treat everything else as bag of tokens (sparse data).
date_index = []
with open('/var/log/system.log') as syslog:
syslog_rows = syslog.readlines()
syslog_events = []
for row in syslog_rows:
split_list = row.split()
date_string = ' '.join(split_list[:3])
try:
date = dateutil.parser.parse(date_string)
except (ValueError, TypeError) as e:
continue # Going to skip rows where we can't get a timestamp
features = split_list[4:]
syslog_events.append({'features': features, 'label': make_label(features), 'type': features[0].split('[')[0]})
date_index.append(date)
dataframe = pd.DataFrame(syslog_events, index=date_index)
# Make sure our timestamp got pulled/parsed/converted correctly
type(dataframe.index[0])
pandas.tslib.Timestamp
dataframe.head(5)
# So what do we have? Datetime is our index, features is just the
# syslog row split by whitespace and placed into a list. The label
# is just the features (truncated) and flattened with : separators.
# 'type' is just the syslog identifier right after the datetime and machine.
# Note: The parsing of the syslog is more art than science and syslogs
# can vary greatly, but we haven't done any super magic here,
# just basically thrown the row data into some bins...
features | label | type | |
---|---|---|---|
2014-01-11 00:16:09 | [kernel[0]:, Wake, reason:, RTC, (Alarm)] | kernel:Wake:reason:RTC:(Alarm | kernel |
2014-01-11 00:16:09 | [kernel[0]:, RTC:, SleepService, 2014/1/11, 07... | kernel:RTC::SleepS:2014/1:07:16::sleep:2014/1:... | kernel |
2014-01-11 00:16:09 | [kernel[0]:, AirPort_Brcm43xx::powerChange:, S... | kernel:AirPor:System:Wake:-:Full:Wake/:Dark:Wa... | kernel |
2014-01-11 00:16:09 | [kernel[0]:, Previous, Sleep, Cause:, 5] | kernel:Previo:Sleep:Cause::5 | kernel |
2014-01-11 00:16:09 | [kernel[0]:, IOPPF:, Sent, gpu-internal-plimit... | kernel:IOPPF::Sent:gpu-in:last:value:0:(round:... | kernel |
5 rows × 3 columns
dataframe.tail(5)
features | label | type | |
---|---|---|---|
2014-01-11 14:03:10 | [Google, Chrome, Helper[9524]:, CoreText, Copy... | Google:Chrome:Helper:CoreTe:CopyFo:receiv:mig:... | |
2014-01-11 14:03:10 | [last, message, repeated, 1, time, ---] | last:messag:repeat:1:time:--- | last |
2014-01-11 14:03:10 | [kernel[0]:, SMC::smcReadKeyAction, ERROR, TC0... | kernel:SMC::s:ERROR:TC0D:kSMCBa:fKeyHa | kernel |
2014-01-11 14:03:40 | [last, message, repeated, 24, times, ---] | last:messag:repeat:24:times:--- | last |
2014-01-11 14:03:40 | [kernel[0]:, SMC::smcReadKeyAction, ERROR, TC0... | kernel:SMC::s:ERROR:TC0D:kSMCBa:fKeyHa | kernel |
5 rows × 3 columns
# We can get a count of all the unique values by running value_counts()
dataframe['label'].value_counts()
# Okay the breakout below gives us a sense for our challenge...
kernel:SMC::s:ERROR:TC0D:kSMCBa:fKeyHa 651 last:messag:repeat:24:times:--- 539 last:messag:repeat:1:time:--- 27 config:networ:change:v4(en0:v6(en0:DNS!:Proxy:SMB 26 Google:Chrome:Helper:Intern:of:CFAllo:not:known;:out-of:failur:via:CFAllo:will:not:result:in:termin:http:/ 18 Google:Chrome:Helper:CGSLoo:Failed:to:look:up:the:port:for:"com.a:(1100) 17 Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch:::LSXPCC:#426:___ZN2:q=com. 17 Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch 17 kernel:IOThun:-:Thunde:HPD:packet:for:route:=:0x0:port:=:11:unplug:=:0 16 last:messag:repeat:2:times:--- 14 kernel:IOThun:-:Thunde:HPD:packet:for:route:=:0x0:port:=:12:unplug:=:0 13 Google:Chrome:Helper:CoreTe:CopyFo:receiv:mig:IPC:error:(FFFFF:from:font:server 12 kernel:en0::BSSID:change:to:c8:d7: 11 Window:device:authw:0x0(0):shield:0x7fcd 10 kernel:AirPor:RSN:handsh:comple:on:en0 10 ... coreau:2014-0:08:30::AM:[AirPl:Sent:retran:to:192.16:(56:retran:0:futile 1 coreau:2014-0:01:23::PM:[AirPl:###:Start:taking:too:long..:asynch 1 kernel:efi:pageco:65 1 coreau:2014-0:01:23::PM:[AirPl:Quiesc:endpoi:'Downs 1 coreau:2014-0:12:57::PM:[AirPl:Quiesc:endpoi:'Downs 1 kernel:did:discar:act:47067:inact:201504:purgea:134512:spec:40403:cleane:2 1 coreau:2014-0:07:34::AM:[AirPl:Sent:retran:to:192.16:(43:retran:0:futile 1 kernel:[0x2b3:0x2700 1 coreau:2014-0:07:41::AM:[AirPl:Sent:retran:to:192.16:(45:retran:0:futile 1 kernel:CODE:SIGNIN:cs_inv:p=8615:final:status:0x0,:allow:(remov:VALID):page 1 kernel:hibern:prefli:pageCo:241516:est:comp:47:setfil:523491:min:107374 1 coreau:2014-0:12:57::PM:[AirPl:###:Report:networ:status:(2,:en0):failed:1/0x1:kCFHos:/:kCFStr:/:kCFStr:/:kCFStr:/:evtNot:/:siInit:/:kUSBPe:/:dsBusE:/:kStatu:/:kOTSer:/:cdevRe:/:EPERM 1 AirPla:2014-0:12:57::PM:[AirPl:AirPla:became:the:defaul:audio:device 1 sudo[8:brifor:::TTY=tt:;:PWD=/U:;:USER=r:;:COMMAN:8769 1 kernel:CODE:SIGNIN:cs_inv:p=8906:final:status:0x0,:allow:(remov:VALID):page 1 Length: 275, dtype: int64
# Realize that we don't want the 'last message' events
dataframe = dataframe[dataframe['type'] != 'last']
# Now prepare the data for plotting by pivoting on the
# feature to create a new column (series) for each value
# Only pull out the top 7 syslog event types
topN = dataframe['type'].value_counts()[:7].index
subset = dataframe[dataframe['type'].isin(topN)]
print 'Subset: %d rows %d columns' % subset.shape
# We're going to add a new column called value (needed for pivot). This
# is a bit dorky, but needed as the new columns that get created should
# really have a value in them, also we can use this as our value to sum over.
subset['count'] = 1
pivot = pd.pivot_table(subset, values='count', rows=subset.index, cols=['type'], fill_value=.01)
by = lambda x: lambda y: getattr(y, x)
grouped = pivot.groupby([by('hour')]).sum()
grouped.plot(kind='bar', stacked=True)
grouped.plot()
Subset: 1227 rows 3 columns
<matplotlib.axes._subplots.AxesSubplot at 0x10d4695d0>
Other popular online clustering algorithms
- [DBSCAN](http://en.wikipedia.org/wiki/DBSCAN)
- [OPTICS Algorithms](http://en.wikipedia.org/wiki/OPTICS_algorithm)
# Even for small syslogs the number of similarity pairs to compute quickly
# becomes quite large O(N**2), so for 100k rows that's 10 billion possible
# pairs. Using Banded MinHash will drastically reduce the number of
# candidates that we have to compute.
import data_hacking.lsh_sims as lsh_sims
# Note: The parameters here are setup for feeding the results into a Hierarchical
# Clustering algorithm, which needs as many similarities as you can get.
# In general you'd parameters like num_hashes:20, lsh_bands:5 lsh_rows:4
# Note: lsh_bands*lsh_rows ^must^ equal num_hashes
params = {'num_hashes':20, 'lsh_bands':20, 'lsh_rows':1, 'drop_duplicates':True}
lsh = lsh_sims.LSHSimilarities(dataframe['features'], mh_params=params)
sims = lsh.batch_compute_similarities(distance_metric='jaccard', threshold=.2)
12694 (0.64% out of 1979649) pairs returned from MinHash
Jaccard Index: a set based distance metric (overlaps in sets of elements)
Levenshtein Distance: based on the edit distance of the elements (so order matters).
# Lets look at the difference between Jaccard Similarity and Levenshtein Similarity
# So here similarity is a normalized measure of inverse distance...
print 'Jaccard Index (Sim): %f ' % lsh.jaccard_sim(['a','b','c','d'], ['a','b','d','c'])
print 'Levenshtein Distance: %f ' % lsh.levenshtein(['a','b','c','d'], ['a','b','d','c'])
print 'Levenshtein (Sim): %f ' % lsh.l_sim(['a','b','c','d'], ['a','b','d','c'])
Jaccard Index (Sim): 1.000000 Levenshtein Distance: 2.000000 Levenshtein (Sim): 0.500000
# One more example for intuition (also note they don't have to be the same size)
print 'Jaccard Index (Sim): %f ' % lsh.jaccard_sim(['a','b','c'], ['a','b','c','x'])
print 'Levenshtein Distance: %f ' % lsh.levenshtein(['a','b','c'], ['a','b','c','x'])
print 'Levenshtein (Sim): %f ' % lsh.l_sim(['a','b','c'], ['a','b','c','x'])
Jaccard Index (Sim): 0.750000 Levenshtein Distance: 1.000000 Levenshtein (Sim): 0.750000
# Okay now that we have the similarities between all the rows in our syslog
# we can start to investigate the results.
sims.sort(reverse=True)
print '*** Top sims ***'
sims[:10]
#sims[-10:]
*** Top sims ***
[(1.0, 277, 1186), (0.9705882352941176, 376, 1090), (0.9666666666666667, 375, 1089), (0.9629629629629629, 1382, 1400), (0.9629629629629629, 1355, 1400), (0.9629629629629629, 1355, 1382), (0.9629629629629629, 1343, 1400), (0.9629629629629629, 1343, 1382), (0.9629629629629629, 1343, 1355), (0.9629629629629629, 1334, 1400)]
print dataframe.iloc[376]['features']
['System', 'Events[8642]:', '.sdef', 'warning', 'for', 'type', "'text", '|', 'missing', 'value', '|', "any'", 'attribute', "'uniqueID'", 'of', 'class', "'XML", "element'", 'in', 'suite', "'XML", "Suite':", 'AppleScript', 'ID', 'references', 'may', 'not', 'work', 'for', 'this', 'property', 'because', 'its', 'type', 'is', 'not', 'NSNumber-', 'or', 'NSString-derived.']
print dataframe.iloc[1090]['features']
['System', 'Events[9028]:', '.sdef', 'warning', 'for', 'type', "'text", '|', 'missing', 'value', '|', "any'", 'attribute', "'uniqueID'", 'of', 'class', "'XML", "element'", 'in', 'suite', "'XML", "Suite':", 'AppleScript', 'ID', 'references', 'may', 'not', 'work', 'for', 'this', 'property', 'because', 'its', 'type', 'is', 'not', 'NSNumber-', 'or', 'NSString-derived.']
# The two feature sets should look quite similar (recall that this
# is just our syslog row split on white space and thrown into a list)
# So now for any row in our syslog we can see what rows are highly
# similar to that row.
query_item = ['Google', 'Chrome', 'Helper[11545]:', 'Process', 'unable', 'to', 'create', 'connection', 'because', 'the', 'sandbox', 'denied', 'the', 'right', 'to', 'lookup', 'com.apple.coreservices.launchservicesd', 'and', 'so', 'this', 'process', 'cannot', 'talk', 'to', 'launchservicesd.', ':', 'LSXPCClient.cp', '#426', '___ZN26LSClientToServerConnection21setupServerConnectionEiPK14__CFDictionary_block_invoke()', 'q=com.apple.main-thread']
lsh.top_N(query_item,dataframe['label'], 5)
[{'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch:::LSXPCC:#426:___ZN2:q=com.', 'sim': 0.9629629629629629}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch:::LSXPCC:#426:___ZN2:q=com.', 'sim': 0.9629629629629629}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch:::LSXPCC:#426:___ZN2:q=com.', 'sim': 0.9629629629629629}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch:::LSXPCC:#426:___ZN2:q=com.', 'sim': 0.9629629629629629}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch:::LSXPCC:#426:___ZN2:q=com.', 'sim': 0.9629629629629629}]
# Note the query object doesn't have all the original features
query_item = ['Google', 'Chrome', 'Process', 'unable', 'to', 'sandbox']
lsh.top_N(query_item,dataframe['label'], 5)
[{'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch', 'sim': 0.2727272727272727}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch', 'sim': 0.2727272727272727}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch', 'sim': 0.2727272727272727}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch', 'sim': 0.2727272727272727}, {'label': 'Google:Chrome:Helper:Proces:unable:to:create:connec:becaus:the:sandbo:denied:the:right:to:lookup:com.ap:and:so:this:proces:cannot:talk:to:launch', 'sim': 0.2727272727272727}]
We're using a bottom up method (image is flipped :), you simply sort the similarities and start building your tree from the bottom. If B and C are the most similar you link them, then D/E and so on until you complete the tree. The devil is definitely in the details on the implementation of this, so luckily we have a python class that does it for us.
# A simple example for illustration
import data_hacking.hcluster as hcluster
toy_data = [['a','b','c','d'],['a','b','d'],['a','b','e','d'],['a','b','f'],
['w','x','y','z'],['x','y','z'],['w','x','q','z','y'],
['r','s','t'],['u','s','t']]
toy_lsh = lsh_sims.LSHSimilarities(toy_data, mh_params=params)
toy_sims = toy_lsh.batch_compute_similarities(distance_metric='jaccard', threshold=.2)
# Compute a hierarchical clustering from the similarity list
toy_h_clustering = hcluster.HCluster(toy_data)
toy_h_clustering.set_sim_method(toy_lsh.jaccard_sim)
toy_h_tree, toy_root = toy_h_clustering.sims_to_hcluster(toy_sims)
# Plot the hierarchical tree
toy_h_clustering.plot_htree(toy_h_tree, prog='dot')
10 (12.35% out of 81) pairs returned from MinHash
# Now we'll do the same thing for our syslog data
# Compute a hierarchical clustering from the similarity list
h_clustering = hcluster.HCluster(dataframe['features'])
h_clustering.set_sim_method(lsh.jaccard_sim)
h_tree, root = h_clustering.sims_to_hcluster(sims, agg_sim=.3)
# Here's an image for those folks not looking at it interactively
# Even better save the graph out to json format and visualize it with D3 (D3js.org)
import networkx.readwrite.json_graph
import json
graph_json = networkx.readwrite.json_graph.tree_data(h_tree, root=root)
json.dump(graph_json, open('h_tree.json','w'))
To run the visualization in your web browser, here we're using port 9999 instead of standard 8888 because we may have IPython already running on 8888:
> cd data_hacking/fun_with_syslog > python -m SimpleHTTPServer 9999 &
Now point your brower at the html file http://localhost:9999/syslog_vis.html
Really want to improve the interactive D3 based Hierarchical Tree Visualization D3 Data Driven Documents
We pulled in some syslog data into a Pandas dataframe, made some plots, computed row similarities with 'Banded MinHash' and used single-linkage clustering to build an agglomerative hierarchical cluster. Lots of possibilities from here, you could use just the LSH datastructure or the H-Cluster...