Notebook

Timesketch and Colab¶

This is a small colab that is built to demonstrate how to interact with Timesketch from colab to do some additional exploration of the data.

Colab can greatly complement investigations by providing the analyst with access to the powers of using python to manipulate the data stored in Timeskech. Additionally it provides developers with the ability to do research on the data in order to speed up developments of analyzers, aggregators and graphing. The purpose of this colab is simply to briefly introduce the powers of colab to analysts and developers, with the hope of inspiring more to take advantage of this powerful platform. It is also an option to use jupyter notebook instead of colab, both are just as valid options.

Each code cell (denoted by the [] and grey color) can be run simply by hitting "shift + enter" inside it. The first code that you execute will automatically connect you to a public runtime for colab and connect to the publicly open demo timesketch. You can easily add new code cells, or modify the code that is already there to experiment.

README¶

If you simply click the connect button in the upper right corner you will connect to a kernel runtime running in the cloud. It is a great way to explore what colab has to offer and provides a quick way to play with the demo data.

However if you want to connect to your own Timesketch instance, or load data from local drive or don't want the data to be read into a cloud machine then it is better to run from a local runtime environment. Install jupyter on your machine and follow the guideline posted here. These instructions are also available from the pop-up that comes when you select a local runtime.

Once you have your local runtime setup you should be able to reach your local Timesketch instance.

You cannot save changes to this colab document, if you want to have your own copy of the colab to make changes or do some other experimentation you can simply select "File / Save a Copy in Drive" button to make your own copy of this colab and start making changes.

Installation¶

Let's start by installing the TS API client... all commands that start with ! are executed in the shell, therefore if you are missing Python packages you can use pip.

This is not needed if you are running a local kernel, that has the library already installed.

In [ ]:

!pip install --upgrade timesketch-api-client

Remember to execute the cell below¶

Just a gentle reminder that the cell below is a code cell, so it needs to be executed (you can see the "play" button next to it)

In [ ]:

# @title Import Libraries
# @markdown We first need to import libraries that we will use throughout the colab.
import altair as alt # For graphing.
import numpy as np   # Never know when this will come in handy.
import pandas as pd  # We will be using pandas quite heavily.

from timesketch_api_client import config
from timesketch_api_client import search

(notice that the cell above is an actual code cell, it is just using formatting to look nice. If you want to see the code behind it, select the cell, and click the three dots and select "Form | Show Code")

Connect to TS¶

And now we can start creating a timesketch client. The client is the object used to connect to the TS server and provides the API to interact with it.

The TS API consists of several objects, each with it's own purpose, some of them are:

client: A TS client object is the main gateway to TS. That includes

authenticating to TS, keeping a session to interact with the REST API, and providing functions that allow you to create new sketches, get a sketch, and work with search indices.

sketch: A sketch object is what you will most likely interact with the most. That allows you to operate on a sketch, so that means to see the sketch ACL, attributes, labels and other metadata as well as to run analyzers, search queries, aggregations, stories, tag or label events, etc.
timeline: A timeline object allows you to view properties of a timeline, as well as adding/removing labels from it.
story: A story object allows you to interact with a story, add/delete/edit blocks, move them around, export the story, etc.
view: A view object is an object that holds information about saved searches, or views in a sketch. This can be passed to the sketch to query data, or to view the content of the view.
aggregation: An aggregation object is used to run aggregation queries on the dataset. It can provide you with a data frame, a chart, or the option to save/delete aggregations.

Let's start by getting a TS client object. There are multiple ways of getting that, yet the easiest way is to use the configuration object. That automates most of the actions that are needed, and prompts the user with questions if data is missing (reading information from a configuration file to fill in the blanks).

The first time you request the client it will ask you questions (since for the first time there will be no configuration file). For this demonstration we are going to be using the demo server, so we will use the following configs:

host_uri: https://demo.timesketch.org
auth_mode: timesketch (this is simple user/pass)
Username: demo
Password: demo

keep in mind that after answering these questions for the first time, a configuration file ~/.timesketchrc and ~/.timesketch.token will be saved so you don't need to answer these questions again.

In [ ]:

ts_client = config.get_client(confirm_choices=True)

Let's Explore¶

And now we can start to explore. The first thing is to get all the sketches that are available. Most of the operations you want to do with TS are available in the sketch API.

In [ ]:

sketches = ts_client.list_sketches()

Now that we've got a lis of all available sketches, let's print out the names of the sketches as well as the index into the list, so that we can more easily choose a sketch that interests us.

In [ ]:

for i, sketch in enumerate(sketches):
  print(f'[{i}] <sketch ID: {sketch.id} | {sketch.name} - {sketch.description}')

Another way is to create a dictionary where the keys are the names of the sketchces and values are the sketch objects.

In [ ]:

sketch_dict = dict((x.name, x) for x in sketches)

In [ ]:

sketch_dict

Let's now take a closer look at some of the data we've got in the "Greendale" investigation.

In [ ]:

gd_sketch = sketch_dict.get('Greendale', ts_client.get_sketch(1))

Now that we've connected to a sketch we can do all sorts of things.

Try doing: gd_sketch.<TAB>

In colab you can use TAB completion to get a list of all attributes of the object you are working with. See a function you may want to call? Try calling it with gd_sketch.function_name? and hit enter.. let's look at an example:

In [ ]:

gd_sketch.list_saved_searches?

This way you'll get a list of all the parameters you may want or need to use. You can also use tab completion as soon as you type, gd_sketch.e<TAB> will give you all options that start with an e, etc.

You can also type gd_sketch.list_saved_searches(<TAB>) and get a pop-up with a list of what parameters this function provides.

Now let's look at somethings we can do with the sketch object and the TS client. For example if we want to get all starred events in the sketch we can do that by querying the sketch for available labels. You can look at a label as a "sketch specific tag", that is unlike a tag that is stored in the Elastic document and therefore is shared among all sketches that have that same timeline attached, a label is bound to the actual sketch and therefore not available outside of it... this is used in various places, most notably to indicate which events have labels, are hidden from views and are starred. These pre-defined labels are:

__ts_star: Starred event
__ts_comment: Event with a comment
__ts_hidden: A hidden event

Let's for instance look at all starred events in the Greendale index. We will use the parameter as_pandas=True. That will mean that the events will be returned as a pandas DataFrame. This is a very flexible object that we will use throughout this colab. We will try to introduce some basic operations on a pandas object, yet for more details there are plenty of guides that can be found online. One way to think about pandas is to think about spreadsheets, or databases, where the data is stored in a table (data frame), which consists of columns and rows. And then there are operations that work on either the column or the row.

But let's start by looking at one such data frame, by looking for all starred events.

There are two ways of doing that, either by using the search object (preferred) or via the sketch object (soon to be deprecated)

Once we get the data frame back, we call data_frame.shape, which returns a tuple with two items, number of rows and number of columns. That way we can assess the size of the dataframe.

In [ ]:

starred_events = gd_sketch.search_by_label('__ts_star', as_pandas=True)
starred_events.shape

Let's look at how to achieve the same using the search object.

In [ ]:

search_obj = search.Search(gd_sketch)
label_chip = search.LabelChip()
label_chip.use_star_label()
search_obj.add_chip(label_chip)
search_obj.query_string = '*'

starred_events = search_obj.table
starred_events.shape

As you noticed there are quite a few starred events.. to limit this, let's look at just the first 10

In [ ]:

starred_events.head(10)

Or a single one...

In [ ]:

pd.set_option('display.max_colwidth', 100) # this is just meant to make the output wider
starred_events.iloc[9]

To continue let's look at what searches have been stored in the sketch:

In [ ]:

saved_searches = gd_sketch.list_saved_searches()

for index, saved_search in enumerate(saved_searches):
  print('[{0:d}] {1:s}'.format(index, saved_search.name))

You can then start to query the API to get back results from these saved searches. Let's try one of them...

Word of caution, try to limit your search so that you don't get too many results back. The API will happily let you get all the results back as you choose, but the more records you get back the longer the API call will take (10k events per API call).

In [ ]:

# You can change this number if you would like to test out another view.
# The way the code works is that it checks first of you set the "view_text", and uses that to pick a view, otherwise the number is used.
saved_search_id = 1
saved_search_text = 'Phishy Domains'

if saved_search_text:
  for index, saved_search in enumerate(saved_searches):
    if saved_search.name == saved_search_text:
      saved_search_id = index
      break

print('Fetching data from : {0:s}'.format(saved_searches[saved_search_id].name))
print('        Query used : {0:s}'.format(
    saved_searches[saved_search_id].query_string if saved_searches[saved_search_id].query_string else saved_searches[saved_search_id].query_dsl))

If you want to issue this query, then you can run the cell below, otherwise you can change the view_number to try another one.

In [ ]:

greendale_frame = saved_searches[saved_search_id].table

One thing you may notice is that throughout this colab we will use the ".table" property of the search object. That means that the data that we'll get back is a pandas DataFrame that we can now start exploring.

Let's start with seeing how many entries we got back.

In [ ]:

greendale_frame.shape

This tells us that we got back only 40 records... and that's because we are using a saved search, that limited the number of records returned back, let's confirm that:

In [ ]:

saved_search = saved_searches[saved_search_id]

saved_search.query_filter

You can see that the size of the return value is only 40 entries... but the log entry before told us that there were 2240 entries to be gathered so let's increase that.

warning... since this is a saved search the API will attempt to update the actual saved search on the backend, but the demo user you are using is not allowed to change the saved search, so there will be a RuntimeError raised. Don't worry though, you can still change the value locally though.

In [ ]:

saved_search.max_entries = 4000

In [ ]:

greendale_frame = saved_search.table
greendale_frame.shape

This tells us that the view returned back 2.284 events with 12 columns. Let's explore the first few entries, just so that we can wrap our head around what we got back.

This is a great way to just get a feeling of what the data looks like that will be returned back. To see the first five entries we can use the .head(5) function, and the same if we want the last entries, we can use .tail(5).

In [ ]:

greendale_frame.head(5)

Let's look at what columns we got back... and maybe create a slice that contains fewer columns, or not necessarily fewer but at least more the ones that we want to be able to see.

In [ ]:

greendale_frame.columns

Since this is a result from the analyzers we have few extra fields we can pull in.

Looking at the results you see the same column names in the UI, but when you click an event you'll notice that it has a lot more fields in it than the default view is. This can also be changed in the API client. For that we use the variable return_fields. Let's set that one.

    return_fields: List of fields that should be included in the
        response.

We can use that to specify what fields we would like to get back. Let's add few more fields (you can see what fields are available in the UI)

In [ ]:

search_obj = saved_searches[saved_search_id]
try:
  search_obj.return_fields = 'datetime,timestamp_desc,tag,message,label,url,domain,human_readable,access_count,title,domain_count,search_string'
except RuntimeError:
  pass

try:
  search_obj.max_entries = 10000
except RuntimeError:
  pass
greendale_frame = search_obj.table
greendale_frame.head(4)

Let's briefly look at these events.

OK,.... since this is a phishy domain analyzer, and all the results we got back are essentially from that analyzer, let's look at few things. First of all let's look at the tags that are available.

Let's start by doing a simple method of converting the tags that are now a list into a string and then finding the unique strings.

In [ ]:

greendale_frame.tag.str.join('|').unique()

Then we can start to do this slightly differently, this time we want to get a list of all the different tags.

In [ ]:

tags = set()
def add_tag(tag_list):
  list(map(tags.add, tag_list))

greendale_frame.tag.apply(add_tag)

print(tags)

Let's go over the code above to understand what just happened.

First of all a set is created, called tags, since this is a set it cannot contain duplicates (duplicates are ignored).

Then we define a function that accepts a list, and then applies the function tags.add against every item in the list (the map function). This mean that for each entry in the supplied tag_list the function tags.add is called.

Then finally we take the dataframe greendale_frame and call the apply function on the series tag. That takes the column or series tag, which contains lists of tags applied and then for each row in the data frame applies the function add_tag that we created.

This code effectively does the following:

For each row of greendale_frame extract the tag list and apply the add_tag function
The add tag function takes then each entry in the tag list and adds it to the set tags

This gives us a final set that contains exactly one copy of each tag that was applied to the records.

Looking at the results from the tags, we do see some outside-active-hours tags. Let's look at those specifically. What does that mean? That means that the timeframe analyzer determined that the browsing activity occurred outside regular hours of the timeline it analyzed.

In [ ]:

greendale_frame[greendale_frame.tag.str.join(',').str.contains('outside-active-hours')].domain.value_counts()

OK... now we get to see all the domains that the domain analyzer considered to be potentially "phishy"... is there a domain that stands out??? what about that grendale one?

In [ ]:

greendale_frame[greendale_frame.domain == 'grendale.xyz'][['datetime', 'url']]

OK... this seems odd.. let's look at few things, a the human_readable string as well as the URL...

In [ ]:

grendale = greendale_frame[greendale_frame.domain == 'grendale.xyz']

string_set = set()
for string_list in grendale.human_readable:
  new_list = [x for x in string_list if 'phishy_domain' in x]
  _ = list(map(string_set.add, new_list))

for entry in string_set:
  print('Human readable string is: {0:s}'.format(entry))
  

print('')
print('Counts for URL connections to the grendale domain:')
grendale_count = grendale.url.value_counts()
for index in grendale_count.index:
  print('[{0:d}] {1:s}'.format(grendale_count[index], index))

We can start doing a lot more now if we want to... let's look at when these things occurred...

In [ ]:

grendale_array = grendale.url.unique()

greendale_frame[greendale_frame.url.isin(grendale_array)]

OK... we can then start to look at surrounding events.... let's look at one date in particular... "2015-08-29 12:21:06"

In [ ]:

search_obj = search.Search(gd_sketch)
date_chip = search.DateIntervalChip()

# Let's set the date
date_chip.date = '2015-08-29T12:21:06'

# And now how much time we want before and after.
date_chip.before = 1
date_chip.after = 1

# and the unit, we want minutes.. so that is m
date_chip.unit = 'm'

search_obj.query_string = '*'
search_obj.add_chip(date_chip)

search_obj.return_fields = 'message,human_readable,datetime,timestamp_desc,source_short,data_type,tags,url,domain'

data = search_obj.table

And now we can start to look at the results:

In [ ]:

data[['datetime', 'message', 'human_readable', 'url']].head(4)

Let's find the grendale and just look at events two seconds before/after

In [ ]:

data[(data.datetime > '2015-08-29 12:21:04') & (data.datetime < '2015-08-29 12:21:08')][['datetime', 'message', 'timestamp_desc']]

Let's look at aggregation¶

Timesketch also has aggregation capabilities that we can call from the client. Let's take a quick look.

Start by checking out whether there are any stored aggregations that we can just take a look at.

You can also store your own aggregations using the gd_sketch.store_aggregation function. However we are not going to do that in this colab.

In [ ]:

[(x.id, x.name, x.title, x.description) for x in gd_sketch.list_aggregations()]

OK, so there are some aggregations stored. Let's just pick one of those to take a closer look at.

In [ ]:

aggregation = gd_sketch.get_aggregation(24)

Now we've got an aggregation object that we can take a closer look at.

In [ ]:

aggregation.description

OK, so from the name, we can guess what it contains. We can also look at all of the stored aggregations

In [ ]:

pd.DataFrame([{'name': x.name, 'description': x.description} for x in gd_sketch.list_aggregations()])

Let's look at the aggregation visually, both as a table and a chart.

In [ ]:

aggregation.table

In [ ]:

aggregation.chart

The chart there is empty, since the aggregation didn't contain a chart.

We can also take a look at what aggregators can be used, if we want to run our own custom aggregator.

In [ ]:

gd_sketch.list_available_aggregators()

Now we can see that there are at least the "field_bucket" and "query_bucket" aggregators that we can look at. The field_bucket one is a terms bucket aggregation, which means we can take any field in the dataset and aggregate on that.

So if we want to for instance see the top 20 domains that were visited we can just ask for an aggregation of the field domain and limit it to 20 records (which will be the top 20). Let's do that:

In [ ]:

aggregator = gd_sketch.run_aggregator(
    aggregator_name='field_bucket',
    aggregator_parameters={'field': 'domain', 'limit': 20, 'supported_charts': 'barchart'})

Now we've got an aggregation object that we can take a closer look at... let's look at the data it stored. What we were trying to get out was the top 20 domains that were visited.

In [ ]:

aggregator.table

Or we can look at this visually... as a chart

In [ ]:

aggregator.chart

We can also do something a bit more complex. The other aggregator, the query_bucket works in a similar way, except you can filter the results first. We want to aggregate all the domains that have been tagged with the phishy domain tag.

In [ ]:

tag_aggregator = gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'domain',
        'query_string': 'tag:"phishy-domain"',
        'supported_charts': 'barchart',
    }
)

Let's look at the results.

In [ ]:

tag_aggregator.table

We can also look at all the tags in the timeline. What tags have been applied and how frequent are they.

In [ ]:

gd_sketch.run_aggregator(
    aggregator_name='field_bucket',
    aggregator_parameters={
        'field': 'tag',
        'limit': 10,
    }
).table

And then to see what are the most frequent applications executed on the machine.

Since not all of the execution events have the same fields in them we'll have to create few tables here... let's start with looking at what data types are there.

In [ ]:

gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'data_type',
        'query_string': 'tag:"browser-search"',
        'supported_charts': 'barchart',
    }
).table

And then we can do a summary for each one.

In [ ]:

gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'domain',
        'query_string': 'tag:"browser-search"',
        'supported_charts': 'barchart',
    }
).table

In [ ]:

agg = gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'search_string',
        'query_string': 'tag:"browser-search"',
        'supported_charts': 'hbarchart',
    }
)

agg.table

Or as a chart

In [ ]:

agg.chart

Let's look at logins...¶

Let's do a search to look at login entries...

In [ ]:

search_obj = search.Search(gd_sketch)
search_obj.query_string = 'tag:"logon-event"'
search_obj.max_entries = 500000
search_obj.return_fields = (
    'datetime,timestamp_desc,human_readable,message,tag,event_identifier,hostname,record_number,'
    'recovered,strings,username,strings_parsed,logon_type,logon_process,windows_domain,'
    'source_username,user_id,computer_name')

login_data = search_obj.table

This will produce quite a bit of events... let's look at how many.

In [ ]:

login_data.shape

Let's look at usernames....

In [ ]:

login_data.username.value_counts()

Let's also look at what windows domains where used:

In [ ]:

login_data.windows_domain.value_counts()

And the logon types:

In [ ]:

login_data.logon_type.value_counts()

In [ ]:

login_data.computer_name.value_counts()

Let's graph.... and you can then interact with the graph... try zomming in, etc.

First we'll define a graph function that we can then call with parameters...

In [ ]:

def GraphLogins(data_frame, machine_name=None):
  
  if machine_name:
    data_slice = data_frame[data_frame.computer_name == machine_name]
    title = 'Accounts Logged In - {0:s}'.format(machine_name)
  else:
    data_slice = data_frame
    title = 'Accounts Logged In'
    
  data_grouped = data_slice[['username', 'datetime']].groupby('username', as_index=False).count()
  data_grouped.rename(columns={'datetime': 'count'}, inplace=True)

  return alt.Chart(data_grouped, width=400).mark_bar().encode(
    x='username', y='count',
    tooltip=['username', 'count']
  ).properties(
    title=title
  ).interactive()

Start by graphing all machines

In [ ]:

GraphLogins(login_data)

Or we can look at this for a particular machine:

In [ ]:

GraphLogins(login_data, 'Student-PC1.internal.greendale.edu')

Or we can look at this as a scatter plot...

First we'll define a function that munches the data for us. This function will essentially graph all logins in a day with a scatter plot, using colors to denote the count value.

This graph will be very interactive... try selecting a time period by clicking with the mouse on the upper graph and drawing a selection.

In [ ]:

login_data['day'] = login_data['datetime'].dt.strftime('%Y-%m-%d')

def GraphScatterLogin(data_frame, machine_name=''):
  if machine_name:
    data_slice = data_frame[data_frame.computer_name == machine_name]
    title = 'Accounts Logged In - {0:s}'.format(machine_name)
  else:
    data_slice = data_frame
    title = 'Accounts Logged In'
  
  login_grouped = data_slice[['day', 'computer_name', 'username', 'message']].groupby(['day', 'computer_name', 'username'], as_index=False).count()
  login_grouped.rename(columns={'message': 'count'}, inplace=True)
    
  brush = alt.selection_interval(encodings=['x'])
  click = alt.selection_multi(encodings=['color'])
  color = alt.Color('count:Q')

  chart1 = alt.Chart(login_grouped).mark_point().encode(
      x='day', 
      y='username',
      color=alt.condition(brush, color, alt.value('lightgray')),
  ).properties(
      title=title,
      width=600
  ).add_selection(
      brush
  ).transform_filter(
      click
  )
  
  chart2 = alt.Chart(login_grouped).mark_bar().encode(
      x='count',
      y='username',
      color=alt.condition(brush, color, alt.value('lightgray')),
      tooltip=['count'],
  ).transform_filter(
      brush
  ).properties(
      width=600
  ).add_selection(
      click
  )
  
  return chart1 & chart2

OK, let's start by graphing for all logins...

In [ ]:

GraphScatterLogin(login_data)

And now just for the Student-PC1

In [ ]:

GraphScatterLogin(login_data, 'Student-PC1.internal.greendale.edu')

And now it is your time to shine, experiment with python pandas, the graphing library and other data science techniques.

In [ ]: