Please ensure that MSTICPy is installed before continuing with this notebook.
The nbinit module loads required libraries and optionally installs require packages.
from pathlib import Path
from IPython.display import display, HTML
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
REQ_PYTHON_VER = "3.6"
REQ_MSTICPY_VER = "1.0.0"
REQ_MP_EXTRAS = ["ml"]
display(HTML("<h3>Starting Notebook setup...</h3>"))
import msticpy as mp
mp.init_notebook(namespace=globals());
from urllib.request import urlretrieve
from pathlib import Path
from tqdm.auto import tqdm
github_uri = "https://raw.githubusercontent.com/Azure/Azure-Sentinel-Notebooks/master/{file_name}"
github_files = {
"exchange_admin.pkl": "src/data",
"processes_on_host.pkl": "src/data",
"timeseries.pkl": "src/data",
"data_queries.yaml": "src/data",
}
Path("./src/data").mkdir(exist_ok=True, parents=True)
for file, path in tqdm(github_files.items(), desc="File download"):
file_path = Path(path).joinpath(file)
url_path = f"{path}/{file}" if path else file
urlretrieve(
github_uri.format(file_name=url_path),
file_path
)
assert Path(file_path).is_file()
The starting point is ingesting data to analyze.
MSTICpy contains a number of query providers that let you query and return data from several different sources.
Below we are using the LocalData query provider to return data from sample files.
Data is returned in a Pandas DataFrame for easy manipulation and to provide a common interface for other features in MSTICpy.
Here we are getting a summary of our network traffic for the time period we are interested in.
query_range = nbwidgets.QueryTime(
origin_time=pd.Timestamp("2020-07-13 00:00:00"),
before=1,
units="week"
)
query_range
This query fetches the total number of bytes send outbound on the network, grouped by hour.
The input to the Timeseries analysis needs to be in the form of:
# Initialize the data provider and connect to our Splunk instance.
qry_prov = QueryProvider("LocalData", data_paths=["./src/data"], query_paths=["./src/data"])
qry_prov.connect()
ob_bytes_per_hour = qry_prov.Network.get_network_summary(query_range)
md("Sample data:", "large")
ob_bytes_per_hour.head(3)
Below we use MSTICpy's time series analysis
machine learning capabilities to identify anomalies in our network traffic for further investigation.
As well as computing anomalies we visualize the data so that we can more easily see where these anomalies present themselves.
from msticpy.analysis.timeseries import display_timeseries_anomalies
from msticpy.analysis.timeseries import timeseries_anomalies_stl
# Conduct our timeseries analysis
ts_analysis = timeseries_anomalies_stl(ob_bytes_per_hour)
# Visualize the timeseries and any anomalies
display_timeseries_anomalies(data=ts_analysis, y= 'TotalBytesSent');
md("We can see two clearly anomalous data points representing unusual outbound traffic.<hr>", "bold")
max_score, min_score = ts_analysis.score.max(), ts_analysis.min()
ts_analysis[ts_analysis["anomalies"] == 1]
We can extract the start and end times of anomalous events and use this more-focused time range to query for unusual activity in this period.
Note: if more than one anomalous period is indicated we can use
msticpy.analysis.timeseries.extract_anomaly_periods()
function to isolate time blocks around the anomalous periods.
# Identify when the anomalies occur so that we can use this timerange
# to scope the next stage of our investigation.
# Add a 1 hour buffer around the anomalies
start = ts_analysis[ts_analysis['anomalies']==1]['TimeGenerated'].min() - pd.to_timedelta(1, unit='h')
end = ts_analysis[ts_analysis['anomalies']==1]['TimeGenerated'].max() + pd.to_timedelta(1, unit='h')
# md and md_warn are MSTICpy features to provide simple, and clean output in notebook cells
md("Anomalous network traffic detected between:", "large")
md(f"Start time: <b>{start}</b><br>End time: <b>{end}</b><hr>",)
We would take these start and end times to zero in on which machines were responsible for the anomalous traffic. Once we find them we can use other techniques to analyze what's going on on these hosts.
You can use the msticpy query function MultiDataSource.get_timeseries_anomalies
on most Microsoft Sentinel tables to do this summarization directly.
Three examples are shown below.
start = pd.Timestamp("2020-09-01T00:00:00")
end = pd.Timestamp("2020-09-301T00:00:00")
# Sent bytes by hour (default) from Palo Alto devices
time_series_net_bytes = qry_prov.MultiDataSource.get_timeseries_decompose(
start=start,
end=end,
table="CommonSecurityLog",
timestampcolumn="TimeGenerated",
aggregatecolumn="SentBytes",
groupbycolumn="DeviceVendor",
aggregatefunction="sum(SentBytes)",
where_clause='|where DeviceVendor=="Palo Alto Networks"',
)
# Sign-in failure count in AAD
time_series_signin_fail = qry_prov.MultiDataSource.get_timeseries_anomalies(
table="SigninLogs",
start=start,
end=end,
timestampcolumn="TimeGenerated",
aggregatecolumn="AppDisplayName",
groupbycolumn="ResultType",
aggregatefunction="count(AppDisplayName)",
where_clause='| where ResultType in (50126, 50053, 50074, 50076)',
add_query_items='| project-rename Total=AppDisplayName',
)
# Number of distinct processes by hour
time_series_procs = qry_prov.MultiDataSource.get_timeseries_anomalies(
table="SecurityEvent",
start=start,
end=end,
timestampcolumn="TimeGenerated",
aggregatecolumn="DistinctProcesses",
groupbycolumn="Account",
aggregatefunction="dcount(NewProcessName)",
where_clause="| where Computer='myhost.domain.con'",
)
# Then submit to ts anomalies decomposition
ts_analysis = timeseries_anomalies_stl(time_series_procs)
# Visualize the timeseries and any anomalies
display_timeseries_anomolies(data=ts_analysis, y='Total');
Sifting through thousands of events from a host is tedious in the extreme. We want to find a better way of identifying suspicious clusters of activity.
Query the data and do some initial analysis of the results
print("Getting process events...", end="")
processes_on_host = qry_prov.WindowsSecurity.list_host_processes(
query_range, host_name="MSTICAlertsWin1"
)
md("Initial analysis of data set", "large, bold")
md(f"Total processes in data set <b>{len(processes_on_host)}</b>")
for column in ("Account", "NewProcessName", "CommandLine"):
md(f"Total distinct {column} in data <b>{processes_on_host[column].nunique()}</b>")
md("<hr>")
md("Try grouping by distinct Account, Process, Commandline<br> - we still have 1000s of rows!", "large")
display(
processes_on_host
.groupby(["Account", "NewProcessName", "CommandLine"])
[["TimeGenerated"]]
.count()
.rename(columns={"TimeGenerated": "Count"})
)
We want to find atypical commands being run and see if they are associated with
the same user or time period
It is tedious to do repeated queries grouping on different attributes of events.
Instead we can specify features that we are interested in grouping around and use
clustering, a form of unsupervised learning, to group the data.
A challenge when using simple grouping is that commands (commandlines) may vary slightly but are essentially repetitions of the same thing (e.g. contain dynamically-generated GUIDs or other temporary data).
We can extract features of the commandline rather than using it in its raw form.
Using clustering we can add arbitrarily many features to group on. Here we are using the following features:
Note: A downside to clustering is that text features (usually) need to be transformed from a string
into a numeric representation.
from msticpy.sectools.eventcluster import dbcluster_events, add_process_features, char_ord_score
from collections import Counter
print(f"Input data: {len(processes_on_host)} events")
print("Extracting features...", end="")
feature_procs = add_process_features(input_frame=processes_on_host, path_separator="\\")
feature_procs["accountNum"] = feature_procs.apply(
lambda x: char_ord_score(x.Account), axis=1
)
print(".", end="")
# you might need to play around with the max_cluster_distance parameter.
# decreasing this gives more clusters.
cluster_columns = ["commandlineTokensFull", "pathScore", "accountNum", "isSystemSession"]
print("done")
print("Clustering...", end="")
(clus_events, dbcluster, x_data) = dbcluster_events(
data=feature_procs,
cluster_columns=cluster_columns,
max_cluster_distance=0.0001,
)
print("done")
print("Number of input events:", len(feature_procs))
print("Number of clustered events:", len(clus_events))
print("Merging with source data and computing rarity...", end="")
# Join the clustered results back to the original process frame
procs_with_cluster = feature_procs.merge(
clus_events[[*cluster_columns, "ClusterSize"]],
on=["commandlineTokensFull", "accountNum", "pathScore", "isSystemSession"],
)
# Compute Process pattern Rarity = inverse of cluster size
procs_with_cluster["Rarity"] = 1 / procs_with_cluster["ClusterSize"]
# count the number of processes for each logon ID
lgn_proc_count = (
pd.concat(
[
processes_on_host.groupby("TargetLogonId")["TargetLogonId"].count(),
processes_on_host.groupby("SubjectLogonId")["SubjectLogonId"].count(),
]
).sum(level=0)
).to_dict()
print("done")
# Display the results
md("<br><hr>Sessions ordered by process rarity", "large, bold")
md("Higher score indicates higher number of unusual processes")
process_rarity = (procs_with_cluster.groupby(["SubjectUserName", "SubjectLogonId"])
.agg({"Rarity": "mean", "TimeGenerated": "count"})
.rename(columns={"TimeGenerated": "ProcessCount"})
.reset_index())
display(
process_rarity
.sort_values("Rarity", ascending=False)
.style.bar(subset=["Rarity"], color="#d65f5f")
)
# get the logon ID of the rarest session
rarest_logonid = process_rarity[process_rarity["Rarity"] == process_rarity.Rarity.max()].SubjectLogonId.iloc[0]
# extract processes with this logonID
sample_processes = (
processes_on_host
[processes_on_host["SubjectLogonId"] == rarest_logonid]
[["TimeGenerated", "CommandLine"]]
.sort_values("TimeGenerated")
)[5:25]
# compute duration of session
duration = sample_processes.TimeGenerated.max() - sample_processes.TimeGenerated.min()
md(f"{len(sample_processes)} processes executed in {duration.total_seconds()} sec", "bold")
display(sample_processes)
md("Note: '[PLACEHOLDER]' in the CommandLine values replaces the password value.")
We have narrowed down the task of sifting through > 20,000 processes to a few 10s and have them grouped into sessions ordered by the relative rarity of the process patterns
You can use this technique on other datasets where you want to group by multiple features of the data.
The caveat is that you need to transform any non-numeric data field into a numeric form.
msticpy has a few built-in functions to help with this:
from msticpy.sectools import eventcluster
# This will group similar names together (e.g. "Administrator" and "administrator")
my_df["account_num"] = eventcluster.char_ord_score_df(data=my_df, column="Account")
# This will create a distinct hash for even minor differences in the input.
# This might be useful to detect imperfectly faked UA strings.
my_df["ua_hash"] = eventcluster.crc32_hash_df(data=my_df, column="UserAgent")
# This will return the number of delimiter chars in the string - often a
# a good proxy for the structure of an input while ignoring variable text values
# e.g.
# "https://my.dom.com/path1?u1=163.174.4.23" will produce the same score as
# "https://www.contoso.com/azure?srcdom=moon.base.alpha.com"
# but
# "curl https://www.contoso.com/top/next?u=23 > ~/my_page"
# "curl https://www.contoso.com/top_next?u=2.3 > ~/my_page.sh"
# will produce different values despite the similarity of the strings.
# Note - you can override the default delimiter list of " \-/.,"'|&:;%\$()]"
my_df["request_struct"] = eventcluster.delim_count_df(my_df, column="Request")
You can use a combination of these and other functions on the same fields to measure different aspects of the data. For example, the following takes a hash of the browser version of the UA (user agent) string and a structural count of the delimiters used.
Use the ua_pref_hash
and ua_delims
to cluster on identical browser versions that
have the same UA string
my_df["ua_prefix"] = data=my_df["UserAgent"].str.split(")")[-1])
my_df["ua_pref_hash"] = eventcluster.crc32_hash_df(data=my_df, column="ua_prefix")
my_df["ua_delims"] = eventcluster.delim_count_df(data=my_df, column="UserAgent")
The anomalous_sequence MSTICPy package uses Markov Chain analysis to predict the probability
that a particular sequence of events will occur given what has happened in the past.
Here we're applying it to Office activity.
query = """
| where TimeGenerated >= ago(60d)
| where RecordType_s == 'ExchangeAdmin'
| where UserId_s !startswith "NT AUTHORITY"
| where UserId_s !contains "prod.outlook.com"
| extend params = todynamic(strcat('{"', Operation_s, '" : ', tostring(Parameters_s), '}'))
| extend UserId = UserId_s, ClientIP = ClientIP_s, Operation = Operation_s
| project TimeGenerated= Start_Time_t, UserId, ClientIP, Operation, params
| sort by UserId asc, ClientIP asc, TimeGenerated asc
| extend begin = row_window_session(TimeGenerated, 20m, 2m, UserId != prev(UserId) or ClientIP != prev(ClientIP))
| summarize cmds=makelist(Operation), end=max(TimeGenerated), nCmds=count(), nDistinctCmds=dcount(Operation),
params=makelist(params) by UserId, ClientIP, begin
| project UserId, ClientIP, nCmds, nDistinctCmds, begin, end, duration=end-begin, cmds, params
"""
exchange_df = qry_prov.Azure.OfficeActivity(add_query_items=query)
print(f"Number of events {len(exchange_df)}")
exchange_df.drop(columns="params").head()
The analysis groups events into sessions (time-bounded and linked by a common account). It then
builds a probability model for the types of command (E.g. "SetMailboxProperty")
and the parameters and parameter values used for that command.
I.e. how likely is it that a given user would be running this sequence of commands in a logon session?
Using this probability model, we can highlight sequences that have an extremely low probability, based
on prior behaviour.
from msticpy.analysis.anomalous_sequence.utils.data_structures import Cmd
from msticpy.analysis.anomalous_sequence import anomalous
def process_exchange_session(session_with_params, include_vals):
new_ses = []
for cmd in session_with_params:
c = list(cmd.keys())[0]
par = list(cmd.values())[0]
new_pars = set()
if include_vals:
new_pars = dict()
for p in par:
if include_vals:
new_pars[p['Name']] = p['Value']
else:
new_pars.add(p['Name'])
new_ses.append(Cmd(name=c, params=new_pars))
return new_ses
sessions = exchange_df.cmds.values.tolist()
param_sessions = []
param_value_sessions = []
for ses in exchange_df.params.values.tolist():
new_ses_set = process_exchange_session(session_with_params=ses, include_vals=False)
new_ses_dict = process_exchange_session(session_with_params=ses, include_vals=True)
param_sessions.append(new_ses_set)
param_value_sessions.append(new_ses_dict)
data = exchange_df
data['session'] = sessions
data['param_session'] = param_sessions
data['param_value_session'] = param_value_sessions
modelled_df = anomalous.score_sessions(
data=data,
session_column='param_value_session',
window_length=3
)
anomalous.visualise_scored_sessions(
data_with_scores=modelled_df,
time_column='begin', # this will appear on the x-axis
score_column='rarest_window3_likelihood', # this will appear on the y axis
window_column='rarest_window3', # this will represent the session in the tool-tips
source_columns=['UserId', 'ClientIP'], # specify any additional columns to appear in the tool-tips
)
The events are shown in descending order of likelihood (vertically), so the
events at the bottom of the chart are the ones most interesting to us.
Looking at these rare events, we can see potentially suspicious activity changing role memberships.
pd.set_option("display.html.table_schema", False)
likelihood_max=modelled_df["rarest_window3_likelihood"].max()
likelihood_min=modelled_df["rarest_window3_likelihood"].min()
slider_step = (likelihood_max - likelihood_min) / 20
start_val = likelihood_min + slider_step
threshold = widgets.FloatSlider(
description="Select likelihood threshold",
max=likelihood_max,
min=likelihood_min,
value=start_val,
step=start_val,
layout=widgets.Layout(width="60%"),
style={"description_width": "200px"},
readout_format=".7f"
)
def show_rows(change):
thresh = change["new"]
pd_disp.update(modelled_df[modelled_df["rarest_window3_likelihood"] < thresh])
threshold.observe(show_rows, names="value")
md("Move the slider to see event sessions below the selected <i>likelihood</i> threshold", "bold")
display(HTML("<hr>"))
display(threshold)
display(HTML("<hr>"))
md(f"Range is {likelihood_min:.7f} (min likelihood) to {likelihood_max:.7f} (max likelihood)<br><br><hr>")
pd_disp = display(
modelled_df[modelled_df["rarest_window3_likelihood"] < start_val],
display_id=True
)
Note for many events the output will be long
import pprint
rarest_events = (
modelled_df[modelled_df["rarest_window3_likelihood"] < threshold.value]
[[
"UserId", "ClientIP", "begin", "end", "param_value_session", "rarest_window3_likelihood"
]]
.rename(columns={"rarest_window3_likelihood": "likelihood"})
.sort_values("likelihood")
)
for idx, (_, rarest_event) in enumerate(rarest_events.iterrows(), 1):
md(f"Event {idx}", "large")
display(pd.DataFrame(rarest_event[["UserId", "ClientIP", "begin", "end", "likelihood"]]))
md("<hr>")
md("Param session details:", "bold")
for cmd in rarest_event.param_value_session:
md(f"Command: {cmd.name}")
md(pprint.pformat(cmd.params))
md("<hr><br>")