Microsoft Sentinel Notebooks and MSTICPy

Examples of machine learning techniques in Jupyter notebooks

Author: Ian Hellen
Co-Authors: Pete Bryan, Ashwin Patil

Released: 26 Oct 2020

Notebook Setup

Please ensure that MSTICPy is installed before continuing with this notebook.

The nbinit module loads required libraries and optionally installs require packages.

In [1]:
from pathlib import Path
from IPython.display import display, HTML
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

REQ_PYTHON_VER = "3.6"
REQ_MSTICPY_VER = "1.0.0"
REQ_MP_EXTRAS = ["ml"]


display(HTML("<h3>Starting Notebook setup...</h3>"))
  
from msticpy.nbtools import nbinit
nbinit.init_notebook(namespace=globals());

Starting Notebook setup...

Exception reporting mode: Verbose
Note: you may need to scroll down this cell to see the full output.

Starting notebook pre-checks...

Checking Python kernel version...
Recommended: switch to using the 'Python 3.8 - AzureML' notebook kernel if this is available.
Info: Python kernel version 3.7.11 OK
Checking msticpy version...
Info: msticpy version 1.0.0 OK
Running install to ensure extras are installed...

Running pip install --no-input --quiet msticpy[ml]>=1.0.0 - this may take a few moments...
Note: you may need to restart the kernel to use updated packages.

Notebook pre-checks complete.


Starting Notebook initialization...

msticpy version installed: 1.5.0rc3 latest published: 1.4.5
Latest version is installed.

Processing imports....
Imported: pd (pandas), IPython.get_ipython, IPython.display.display, IPython.display.HTML, IPython.display.Markdown, widgets (ipywidgets), pathlib.Path, plt (matplotlib.pyplot), matplotlib.MatplotlibDeprecationWarning, sns (seaborn), np (numpy), msticpy, msticpy.data.QueryProvider, msticpy.nbtools.foliummap.FoliumMap, msticpy.common.utility.md, msticpy.common.utility.md_warn, msticpy.common.wsconfig.WorkspaceConfig, msticpy.datamodel.pivot.Pivot, msticpy.datamodel.entities, msticpy.vis.mp_pandas_plot
Checking configuration....
Azure CLI credentials available.
Setting notebook options....

Notebook initialization complete

Retrieve sample data files

In [2]:
from urllib.request import urlretrieve
from pathlib import Path
from tqdm.auto import tqdm

github_uri = "https://raw.githubusercontent.com/Azure/Azure-Sentinel-Notebooks/master/{file_name}"
github_files = {
    "exchange_admin.pkl": "data",
    "processes_on_host.pkl": "data",
    "timeseries.pkl": "data",
    "data_queries.yaml": "data",
}

Path("data").mkdir(exist_ok=True)
for file, path in tqdm(github_files.items(), desc="File download"):
    file_path = Path(path).joinpath(file)
    print(file_path, end=", ")
    url_path = f"{path}/{file}" if path else file
    urlretrieve(
        github_uri.format(file_name=url_path),
        file_path
    )
    assert Path(file_path).is_file()
File download:   0%|          | 0/4 [00:00<?, ?it/s]
data\exchange_admin.pkl, 
File download:  25%|██▌       | 1/4 [00:02<00:06,  2.06s/it]
data\processes_on_host.pkl, 
File download:  75%|███████▌  | 3/4 [00:02<00:00,  1.28it/s]
data\timeseries.pkl, data\data_queries.yaml, 
File download: 100%|██████████| 4/4 [00:03<00:00,  1.26it/s]

Time Series Analysis

Query network data

The starting point is ingesting data to analyze.

MSTICpy contains a number of query providers that let you query and return data from several different sources.

Below we are using the LocalData query provider to return data from sample files.

Data is returned in a Pandas DataFrame for easy manipulation and to provide a common interface for other features in MSTICpy.
Here we are getting a summary of our network traffic for the time period we are interested in.

In [3]:
query_range = nbwidgets.QueryTime(
    origin_time=pd.Timestamp("2020-07-13 00:00:00"),
    before=1,
    units="week"
)
query_range

This query fetches the total number of bytes send outbound on the network, grouped by hour.

The input to the Timeseries analysis needs to be in the form of:

  • a datetime index (in a regular interval like an hour or day)
  • a scalar value used to determine anomalous values based on periodicity
In [4]:
# Initialize the data provider and connect to our Splunk instance.
qry_prov = QueryProvider("LocalData", data_paths=["./data"], query_paths=["./data"])
qry_prov.connect()

ob_bytes_per_hour = qry_prov.Network.get_network_summary(query_range)
md("Sample data:", "large")
ob_bytes_per_hour.head(3)
Connected.

Sample data:

Out[4]:
TotalBytesSent
TimeGenerated
2020-07-06 00:00:00+00:00 10823
2020-07-06 01:00:00+00:00 14821
2020-07-06 02:00:00+00:00 13532

Using Timeseries decomposition to detect anomalous network activity

Below we use MSTICpy's time series analysis machine learning capabilities to identify anomalies in our network traffic for further investigation.
As well as computing anomalies we visualize the data so that we can more easily see where these anomalies present themselves.

In [5]:
from msticpy.nbtools.timeseries import display_timeseries_anomolies
from msticpy.analysis.timeseries import timeseries_anomalies_stl

# Conduct our timeseries analysis
ts_analysis = timeseries_anomalies_stl(ob_bytes_per_hour)
# Visualize the timeseries and any anomalies
display_timeseries_anomolies(data=ts_analysis, y= 'TotalBytesSent');

md("We can see two clearly anomalous data points representing unusual outbound traffic.<hr>", "bold")
Loading BokehJS ...

We can see two clearly anomalous data points representing unusual outbound traffic.


View the summary events marked as anomalous

In [6]:
max_score, min_score = ts_analysis.score.max(), ts_analysis.min()
ts_analysis[ts_analysis["anomalies"] == 1]
Out[6]:
TimeGenerated TotalBytesSent residual trend seasonal weights baseline score anomalies
114 2020-07-10 18:00:00+00:00 48616 16383 21598 10633 1 32232 6.220873 1
115 2020-07-10 19:00:00+00:00 45856 15949 21373 8532 1 29906 6.055974 1

Extract the anomaly period

We can extract the start and end times of anomalous events and use this more-focused time range to query for unusual activity in this period.

Note: if more than one anomalous period is indicated we can use
msticpy.analysis.timeseries.extract_anomaly_periods() function to isolate time blocks around the anomalous periods.

In [7]:
# Identify when the anomalies occur so that we can use this timerange
# to scope the next stage of our investigation.
# Add a 1 hour buffer around the anomalies
start = ts_analysis[ts_analysis['anomalies']==1]['TimeGenerated'].min() -  pd.to_timedelta(1, unit='h')
end = ts_analysis[ts_analysis['anomalies']==1]['TimeGenerated'].max() +  pd.to_timedelta(1, unit='h')

# md and md_warn are MSTICpy features to provide simple, and clean output in notebook cells
md("Anomalous network traffic detected between:", "large")
md(f"Start time: <b>{start}</b><br>End time: <b>{end}</b><hr>",)

Anomalous network traffic detected between:

Start time: 2020-07-10 17:00:00+00:00
End time: 2020-07-10 20:00:00+00:00


Time Series Conclusion

We would take these start and end times to zero in on which machines were responsible for the anomalous traffic. Once we find them we can use other techniques to analyze what's going on on these hosts.

Other Applications

You can use the msticpy query function MultiDataSource.get_timeseries_anomalies on most Microsoft Sentinel tables to do this summarization directly.

Three examples are shown below.

start = pd.Timestamp("2020-09-01T00:00:00")
end = pd.Timestamp("2020-09-301T00:00:00")

# Sent bytes by hour (default) from Palo Alto devices
time_series_net_bytes = qry_prov.MultiDataSource.get_timeseries_decompose(
    start=start,
    end=end,
    table="CommonSecurityLog",
    timestampcolumn="TimeGenerated",
    aggregatecolumn="SentBytes",
    groupbycolumn="DeviceVendor",
    aggregatefunction="sum(SentBytes)",
    where_clause='|where DeviceVendor=="Palo Alto Networks"',
)

# Sign-in failure count in AAD
time_series_signin_fail = qry_prov.MultiDataSource.get_timeseries_anomalies(
    table="SigninLogs",
    start=start,
    end=end,
    timestampcolumn="TimeGenerated",
    aggregatecolumn="AppDisplayName",
    groupbycolumn="ResultType",
    aggregatefunction="count(AppDisplayName)",
    where_clause='| where ResultType in (50126, 50053, 50074, 50076)',
    add_query_items='| project-rename Total=AppDisplayName',
)

# Number of distinct processes by hour
time_series_procs = qry_prov.MultiDataSource.get_timeseries_anomalies(
    table="SecurityEvent",
    start=start,
    end=end,
    timestampcolumn="TimeGenerated",
    aggregatecolumn="DistinctProcesses",
    groupbycolumn="Account",
    aggregatefunction="dcount(NewProcessName)",
    where_clause="| where Computer='myhost.domain.con'",
)

# Then submit to ts anomalies decomposition
ts_analysis = timeseries_anomalies_stl(time_series_procs)
# Visualize the timeseries and any anomalies
display_timeseries_anomolies(data=ts_analysis, y='Total');

Using Clustering

- Example: aggregating similar process patterns to highlight unusual logon sessions

Sifting through thousands of events from a host is tedious in the extreme. We want to find a better way of identifying suspicious clusters of activity.

Query the data and do some initial analysis of the results

In [8]:
print("Getting process events...", end="")
processes_on_host = qry_prov.WindowsSecurity.list_host_processes(
    query_range, host_name="MSTICAlertsWin1"
)

md("Initial analysis of data set", "large, bold")
md(f"Total processes in data set <b>{len(processes_on_host)}</b>")
for column in ("Account", "NewProcessName", "CommandLine"):
    md(f"Total distinct {column} in data <b>{processes_on_host[column].nunique()}</b>")
md("<hr>")
md("Try grouping by distinct Account, Process, Commandline<br> - we still have 1000s of rows!", "large")
display(
    processes_on_host
    .groupby(["Account", "NewProcessName", "CommandLine"])
    [["TimeGenerated"]]
    .count()
    .rename(columns={"TimeGenerated": "Count"})
)
Getting process events...

Initial analysis of data set

Total processes in data set 22979

Total distinct Account in data 5

Total distinct NewProcessName in data 192

Total distinct CommandLine in data 4551


Try grouping by distinct Account, Process, Commandline
- we still have 1000s of rows!

Count
Account NewProcessName CommandLine
MSTICAlertsWin1\MSTICAdmin C:\Program Files (x86)\Internet Explorer\iexplore.exe "C:\Program Files (x86)\Internet Explorer\IEXPLORE.EXE" SCODEF:30680 CREDAT:82945 /prefetch:2 1
"C:\Program Files (x86)\Internet Explorer\IEXPLORE.EXE" SCODEF:5820 CREDAT:82945 /prefetch:2 1
C:\Program Files\Internet Explorer\iexplore.exe "C:\Program Files\Internet Explorer\iexplore.exe" 1
"C:\Program Files\Internet Explorer\iexplore.exe" -restart /WERRESTART 1
C:\Program Files\PuTTY\putty.exe "C:\Program Files\PuTTY\putty.exe" 1
... ... ... ...
WORKGROUP\MSTICAlertsWin1$ C:\Windows\Temp\CR_42BC8.tmp\setup.exe C:\Windows\TEMP\CR_42BC8.tmp\setup.exe --type=crashpad-handler /prefetch:7 --monitor-self-annotation=ptype=crashpad-handler --database=C:\Windows\TEMP\Crashpad --url=https://clients2.google.com/cr/report --annotation=channel= --annotation=plat=Win64 --annotation=prod=Chrome --annotation=ver=72.0.3626.109 --initial-client-data=0x1e0,0x1e4,0x1e8,0x1dc,0x1ec,0x7ff728255098,0x7ff7282550a8,0x7ff7282550b8 1
C:\Windows\Temp\D398059B-A17E-43B8-95E4-8F0453629D9F\DismHost.exe C:\Windows\TEMP\D398059B-A17E-43B8-95E4-8F0453629D9F\dismhost.exe {5B5DC19A-0D8F-4B1F-8B28-CAE7B134263A} 1
C:\Windows\WinSxS\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_10.0.14393.2602_none_7ee6020e2207416d\TiWorker.exe C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_10.0.14393.2602_none_7ee6020e2207416d\TiWorker.exe -Embedding 16
C:\Windows\WinSxS\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_10.0.14393.2782_none_7ee3347222082816\TiWorker.exe C:\Windows\winsxs\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_10.0.14393.2782_none_7ee3347222082816\TiWorker.exe -Embedding 11
C:\Windows\servicing\TrustedInstaller.exe C:\Windows\servicing\TrustedInstaller.exe 26

4594 rows × 1 columns

Clustering motivation

We want to find atypical commands being run and see if they are associated with the same user or time period

It is tedious to do repeated queries grouping on different attributes of events.
Instead we can specify features that we are interested in grouping around and use
clustering, a form of unsupervised learning, to group the data.

A challenge when using simple grouping is that commands (commandlines) may vary slightly but are essentially repetitions of the same thing (e.g. contain dynamically-generated GUIDs or other temporary data).

We can extract features of the commandline rather than using it in its raw form.

Using clustering we can add arbitrarily many features to group on. Here we are using the following features:

  • Account name
  • Process name
  • Command line structure
  • Whether the process is a system session or not

Note: A downside to clustering is that text features (usually) need to be transformed from a string
into a numeric representation.

In [9]:
from msticpy.sectools.eventcluster import dbcluster_events, add_process_features, char_ord_score
from collections import Counter

print(f"Input data: {len(processes_on_host)} events")
print("Extracting features...", end="")
feature_procs = add_process_features(input_frame=processes_on_host, path_separator="\\")

feature_procs["accountNum"] = feature_procs.apply(
    lambda x: char_ord_score(x.Account), axis=1
)
print(".", end="")

# you might need to play around with the max_cluster_distance parameter.
# decreasing this gives more clusters.
cluster_columns = ["commandlineTokensFull", "pathScore", "accountNum", "isSystemSession"]
print("done")
print("Clustering...", end="")
(clus_events, dbcluster, x_data) = dbcluster_events(
    data=feature_procs,
    cluster_columns=cluster_columns,
    max_cluster_distance=0.0001,
)
print("done")
print("Number of input events:", len(feature_procs))
print("Number of clustered events:", len(clus_events))

print("Merging with source data and computing rarity...", end="")

# Join the clustered results back to the original process frame
procs_with_cluster = feature_procs.merge(
    clus_events[[*cluster_columns, "ClusterSize"]],
    on=["commandlineTokensFull", "accountNum", "pathScore", "isSystemSession"],
)

# Compute Process pattern Rarity = inverse of cluster size
procs_with_cluster["Rarity"] = 1 / procs_with_cluster["ClusterSize"]
# count the number of processes for each logon ID
lgn_proc_count = (
    pd.concat(
        [
            processes_on_host.groupby("TargetLogonId")["TargetLogonId"].count(),
            processes_on_host.groupby("SubjectLogonId")["SubjectLogonId"].count(),
        ]
    ).sum(level=0)
).to_dict()
print("done")
# Display the results
md("<br><hr>Sessions ordered by process rarity", "large, bold")
md("Higher score indicates higher number of unusual processes")
process_rarity = (procs_with_cluster.groupby(["SubjectUserName", "SubjectLogonId"])
    .agg({"Rarity": "mean", "TimeGenerated": "count"})
    .rename(columns={"TimeGenerated": "ProcessCount"})
    .reset_index())
display(
    process_rarity
    .sort_values("Rarity", ascending=False)
    .style.bar(subset=["Rarity"], color="#d65f5f")
)
Input data: 22979 events
Extracting features....done
Clustering...done
Number of input events: 22979
Number of clustered events: 318
Merging with source data and computing rarity...done



Sessions ordered by process rarity

Higher score indicates higher number of unusual processes

SubjectUserName SubjectLogonId Rarity ProcessCount
15 ian 0x5d5af2 0.607143 56
9 MSTICAdmin 0xbd57571 0.484735 38
2 MSTICAdmin 0x109c408 0.432549 10
5 MSTICAdmin 0x2e2017 0.408239 33
0 - 0x3e7 0.350000 20
7 MSTICAdmin 0x78225e 0.297775 21
10 MSTICAdmin 0xbed1e13 0.297775 21
3 MSTICAdmin 0x1e821b5 0.239992 8
8 MSTICAdmin 0xab5a5ac 0.239992 8
11 MSTICAdmin 0xc277459 0.236656 6
4 MSTICAdmin 0x1f388a3 0.202848 7
12 MSTICAdmin 0xc54c7b9 0.202848 7
6 MSTICAdmin 0x527d50d 0.058824 3
1 LOCAL SERVICE 0x3e5 0.038462 26
14 MSTICAlertsWin1$ 0x3e7 0.012226 14508
13 MSTICAlertsWin1$ 0x3e4 0.004636 8197
In [10]:
# get the logon ID of the rarest session
rarest_logonid = process_rarity[process_rarity["Rarity"] == process_rarity.Rarity.max()].SubjectLogonId.iloc[0]
# extract processes with this logonID
sample_processes = (
    processes_on_host
    [processes_on_host["SubjectLogonId"] == rarest_logonid]
    [["TimeGenerated", "CommandLine"]]
    .sort_values("TimeGenerated")
)[5:25]
# compute duration of session
duration = sample_processes.TimeGenerated.max() - sample_processes.TimeGenerated.min()
md(f"{len(sample_processes)} processes executed in {duration.total_seconds()} sec", "bold")
display(sample_processes)

20 processes executed in 1.323 sec

TimeGenerated CommandLine
22656 2021-06-27 07:48:54.286344 ftp -s:MG06.dll
22657 2021-06-27 07:48:55.166344 cacls.exe C:\Windows\system32\cscript.exe /e /t /g SYSTEM:F
22658 2021-06-27 07:48:55.256344 net users
22659 2021-06-27 07:48:55.269344 findstr "abai$"
22660 2021-06-27 07:48:55.296344 C:\Windows\system32\net1 users
22661 2021-06-27 07:48:55.356344 net user abai$ Wf9k44_9d[=$ /add
22662 2021-06-27 07:48:55.366344 C:\Windows\system32\net1 user abai$ Wf9k44_9d[=$ /add
22663 2021-06-27 07:48:55.392344 net user abai$ Wf9k44_9d[=$
22664 2021-06-27 07:48:55.406344 C:\Windows\system32\net1 user abai$ Wf9k44_9d[=$
22665 2021-06-27 07:48:55.432344 net localgroup administrators
22666 2021-06-27 07:48:55.439344 findstr "abai$"
22667 2021-06-27 07:48:55.449344 C:\Windows\system32\net1 localgroup administrators
22668 2021-06-27 07:48:55.472344 net localgroup administrators abai$ /add
22669 2021-06-27 07:48:55.489344 C:\Windows\system32\net1 localgroup administrators abai$ /add
22670 2021-06-27 07:48:55.529344 net users
22671 2021-06-27 07:48:55.536344 findstr "www.401hk.com"
22672 2021-06-27 07:48:55.542344 C:\Windows\system32\net1 users
22673 2021-06-27 07:48:55.569344 net user www.401hk.com Wf9k44_9d[=$ /add
22674 2021-06-27 07:48:55.582344 C:\Windows\system32\net1 user www.401hk.com Wf9k44_9d[=$ /add
22675 2021-06-27 07:48:55.609344 net user www.401hk.com Wf9k44_sinc9d3[=$

Clustering conclusion

We have narrowed down the task of sifting through > 20,000 processes to a few 10s and have them grouped into sessions ordered by the relative rarity of the process patterns

Other Applications

You can use this technique on other datasets where you want to group by multiple features of the data.

The caveat is that you need to transform any non-numeric data field into a numeric form.

msticpy has a few built-in functions to help with this:

from msticpy.sectools import eventcluster

# This will group similar names together (e.g. "Administrator" and "administrator")
my_df["account_num"] = eventcluster.char_ord_score_df(data=my_df, column="Account")

# This will create a distinct hash for even minor differences in the input.
# This might be useful to detect imperfectly faked UA strings.
my_df["ua_hash"] = eventcluster.crc32_hash_df(data=my_df, column="UserAgent")

# This will return the number of delimiter chars in the string - often a 
# a good proxy for the structure of an input while ignoring variable text values
# e.g. 
# "https://my.dom.com/path1?u1=163.174.4.23" will produce the same score as
# "https://www.contoso.com/azure?srcdom=moon.base.alpha.com"
# but
# "curl https://www.contoso.com/top/next?u=23 > ~/my_page"
# "curl https://www.contoso.com/top_next?u=2.3  > ~/my_page.sh"
# will produce different values despite the similarity of the strings.
# Note - you can override the default delimiter list of " \-/.,"'|&:;%\$()]"
my_df["request_struct"] = eventcluster.delim_count_df(my_df, column="Request")

You can use a combination of these and other functions on the same fields to measure different aspects of the data. For example, the following takes a hash of the browser version of the UA (user agent) string and a structural count of the delimiters used.

Use the ua_pref_hash and ua_delims to cluster on identical browser versions that have the same UA string

my_df["ua_prefix"] = data=my_df["UserAgent"].str.split(")")[-1])
my_df["ua_pref_hash"] = eventcluster.crc32_hash_df(data=my_df, column="ua_prefix")
my_df["ua_delims"] = eventcluster.delim_count_df(data=my_df, column="UserAgent")

Detecting anomalous sequences using Markov Chain

The anomalous_sequence MSTICPy package uses Markov Chain analysis to predict the probability
that a particular sequence of events will occur given what has happened in the past.

Here we're applying it to Office activity.

Query the data

In [11]:
query = """
| where TimeGenerated >= ago(60d)
| where RecordType_s == 'ExchangeAdmin'
| where UserId_s !startswith "NT AUTHORITY"
| where UserId_s !contains "prod.outlook.com"  
| extend params = todynamic(strcat('{"', Operation_s, '" : ', tostring(Parameters_s), '}')) 
| extend UserId = UserId_s, ClientIP = ClientIP_s, Operation = Operation_s
| project TimeGenerated= Start_Time_t, UserId, ClientIP, Operation, params
| sort by UserId asc, ClientIP asc, TimeGenerated asc
| extend begin = row_window_session(TimeGenerated, 20m, 2m, UserId != prev(UserId) or ClientIP != prev(ClientIP))
| summarize cmds=makelist(Operation), end=max(TimeGenerated), nCmds=count(), nDistinctCmds=dcount(Operation),
params=makelist(params) by UserId, ClientIP, begin
| project UserId, ClientIP, nCmds, nDistinctCmds, begin, end, duration=end-begin, cmds, params
"""
exchange_df = qry_prov.Azure.OfficeActivity(add_query_items=query)
print(f"Number of events {len(exchange_df)}")
exchange_df.drop(columns="params").head()
Number of events 146
Out[11]:
UserId ClientIP nCmds nDistinctCmds begin end duration cmds
0 NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker) nan 28 1 2020-06-21 02:36:46+00:00 2020-06-21 02:36:46+00:00 0 days [Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond...
1 NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker) nan 28 1 2020-06-21 05:31:34+00:00 2020-06-21 05:31:34+00:00 0 days [Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond...
2 NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker) nan 2 1 2020-06-22 02:27:06+00:00 2020-06-22 02:27:06+00:00 0 days [Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy]
3 NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker) nan 26 1 2020-06-22 02:30:52+00:00 2020-06-22 02:30:52+00:00 0 days [Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond...
4 NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker) nan 28 1 2020-06-22 04:55:59+00:00 2020-06-22 04:55:59+00:00 0 days [Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond...

Perform Anomalous Sequence analysis on the data

The analysis groups events into sessions (time-bounded and linked by a common account). It then
builds a probability model for the types of command (E.g. "SetMailboxProperty")
and the parameters and parameter values used for that command.

I.e. how likely is it that a given user would be running this sequence of commands in a logon session?

Using this probability model, we can highlight sequences that have an extremely low probability, based
on prior behaviour.

In [12]:
from msticpy.analysis.anomalous_sequence.utils.data_structures import Cmd
from msticpy.analysis.anomalous_sequence import anomalous

def process_exchange_session(session_with_params, include_vals):
    new_ses = []
    for cmd in session_with_params:
        c = list(cmd.keys())[0]
        par = list(cmd.values())[0]
        new_pars = set()
        if include_vals:
            new_pars = dict()
        for p in par:
            if include_vals:
                new_pars[p['Name']] = p['Value']
            else:
                new_pars.add(p['Name'])
        new_ses.append(Cmd(name=c, params=new_pars))
    return new_ses

sessions = exchange_df.cmds.values.tolist()
param_sessions = []
param_value_sessions = []

for ses in exchange_df.params.values.tolist():
    new_ses_set = process_exchange_session(session_with_params=ses, include_vals=False)
    new_ses_dict = process_exchange_session(session_with_params=ses, include_vals=True)
    param_sessions.append(new_ses_set)
    param_value_sessions.append(new_ses_dict)

data = exchange_df
data['session'] = sessions
data['param_session'] = param_sessions
data['param_value_session'] = param_value_sessions

modelled_df = anomalous.score_sessions(
    data=data,
    session_column='param_value_session',
    window_length=3
)

anomalous.visualise_scored_sessions(
    data_with_scores=modelled_df,
    time_column='begin',  # this will appear on the x-axis
    score_column='rarest_window3_likelihood',  # this will appear on the y axis
    window_column='rarest_window3',  # this will represent the session in the tool-tips
    source_columns=['UserId', 'ClientIP'],  # specify any additional columns to appear in the tool-tips
)
Loading BokehJS ...

The events are shown in descending order of likelihood (vertically), so the
events at the bottom of the chart are the ones most interesting to us.

Looking at these rare events, we can see potentially suspicious activity changing role memberships.

In [13]:
pd.set_option("display.html.table_schema", False)

likelihood_max=modelled_df["rarest_window3_likelihood"].max()
likelihood_min=modelled_df["rarest_window3_likelihood"].min()
slider_step = (likelihood_max - likelihood_min) / 20
start_val = likelihood_min + slider_step
threshold = widgets.FloatSlider(
    description="Select likelihood threshold",
    max=likelihood_max,
    min=likelihood_min,
    value=start_val,
    step=start_val,
    layout=widgets.Layout(width="60%"),
    style={"description_width": "200px"},
    readout_format=".7f"
)


def show_rows(change):
    thresh = change["new"]
    pd_disp.update(modelled_df[modelled_df["rarest_window3_likelihood"] < thresh])

threshold.observe(show_rows, names="value")
md("Move the slider to see event sessions below the selected <i>likelihood</i> threshold", "bold")
display(HTML("<hr>"))
display(threshold)
display(HTML("<hr>"))
md(f"Range is {likelihood_min:.7f} (min likelihood) to {likelihood_max:.7f} (max likelihood)<br><br><hr>")
pd_disp = display(
    modelled_df[modelled_df["rarest_window3_likelihood"] < start_val],
    display_id=True
)

Move the slider to see event sessions below the selected likelihood threshold



Range is 0.0000025 (min likelihood) to 0.0106819 (max likelihood)


UserId ClientIP nCmds nDistinctCmds begin end duration cmds params session param_session param_value_session rarest_window3_likelihood rarest_window3
18 NAMPRD06\Administrator (Microsoft.Office.Datacenter.Torus.PowerShellWorker) nan 10 1 2020-06-28 05:30:53+00:00 2020-06-28 05:30:53+00:00 0 days [Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond... [{'Set-ConditionalAccessPolicy': [{'Name': 'Identity', 'Value': 'seccxpninja.onmicrosoft.com\\23... [Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-ConditionalAccessPolicy, Set-Cond... [Cmd(name='Set-ConditionalAccessPolicy', params={'PolicyIdentifierString', 'PolicyDetails', 'Ten... [Cmd(name='Set-ConditionalAccessPolicy', params={'Identity': 'seccxpninja.onmicrosoft.com\\235be... 0.003386 [Cmd(name='Set-ConditionalAccessPolicy', params={'Identity': 'seccxpninja.onmicrosoft.com\\6490d...
145 [email protected] 20.185.182.48:37965 6 1 2020-07-29 20:11:27+00:00 2020-07-29 20:11:27+00:00 0 days [Update-RoleGroupMember, Update-RoleGroupMember, Update-RoleGroupMember, Update-RoleGroupMember,... [{'Update-RoleGroupMember': [{'Name': 'Members', 'Value': 'CBoehmSA;pcadmin;SecurityAdmins_20075... [Update-RoleGroupMember, Update-RoleGroupMember, Update-RoleGroupMember, Update-RoleGroupMember,... [Cmd(name='Update-RoleGroupMember', params={'Identity', 'Members'}), Cmd(name='Update-RoleGroupM... [Cmd(name='Update-RoleGroupMember', params={'Members': 'CBoehmSA;pcadmin;SecurityAdmins_20075581... 0.000003 [Cmd(name='Update-RoleGroupMember', params={'Members': 'CBoehmSA;ComplianceAdmins_939735849', 'I...

Note for many events the output will be long

In [14]:
import pprint

rarest_events = (
    modelled_df[modelled_df["rarest_window3_likelihood"] < threshold.value]
    [[
        "UserId", "ClientIP", "begin", "end", "param_value_session", "rarest_window3_likelihood"
    ]]
    .rename(columns={"rarest_window3_likelihood": "likelihood"})
    .sort_values("likelihood")
)
for idx, (_, rarest_event) in enumerate(rarest_events.iterrows(), 1):
    md(f"Event {idx}", "large")
    display(pd.DataFrame(rarest_event[["UserId", "ClientIP", "begin", "end", "likelihood"]]))

    md("<hr>")
    md("Param session details:", "bold")
    for cmd in rarest_event.param_value_session:
        md(f"Command: {cmd.name}")
        md(pprint.pformat(cmd.params))
    md("<hr><br>")

Event 1

145
UserId [email protected]
ClientIP 20.185.182.48:37965
begin 2020-07-29 20:11:27+00:00
end 2020-07-29 20:11:27+00:00
likelihood 0.000003


Param session details:

Command: Update-RoleGroupMember

{'Identity': 'Security Administrator', 'Members': 'CBoehmSA;pcadmin;SecurityAdmins_2007558133'}

Command: Update-RoleGroupMember

{'Identity': 'Compliance Management', 'Members': 'CBoehmSA;ComplianceAdmins_939735849'}

Command: Update-RoleGroupMember

{'Identity': 'Discovery Management', 'Members': 'CBoehmSA'}

Command: Update-RoleGroupMember

{'Identity': 'Compliance Management', 'Members': 'CBoehmSA;ComplianceAdmins_939735849'}

Command: Update-RoleGroupMember

{'Identity': 'Discovery Management', 'Members': 'CBoehmSA'}

Command: Update-RoleGroupMember

{'Identity': 'Security Administrator', 'Members': 'CBoehmSA;pcadmin;SecurityAdmins_2007558133'}




Resources

MSTICpy:

MSTICpy maintainers:

Microsoft Sentinel Notebooks: