Notebook: https://aka.ms/Jupyterthon-ws-2-2
License: Creative Commons Attribution-ShareAlike 4.0 International
Q&A - OTR Discord #Jupyterthon #WORKSHOP DAY 2 - VISUALIZING DATA
This is not intended to be a comprehensive overview of Visualization in Python/Jupyter. There are many libraries and techniques not covered here. These are just a few options that we've used and liked and give you a lot of scope.
Resources:
Cheatsheets :
Refer Bar Plots section for more examples and options to customize
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import pandas as pd
logons_full_df = pd.read_pickle("../data/host_logons.pkl")
net_full_df = pd.read_pickle("../data/az_net_comms_df.pkl")
logons_full_df.head()
Account | EventID | TimeGenerated | Computer | SubjectUserName | SubjectDomainName | SubjectUserSid | TargetUserName | TargetDomainName | TargetUserSid | TargetLogonId | LogonType | IpAddress | WorkstationName | TimeCreatedUtc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NT AUTHORITY\SYSTEM | 4624 | 2019-02-12 04:56:34.307 | MSTICAlertsWin1 | MSTICAlertsWin1$ | WORKGROUP | S-1-5-18 | SYSTEM | NT AUTHORITY | S-1-5-18 | 0x3e7 | 5 | - | - | 2019-02-12 04:56:34.307 |
1 | MSTICAlertsWin1\MSTICAdmin | 4624 | 2019-02-12 04:37:25.340 | MSTICAlertsWin1 | - | - | S-1-0-0 | MSTICAdmin | MSTICAlertsWin1 | S-1-5-21-996632719-2361334927-4038480536-500 | 0xc90e957 | 3 | 131.107.147.209 | IANHELLE-DEV17 | 2019-02-12 04:37:25.340 |
2 | MSTICAlertsWin1\MSTICAdmin | 4624 | 2019-02-12 04:37:27.997 | MSTICAlertsWin1 | - | - | S-1-0-0 | MSTICAdmin | MSTICAlertsWin1 | S-1-5-21-996632719-2361334927-4038480536-500 | 0xc90ea44 | 3 | 131.107.147.209 | IANHELLE-DEV17 | 2019-02-12 04:37:27.997 |
3 | MSTICAlertsWin1\MSTICAdmin | 4624 | 2019-02-12 04:38:16.550 | MSTICAlertsWin1 | - | - | S-1-0-0 | MSTICAdmin | MSTICAlertsWin1 | S-1-5-21-996632719-2361334927-4038480536-500 | 0xc912d62 | 3 | 131.107.147.209 | IANHELLE-DEV17 | 2019-02-12 04:38:16.550 |
4 | MSTICAlertsWin1\MSTICAdmin | 4624 | 2019-02-12 04:38:21.370 | MSTICAlertsWin1 | - | - | S-1-0-0 | MSTICAdmin | MSTICAlertsWin1 | S-1-5-21-996632719-2361334927-4038480536-500 | 0xc913737 | 3 | 131.107.147.209 | IANHELLE-DEV17 | 2019-02-12 04:38:21.370 |
# Preprocess the data- Group by LogonType and count the no of accounts
logontypebyacc = logons_full_df.groupby(['LogonType'])['Account'].count()
logontypebyacc.head()
LogonType 0 2 2 12 3 13 4 9 5 126 Name: Account, dtype: int64
logontypebyacc.plot(kind='bar')
<AxesSubplot:xlabel='LogonType'>
#Preprocess dataframe by
logonaccountbyday = logons_full_df.set_index('TimeGenerated').resample('D')['Account'].count()
logonaccountbyday.head()
TimeGenerated 2019-02-09 3 2019-02-10 11 2019-02-11 6 2019-02-12 72 2019-02-13 15 Freq: D, Name: Account, dtype: int64
logonaccountbyday.plot(figsize = (20,8))
<AxesSubplot:xlabel='TimeGenerated'>
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
plt.figure(figsize = (20,8))
plt.plot(logonaccountbyday, marker='o')
plt.title("Daily trend of account logons")
plt.xlabel("Date")
plt.ylabel("Logon Count")
# another example of customization with plot
# plt.plot(logonaccountbyday, color='green', marker='o', linestyle='dashed',linewidth=2)
plt.show()
is a very flexible JS visualization framework. Beautiful interactive charts but somewhat complex.
Example Bokeh Ridge plot
is a higherlevel, declarative layer built on top of Bokeh (or MatplotLib)
Example Holoviews Violin plot
is some of Holoviews functionality implemented as a pandas extension.
conda install -c pyviz hvplot
pip install hvplot
import hvplot.pandas
count_of_logons = logons_full_df[["TimeGenerated", "Account"]].groupby("Account").count()
count_of_logons.hvplot.barh(height=300)
plot_df = (
net_full_df[["L7Protocol", "AllExtIPs", "TotalAllowedFlows"]]
.groupby(["L7Protocol", "TotalAllowedFlows"])
.nunique()
)
display(plot_df.head(3))
plot_df.hvplot.scatter(by="L7Protocol")
AllExtIPs | ||
---|---|---|
L7Protocol | TotalAllowedFlows | |
ftp | 1.0 | 1 |
http | 1.0 | 12 |
2.0 | 16 |
plot_df = (
logons_full_df[["TimeCreatedUtc", "Account", "LogonType"]]
.assign(HourOfDay=logons_full_df.TimeCreatedUtc.dt.hour)
)
display(plot_df.head(3))
plot_df.hvplot.hist(y="HourOfDay", by="Account", title="Logons by Hour")
TimeCreatedUtc | Account | LogonType | HourOfDay | |
---|---|---|---|---|
0 | 2019-02-12 04:56:34.307 | NT AUTHORITY\SYSTEM | 5 | 4 |
1 | 2019-02-12 04:37:25.340 | MSTICAlertsWin1\MSTICAdmin | 3 | 4 |
2 | 2019-02-12 04:37:27.997 | MSTICAlertsWin1\MSTICAdmin | 3 | 4 |
plot_df.hvplot.hist(y="HourOfDay", by="Account", subplots=True, width=400).cols(2)
More parameters
plot_df.hvplot.hist(y="HourOfDay", by="Account", subplots=True, shared_axes=False, width=400).cols(2)
plot_df = (
net_full_df[["L7Protocol", "AllExtIPs", "TotalAllowedFlows"]]
.groupby(["L7Protocol", "TotalAllowedFlows"])
.nunique()
)
display(plot_df.head(3))
plot_df.hvplot.violin(by="L7Protocol", height=600)
AllExtIPs | ||
---|---|---|
L7Protocol | TotalAllowedFlows | |
ftp | 1.0 | 1 |
http | 1.0 | 12 |
2.0 | 16 |
plot_df = (
net_full_df[["L7Protocol", "AllExtIPs", "TotalAllowedFlows"]]
.groupby(["L7Protocol", "TotalAllowedFlows"])
.nunique()
)
plot_df.hvplot.scatter(by="L7Protocol", height=600) + plot_df.hvplot.violin(by="L7Protocol", height=600)
plot2_df = (
net_full_df[["FlowStartTime", "AllExtIPs", "L7Protocol", "RemoteRegion"]]
.groupby(["RemoteRegion", pd.Grouper(key="FlowStartTime", freq="5min")])
.agg({"L7Protocol": "nunique", "AllExtIPs": "nunique"})
.sort_index()
# .head(500)
.reset_index()
)
plot2_df.hvplot.scatter(y="AllExtIPs", alpha=0.5, height=500, by="RemoteRegion") * plot2_df.hvplot.line(y="L7Protocol", color="blue")
plot_df = (
net_full_df[["FlowStartTime", "L7Protocol", "RemoteRegion", "TotalAllowedFlows", "AllExtIPs"]]
.assign(MinOfDay=(
net_full_df.FlowStartTime.dt.hour * 60) + net_full_df.FlowStartTime.dt.minute
)
.groupby(["FlowStartTime", "L7Protocol", "RemoteRegion", "TotalAllowedFlows", ])
.nunique()
.reset_index()
)
plot_df.hvplot.box(y="TotalAllowedFlows", by="RemoteRegion", rot=30, height=400) * plot_df.hvplot.violin(y="TotalAllowedFlows", by="RemoteRegion")
Intro: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Statistical specialization
Resources:
In below example, we are visualizing regression models with demo dataset provided by seaborn. The dataset has 2 quantitive variable and with this graph we can see how those variable are related to each other.
You can check more examples based on the data you have:
import seaborn as sns
sns.set_theme(style="darkgrid")
tips = sns.load_dataset("tips")
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, height= 8, aspect=15/8)
<seaborn.axisgrid.FacetGrid at 0x268db708048>
By Meenal Sarda.
Plotly is an open-source library that provides a whole set of chart types as well as tools to create dynamic dashboards. You can think of Plotly as a suite of tools as it integrates or extends with libraries such as Dash or Chart Studio to provide interactive dashboards. Plotly’s Python graphing library makes interactive, publication-quality graphs.
Plotly supports dynamic charts and animations as a first principle and this is the main difference between other visualization libraries like matplotlib or seaborn.
Main Properties of Plotly:
The plotly.express module (usually imported as px) contains functions that can create entire figures at once, and is referred to as Plotly Express or PX. Plotly Express is a built-in part of the plotly library, and is the recommended starting point for creating most common figures.
import plotly.express as px
# Preparing Dataframe
df = logontypebyacc.to_frame(name = 'Frequency')
df.reset_index(inplace = True)
# Creating bar chart
fig = px.bar(df, x = 'LogonType', y = 'Frequency', title = 'Logon Frequency by Logon Type')
# Forcing the X axis to be categorical. Reference: https://plotly.com/python/categorical-axes/
fig.update_xaxes(type='category')
# Presenting chart
fig.show()
import pandas as pd
import json
# Opeing the log file
zeek_data = open('../data/combined_zeek.log','r')
# Creating a list of dictionaries
zeek_list = []
for dict in zeek_data:
zeek_list.append(json.loads(dict))
# Closing the log file
zeek_data.close()
# Creating a dataframe
zeek_df = pd.DataFrame(data = zeek_list)
zeek_df.head()
@stream | @system | @proc | ts | uid | id_orig_h | id_orig_p | id_resp_h | id_resp_p | proto | ... | is_64bit | uses_aslr | uses_dep | uses_code_integrity | uses_seh | has_import_table | has_export_table | has_cert_table | has_debug_data | section_names | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | conn | bobs.bigwheel.local | zeek | 1.588205e+09 | Cvf4XX17hSAgXDdGEd | 10.0.1.6 | 54243.0 | 10.0.0.4 | 53.0 | udp | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | conn | bobs.bigwheel.local | zeek | 1.588205e+09 | CJ21Le4zsTUcyKKi98 | 10.0.1.6 | 56880.0 | 10.0.0.4 | 445.0 | tcp | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | conn | bobs.bigwheel.local | zeek | 1.588205e+09 | CnOP7t1eGGHf6LFfuk | 10.0.1.6 | 65108.0 | 10.0.0.4 | 53.0 | udp | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | conn | bobs.bigwheel.local | zeek | 1.588205e+09 | CvxbPE3MuO7boUdSc8 | 10.0.1.6 | 138.0 | 10.0.1.255 | 138.0 | udp | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | conn | bobs.bigwheel.local | zeek | 1.588205e+09 | CuRbE21APSQo2qd6rk | 10.0.1.6 | 123.0 | 10.0.0.4 | 123.0 | udp | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 148 columns
# Creating histogram chart
fig = px.histogram(zeek_df, x = 'duration', title = 'Distribution of Frequencies', nbins = 1000)
# Presenting chart
fig.show()
# Creating box plot
fig = px.box(zeek_df, x = 'id_resp_h', y = 'duration', title = 'Variability of Duration by Response IP Address')
# Presenting chart
fig.show()
import msticpy.vis.mp_pandas_plot
net_data = net_full_df.sort_values("FlowStartTime").tail(500)
net_data.mp_plot.timeline(time_column="FlowStartTime")
net_data.mp_plot.timeline(
time_column="FlowStartTime",
source_columns=["L7Protocol", "RemoteRegion", "AllExtIPs"]
)
net_data.mp_plot.timeline(
time_column="FlowStartTime",
source_columns=["L7Protocol", "RemoteRegion", "AllExtIPs"],
group_by="L7Protocol",
)
help(net_data.mp_plot.timeline)
Help on method timeline in module msticpy.vis.mp_pandas_plot: timeline(**kwargs) -> bokeh.models.layouts.LayoutDOM method of msticpy.vis.mp_pandas_plot.MsticpyPlotAccessor instance Display a timeline of events. Parameters ---------- time_column : str, optional Name of the timestamp column (the default is 'TimeGenerated') source_columns : list, optional List of default source columns to use in tooltips (the default is None) Other Parameters ---------------- title : str, optional Title to display (the default is None) alert : SecurityAlert, optional Add a reference line/label using the alert time (the default is None) ref_event : Any, optional Add a reference line/label using the alert time (the default is None) ref_time : datetime, optional Add a reference line/label using `ref_time` (the default is None) group_by : str The column to group timelines on. legend: str, optional "left", "right", "inline" or "none" (the default is to show a legend when plotting multiple series and not to show one when plotting a single series) yaxis : bool, optional Whether to show the yaxis and labels (default is False) ygrid : bool, optional Whether to show the yaxis grid (default is False) xgrid : bool, optional Whether to show the xaxis grid (default is True) range_tool : bool, optional Show the the range slider tool (default is True) height : int, optional The height of the plot figure (the default is auto-calculated height) width : int, optional The width of the plot figure (the default is 900) color : str Default series color (default is "navy") overlay_data : pd.DataFrame: A second dataframe to plot as a different series. overlay_color : str Overlay series color (default is "green") ref_events : pd.DataFrame, optional Add references line/label using the event times in the dataframe. (the default is None) ref_time_col : str, optional Add references line/label using the this column in `ref_events` for the time value (x-axis). (this defaults the value of the `time_column` parameter or 'TimeGenerated' `time_column` is None) ref_col : str, optional The column name to use for the label from `ref_events` (the default is None) ref_times : List[Tuple[datetime, str]], optional Add one or more reference line/label using (the default is None) Returns ------- LayoutDOM The bokeh plot figure.
net_data.mp_plot.timeline_duration(group_by="L7Protocol")
Simple interactions
net_data.mp_plot.matrix(x="RemoteRegion", y="AllExtIPs")
(
net_data[~net_data["L7Protocol"]
.isin(["http", "https"])]
.mp_plot.matrix(x="L7Protocol", y="AllExtIPs", invert=True)
)
net_data.mp_plot.matrix(x="RemoteRegion", y="AllExtIPs", invert=True)
process_df = pd.read_pickle("../data/processes_test.pkl")
process_df.mp_plot.process_tree(legend_col="Account")