Author: Vani Asawa
Date: December 2020
Notebook Version: 1.0
Python Version: Python 3.6
Required Packages: msticpy, pandas, kqlmagic
Data Sources Required: WAF data (AzureDiagnostics)
Web Application Firewall (WAF) data records the monitored and blocked HTTP traffic to and from a web service. Due to the large magnitudes of HTTP requests made to such services in any workspace, the data tends to be incredibly noisy, and hence may prevent an analyst from determining if there are any bad requests made to the servers, which could result in a potentially malicious attack.
This notebook analyses the blocked WAF Alerts and aim to surface any unusual HTTP requests made by the client IPs to the servers, using a variety of statistical techniques applied on several features of the WAF data, such as the Rule ID of the triggering event, the HTTP status code returned to the client from the alerts, and the contents of the request URIs themselves
Distribution of WAF logs and blocked alerts over an extended time frame
Cluster the request URIs in WAF blocked alerts, based on TFIDF scores
Term frequency-inverse document frequency (TFIDF) score is a numerical statistic of how important a variable is to a document. The value of the statistic is directly proportional to the variable's frequency in the document, and inversely proportional to the number of documents that contain the variable. More information about TFIDF can be found here
In our analysis, the variable will be the 'split URIs' and 'rule IDs', while a single document is all the blocked alerts for a single client IP in the selected time frame. We will be assessing the relative importance of every single token of the split request URIs and the number of times a ruleID is triggered for our blocked alerts over multiple such 'documents'. We will be using these two sets of scores to cluster the request URIs, and obtain single/grouped sets of interesting (and potentially malicious) request URIs that were blocked by the WAF.
Prerequisites
Running the Notebook
The best way of using the notebook is as follows:
Individually run all of the cells up to the start of Section 1:
Default paramenters will allow the entire notebook to run from Section I using the 'Run Selected Cell and All Below' option under the Run tab. However, for added value, run the cells sequentially in any given section.
from pathlib import Path
import os
import sys
from pathlib import Path
from IPython.display import display, HTML
REQ_PYTHON_VER=(3, 6)
REQ_MSTICPY_VER=(1, 0, 0)
REQ_MP_EXTRAS = ["ml", "kql"]
display(HTML("<h3>Starting Notebook setup...</h3>"))
if Path("./utils/nb_check.py").is_file():
from utils.nb_check import check_versions
check_versions(REQ_PYTHON_VER, REQ_MSTICPY_VER, REQ_MP_EXTRAS)
# If not using Azure Notebooks, install msticpy with
# !pip install msticpy
from msticpy.nbtools import nbinit
nbinit.init_notebook(
namespace=globals(),
additional_packages=["adjustText", "plotly"]
);
from ipywidgets import widgets
import plotly.graph_objects as go
import plotly.express as px
import re
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from adjustText import adjust_text
import itertools
import ipaddress
import traceback
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.max_colwidth', None)
layout = widgets.Layout(width="50%", height="80px")
style = {"description_width": "200px"}
class color:
BOLD = '\033[1m'
END = '\033[0m'
# See if we have an Azure Sentinel Workspace defined in our config file.
# If not, let the user specify Workspace and Tenant IDs
ws_config = WorkspaceConfig()
if not ws_config.config_loaded:
ws_config.prompt_for_ws()
qry_prov = QueryProvider(data_environment="AzureSentinel")
print("done")
Text(value='', description='Enter Tenant ID: ', layout=Layout(height='30px', width='50%'), placeholder='Enter …
Text(value='', description='Enter Workspace ID: ', layout=Layout(height='30px', width='50%'), placeholder='Ent…
# Authenticate to Azure Sentinel workspace
qry_prov.connect(ws_config)
Querying Function : Accessing the results of the Kusto query as a pandas dataframe, and removing empty/null columns from the dataframe
def showQuery(query):
df = qry_prov.exec_query(query)
trimDF(df)
return df
def trimDF(df):
# Store names of columns with null values for all entries
empty_null_cols = [col for col in df.columns if df[col].isnull().all()]
# Store names of columns with empty string '' values for all entries
empty_str_cols = []
for col in df.columns:
try:
if ''.join(df[col].map(str)) == '':
empty_str_cols = empty_str_cols + [col]
except:
continue
df.drop(empty_null_cols + empty_str_cols, axis=1, inplace=True)
binIntervals = ['1m', '5m', '10m', '15m', '30m', '1h', '12h', '1d', '5d', '10d']
Selecting a Host
def queryHost(startTime, endTime):
query = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayFirewallLog"
| where action_s == 'Blocked' or isempty(action_s)
| summarize AlertCountPerHost = count() by hostname_s, bin(timeStamp_t, {binInterval})
| render timechart
'''.format(startTime = startTime, endTime = endTime, binInterval = '1h')
return(query)
Auto determine masking bits for clubbing IPs
def maskBitsVal(uniqueIPLen):
if uniqueIPLen > 150:
return '/8'
elif uniqueIPLen > 40:
return '/16'
elif uniqueIPLen > 15:
return '/24'
return '/32'
Select an extended time frame to view the distribution of WAF logs and blocked alerts over all hosts.
query_times_1 = nbwidgets.QueryTime(units='day', max_before=30, before=-15, max_after=-1)
query_times_1.display()
HTML(value='<h4>Set query time boundaries</h4>')
HBox(children=(DatePicker(value=datetime.date(2020, 12, 4), description='Origin Date'), Text(value='10:31:59.0…
VBox(children=(IntRangeSlider(value=(-15, 1), description='Time Range (day):', layout=Layout(width='80%'), max…
categories = ['ApplicationGatewayAccessLog', 'ApplicationGatewayFirewallLog']
def viewLogs(category):
log_alert_query = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "{category}"
| where action_s == 'Blocked' or isempty(action_s)
| summarize NoOfAlerts= count() by bin(timeStamp_t, {binInterval})
| render timechart '''.format(startTime = query_times_1.start, endTime = query_times_1.end, category = category, binInterval = '1h')
%kql -query log_alert_query
rawDataQuery = """
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == '{category}'
| where action_s == 'Blocked' or isempty(action_s)
| take 15
""".format(startTime = query_times_1.start, endTime = query_times_1.end, category = category)
display(showQuery(rawDataQuery).head(5))
category = widgets.Select(options = categories, style = style, layout = layout, description = 'Choose logs/alerts: ')
display(category)
interactive(children=(Select(description='Choose logs/alerts: ', layout=Layout(height='80px', width='50%'), op…
viewLogs(category = category.value)
Select a time frame of interest to view the distribution of WAF blocked alerts over all hosts.
Recommended: Analyse a shorter time frame than Section I for more detail
query_times_2 = nbwidgets.QueryTime(units='day', max_before=30, before=-10, max_after=-1)
query_times_2.display()
HTML(value='<h4>Set query time boundaries</h4>')
HBox(children=(DatePicker(value=datetime.date(2020, 12, 4), description='Origin Date'), Text(value='10:32:15.6…
VBox(children=(IntRangeSlider(value=(-10, 1), description='Time Range (day):', layout=Layout(width='80%'), max…
The following host entity will be used for the remainder of this section
query = queryHost(query_times_2.start, query_times_2.end)
%kql -query query
try:
df_host = showQuery(query)
list_hosts = set([x for x in df_host['hostname_s']])
df = df_host.groupby(['hostname_s']).agg({'AlertCountPerHost': sum}).rename(columns = {'AlertCountPerHost': 'Num_blocked_alerts'})
hosts = widgets.Select(options=list_hosts, style = style, layout = layout, value=df['Num_blocked_alerts'].idxmax(), description = 'Select Host: ')
display(df)
display(hosts)
except Exception as e:
print('Error: ' + e)
traceback.print_exc()
* 8ecf8077-cf51-4820-aadd-14040956f35d@loganalytics
Done (00:02.647): 60 records
Num_blocked_alerts | |
---|---|
hostname_s | |
4 | |
127.0.0.1 | 2 |
13.89.108.163 | 120 |
13.89.108.163:80 | 83 |
<undefined> | 18 |
Select(description='Select Host: ', index=4, layout=Layout(height='80px', width='50%'), options=('', '127.0.0.…
We will be using balloon plots to visualise the number of WAF alerts over rule IDs, http-status codes, and client IP entities, for the selected host entity.
query_distribution = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayFirewallLog"
| where hostname_s == "{host}"
| where action_s == 'Blocked' or isempty(action_s)
| join kind=leftouter ( AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayAccessLog"
| summarize by requestUri_s, httpStatus_d
) on requestUri_s
| summarize NoOfAlerts = count(), make_set(requestUri_s), DistinctURIs = dcount(requestUri_s) by clientIp_s, ruleId_s, httpStatus_d1
'''.format(startTime = query_times_2.start, endTime = query_times_2.end, host = hosts.value)
try:
df_distribution = showQuery(query_distribution)
df_distribution.rename(columns = {'clientIp_s':'Ip Address', 'ruleId_s':'Rule ID', 'set_requestUri_s': 'Request Uris'}, inplace = True)
if 'httpStatus_d1' in df_distribution.columns:
df_distribution = df_distribution.sort_values(by=['httpStatus_d1'], ascending = True).reset_index(drop = True)
df_distribution.rename(columns = {'httpStatus_d1':'Http status'}, inplace = True)
df_distribution['Http status'] = 'h: ' + df_distribution['Http status'].astype(str)
maskBits = maskBitsVal(len(df_distribution['Ip Address'].unique()))
df_distribution['Ip Address'] = df_distribution['Ip Address'].apply(lambda x: ipaddress.IPv4Network(x + maskBits, strict = False))
df_distribution['Ip Address'], df_distribution['Rule ID'] = 'Ip ' + df_distribution['Ip Address'].astype(str), 'rID ' + df_distribution['Rule ID'].astype(str)
except Exception as e:
print('Error: ' + e)
traceback.print_exc()
options = ['Ip Address', 'Rule ID']
if 'Http status' in df_distribution.columns:
options += ['Http status']
def viewBalloonPlot(x_axis, y_axis, display_rawResult):
try:
df_balloon_plot = (df_distribution
.groupby([x_axis, y_axis], as_index=False)
.agg({'NoOfAlerts': sum, 'DistinctURIs': sum, 'Request Uris': list})
.reset_index(drop = True))
fig = px.scatter(df_balloon_plot, x=df_balloon_plot[x_axis], y = df_balloon_plot[y_axis],
size= np.log(1 + df_balloon_plot['NoOfAlerts'] ), color = 'NoOfAlerts',
hover_data=['NoOfAlerts', 'DistinctURIs'])
fig.update_layout(height = max(300, 30 * len(set(df_balloon_plot[y_axis]))), title_text='Alert Distribution for host ID '+ str(hosts.value))
fig.show()
if display_rawResult == 'Yes':
print('Top 5 raw results with the highest number of alerts: \n')
df_balloon_plot['Request Uris'] = [np.unique(list(itertools.chain(*row['Request Uris']))) for index, row in df_balloon_plot.iterrows() ]
df_balloon_plot['DistinctURIs'] = df_balloon_plot['Request Uris'].str.len()
display(df_balloon_plot[[y_axis, x_axis, 'NoOfAlerts','Request Uris', 'DistinctURIs']].sort_values(by='NoOfAlerts', ascending = False).head(5))
except ValueError:
print('ValueError: Choose distinct x and y axes')
except Exception as e:
print('Error: ' + e)
traceback.print_exc()
x_axis = widgets.Select(options = options, style = style, layout = layout, description = 'Select x-axis: ')
y_axis = widgets.Select(options = options, style = style, layout = layout, description = 'Select y-axis: ')
display_rawResult = widgets.Select(options = ['Yes', 'No'], description = 'Display raw results: ')
md("Select graph properties:", "bold")
display(x_axis)
display(y_axis)
display(display_rawResult)
interactive(children=(Select(description='Select x-axis: ', layout=Layout(height='80px', width='50%'), options…
viewBalloonPlot, x_axis = x_axis.value,
y_axis = y_axis.value, display_rawResult = display_rawResult.value)
display(w)
Select the timeframe and host entity for this section of the notebook.
Recommended: Set a timeframe of >20 days
query_times_3 = nbwidgets.QueryTime(units='day', max_before=30, before=10, max_after=-1)
query_times_3.display()
HTML(value='<h4>Set query time boundaries</h4>')
HBox(children=(DatePicker(value=datetime.date(2020, 12, 4), description='Origin Date'), Text(value='10:32:42.8…
VBox(children=(IntRangeSlider(value=(-10, 1), description='Time Range (day):', layout=Layout(width='80%'), max…
df_host_2 = showQuery(queryHost(query_times_3.start, query_times_3.end))
df = df_host_2.groupby(['hostname_s']).agg({'AlertCountPerHost': sum}).rename(columns = {'AlertCountPerHost': 'Num_blocked_alerts'})
hosts_2 = widgets.Select(options=set([x for x in df_host_2['hostname_s']]), value=df['Num_blocked_alerts'].idxmax(), description = 'Select Host: ')
display(df)
display(hosts_2)
Num_blocked_alerts | |
---|---|
hostname_s | |
20 | |
112.124.42.80:63435 | 1 |
123.125.114.144 | 1 |
127.0.0.1 | 7 |
127.0.0.1:80 | 11 |
13.89.108.163 | 37341 |
13.89.108.163:80 | 317 |
<undefined> | 86 |
Select(description='Select Host: ', index=6, options=('', '127.0.0.1', '13.89.108.163:80', '112.124.42.80:6343…
Enter min_df and max_df value parameters
min_df: The min_df variable is used to eliminate terms that do not appear very frequently in our data. A min_df value of 0.01 implies eliminating terms that apear in less than 1% of the data.
max_df: The max_df variable eliminates terms that appear very frequently in our data. A max_df value of 0.9 implies eliminating terms that appear in more than 90% of the data.
For more information about these parameters in the TFIDF vectorizer, please see here
Note: In the case of errors running the code below for the two approaches (Request URIs split on "/" against the client IP entities OR Number of blocked alerts for every Rule ID against the client IPs), run the TFIDF vectoriser for ALL the data
If you would like to view the TFIDF scores for all the data, change the following code in the tfidfScores
function:
vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False, min_df = min_df_value, max_df = max_df_value)
to
vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False)
min_df_widget = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'Enter min_df: ', placeholder = '% or Integer or None', value = '0.01')
max_df_widget = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'Enter max_df: ', placeholder = '% or Integer or None', value = '0.9')
display(min_df_widget)
display(max_df_widget)
Text(value='0.01', description='Enter min_df: ', layout=Layout(height='30px', width='50%'), placeholder='% or …
Text(value='0.9', description='Enter max_df: ', layout=Layout(height='30px', width='50%'), placeholder='% or I…
try:
min_df_value = float(min_df_widget.value)
max_df_value = float(max_df_widget.value)
except Exception as e:
print('Error: ' + str(e))
traceback.print_exc()
def tfidfScores(df, tokenList = None):
def identity_tokenizer(text):
return text
vectorizer = TfidfVectorizer(tokenizer=identity_tokenizer, lowercase=False, min_df = min_df_value, max_df = max_df_value)
vectors = vectorizer.fit_transform(tokenList)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df_scores = pd.DataFrame(denselist, columns = feature_names)
multicol1 = pd.MultiIndex.from_tuples([('weight', str(j)) for j in df_scores.columns])
df_multiIndex = pd.DataFrame([list(df_scores.iloc[i]) for i in range(0, len(df_scores))], index=[df['Ip Address']], columns=multicol1)
return df_multiIndex
query_URIs = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayFirewallLog"
| where hostname_s startswith "{host}"
| where action_s == 'Blocked' or isempty(action_s)
| distinct clientIp_s, requestUri_s
| summarize make_list(requestUri_s) by clientIp_s
'''.format(startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value)
try:
df_URIs = showQuery(query_URIs)
df_URIs.rename(columns = {'clientIp_s':'Ip Address', 'list_requestUri_s': 'RequestUris'}, inplace = True)
viewData_splitUri = df_URIs.copy()
maskBits = maskBitsVal(len(viewData_splitUri['Ip Address'].unique()))
viewData_splitUri['Ip Address'] = viewData_splitUri['Ip Address'].apply(lambda x: ipaddress.IPv4Network(x + maskBits, strict = False))
viewData_splitUri.groupby(["Ip Address"], as_index=False).agg({'RequestUris': list})
tokenList = []
for index, row in viewData_splitUri.iterrows():
splitUris = re.split('/', ''.join(row['RequestUris']))
tokenList = tokenList + [splitUris]
df_splitUri_tfidf = tfidfScores(viewData_splitUri, tokenList)
except Exception as e:
print('Error: ' + str(e))
traceback.print_exc()
weight | |||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
!.php | $DefaultNav | ${@die(md5(HelloThinkPHP))} | %62%61%73%65 | %68%74%6D%6C%6F%66%66%69%63%65%73%65%72%76%6C%65%74 | %69%6D | %69%73%70%69%72%69%74 | %70%6F%73%74%2E%70%68%70 | %72%65%67%69%73%74%65%72?%65%6c%65%6d%65%6e%74%5f%70%61%72%65%6e%74%73=%74%69%6d%65%7a%6f%6e%65%2f%74%69%6d%65%7a%6f%6e%65%2f%23%76%61%6c%75%65\u0026%61%6a%61%78%5f%66%6f%72%6d=1\u0026%5f%77%72%61%70%70%65%72%5f%66%6f%72%6d%61%74=%64%72%75%70%61%6c%5f%61%6a%61%78 | %73%65%65%79%6F%6E | %75%70%6C%6F%61%64%2E%70%68%70 | %75%73%65%72 | %75%73%65%72%2e%70%68%70 | - | .bzr | .bzrignore | .config.php | .env | .git | .git.php | .gitignore | .hg | .hgignore | .htaccess | .htaccess.svn-base | ... | zencart | zhk.php | zhui.php | zikula | zip | zipfiles | zmp.php | zp-core | zshmindex.php | zuo.php | zuoindex.php | zuos.php | zuoshou.php | zuoshss.php | zuoss.php | zxc.php | zxc0.php | zxc1.php | zxc2.php | zxy.php | zyc.php | zz.php | zza.php | zzk.php | zzz.php | |
Ip Address | |||||||||||||||||||||||||||||||||||||||||||||||||||
20.51.0.0/16 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
45.249.0.0/16 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
20.57.0.0/16 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
20.51.0.0/16 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
91.241.0.0/16 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 2737 columns
query_RuleIds = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "ApplicationGatewayFirewallLog"
| where hostname_s startswith "{host}"
| where action_s == 'Blocked'
| summarize alertCount = count(), make_set(requestUri_s) by clientIp_s, ruleId_s
'''.format(startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value)
try:
dfPrac = showQuery(query_RuleIds)
df_RuleIds = showQuery(query_RuleIds)
df_RuleIds.rename(columns = {'clientIp_s':'Ip Address', 'ruleId_s':'RuleId', 'set_requestUri_s': 'RequestUris'}, inplace = True)
maskBits = maskBitsVal(len(df_RuleIds['Ip Address'].unique()))
df_RuleIds['Ip Address'] = df_RuleIds['Ip Address'].apply(lambda x: ipaddress.IPv4Network(x + maskBits, strict = False))
viewData_ruleId = df_RuleIds.groupby(["Ip Address"], as_index=False).agg({'RuleId': list, 'alertCount': list, 'RequestUris': list})
tokenList = [sum([[s] * n for s, n in zip(viewData_ruleId['RuleId'][x], viewData_ruleId['alertCount'][x])], []) for x in range(0, len(viewData_ruleId))]
df_ruleId_tfidf = tfidfScores(viewData_ruleId, tokenList)
except Exception as e:
print('Error: ' + e)
traceback.print_exc()
weight | ||||
---|---|---|---|---|
949110 | 980130 | BlockGeoLocationChina | BlockInternetExplorer11 | |
Ip Address | ||||
3.16.0.0/16 | 0.707107 | 0.707107 | 0.0 | 0.0 |
3.236.0.0/16 | 0.707107 | 0.707107 | 0.0 | 0.0 |
13.65.0.0/16 | 0.707107 | 0.707107 | 0.0 | 0.0 |
13.68.0.0/16 | 0.707107 | 0.707107 | 0.0 | 0.0 |
13.84.0.0/16 | 0.707107 | 0.707107 | 0.0 | 0.0 |
We will be using balloon plots to view the TFIDF scores for the two approaches
options = ['RuleId', 'SplitUris']
def visualiseTFIDF(TfidfCategory):
try:
max_category = 30
df = pd.DataFrame()
if TfidfCategory == 'RuleId': df = df_ruleId_tfidf.copy()
else:
df = df_splitUri_tfidf.copy()
df_tfidf = df.iloc[:, : max_category].stack().reset_index(drop = False).rename(columns = {'level_1':TfidfCategory, 'weight':'tfidf'})
df_tfidf['Ip Address'] = 'Ip ' + df_tfidf['Ip Address'].astype(str)
if 'RuleId' == TfidfCategory:
df_tfidf['RuleId'] = 'rID ' + df_tfidf['RuleId'].astype(str)
else:
df_tfidf['SplitUris'] = df_tfidf['SplitUris'].apply(lambda x: (x[0:20]+ '...') if len(x)> 20 else x)
fig = px.scatter(df_tfidf, x = df_tfidf[TfidfCategory], y = df_tfidf['Ip Address'],
size= np.log(1 + df_tfidf['tfidf']), color = df_tfidf['tfidf'],
hover_data=[df_tfidf['tfidf']])
fig.update_layout(height = max(800, 20 * len(set(df_tfidf[TfidfCategory]))), title_text= 'TFIDF distribution of ' + TfidfCategory + ' against client IPs', width = 1700)
fig.show()
except Exception as e:
print('Error: ' + e)
traceback.print_exc()
TfidfCategory = widgets.Select(options = options, style = style, layout = layout, description = 'TFIDF approach: ')
display(TfidfCategory)
interactive(children=(Select(description='TFIDF approach: ', layout=Layout(height='80px', width='50%'), option…
visualiseTFIDF(TfidfCategory = TfidfCategory.value)
DBSCAN is a non-parametric density-based spatial clustering algorithm, which groups together points that are "closely packed" together. Points which lie in low density regions are marked as outliers. For more information, please see here. We use DBScan on our data in order to aggregate request URIs which are similar to each other, and surface unusual request URIs as outliers. The clustering uses the Tfidf scores data obtained for the rule ID and split URIs approaches respectively.
Select the eps and min_samples value for DBScan and n_components value for PCA below. More information about these parameters can be found here and here.
DBScan:
eps value: Eps value is a measure of the distance below which two points are considered neighbors.
min_samples: The minimum number of neighbors that a point should have in order to be classified as a core point. The core point is included in the min_samples count.
PCA: PCA is a dimensionality reduction technique that compresses the multivariate data into principal components, which describe most of the variation in the original dataset. In our case, we are able to better visualise the clubbing of similar and outlier request URIs by visualising the first two Principal components.
n_components: Number of principal components
eps_widget = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'DBSCAN - Enter eps value', value = '0.4')
min_samples_widget = widgets.IntSlider(style = style, layout = widgets.Layout(width="50%", height="30px"), description='DBSCAN - Enter min samples', start=1, end=15, step=1, value=5)
n_components_widget = widgets.IntSlider(style = style, layout = widgets.Layout(width="50%", height="30px"), description='PCA - Enter n_components', start=1, end=15, step=1, value=2)
display(eps_widget)
display(min_samples_widget)
display(n_components_widget)
Text(value='0.4', description='DBSCAN : Enter eps value', layout=Layout(height='30px', width='50%'), style=Des…
IntSlider(value=5, description='DBSCAN : Enter min samples', layout=Layout(height='30px', width='50%'), style=…
IntSlider(value=2, description='PCA : Enter n_components', layout=Layout(height='30px', width='50%'), style=Sl…
def db_scan_clustering(data, eps = float(eps_widget.value)):
dbscan = DBSCAN(eps=eps, min_samples = int(min_samples.value))
dbscan.fit(data)
return dbscan.labels_
def principal_component_analysis(data, eps = float(eps_widget.value)):
while True:
try:
pca = PCA(n_components=int(n_components_widget.value))
pca.fit(data)
x_pca = pca.transform(data)
break
except:
continue
clusters = db_scan_clustering(data.values, eps)
label = list(range(0, len(data), 1))
plt.figure(figsize=(20,15))
scatter = plt.scatter(x_pca[:,0],x_pca[:,1],c = clusters,cmap='rainbow')
handles, labels = scatter.legend_elements(prop="colors", alpha=0.6)
plt.legend(handles, labels, loc="upper right", title="Clusters")
n = list(range(0, len(x_pca[:,0]), 1))
texts = []
for i, txt in enumerate(n):
texts.append(plt.text(x_pca[:,0][i], x_pca[:,1][i], txt))
adjust_text(texts)
plt.show()
options1 = ['RuleId', 'SplitUris']
def viewPCA(tfidfCategory):
df = df_splitUri_tfidf.copy()
viewData = viewData_splitUri.copy()
if tfidfCategory == 'RuleId':
df = df_ruleId_tfidf.copy()
viewData = viewData_ruleId.copy()
print(tfidfCategory + ' approach (Outliers + Clustered request URI data): \n')
while True:
try:
principal_component_analysis(df)
break
except:
continue
print(color.BOLD + 'Principal Component Analysis \n' + color.END)
tfidfCategory = widgets.Select(options = options1, style = style, layout = layout, description = 'TFIDF approach: ')
display(tfidfCategory)
Principal Component Analysis
interactive(children=(Select(description='TFIDF approach: ', layout=Layout(height='80px', width='50%'), option…
viewPCA(tfidfCategory = tfidfCategory.value)
options1 = ['RuleId', 'SplitUris']
options2 = ['Outlier', 'Clustered']
def viewClusters(tfidfCategory, requestURIs):
try:
df = df_splitUri_tfidf.copy()
viewData = viewData_splitUri.copy()
if tfidfCategory == 'RuleId':
df = df_ruleId_tfidf.copy()
viewData = viewData_ruleId.copy()
clusters = db_scan_clustering(df.values)
print(requestURIs + ' URIs for ' + tfidfCategory+ ': \n')
clusterList = list(set(clusters))
try:
clusterList.remove(-1)
except:
print()
if requestURIs == 'Outlier':
clusterList = [-1]
if clusterList:
for k in clusterList:
print('Cluster ' + str(k))
display(viewData[viewData['Ip Address'].isin(df.index.get_level_values(0)[clusters == k])])
else:
print('No Data')
except Exception as e:
print('Error: ' + e)
traceback.print_exc()
print(color.BOLD + 'DBScan Clustering of the Request URIs \n' + color.END)
tfidfCategory = widgets.Select(options = options1, style = style, layout = layout, description = 'TFIDF approach: ')
requestURIs = widgets.Select(options = options2, style = style, layout = layout, description = 'Request URIs: ')
display(tfidfCategory)
display(requestURIs)
DBScan Clustering of the Request URIs
interactive(children=(Select(description='TFIDF approach: ', layout=Layout(height='80px', width='50%'), option…
viewClusters, tfidfCategory = widgets.Select(options = options1, style = style, layout = layout, description = 'TFIDF approach: '), requestURIs = widgets.Select(options = options2, style = style, layout = layout, description = 'Request URIs: ') ))
ipAddress = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'IP address: ', placeholder = 'Enter masked IP address from the results above. Include masking bits.')
requestURI = widgets.Text(style = style, layout = widgets.Layout(width="50%", height="30px"), description = 'Request URI: ', placeholder = 'Enter request URI from the results above')
print(color.BOLD + '\nStart time: ' + color.END + str(query_times_3.start) + '\n')
print(color.BOLD + 'End time: ' + color.END + str(query_times_3.end) + '\n')
display(ipAddress)
display(requestURI)
Start time: 2020-11-04 10:32:42.885697 End time: 2020-11-26 10:32:42.885697
Text(value='', description='IP address: ', layout=Layout(height='30px', width='50%'), placeholder='Enter maske…
Text(value='', description='Request URI: ', layout=Layout(height='30px', width='50%'), placeholder='Enter requ…
try:
pd.set_option('display.max_colwidth', 20)
kql_query = '''
AzureDiagnostics
| where TimeGenerated between (datetime({startTime}).. datetime({endTime}))
| where Category == "{category}"
| where {hostname} startswith "{host}"
| where action_s == 'Blocked' or isempty(action_s)
| where {ip} startswith "{ipaddress}"
| extend originalRequestUriWithArgs_s = column_ifexists("originalRequestUriWithArgs_s", "")
| where requestUri_s contains {uri} or originalRequestUriWithArgs_s contains {uri}
| take 10
'''
cutOff = [1, 2, 3, 4]
intlist = [8, 16, 24, 32]
if ipAddress.value != '':
ipaddress = str(ipAddress.value).strip().split('/')[0]
maskBits = int(str(ipAddress.value).strip().split('/')[1])
ipaddress = '.'.join(ipaddress.split('.')[0:cutOff[intlist.index(maskBits)]])
else:
ipaddress = ''
print(color.BOLD + '\nStart time: ' + color.END + str(query_times_3.start) + '\n')
print(color.BOLD + 'End time: '+ color.END + str(query_times_3.end) + '\n')
print(color.BOLD + 'Ip Address entered: ' + color.END + str(ipAddress.value) + '\n')
print(color.BOLD + 'Request Uri entered: ' + color.END + str((requestURI.value).strip()) + '\n' )
category = 'ApplicationGatewayAccessLog'
ip_var = 'clientIP_s'
host_var = 'host_s'
uri = '\'' + (requestURI.value).strip() + '\''
kql_accessLogs = kql_query.format(hostname = host_var, startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value, category = category, ip = ip_var, ipaddress = ipaddress, uri = uri)
df_rawAccessKustoQuery = showQuery(kql_accessLogs)
print(category + ' (Raw) Data- \n')
display(df_rawAccessKustoQuery.head(10))
category = 'ApplicationGatewayFirewallLog'
ip_var = 'clientIp_s'
host_var = 'hostname_s'
uri = '@' + '\'' + (requestURI.value).strip() + '\''
kql_firewallLogs = kql_query.format(hostname = host_var, startTime = query_times_3.start, endTime = query_times_3.end, host = hosts_2.value, category = category, ip = ip_var, ipaddress = ipaddress, uri = uri,)
df_rawFirewallKustoQuery = showQuery(kql_firewallLogs)
print(category + ' (Alert) Data- \n')
display(df_rawFirewallKustoQuery.head(10))
pd.reset_option('max_colwidth')
except Exception as e:
print('Error: ' + str(e))
traceback.print_exc()
Start time: 2020-11-04 10:32:42.885697 End time: 2020-11-26 10:32:42.885697 Ip Address entered: 108.4.0.0/16 Request Uri entered: \\xcc\\xb2\\xcc\\x85]-1572603645543.jpg
ApplicationGatewayAccessLog (Raw) Data-
ApplicationGatewayFirewallLog (Alert) Data-
TenantId | TimeGenerated | ResourceId | Category | ResourceGroup | SubscriptionId | ResourceProvider | Resource | ResourceType | OperationName | requestUri_s | Message | instanceId_s | SourceSystem | ruleSetType_s | ruleSetVersion_s | ruleId_s | action_s | site_s | details_message_s | details_file_s | details_line_s | hostname_s | transactionId_g | policyId_s | policyScope_s | policyScopeName_s | timeStamp_t | clientIp_s | Type | _ResourceId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8ecf8077-cf51-4820-aadd-14040956f35d | 2020-11-10 16:17:16.533000+00:00 | /SUBSCRIPTIONS/D1D8779D-38D7-4F06-91DB-9CBC8DE0176F/RESOURCEGROUPS/SOC-NS/PROVIDERS/MICROSOFT.NETWORK/APPLICATIONGATEWAYS/SOC-NS-AG-WAFV2 | ApplicationGatewayFirewallLog | SOC-NS | d1d8779d-38d7-4f06-91db-9cbc8de0176f | MICROSOFT.NETWORK | SOC-NS-AG-WAFV2 | APPLICATIONGATEWAYS | ApplicationGatewayFirewall | /assets/public/images/uploads/my-rare-collectors-item!-[\\xcc\\xb2\\xcc\\x85$\\xcc\\xb2\\xcc\\x85(\\xcc\\xb2\\xcc\\x85-\\xcd\\xa1\\xc2\\xb0-\\xcd\\x9c\\xca\\x96-\\xcd\\xa1\\xc2\\xb0\\xcc\\xb2\\xcc\\x85)\\xcc\\xb2\\xcc\\x85$\\xcc\\xb2\\xcc\\x85]-1572603645543.jpg | Mandatory rule. Cannot be disabled. Inbound Anomaly Score Exceeded (Total Score: 5) | appgw_1 | Azure | OWASP_CRS | 3.1.0 | 949110 | Blocked | Global | Access denied with code 403 (phase 2). Operator GE matched 5 at TX:anomaly_score. | rules/REQUEST-949-BLOCKING-EVALUATION.conf | 93 | 13.89.108.163 | ecdc12b7-045a-1063-0485-9ca508f6fd2b | default | Global | Global | 2020-11-10 16:16:00+00:00 | 108.4.232.173 | AzureDiagnostics | /subscriptions/d1d8779d-38d7-4f06-91db-9cbc8de0176f/resourcegroups/soc-ns/providers/microsoft.network/applicationgateways/soc-ns-ag-wafv2 |
1 | 8ecf8077-cf51-4820-aadd-14040956f35d | 2020-11-10 16:17:16.533000+00:00 | /SUBSCRIPTIONS/D1D8779D-38D7-4F06-91DB-9CBC8DE0176F/RESOURCEGROUPS/SOC-NS/PROVIDERS/MICROSOFT.NETWORK/APPLICATIONGATEWAYS/SOC-NS-AG-WAFV2 | ApplicationGatewayFirewallLog | SOC-NS | d1d8779d-38d7-4f06-91db-9cbc8de0176f | MICROSOFT.NETWORK | SOC-NS-AG-WAFV2 | APPLICATIONGATEWAYS | ApplicationGatewayFirewall | /assets/public/images/uploads/my-rare-collectors-item!-[\\xcc\\xb2\\xcc\\x85$\\xcc\\xb2\\xcc\\x85(\\xcc\\xb2\\xcc\\x85-\\xcd\\xa1\\xc2\\xb0-\\xcd\\x9c\\xca\\x96-\\xcd\\xa1\\xc2\\xb0\\xcc\\xb2\\xcc\\x85)\\xcc\\xb2\\xcc\\x85$\\xcc\\xb2\\xcc\\x85]-1572603645543.jpg | Mandatory rule. Cannot be disabled. Inbound Anomaly Score Exceeded (Total Inbound Score: 5 - SQLI=0,XSS=0,RFI=0,LFI=0,RCE=0,PHPI=5,HTTP=0,SESS=0): PHP Injection Attack: Variable Function Call Found; individual paranoia level scores: 5, 0, 0, 0 | appgw_1 | Azure | OWASP_CRS | 3.1.0 | 980130 | Blocked | Global | Warning. Operator GE matched 5 at TX:inbound_anomaly_score. | rules/RESPONSE-980-CORRELATION.conf | 86 | 13.89.108.163 | ecdc12b7-045a-1063-0485-9ca508f6fd2b | default | Global | Global | 2020-11-10 16:16:00+00:00 | 108.4.232.173 | AzureDiagnostics | /subscriptions/d1d8779d-38d7-4f06-91db-9cbc8de0176f/resourcegroups/soc-ns/providers/microsoft.network/applicationgateways/soc-ns-ag-wafv2 |