You can install it via PIP: pip install attackcti
from attackcti import attack_client
from pandas import *
from pandas.io.json import json_normalize
import numpy as np
import altair as alt
alt.renderers.enable('notebook')
import itertools
lift = attack_client()
Getting ALL ATT&CK Techniques
all_techniques = lift.get_techniques(stix_format=False)
Showing the first technique in our list
all_techniques[0]
{'external_references': [{'external_id': 'T1500', 'source_name': 'mitre-attack', 'url': 'https://attack.mitre.org/techniques/T1500'}, {'url': 'https://www.clearskysec.com/wp-content/uploads/2018/11/MuddyWater-Operations-in-Lebanon-and-Oman.pdf', 'source_name': 'ClearSky MuddyWater Nov 2018', 'description': 'ClearSky Cyber Security. (2018, November). MuddyWater Operations in Lebanon and Oman: Using an Israeli compromised domain for a two-stage campaign. Retrieved November 29, 2018.'}, {'url': 'https://blog.trendmicro.com/trendlabs-security-intelligence/windows-app-runs-on-mac-downloads-info-stealer-and-adware/', 'source_name': 'TrendMicro WindowsAppMac', 'description': 'Trend Micro. (2019, February 11). Windows App Runs on Mac, Downloads Info Stealer and Adware. Retrieved April 25, 2019.'}], 'kill_chain_phases': [{'phase_name': 'defense-evasion', 'kill_chain_name': 'mitre-attack'}], 'x_mitre_version': '1.0', 'url': 'https://attack.mitre.org/techniques/T1500', 'matrix': 'mitre-attack', 'technique_id': 'T1500', 'object_marking_refs': ['marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168'], 'type': 'attack-pattern', 'modified': '2019-04-29T21:13:49.686Z', 'created_by_ref': 'identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5', 'tactic': ['defense-evasion'], 'id': 'attack-pattern--cf7b3a06-8b42-4c33-bbe9-012120027925', 'technique': 'Compile After Delivery', 'created': '2019-04-25T20:53:07.719Z', 'technique_description': 'Adversaries may attempt to make payloads difficult to discover and analyze by delivering files to victims as uncompiled code. Similar to [Obfuscated Files or Information](https://attack.mitre.org/techniques/T1027), text-based source code files may subvert analysis and scrutiny from protections targeting executables/binaries. These payloads will need to be compiled before execution; typically via native utilities such as csc.exe or GCC/MinGW.(Citation: ClearSky MuddyWater Nov 2018)\n\nSource code payloads may also be encrypted, encoded, and/or embedded within other files, such as those delivered as a [Spearphishing Attachment](https://attack.mitre.org/techniques/T1193). Payloads may also be delivered in formats unrecognizable and inherently benign to the native OS (ex: EXEs on macOS/Linux) before later being (re)compiled into a proper executable binary with a bundled compiler and execution framework.(Citation: TrendMicro WindowsAppMac)\n', 'contributors': ['Ye Yint Min Thu Htut, Offensive Security Team, DBS Bank', 'Praetorian'], 'permissions_required': ['User'], 'data_sources': ['Process command-line parameters', 'Process monitoring', 'File monitoring'], 'technique_detection': 'Monitor the execution file paths and command-line arguments for common compilers, such as csc.exe and GCC/MinGW, and correlate with other suspicious behavior to reduce false positives from normal user and administrator behavior. The compilation of payloads may also generate file creation and/or file write events. Look for non-native binary formats and cross-platform compiler and execution frameworks like Mono and determine if they have a legitimate purpose on the system.(Citation: TrendMicro WindowsAppMac) Typically these should only be used in specific and limited cases, like for software development.', 'platform': ['Linux', 'macOS', 'Windows'], 'system_requirements': ['Compiler software (either native to the system or delivered by the adversary)'], 'defense_bypassed': ['Static File Analysis', 'Binary Analysis', 'Anti-virus', 'Host intrusion prevention systems', 'Signature-based detection']}
Normalizing semi-structured JSON data into a flat table via pandas.io.json.json_normalize
techniques_normalized = json_normalize(all_techniques)
techniques_normalized[0:1]
external_references | kill_chain_phases | x_mitre_version | url | matrix | technique_id | object_marking_refs | type | modified | created_by_ref | ... | effective_permissions | network_requirements | x_mitre_old_attack_id | detectable_by_common_defenses | difficulty_explanation | difficulty_for_adversary | detectable_explanation | x_mitre_deprecated | tactic_type | revoked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [{'external_id': 'T1500', 'source_name': 'mitr... | [{'phase_name': 'defense-evasion', 'kill_chain... | 1.0 | https://attack.mitre.org/techniques/T1500 | mitre-attack | T1500 | [marking-definition--fa42a846-8d90-4e51-bc29-7... | attack-pattern | 2019-04-29T21:13:49.686Z | identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 36 columns
techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
techniques.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | [Linux, macOS, Windows] | [defense-evasion] | Compile After Delivery | T1500 | [Process command-line parameters, Process moni... |
1 | mitre-attack | [Linux] | [persistence] | Systemd Service | T1501 | [Process command-line parameters, Process moni... |
2 | mitre-attack | [Linux, macOS, Windows] | [impact] | Endpoint Denial of Service | T1499 | [SSL/TLS inspection, Web logs, Web application... |
3 | mitre-attack | [Windows] | [defense-evasion, discovery] | Virtualization/Sandbox Evasion | T1497 | [Process monitoring, Process command-line para... |
4 | mitre-attack | [Linux, macOS, Windows] | [impact] | Network Denial of Service | T1498 | [Sensor health and status, Network protocol an... |
print('A total of ',len(techniques),' techniques')
A total of 500 techniques
all_techniques_no_revoked = lift.remove_revoked(all_techniques)
print('A total of ',len(all_techniques_no_revoked),' techniques')
A total of 485 techniques
all_techniques_revoked = lift.extract_revoked(all_techniques)
print('A total of ',len(all_techniques_revoked),' techniques that have been revoked')
A total of 15 techniques that have been revoked
The revoked techniques are the following ones:
for t in all_techniques_revoked:
print(t['technique'])
Remotely Install Application Insecure Third-Party Libraries Fake Developer Accounts Detect App Analysis Environment Malicious Software Development Tools Biometric Spoofing Device Unlock Code Guessing or Brute Force Malicious Media Content Abuse of iOS Enterprise App Signing Key App Delivered via Web Download App Delivered via Email Attachment Malicious or Vulnerable Built-in Device Functionality Malicious SMS Message Exploit Baseband Vulnerability Stolen Developer Credentials or Signing Keys
techniques_normalized = json_normalize(all_techniques_no_revoked)
techniques = techniques_normalized.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
Using altair python library we can start showing a few charts stacking the number of techniques with or without data sources. Reference: https://altair-viz.github.io/
data = techniques
data_2 = data.groupby(['matrix'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3
matrix | technique | |
---|---|---|
0 | mitre-attack | 244 |
1 | mitre-mobile-attack | 67 |
2 | mitre-pre-attack | 174 |
alt.Chart(data_3).mark_bar().encode(x='technique', y='matrix', color='matrix').properties(height = 200)
data_source_distribution = pandas.DataFrame({
'Techniques': ['Without DS','With DS'],
'Count of Techniques': [techniques['data_sources'].isna().sum(),techniques['data_sources'].notna().sum()]})
bars = alt.Chart(data_source_distribution).mark_bar().encode(x='Techniques',y='Count of Techniques',color='Techniques').properties(width=200,height=300)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
What is the distribution of techniques based on ATT&CK Matrix?
data = techniques
data['Count_DS'] = data['data_sources'].str.len()
data['Ind_DS'] = np.where(data['Count_DS']>0,'With DS','Without DS')
data_2 = data.groupby(['matrix','Ind_DS'])['technique'].count()
data_3 = data_2.to_frame().reset_index()
data_3
matrix | Ind_DS | technique | |
---|---|---|---|
0 | mitre-attack | With DS | 240 |
1 | mitre-attack | Without DS | 4 |
2 | mitre-mobile-attack | Without DS | 67 |
3 | mitre-pre-attack | Without DS | 174 |
alt.Chart(data_3).mark_bar().encode(x='technique', y='Ind_DS', color='matrix').properties(height = 200)
What are those mitre-attack techniques without data sources?
data[(data['matrix']=='mitre-attack') & (data['Ind_DS']=='Without DS')]
matrix | platform | tactic | technique | technique_id | data_sources | Count_DS | Ind_DS | |
---|---|---|---|---|---|---|---|---|
54 | mitre-attack | [Linux, macOS] | [defense-evasion, persistence, command-and-con... | Port Knocking | T1205 | NaN | NaN | Without DS |
104 | mitre-attack | [macOS] | [defense-evasion] | Gatekeeper Bypass | T1144 | NaN | NaN | Without DS |
107 | mitre-attack | [macOS] | [persistence] | Re-opened Applications | T1164 | NaN | NaN | Without DS |
124 | mitre-attack | [Windows] | [discovery] | Peripheral Device Discovery | T1120 | NaN | NaN | Without DS |
techniques_without_data_sources=techniques[techniques.data_sources.isnull()].reset_index(drop=True)
techniques_without_data_sources.head()
matrix | platform | tactic | technique | technique_id | data_sources | Count_DS | Ind_DS | |
---|---|---|---|---|---|---|---|---|
0 | mitre-attack | [Linux, macOS] | [defense-evasion, persistence, command-and-con... | Port Knocking | T1205 | NaN | NaN | Without DS |
1 | mitre-attack | [macOS] | [defense-evasion] | Gatekeeper Bypass | T1144 | NaN | NaN | Without DS |
2 | mitre-attack | [macOS] | [persistence] | Re-opened Applications | T1164 | NaN | NaN | Without DS |
3 | mitre-attack | [Windows] | [discovery] | Peripheral Device Discovery | T1120 | NaN | NaN | Without DS |
4 | mitre-pre-attack | NaN | [technical-information-gathering] | Spearphishing for Information | T1397 | NaN | NaN | Without DS |
print('There are ',techniques['data_sources'].isna().sum(),' techniques without data sources (',"{0:.0%}".format(techniques['data_sources'].isna().sum()/len(techniques)),' of ',len(techniques),' techniques)')
There are 245 techniques without data sources ( 51% of 485 techniques)
techniques_with_data_sources=techniques[techniques.data_sources.notnull()].reset_index(drop=True)
techniques_with_data_sources.head()
matrix | platform | tactic | technique | technique_id | data_sources | Count_DS | Ind_DS | |
---|---|---|---|---|---|---|---|---|
0 | mitre-attack | [Linux, macOS, Windows] | [defense-evasion] | Compile After Delivery | T1500 | [Process command-line parameters, Process moni... | 3.0 | With DS |
1 | mitre-attack | [Linux] | [persistence] | Systemd Service | T1501 | [Process command-line parameters, Process moni... | 3.0 | With DS |
2 | mitre-attack | [Linux, macOS, Windows] | [impact] | Endpoint Denial of Service | T1499 | [SSL/TLS inspection, Web logs, Web application... | 7.0 | With DS |
3 | mitre-attack | [Windows] | [defense-evasion, discovery] | Virtualization/Sandbox Evasion | T1497 | [Process monitoring, Process command-line para... | 2.0 | With DS |
4 | mitre-attack | [Linux, macOS, Windows] | [impact] | Network Denial of Service | T1498 | [Sensor health and status, Network protocol an... | 5.0 | With DS |
print('There are ',techniques['data_sources'].notna().sum(),' techniques with data sources (',"{0:.0%}".format(techniques['data_sources'].notna().sum()/len(techniques)),' of ',len(techniques),' techniques)')
There are 240 techniques with data sources ( 49% of 485 techniques)
Let's create a graph to represent the number of techniques per matrix:
matrix_distribution = pandas.DataFrame({
'Matrix': list(techniques_with_data_sources.groupby(['matrix'])['matrix'].count().keys()),
'Count of Techniques': techniques_with_data_sources.groupby(['matrix'])['matrix'].count().tolist()})
bars = alt.Chart(matrix_distribution).mark_bar().encode(y='Matrix',x='Count of Techniques').properties(width=300,height=100)
text = bars.mark_text(align='center',baseline='middle',dx=10,dy=0).encode(text='Count of Techniques')
bars + text
All the techniques belong to mitre-attack matrix which is the main Enterprise matrix. Reference: https://attack.mitre.org/wiki/Main_Page
First, we need to split the platform column values because a technique might be mapped to more than one platform
techniques_platform=techniques_with_data_sources
attributes_1 = ['platform'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes_1:
s = techniques_platform.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_platform=techniques_platform.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_platform", and then join "techniques_platform" with "s"
# Let's re-arrange the columns from general to specific
techniques_platform_2=techniques_platform.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
We can now show techniques with data sources mapped to one platform at the time
techniques_platform_2.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | Linux | [defense-evasion] | Compile After Delivery | T1500 | [Process command-line parameters, Process moni... |
1 | mitre-attack | macOS | [defense-evasion] | Compile After Delivery | T1500 | [Process command-line parameters, Process moni... |
2 | mitre-attack | Windows | [defense-evasion] | Compile After Delivery | T1500 | [Process command-line parameters, Process moni... |
3 | mitre-attack | Linux | [persistence] | Systemd Service | T1501 | [Process command-line parameters, Process moni... |
4 | mitre-attack | Linux | [impact] | Endpoint Denial of Service | T1499 | [SSL/TLS inspection, Web logs, Web application... |
Let's create a visualization to show the number of techniques grouped by platform:
platform_distribution = pandas.DataFrame({
'Platform': list(techniques_platform_2.groupby(['platform'])['platform'].count().keys()),
'Count of Techniques': techniques_platform_2.groupby(['platform'])['platform'].count().tolist()})
bars = alt.Chart(platform_distribution,height=300).mark_bar().encode(x ='Platform',y='Count of Techniques',color='Platform').properties(width=200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
In the bar chart above we can see that there are more techniques with data sources mapped to the Windows platform.
Again, first we need to split the tactic column values because a technique might be mapped to more than one tactic:
techniques_tactic=techniques_with_data_sources
attributes_2 = ['tactic'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes_2:
s = techniques_tactic.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_tactic = techniques_tactic.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_tactic", and then join "techniques_tactic" with "s"
# Let's re-arrange the columns from general to specific
techniques_tactic_2=techniques_tactic.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
We can now show techniques with data sources mapped to one tactic at the time
techniques_tactic_2.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | [Linux, macOS, Windows] | defense-evasion | Compile After Delivery | T1500 | [Process command-line parameters, Process moni... |
1 | mitre-attack | [Linux] | persistence | Systemd Service | T1501 | [Process command-line parameters, Process moni... |
2 | mitre-attack | [Linux, macOS, Windows] | impact | Endpoint Denial of Service | T1499 | [SSL/TLS inspection, Web logs, Web application... |
3 | mitre-attack | [Windows] | defense-evasion | Virtualization/Sandbox Evasion | T1497 | [Process monitoring, Process command-line para... |
4 | mitre-attack | [Windows] | discovery | Virtualization/Sandbox Evasion | T1497 | [Process monitoring, Process command-line para... |
Let's create a visualization to show the number of techniques grouped by tactic:
tactic_distribution = pandas.DataFrame({
'Tactic': list(techniques_tactic_2.groupby(['tactic'])['tactic'].count().keys()),
'Count of Techniques': techniques_tactic_2.groupby(['tactic'])['tactic'].count().tolist()}).sort_values(by='Count of Techniques',ascending=True)
bars = alt.Chart(tactic_distribution,width=800,height=300).mark_bar().encode(x ='Tactic',y='Count of Techniques',color='Tactic').properties(width=400)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
Defende-evasion and Persistence are tactics with the highest nummber of techniques with data sources
We need to split the data source column values because a technique might be mapped to more than one data source:
techniques_data_source=techniques_with_data_sources
attributes_3 = ['data_sources'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes_3:
s = techniques_data_source.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_data_source = techniques_data_source.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_data_source", and then join "techniques_data_source" with "s"
# Let's re-arrange the columns from general to specific
techniques_data_source_2 = techniques_data_source.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_source_3 = techniques_data_source_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])
We can now show techniques with data sources mapped to one data source at the time
techniques_data_source_3.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | [Linux, macOS, Windows] | [defense-evasion] | Compile After Delivery | T1500 | Process command-line parameters |
1 | mitre-attack | [Linux, macOS, Windows] | [defense-evasion] | Compile After Delivery | T1500 | Process Monitoring |
2 | mitre-attack | [Linux, macOS, Windows] | [defense-evasion] | Compile After Delivery | T1500 | File monitoring |
3 | mitre-attack | [Linux] | [persistence] | Systemd Service | T1501 | Process command-line parameters |
4 | mitre-attack | [Linux] | [persistence] | Systemd Service | T1501 | Process Monitoring |
Let's create a visualization to show the number of techniques grouped by data sources:
data_source_distribution = pandas.DataFrame({
'Data Source': list(techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().keys()),
'Count of Techniques': techniques_data_source_3.groupby(['data_sources'])['data_sources'].count().tolist()})
bars = alt.Chart(data_source_distribution,width=800,height=300).mark_bar().encode(x ='Data Source',y='Count of Techniques',color='Data Source').properties(width=1200)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
A few interesting things from the bar chart above:
Although identifying the data sources with the highest number of techniques is a good start, they usually do not work alone. You might be collecting Process Monitoring already but you might be still missing a lot of context from a data perspective.
data_source_distribution_2 = pandas.DataFrame({
'Techniques': list(techniques_data_source_3.groupby(['technique'])['technique'].count().keys()),
'Count of Data Sources': techniques_data_source_3.groupby(['technique'])['technique'].count().tolist()})
data_source_distribution_3 = pandas.DataFrame({
'Number of Data Sources': list(data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().keys()),
'Count of Techniques': data_source_distribution_2.groupby(['Count of Data Sources'])['Count of Data Sources'].count().tolist()})
bars = alt.Chart(data_source_distribution_3).mark_bar().encode(x ='Number of Data Sources',y='Count of Techniques').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx=0,dy=-5).encode(text='Count of Techniques')
bars + text
The image above shows you the number data sources needed per techniques according to ATT&CK:
Let's create subsets of data sources with the data source column defining and using a python function:
# https://stackoverflow.com/questions/26332412/python-recursive-function-to-display-all-subsets-of-given-set
def subs(l):
res = []
for i in range(1, len(l) + 1):
for combo in itertools.combinations(l, i):
res.append(list(combo))
return res
Before applying the function, we need to use lowercase data sources names and sort data sources names to improve consistency:
df = techniques_with_data_sources[['data_sources']]
for index, row in df.iterrows():
row["data_sources"]=[x.lower() for x in row["data_sources"]]
row["data_sources"].sort()
df.head()
data_sources | |
---|---|
0 | [file monitoring, process command-line paramet... |
1 | [file monitoring, process command-line paramet... |
2 | [netflow/enclave netflow, network device logs,... |
3 | [process command-line parameters, process moni... |
4 | [netflow/enclave netflow, network device logs,... |
Let's apply the function and split the subsets column:
df['subsets']=df['data_sources'].apply(subs)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.
df.head()
data_sources | subsets | |
---|---|---|
0 | [file monitoring, process command-line paramet... | [[file monitoring], [process command-line para... |
1 | [file monitoring, process command-line paramet... | [[file monitoring], [process command-line para... |
2 | [netflow/enclave netflow, network device logs,... | [[netflow/enclave netflow], [network device lo... |
3 | [process command-line parameters, process moni... | [[process command-line parameters], [process m... |
4 | [netflow/enclave netflow, network device logs,... | [[netflow/enclave netflow], [network device lo... |
We need to split the subsets column values:
techniques_with_data_sources_preview = df
attributes_4 = ['subsets']
for a in attributes_4:
s = techniques_with_data_sources_preview.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
s.name = a
techniques_with_data_sources_preview = techniques_with_data_sources_preview.drop(a, axis=1).join(s).reset_index(drop=True)
techniques_with_data_sources_subsets = techniques_with_data_sources_preview.reindex(['data_sources','subsets'], axis=1)
techniques_with_data_sources_subsets.head()
data_sources | subsets | |
---|---|---|
0 | [file monitoring, process command-line paramet... | [file monitoring] |
1 | [file monitoring, process command-line paramet... | [process command-line parameters] |
2 | [file monitoring, process command-line paramet... | [process monitoring] |
3 | [file monitoring, process command-line paramet... | [file monitoring, process command-line paramet... |
4 | [file monitoring, process command-line paramet... | [file monitoring, process monitoring] |
Let's add three columns to analyse the dataframe: subsets_name (Changing Lists to Strings), subsets_number_elements ( Number of data sources per subset) and number_data_sources_per_technique
techniques_with_data_sources_subsets['subsets_name']=techniques_with_data_sources_subsets['subsets'].apply(lambda x: ','.join(map(str, x)))
techniques_with_data_sources_subsets['subsets_number_elements']=techniques_with_data_sources_subsets['subsets'].str.len()
techniques_with_data_sources_subsets['number_data_sources_per_technique']=techniques_with_data_sources_subsets['data_sources'].str.len()
techniques_with_data_sources_subsets.head()
data_sources | subsets | subsets_name | subsets_number_elements | number_data_sources_per_technique | |
---|---|---|---|---|---|
0 | [file monitoring, process command-line paramet... | [file monitoring] | file monitoring | 1 | 3 |
1 | [file monitoring, process command-line paramet... | [process command-line parameters] | process command-line parameters | 1 | 3 |
2 | [file monitoring, process command-line paramet... | [process monitoring] | process monitoring | 1 | 3 |
3 | [file monitoring, process command-line paramet... | [file monitoring, process command-line paramet... | file monitoring,process command-line parameters | 2 | 3 |
4 | [file monitoring, process command-line paramet... | [file monitoring, process monitoring] | file monitoring,process monitoring | 2 | 3 |
As it was described above, we need to find grups pf data sources, so we are going to filter out all the subsets with only one data source:
subsets = techniques_with_data_sources_subsets
subsets_ok=subsets[subsets.subsets_number_elements != 1]
subsets_ok.head()
data_sources | subsets | subsets_name | subsets_number_elements | number_data_sources_per_technique | |
---|---|---|---|---|---|
3 | [file monitoring, process command-line paramet... | [file monitoring, process command-line paramet... | file monitoring,process command-line parameters | 2 | 3 |
4 | [file monitoring, process command-line paramet... | [file monitoring, process monitoring] | file monitoring,process monitoring | 2 | 3 |
5 | [file monitoring, process command-line paramet... | [process command-line parameters, process moni... | process command-line parameters,process monito... | 2 | 3 |
6 | [file monitoring, process command-line paramet... | [file monitoring, process command-line paramet... | file monitoring,process command-line parameter... | 3 | 3 |
10 | [file monitoring, process command-line paramet... | [file monitoring, process command-line paramet... | file monitoring,process command-line parameters | 2 | 3 |
Finally, we calculate the most relevant groups of data sources (Top 15):
subsets_graph = subsets_ok.groupby(['subsets_name'])['subsets_name'].count().to_frame(name='subsets_count').sort_values(by='subsets_count',ascending=False)[0:15]
subsets_graph
subsets_count | |
---|---|
subsets_name | |
process command-line parameters,process monitoring | 88 |
file monitoring,process monitoring | 74 |
file monitoring,process command-line parameters | 49 |
file monitoring,process command-line parameters,process monitoring | 42 |
process monitoring,process use of network | 33 |
api monitoring,process monitoring | 32 |
process monitoring,windows registry | 29 |
packet capture,process use of network | 21 |
packet capture,process monitoring | 19 |
netflow/enclave netflow,process monitoring | 18 |
netflow/enclave netflow,process use of network | 17 |
process command-line parameters,windows registry | 17 |
netflow/enclave netflow,packet capture | 17 |
process monitoring,windows event logs | 16 |
packet capture,process monitoring,process use of network | 16 |
subsets_graph_2 = pandas.DataFrame({
'Data Sources': list(subsets_graph.index),
'Count of Techniques': subsets_graph['subsets_count'].tolist()})
bars = alt.Chart(subsets_graph_2).mark_bar().encode(x ='Data Sources', y ='Count of Techniques', color='Data Sources').properties(width=500)
text = bars.mark_text(align='center',baseline='middle',dx= 0,dy=-5).encode(text='Count of Techniques')
bars + text
Group (Process Monitoring - Process Command-line parameters) is the is the group of data sources with the highest number of techniques. This group of data sources are suggested to hunt 78 techniques
Let's split all the relevant columns of the dataframe:
techniques_data = techniques_with_data_sources
attributes = ['platform','tactic','data_sources'] # In attributes we are going to indicate the name of the columns that we need to split
for a in attributes:
s = techniques_data.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
# "s" is going to be a column of a frame with every value of the list inside each cell of the column "a"
s.name = a
# We name "s" with the same name of "a".
techniques_data=techniques_data.drop(a, axis=1).join(s).reset_index(drop=True)
# We drop the column "a" from "techniques_data", and then join "techniques_data" with "s"
# Let's re-arrange the columns from general to specific
techniques_data_2=techniques_data.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
# We are going to edit some names inside the dataframe to improve the consistency:
techniques_data_3 = techniques_data_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])
techniques_data_3.head()
matrix | platform | tactic | technique | technique_id | data_sources | |
---|---|---|---|---|---|---|
0 | mitre-attack | Linux | defense-evasion | Compile After Delivery | T1500 | Process command-line parameters |
1 | mitre-attack | Linux | defense-evasion | Compile After Delivery | T1500 | Process Monitoring |
2 | mitre-attack | Linux | defense-evasion | Compile After Delivery | T1500 | File monitoring |
3 | mitre-attack | macOS | defense-evasion | Compile After Delivery | T1500 | Process command-line parameters |
4 | mitre-attack | macOS | defense-evasion | Compile After Delivery | T1500 | Process Monitoring |
Do you remember data sources names with a reference to Windows? After splitting the dataframe by platforms, tactics and data sources, are there any macOC or linux techniques that consider windows data sources? Let's identify those rows:
# After splitting the rows of the dataframe, there are some values that relate windows data sources with platforms like linux and masOS.
# We need to identify those rows
conditions = [(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
(techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('windows',case=False)== True),
(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
(techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('powershell',case=False)== True),
(techniques_data_3['platform']=='Linux')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True),
(techniques_data_3['platform']=='macOS')&(techniques_data_3['data_sources'].str.contains('wmi',case=False)== True)]
# In conditions we indicate a logical test
choices = ['NO OK','NO OK','NO OK','NO OK','NO OK','NO OK']
# In choices, we indicate the result when the logical test is true
techniques_data_3['Validation'] = np.select(conditions,choices,default='OK')
# We add a column "Validation" to "techniques_data_3" with the result of the logical test. The default value is going to be "OK"
What is the inconsistent data?
techniques_analysis_data_no_ok = techniques_data_3[techniques_data_3.Validation == 'NO OK']
# Finally, we are filtering all the values with NO OK
techniques_analysis_data_no_ok.head()
matrix | platform | tactic | technique | technique_id | data_sources | Validation | |
---|---|---|---|---|---|---|---|
105 | mitre-attack | macOS | impact | Inhibit System Recovery | T1490 | Windows Registry | NO OK |
107 | mitre-attack | macOS | impact | Inhibit System Recovery | T1490 | Windows event logs | NO OK |
110 | mitre-attack | Linux | impact | Inhibit System Recovery | T1490 | Windows Registry | NO OK |
112 | mitre-attack | Linux | impact | Inhibit System Recovery | T1490 | Windows event logs | NO OK |
181 | mitre-attack | Linux | defense-evasion | File Permissions Modification | T1222 | Windows event logs | NO OK |
print('There are ',len(techniques_analysis_data_no_ok),' rows with inconsistent data')
There are 37 rows with inconsistent data
What is the impact of this inconsistent data from a platform and data sources perspective?
df = techniques_with_data_sources
attributes = ['platform','data_sources']
for a in attributes:
s = df.apply(lambda x: pandas.Series(x[a]),axis=1).stack().reset_index(level=1, drop=True)
s.name = a
df=df.drop(a, axis=1).join(s).reset_index(drop=True)
df_2=df.reindex(['matrix','platform','tactic','technique','technique_id','data_sources'], axis=1)
df_3 = df_2.replace(['Process monitoring','Application logs'],['Process Monitoring','Application Logs'])
conditions = [(df_3['data_sources'].str.contains('windows',case=False)== True),
(df_3['data_sources'].str.contains('powershell',case=False)== True),
(df_3['data_sources'].str.contains('wmi',case=False)== True)]
choices = ['Windows','Windows','Windows']
df_3['Validation'] = np.select(conditions,choices,default='Other')
df_3['Num_Tech'] = 1
df_4 = df_3[df_3.Validation == 'Windows']
df_5 = df_4.groupby(['data_sources','platform'])['technique'].nunique()
df_6 = df_5.to_frame().reset_index()
alt.Chart(df_6).mark_bar().encode(x=alt.X('technique', stack="normalize"), y='data_sources', color='platform').properties(height=200)
There are techniques that consider Windows Error Reporting, Windows Registry, and Windows event logs as data sources and they also consider platforms like Linux and masOS. We do not need to consider this rows because those data sources can only be managed at a Windows environment. These are the techniques that we should not consider in our data base:
techniques_analysis_data_no_ok[['technique','data_sources']].drop_duplicates().sort_values(by='data_sources',ascending=True)
technique | data_sources | |
---|---|---|
667 | Input Prompt | PowerShell logs |
1990 | Credential Dumping | PowerShell logs |
244 | Exploitation of Remote Services | Windows Error Reporting |
317 | Exploitation for Defense Evasion | Windows Error Reporting |
378 | Exploitation for Credential Access | Windows Error Reporting |
1384 | Exploitation for Privilege Escalation | Windows Error Reporting |
105 | Inhibit System Recovery | Windows Registry |
1182 | Disabling Security Tools | Windows Registry |
1311 | Third-party Software | Windows Registry |
1480 | Input Capture | Windows Registry |
1505 | Process Injection | Windows Registry |
107 | Inhibit System Recovery | Windows event logs |
181 | File Permissions Modification | Windows event logs |
654 | Create Account | Windows event logs |
1364 | Indicator Removal on Host | Windows event logs |
1781 | Obfuscated Files or Information | Windows event logs |
Without considering this inconsistent data, the final dataframe is:
techniques_analysis_data_ok = techniques_data_3[techniques_data_3.Validation == 'OK']
techniques_analysis_data_ok.head()
matrix | platform | tactic | technique | technique_id | data_sources | Validation | |
---|---|---|---|---|---|---|---|
0 | mitre-attack | Linux | defense-evasion | Compile After Delivery | T1500 | Process command-line parameters | OK |
1 | mitre-attack | Linux | defense-evasion | Compile After Delivery | T1500 | Process Monitoring | OK |
2 | mitre-attack | Linux | defense-evasion | Compile After Delivery | T1500 | File monitoring | OK |
3 | mitre-attack | macOS | defense-evasion | Compile After Delivery | T1500 | Process command-line parameters | OK |
4 | mitre-attack | macOS | defense-evasion | Compile After Delivery | T1500 | Process Monitoring | OK |
print('There are ',len(techniques_analysis_data_ok),' rows of data that you can play with')
There are 1983 rows of data that you can play with
This function gets techniques' information that includes specific data sources
data_source = 'PROCESS MONITORING'
results = lift.get_techniques_by_datasources(data_source)
len(results)
169
type(results)
list
results2 = lift.get_techniques_by_datasources('pRoceSS MoniTorinG','process commAnd-linE parameters')
len(results2)
178
results2[1]
AttackPattern(type='attack-pattern', id='attack-pattern--0fff2797-19cb-41ea-a5f1-8a9303b8158e', created_by_ref='identity--c78cb6e5-0c4b-4611-8297-d1b8b55e40b5', created='2019-04-23T15:34:30.008Z', modified='2019-04-29T14:14:08.450Z', name='Systemd Service', description="Systemd services can be used to establish persistence on a Linux system. The systemd service manager is commonly used for managing background daemon processes (also known as services) and other system resources.(Citation: Linux man-pages: systemd January 2014)(Citation: Freedesktop.org Linux systemd 29SEP2018) Systemd is the default initialization (init) system on many Linux distributions starting with Debian 8, Ubuntu 15.04, CentOS 7, RHEL 7, Fedora 15, and replaces legacy init systems including SysVinit and Upstart while remaining backwards compatible with the aforementioned init systems.\n\nSystemd utilizes configuration files known as service units to control how services boot and under what conditions. By default, these unit files are stored in the <code>/etc/systemd/system</code> and <code>/usr/lib/systemd/system</code> directories and have the file extension <code>.service</code>. Each service unit file may contain numerous directives that can execute system commands. \n\n* ExecStart, ExecStartPre, and ExecStartPost directives cover execution of commands when a services is started manually by 'systemctl' or on system start if the service is set to automatically start. \n* ExecReload directive covers when a service restarts. \n* ExecStop and ExecStopPost directives cover when a service is stopped or manually by 'systemctl'.\n\nAdversaries have used systemd functionality to establish persistent access to victim systems by creating and/or modifying service unit files that cause systemd to execute malicious commands at recurring intervals, such as at system boot.(Citation: Anomali Rocke March 2019)(Citation: gist Arch package compromise 10JUL2018)(Citation: Arch Linux Package Systemd Compromise BleepingComputer 10JUL2018)(Citation: acroread package compromised Arch Linux Mail 8JUL2018)\n\nWhile adversaries typically require root privileges to create/modify service unit files in the <code>/etc/systemd/system</code> and <code>/usr/lib/systemd/system</code> directories, low privilege users can create/modify service unit files in directories such as <code>~/.config/systemd/user/</code> to achieve user-level persistence.(Citation: Rapid7 Service Persistence 22JUNE2016)", kill_chain_phases=[KillChainPhase(kill_chain_name='mitre-attack', phase_name='persistence')], external_references=[ExternalReference(source_name='mitre-attack', url='https://attack.mitre.org/techniques/T1501', external_id='T1501'), ExternalReference(source_name='Linux man-pages: systemd January 2014', description='Linux man-pages. (2014, January). systemd(1) - Linux manual page. Retrieved April 23, 2019.', url='http://man7.org/linux/man-pages/man1/systemd.1.html'), ExternalReference(source_name='Freedesktop.org Linux systemd 29SEP2018', description='Freedesktop.org. (2018, September 29). systemd System and Service Manager. Retrieved April 23, 2019.', url='https://www.freedesktop.org/wiki/Software/systemd/'), ExternalReference(source_name='Anomali Rocke March 2019', description='Anomali Labs. (2019, March 15). Rocke Evolves Its Arsenal With a New Malware Family Written in Golang. Retrieved April 24, 2019.', url='https://www.anomali.com/blog/rocke-evolves-its-arsenal-with-a-new-malware-family-written-in-golang'), ExternalReference(source_name='gist Arch package compromise 10JUL2018', description='Catalin Cimpanu. (2018, July 10). ~x file downloaded in public Arch package compromise. Retrieved April 23, 2019.', url='https://gist.github.com/campuscodi/74d0d2e35d8fd9499c76333ce027345a'), ExternalReference(source_name='Arch Linux Package Systemd Compromise BleepingComputer 10JUL2018', description='Catalin Cimpanu. (2018, July 10). Malware Found in Arch Linux AUR Package Repository. Retrieved April 23, 2019.', url='https://www.bleepingcomputer.com/news/security/malware-found-in-arch-linux-aur-package-repository/'), ExternalReference(source_name='acroread package compromised Arch Linux Mail 8JUL2018', description='Eli Schwartz. (2018, June 8). acroread package compromised. Retrieved April 23, 2019.', url='https://lists.archlinux.org/pipermail/aur-general/2018-July/034153.html'), ExternalReference(source_name='Rapid7 Service Persistence 22JUNE2016', description='Rapid7. (2016, June 22). Service Persistence. Retrieved April 23, 2019.', url='https://www.rapid7.com/db/modules/exploit/linux/local/service_persistence')], object_marking_refs=['marking-definition--fa42a846-8d90-4e51-bc29-71d5b4802168'], x_mitre_contributors=['Tony Lambert, Red Canary'], x_mitre_data_sources=['Process command-line parameters', 'Process monitoring', 'File monitoring'], x_mitre_detection="Systemd service unit files may be detected by auditing file creation and modification events within the <code>/etc/systemd/system</code>, <code>/usr/lib/systemd/system/</code>, and <code>/home/<username>/.config/systemd/user/</code> directories, as well as associated symbolic links. Suspicious processes or scripts spawned in this manner will have a parent process of ‘systemd’, a parent process ID of 1, and will usually execute as the ‘root’ user.\n\nSuspicious systemd services can also be identified by comparing results against a trusted system baseline. Malicious systemd services may be detected by using the systemctl utility to examine system wide services: <code>systemctl list-units -–type=service –all</code>. Analyze the contents of <code>.service</code> files present on the file system and ensure that they refer to legitimate, expected executables.\n\nAuditing the execution and command-line arguments of the 'systemctl' utility, as well related utilities such as <code>/usr/sbin/service</code> may reveal malicious systemd service execution.", x_mitre_permissions_required=['root', 'User'], x_mitre_platforms=['Linux'], x_mitre_version='1.0')