Notebook

Activity 2 - Hello, Security Analytics¶

For this “Hello World” example, you are working on a problem given to you by the manager of the Security Operations Center (SOC). It seems the SOC analysts are becoming inundated with “trivial” alerts ever since a new data set of indicators was introduced into the Security Information and Event Management (SIEM) system. They have asked for your help in reducing the number of “trivial” alerts with- out sacrificing visibility.

This is a good problem to tackle through data analysis, and we should be able to form a solid, practical question to ask after we perform some exploratory data analysis and hopefully arrive at an answer that helps out the SOC.

Loading in Data¶

In [ ]:

import pandas as pd
import sys
# read in the data into a pandas data frame
avRep = "./data/reputation.data"
av = pd.read_csv(avRep, sep="#", header=None)

# make smarter column names
av.columns = ["IP","Reliability","Risk","Type","Country",
              "Locale","Coords","x"]
print (av)

In [ ]:

av.head().to_csv(sys.stdout)

In [ ]:

from IPython.display import HTML 
# display the first 10 lines of the dataframe as formatted HTML
HTML(av.head(10).to_html())

### This is actually deprecated - we can instead just say 'av' to see this output

In [ ]:

av

Above we have looked at how you can load in a dataset, and then how you can print this in the notebook - either using print(), or the HTML function, or simply by giving the variable as the last command in the cell.

Before going any deeper lets just look at the data so that we know what we are working with:

Reliability, Risk, and x are integers.
IP, Type, Country, Locale, and Coords are character strings.
The IP address is stored in the dotted-quad notation, not in hostnames or decimal format.
Each record is associated with a unique IP address, so there are 258,626 IP addresses (in this download).
Each IP address has been geo-located into the latitude and longitude pair in the Coords field, but they are in a single field separated by a comma. You will have to parse that further if you want to use that field.

Descriptive Statistics¶

What do we mean by descriptive statistics? As the name suggests, these essential describe the properties of our data. They help for summarisation, and for providing easier forms of comparison when consider two groups of data. You will likely be familiar with some of these concepts, but nevertheless, it is important to think further about how they can be used to summarise a data set (and also, if there are any potential issues are with using them and how we can overcome these?)

Commonly used descriptive statistics include:

Minimum and maximum values; taking the difference of these will give you the range (range = max - min)
Median (the value at the middle of the dataset)
First and third quartiles (the 25th and 75th percentiles, or you could think of it as the median value of the first and last halves of the data, respectively)
Mean (sum of all values divided by the number of count)

Whilst we can calculate these "in code", or use in-built functions such as np.mean(), Pandas provides a convenient describe() function that will perform all of these together.

In [ ]:

av['Reliability'].describe()

In [ ]:

av['Risk'].describe()

Above, note how we can select the specific column using av['Reliability'] or av['Risk'].

An important note to make (from the Alienvault documentation) is that Risk and Reliability are scored 1-10, however these are ordinal values rather than numerical.

What does this mean? Essentially, ordinal values denote order, however they are not quantities. Therefore, a score of 4 is not specifically twice the risk of 2, however it is greater.

In [ ]:

def factor_col(col):
    factor = pd.Categorical(col)
    return pd.value_counts(factor,sort=False)

rel_ct = pd.value_counts(av['Reliability'])
risk_ct = pd.value_counts(av['Risk'])
type_ct = pd.value_counts(av['Type'])
country_ct = pd.value_counts(av['Country'])

print (factor_col(av['Reliability']))

print (factor_col(av['Risk']))

print (factor_col(av['Type']).head(n=10))

print (factor_col(av['Country']).head(n=10))

Above, we define our own function called factor_col() that will essentially identify all possible values within a given dataset, and count the number of occurrence for each. A similar function Pandas can called using group_by.

Ploting our data¶

In [ ]:

import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

# sort by country
country_ct = pd.value_counts(av['Country'])

# plot the data
plt.axes(frameon=0) # reduce chart junk
country_ct[:20].plot(kind='bar', 
   rot=0, title="Summary By Country", figsize=(20,10)).grid(False)
plt.show()

In [ ]:

plt.axes(frameon=0) # reduce chart junk
factor_col(av['Reliability']).plot(kind='bar', rot=0,
           title="Summary By 'Reliability'", figsize=(20,10)).grid(False)
plt.show()

In [ ]:

plt.axes(frameon=0) # reduce chart junk
factor_col(av['Risk']).plot(kind='bar', rot=0, 
           title="Summary By 'Risk'", figsize=(20,10)).grid(False)
plt.show()

Above, we have created bar plots that show the values for each attribute, allowing us to examine these in greater detail.

In [ ]:

top10 = pd.value_counts(av['Country'])[0:9] 
# calculate the % for each of the top 10
top10.astype(float) / len(av['Country'])

Note that above, we have divided through by the length of the Country column, essentially giving a percentage of countries rather than absolute counts.

Further Exploration¶

Perhaps we want to look at both risk and reliability together? We can use a crosstab to achieve this.

In [ ]:

from matplotlib import cm
from numpy import arange

print(pd.crosstab(av['Risk'], av['Reliability']).to_string())

In [ ]:

# graphical view of contingency table (swapping risk/reliability)
xtab = pd.crosstab(av['Reliability'], av['Risk'])
fig = plt.figure(figsize=(5,5))
plt.pcolormesh(xtab,cmap=cm.Greens, figure=fig)
plt.yticks(arange(0.5,len(xtab.index), 1),xtab.index)
plt.xticks(arange(0.5,len(xtab.columns), 1),xtab.columns)
plt.colorbar()
plt.title("Risk ~ Reliability")
plt.show()

Ok, so this starts to highlight some interesting details about risk and reliability however it is lacking in context - can we identify risk/reliability against each type of alert as observed in the SOC? Let's try below.

In [ ]:

# Listing 3-23
# require object: av (3-5)
# See corresponding output in Figure 3-9
# compute contingency table for Risk/Reliability factors which 
# produces a matrix of counts of rows that have attributes at

# create new column as a copy of Type column
av['newtype'] = av['Type']

# replace multi-Type entries with “Multiples”
av[av['newtype'].str.contains(";")] = "Multiples"

# setup new crosstab structures
typ = av['newtype']
rel = av['Reliability']
rsk = av['Risk']

# compute crosstab making it split on the
# new “type” column
xtab = pd.crosstab(typ, [ rel, rsk ], rownames=['typ'], colnames=['rel', 'rsk'])



# the following print statement will show a huge text
# representation of the contingency table. The output
# is too large for the book, but is worth looking at 
# as you run through the exercise to see how useful 
# visualizations can be over raw text/numeric output

print (xtab.to_string())

This data is difficult to observe in tabular form - as discussed, there is simply too much and it is nested which also makes it challenging to follow.

Instead, let's consider a bar chart.

In [ ]:

xtab.plot(kind='bar',legend=False, title="Risk ~ Reliabilty | Type", figsize=(20,10)).grid(False)
plt.show()

Excellent! We have a bar chart that shows the combined risk/reliability measures against each type of SOC alert. This starts to look useful. However, perhaps we want to exclude Scanning Host - we expect this behaviour on our network and showing this is making it harder to observe other details about the data. Let's exclude this next.

In [ ]:

# Here we remove Scanning Host
rrt_df = av[av['newtype'] != "Scanning Host"]

# And then we do the chart again
typ = rrt_df['newtype']
rel = rrt_df['Reliability']
rsk = rrt_df['Risk']
xtab = pd.crosstab(typ, [ rel, rsk ], 
       rownames=['typ'], colnames=['rel', 'rsk'])
xtab.plot(kind='bar',legend=False, 
   title="Risk ~ Reliabilty | Type", figsize=(20,10)).grid(False)
plt.show()

Ok this looks more interesting now. We see Malware Domain and Malware distribution cropping up, which would make sense - however we may not necessarily be interested in these for this particular story. Let's exclude these and see what we are left with.

In [ ]:

rrt_df = rrt_df[rrt_df['newtype'] != "Malware distribution" ]
rrt_df = rrt_df[rrt_df['newtype'] != "Malware Domain" ]
typ = rrt_df['newtype']
rel = rrt_df['Reliability']
rsk = rrt_df['Risk']
xtab = pd.crosstab(typ, [ rel, rsk ],
        rownames=['typ'], colnames=['rel', 'rsk'])

print ("Count: %d; Percent: %2.1f%%" % (len(rrt_df), (float(len(rrt_df))
   / len(av)) * 100))
## Count: 15171; Percent: 5.9%

xtab.plot(kind='bar',legend=False, title="Risk ~ Reliabilty | Type", figsize=(20,10)).grid(False)

plt.show()

xtab

We have now managed to filter our data down to reveal aspects about malware IP, malicious hosts, as well as command and control servers, spamming addresses, and multiples (which would require a separate investigation). Importantly, we can gain more insight into these now, and we have worked logically through to filter unnecessary information for our story - we now want to learn more about the malware IPs since there are a great number of these. Also worth noting, we are now working with 5.9% of our original data (15171 rows) making it much more manageable to explore and find relavent details, rather than being inudated with irrelavent information. Our SOC team were struggling with the number of alerts they were dealing with - this workflow would allow them to manage the alerts much more effectively, and concentrate on the key details of interest.

Having developed this in a Notebook for the purpose of exploration, we could easier export this as a Python script that would run periodically to filter our alerts as needed.

This example should help to demonstrate the benefit of interactive analysis of the data, and how this can be used to rapidly design a suitable analysis workflow for deployment.

In [ ]: