In this notebook we're going to use some great python modules to explore, understand and classify domains as being 'legit' or having a high probability of being generated by a DGA (Dynamic Generation Algorithm). We have 'legit' in quotes as we're using the domains in Alexa as the 'legit' set. The primary motivation is to explore the nexus of iPython, Pandas and scikit-learn with DGA classification as a vehicle for that exploration. The exercise intentionally shows common missteps, warts in the data, paths that didn't work out that well and results that could definitely be improved upon. In general capturing what worked and what didn't is not only more realistic but often much more informative. :)
Suggestions/Comments: Please send suggestions or bugs (I'm sure) to bwylie at clicksecurity.com. Also if you have some datasets or would like to explore alternative approaches please touch base.
import sklearn.feature_extraction
sklearn.__version__
'0.14.1'
import pandas as pd
pd.__version__
'0.12.0'
# Set default pylab stuff
pylab.rcParams['figure.figsize'] = (14.0, 5.0)
pylab.rcParams['axes.grid'] = True
# Version 0.12.0 of Pandas has a DeprecationWarning about Height blah that I'm ignoring
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# This is the Alexa 100k domain list, we're not using the 1 Million just for speed reasons. Results
# for the Alexa 1M are given at the bottom of the notebook.
alexa_dataframe = pd.read_csv('data/alexa_100k.csv', names=['rank','uri'], header=None, encoding='utf-8')
alexa_dataframe.head()
rank | uri | |
---|---|---|
0 | 1 | facebook.com |
1 | 2 | google.com |
2 | 3 | youtube.com |
3 | 4 | yahoo.com |
4 | 5 | baidu.com |
# Okay for this exercise we need the 2LD and nothing else
import tldextract
def domain_extract(uri):
ext = tldextract.extract(uri)
if (not ext.suffix):
return np.nan
else:
return ext.domain
alexa_dataframe['domain'] = [ domain_extract(uri) for uri in alexa_dataframe['uri']]
del alexa_dataframe['rank']
del alexa_dataframe['uri']
alexa_dataframe.head()
domain | |
---|---|
0 | |
1 | |
2 | youtube |
3 | yahoo |
4 | baidu |
alexa_dataframe.tail()
domain | |
---|---|
99995 | rhbabyandchild |
99996 | rm |
99997 | sat1 |
99998 | nahimunkar |
99999 | musi |
# It's possible we have NaNs from blanklines or whatever
alexa_dataframe = alexa_dataframe.dropna()
alexa_dataframe = alexa_dataframe.drop_duplicates()
# Set the class
alexa_dataframe['class'] = 'legit'
# Shuffle the data (important for training/testing)
alexa_dataframe = alexa_dataframe.reindex(np.random.permutation(alexa_dataframe.index))
alexa_total = alexa_dataframe.shape[0]
print 'Total Alexa domains %d' % alexa_total
# Hold out 10%
hold_out_alexa = alexa_dataframe[alexa_total*.9:]
alexa_dataframe = alexa_dataframe[:alexa_total*.9]
print 'Number of Alexa domains: %d' % alexa_dataframe.shape[0]
Total Alexa domains 91712 Number of Alexa domains: 82540
alexa_dataframe.head()
domain | class | |
---|---|---|
20904 | transworld | legit |
82690 | lkfun | legit |
85167 | islam2all | legit |
62859 | pulitzer | legit |
85573 | sge | legit |
# Read in the DGA domains
dga_dataframe = pd.read_csv('data/dga_domains.txt', names=['raw_domain'], header=None, encoding='utf-8')
# We noticed that the blacklist values just differ by captilization or .com/.org/.info
dga_dataframe['domain'] = dga_dataframe.applymap(lambda x: x.split('.')[0].strip().lower())
del dga_dataframe['raw_domain']
# It's possible we have NaNs from blanklines or whatever
dga_dataframe = dga_dataframe.dropna()
dga_dataframe = dga_dataframe.drop_duplicates()
dga_total = dga_dataframe.shape[0]
print 'Total DGA domains %d' % dga_total
# Set the class
dga_dataframe['class'] = 'dga'
# Hold out 10%
hold_out_dga = dga_dataframe[dga_total*.9:]
dga_dataframe = dga_dataframe[:dga_total*.9]
print 'Number of DGA domains: %d' % dga_dataframe.shape[0]
Total DGA domains 2664 Number of DGA domains: 2397
dga_dataframe.head()
domain | class | |
---|---|---|
0 | 04055051be412eea5a61b7da8438be3d | dga |
1 | 1cb8a5f36f | dga |
2 | 30acd347397c34fc273e996b22951002 | dga |
3 | 336c986a284e2b3bc0f69f949cb437cb | dga |
5 | 40a43e61e56a5c218cf6c22aca27f7ee | dga |
# Concatenate the domains in a big pile!
all_domains = pd.concat([alexa_dataframe, dga_dataframe], ignore_index=True)
# Add a length field for the domain
all_domains['length'] = [len(x) for x in all_domains['domain']]
# Okay since we're trying to detect dynamically generated domains and short
# domains (length <=6) are crazy random even for 'legit' domains we're going
# to punt on short domains (perhaps just white/black list for short domains?)
all_domains = all_domains[all_domains['length'] > 6]
# Grabbed this from Rosetta Code (rosettacode.org)
import math
from collections import Counter
def entropy(s):
p, lns = Counter(s), float(len(s))
return -sum( count/lns * math.log(count/lns, 2) for count in p.values())
# Add a entropy field for the domain
all_domains['entropy'] = [entropy(x) for x in all_domains['domain']]
all_domains.head()
domain | class | length | entropy | |
---|---|---|---|---|
0 | transworld | legit | 10 | 3.121928 |
2 | islam2all | legit | 9 | 2.419382 |
3 | pulitzer | legit | 8 | 3.000000 |
6 | danarimedia | legit | 11 | 2.663533 |
7 | heartbreakers | legit | 13 | 2.815072 |
all_domains.tail()
domain | class | length | entropy | |
---|---|---|---|---|
84932 | ulxxqduryvv | dga | 11 | 2.913977 |
84933 | ummvzhin | dga | 8 | 2.750000 |
84934 | umsgnwgc | dga | 8 | 2.750000 |
84935 | umzsbhpkrgo | dga | 11 | 3.459432 |
84936 | umzuyjrfwyf | dga | 11 | 2.913977 |
# Boxplots show you the distribution of the data (spread).
# http://en.wikipedia.org/wiki/Box_plot
# Plot the length and entropy of domains
all_domains.boxplot('length','class')
pylab.ylabel('Domain Length')
all_domains.boxplot('entropy','class')
pylab.ylabel('Domain Entropy')
<matplotlib.text.Text at 0x10ce00b50>
# Split the classes up so we can set colors, size, labels
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
alexa = all_domains[~cond]
plt.scatter(alexa['length'], alexa['entropy'], s=140, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['length'], dga['entropy'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Domain Entropy')
# Below you can see that our DGA domains do tend to have higher entropy than Alexa on average.
<matplotlib.text.Text at 0x10680ff50>
# Lets look at the types of domains that have entropy higher than 4
high_entropy_domains = all_domains[all_domains['entropy'] > 4]
print 'Num Domains above 4 entropy: %.2f%% %d (out of %d)' % \
(100.0*high_entropy_domains.shape[0]/all_domains.shape[0],high_entropy_domains.shape[0],all_domains.shape[0])
print "Num high entropy legit: %d" % high_entropy_domains[high_entropy_domains['class']=='legit'].shape[0]
print "Num high entropy DGA: %d" % high_entropy_domains[high_entropy_domains['class']=='dga'].shape[0]
high_entropy_domains[high_entropy_domains['class']=='legit'].head()
# Looking at the results below, we do see that there are more domains
# in the DGA group that are high entropy but only a small percentage
# of the domains are in that high entropy range...
Num Domains above 4 entropy: 0.57% 361 (out of 63294) Num high entropy legit: 3 Num high entropy DGA: 358
domain | class | length | entropy | |
---|---|---|---|---|
29392 | theukwebdesigncompany | legit | 21 | 4.070656 |
37378 | texaswithlove1982-amomentlikethis | legit | 33 | 4.051822 |
55073 | congresomundialjjrperu2009 | legit | 26 | 4.056021 |
high_entropy_domains[high_entropy_domains['class']=='dga'].head()
domain | class | length | entropy | |
---|---|---|---|---|
82558 | a17btkyb38gxe41pwd50nxmzjxiwjwdwfrp52 | dga | 37 | 4.540402 |
82559 | a17c49l68ntkqnuhvkrmyb28fubvn30e31g43dq | dga | 39 | 4.631305 |
82560 | a17d60gtnxk47gskti15izhvlviyksh64nqkz | dga | 37 | 4.270132 |
82561 | a17erpzfzh64c69csi35bqgvp52drita67jzmy | dga | 38 | 4.629249 |
82562 | a17fro51oyk67b18ksfzoti55j36p32o11fvc29cr | dga | 41 | 4.305859 |
# In preparation for using scikit learn we're just going to use
# some handles that help take us from pandas land to scikit land
# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
X = all_domains.as_matrix(['length', 'entropy'])
# Labels (scikit learn uses 'y' for classification labels)
y = np.array(all_domains['class'].tolist()) # Yes, this is weird but it needs
# to be an np.array of strings
# Random Forest is a popular ensemble machine learning classifier.
# http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
#
import sklearn.ensemble
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=20, compute_importances=True) # Trees in the forest
# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=5, n_jobs=4)
print scores
[ 0.9688759 0.96784896 0.96729599 0.96753298 0.96887344]
# Wow 96% accurate! At this point we could claim success and we'd be gigantic morons...
# Recall that we have ~100k 'legit' domains and only 3.5k DGA domains
# So a classifier that marked everything as legit would be about
# 96% accurate....
# So we dive in a bit and look at the predictive performance more deeply.
# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Now plot the results of the 80/20 split in a confusion matrix
from sklearn.metrics import confusion_matrix
labels = ['legit', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)
def plot_cm(cm, labels):
# Compute percentanges
percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T) # Derp, I'm sure there's a better way
print 'Confusion Matrix Stats'
for i, label_i in enumerate(labels):
for j, label_j in enumerate(labels):
print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())
# Show confusion matrix
# Thanks kermit666 from stackoverflow :)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.grid(b=False)
cax = ax.matshow(percent, cmap='coolwarm')
pylab.title('Confusion matrix of the classifier')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
pylab.xlabel('Predicted')
pylab.ylabel('True')
pylab.show()
plot_cm(cm, labels)
# We can see below that our suspicions were correct and the classifier is
# marking almost everything as Alexa. We FAIL.. science is hard... lets go drinking....
Confusion Matrix Stats legit/legit: 99.89% (12152/12165) legit/dga: 0.11% (13/12165) dga/legit: 80.16% (396/494) dga/dga: 19.84% (98/494)
# Well our Mom told us we were still cool.. so with that encouragement we're
# going to compute NGrams for every Alexa domain and see if we can use the
# NGrams to help us better differentiate and mark DGA domains...
# Scikit learn has a nice NGram generator that can generate either char NGrams or word NGrams (we're using char).
# Parameters:
# - ngram_range=(3,5) # Give me all ngrams of length 3, 4, and 5
# - min_df=1e-4 # Minimumum document frequency. At 1e-4 we're saying give us NGrams that
# # happen in at least .1% of the domains (so for 100k... at least 100 domains)
alexa_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-4, max_df=1.0)
# I'm SURE there's a better way to store all the counts but not sure...
# At least the min_df parameters has already done some thresholding
counts_matrix = alexa_vc.fit_transform(alexa_dataframe['domain'])
alexa_counts = np.log10(counts_matrix.sum(axis=0).getA1())
ngrams_list = alexa_vc.get_feature_names()
# For fun sort it and show it
import operator
_sorted_ngrams = sorted(zip(ngrams_list, alexa_counts), key=operator.itemgetter(1), reverse=True)
print 'Alexa NGrams: %d' % len(_sorted_ngrams)
for ngram, count in _sorted_ngrams[:10]:
print ngram, count
Alexa NGrams: 27012 ing 3.40001963507 lin 3.3818368 ine 3.35295391171 tor 3.22349594096 ter 3.21827285357 ion 3.20411998266 ent 3.18184358794 por 3.1562461904 the 3.15228834438 ree 3.11693964655
# We're also going to throw in a bunch of dictionary words
word_dataframe = pd.read_csv('data/words.txt', names=['word'], header=None, dtype={'word': np.str}, encoding='utf-8')
# Cleanup words from dictionary
word_dataframe = word_dataframe[word_dataframe['word'].map(lambda x: str(x).isalpha())]
word_dataframe = word_dataframe.applymap(lambda x: str(x).strip().lower())
word_dataframe = word_dataframe.dropna()
word_dataframe = word_dataframe.drop_duplicates()
word_dataframe.head(10)
word | |
---|---|
37 | a |
48 | aa |
51 | aaa |
53 | aaaa |
54 | aaaaaa |
55 | aaal |
56 | aaas |
57 | aaberg |
58 | aachen |
59 | aae |
# Now compute NGrams on the dictionary words
# Same logic as above...
dict_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-5, max_df=1.0)
counts_matrix = dict_vc.fit_transform(word_dataframe['word'])
dict_counts = np.log10(counts_matrix.sum(axis=0).getA1())
ngrams_list = dict_vc.get_feature_names()
# For fun sort it and show it
import operator
_sorted_ngrams = sorted(zip(ngrams_list, dict_counts), key=operator.itemgetter(1), reverse=True)
print 'Word NGrams: %d' % len(_sorted_ngrams)
for ngram, count in _sorted_ngrams[:10]:
print ngram, count
Word NGrams: 142275 ing 4.38730082245 ess 4.20487933376 ati 4.19334725639 ion 4.16503647999 ter 4.16241503611 nes 4.11250445877 tio 4.07682242334 ate 4.07236020396 ent 4.06963110262 tion 4.04960561259
# We use the transform method of the CountVectorizer to form a vector
# of ngrams contained in the domain, that vector is than multiplied
# by the counts vector (which is a column sum of the count matrix).
def ngram_count(domain):
alexa_match = alexa_counts * alexa_vc.transform([domain]).T # Woot vector multiply and transpose Woo Hoo!
dict_match = dict_counts * dict_vc.transform([domain]).T
print '%s Alexa match:%d Dict match: %d' % (domain, alexa_match, dict_match)
# Examples:
ngram_count('google')
ngram_count('facebook')
ngram_count('1cb8a5f36f')
ngram_count('pterodactylfarts')
ngram_count('ptes9dro-dwacty2lfa5rrts')
ngram_count('beyonce')
ngram_count('bey666on4ce')
google Alexa match:17 Dict match: 14 facebook Alexa match:30 Dict match: 27 1cb8a5f36f Alexa match:0 Dict match: 0 pterodactylfarts Alexa match:34 Dict match: 77 ptes9dro-dwacty2lfa5rrts Alexa match:19 Dict match: 28 beyonce Alexa match:15 Dict match: 16 bey666on4ce Alexa match:2 Dict match: 1
# Compute NGram matches for all the domains and add to our dataframe
all_domains['alexa_grams']= alexa_counts * alexa_vc.transform(all_domains['domain']).T
all_domains['word_grams']= dict_counts * dict_vc.transform(all_domains['domain']).T
all_domains.head()
domain | class | length | entropy | alexa_grams | word_grams | |
---|---|---|---|---|---|---|
0 | transworld | legit | 10 | 3.121928 | 39.051439 | 44.033642 |
2 | islam2all | legit | 9 | 2.419382 | 15.475215 | 17.367964 |
3 | pulitzer | legit | 8 | 3.000000 | 14.458222 | 28.441721 |
6 | danarimedia | legit | 11 | 2.663533 | 40.189599 | 54.829856 |
7 | heartbreakers | legit | 13 | 2.815072 | 45.354321 | 69.734483 |
all_domains.tail()
domain | class | length | entropy | alexa_grams | word_grams | |
---|---|---|---|---|---|---|
84932 | ulxxqduryvv | dga | 11 | 2.913977 | 3.745231 | 6.464859 |
84933 | ummvzhin | dga | 8 | 2.750000 | 6.183945 | 7.180022 |
84934 | umsgnwgc | dga | 8 | 2.750000 | 3.272306 | 3.847079 |
84935 | umzsbhpkrgo | dga | 11 | 3.459432 | 1.653213 | 2.546543 |
84936 | umzuyjrfwyf | dga | 11 | 2.913977 | 0.000000 | 0.000000 |
# Use the vectorized operations of the dataframe to investigate differences
# between the alexa and word grams
all_domains['diff'] = all_domains['alexa_grams'] - all_domains['word_grams']
all_domains.sort(['diff'], ascending=True).head(10)
# The table below shows those domain names that are more 'dictionary' and less 'web'
domain | class | length | entropy | alexa_grams | word_grams | diff | |
---|---|---|---|---|---|---|---|
63819 | bipolardisorderdepressionanxiety | legit | 32 | 3.616729 | 115.885999 | 193.844156 | -77.958157 |
34524 | stirringtroubleinternationally | legit | 30 | 3.481728 | 131.209086 | 207.204729 | -75.995643 |
63954 | americansforresponsiblesolutions | legit | 32 | 3.667838 | 145.071369 | 218.363956 | -73.292587 |
49070 | channel4embarrassingillnesses | legit | 29 | 3.440070 | 98.201709 | 169.721499 | -71.519790 |
5902 | pragmatismopolitico | legit | 19 | 3.326360 | 59.877723 | 121.536223 | -61.658500 |
49210 | egaliteetreconciliation | legit | 23 | 3.186393 | 92.257111 | 152.125325 | -59.868214 |
74130 | interoperabilitybridges | legit | 23 | 3.588354 | 93.803640 | 153.626312 | -59.822673 |
36976 | foreclosurephilippines | legit | 22 | 3.447402 | 72.844280 | 132.514638 | -59.670358 |
47055 | corazonindomablecapitulos | legit | 25 | 3.813661 | 74.706878 | 133.762750 | -59.055872 |
70113 | annamalicesissyselfhypnosis | legit | 27 | 3.429908 | 68.066490 | 126.667692 | -58.601201 |
all_domains.sort(['diff'], ascending=False).head(50)
# The table below shows those domain names that are more 'web' and less 'dictionary'
# Good O' web....
domain | class | length | entropy | alexa_grams | word_grams | diff | |
---|---|---|---|---|---|---|---|
22647 | gay-sex-pics-porn-pictures-gay-sex-porn-gay-se... | legit | 56 | 3.661056 | 160.035734 | 85.124184 | 74.911550 |
44091 | article-directory-free-submission-free-content | legit | 46 | 3.786816 | 233.518879 | 188.230453 | 45.288426 |
63865 | stream-free-movies-online | legit | 25 | 3.509275 | 118.944026 | 74.496915 | 44.447110 |
38570 | top-bookmarking-site-list | legit | 25 | 3.723074 | 117.162056 | 74.126061 | 43.035995 |
79963 | best-online-shopping-site | legit | 25 | 3.452879 | 122.152194 | 79.596640 | 42.555554 |
12532 | watch-free-movie-online | legit | 23 | 3.708132 | 101.010995 | 58.943451 | 42.067543 |
30198 | free-online-directory | legit | 21 | 3.403989 | 122.359797 | 80.735030 | 41.624767 |
40859 | free-links-articles-directory | legit | 29 | 3.702472 | 152.063809 | 110.955361 | 41.108448 |
30875 | online-web-directory | legit | 20 | 3.584184 | 114.439863 | 74.082948 | 40.356915 |
79001 | web-directory-online | legit | 20 | 3.584184 | 114.313583 | 74.082948 | 40.230634 |
78947 | movie-news-online | legit | 17 | 3.175123 | 81.036910 | 41.705735 | 39.331174 |
51532 | xxx-porno-sexvideos | legit | 19 | 3.260828 | 73.025165 | 35.176549 | 37.848617 |
42200 | free-tv-video-online | legit | 20 | 3.284184 | 83.341214 | 45.662984 | 37.678230 |
40771 | freegamesforyourwebsite | legit | 23 | 3.551191 | 114.291735 | 78.515881 | 35.775855 |
58275 | free-web-mobile-themes | legit | 22 | 3.356492 | 88.503556 | 54.149725 | 34.353831 |
70724 | seowebdirectoryonline | legit | 21 | 3.499228 | 126.111921 | 91.819498 | 34.292423 |
28283 | download-free-games | legit | 19 | 3.576618 | 84.492962 | 50.661490 | 33.831472 |
18894 | web-link-directory-site | legit | 23 | 3.729446 | 102.993078 | 69.367186 | 33.625893 |
4838 | the-web-directory | legit | 17 | 3.454822 | 87.520339 | 54.697986 | 32.822353 |
65871 | social-bookmarking-site | legit | 23 | 3.762267 | 116.664791 | 84.545021 | 32.119769 |
21743 | free-links-directory | legit | 20 | 3.646439 | 104.050046 | 71.956644 | 32.093402 |
74449 | money-news-online | legit | 17 | 3.101881 | 77.587799 | 45.775375 | 31.812424 |
48456 | free-sexvideosfc2 | legit | 17 | 3.381580 | 63.659477 | 31.878432 | 31.781045 |
57427 | your-new-directory-site | legit | 23 | 3.555533 | 99.130671 | 67.468067 | 31.662605 |
49041 | addsiteurlfreewebdirectory | legit | 26 | 3.609496 | 134.446230 | 103.178748 | 31.267482 |
34821 | own-free-website | legit | 16 | 3.250000 | 59.564153 | 28.839294 | 30.724859 |
10080 | web-directory-plus | legit | 18 | 3.836592 | 89.030979 | 58.484138 | 30.546841 |
43762 | web-directory-sites | legit | 19 | 3.471354 | 98.528255 | 68.088416 | 30.439839 |
34811 | free-sex-for-you | legit | 16 | 3.030639 | 46.653059 | 16.670504 | 29.982555 |
21390 | online-deal-coupon | legit | 18 | 3.308271 | 77.862004 | 47.886115 | 29.975889 |
48204 | acme-people-search-forum | legit | 24 | 3.553509 | 87.829242 | 57.898987 | 29.930255 |
73304 | free-webdirectory | legit | 17 | 3.337175 | 93.606205 | 63.858372 | 29.747833 |
44221 | good-web-directory | legit | 18 | 3.461320 | 88.201881 | 58.629789 | 29.572091 |
50633 | all-free-download | legit | 17 | 3.219528 | 69.337916 | 39.909696 | 29.428220 |
57095 | free-link-directory | legit | 19 | 3.536887 | 95.869062 | 66.507042 | 29.362020 |
58652 | global-web-directory | legit | 20 | 3.721928 | 100.465474 | 71.293587 | 29.171887 |
74259 | online-games-zone | legit | 17 | 3.292770 | 74.987811 | 45.881826 | 29.105985 |
77290 | us-web-directory | legit | 16 | 3.625000 | 80.044863 | 50.969551 | 29.075312 |
72128 | bookmarking-sites-lists | legit | 23 | 3.621176 | 115.664939 | 86.595393 | 29.069546 |
64948 | web-marketing-directory | legit | 23 | 3.849224 | 125.587313 | 96.714227 | 28.873086 |
79557 | freewebdirectory101 | legit | 19 | 3.471354 | 100.131488 | 71.474824 | 28.656664 |
72737 | free-seo-news | legit | 13 | 2.777363 | 45.267539 | 17.089020 | 28.178520 |
53449 | website-traffic-hog | legit | 19 | 3.721612 | 77.199578 | 49.156126 | 28.043452 |
50837 | myonlinewebdirectory | legit | 20 | 3.584184 | 121.155376 | 93.276322 | 27.879054 |
29303 | business-web-directorys | legit | 23 | 3.621176 | 125.854338 | 98.160126 | 27.694212 |
41310 | free-online-submission | legit | 22 | 3.413088 | 113.459411 | 85.792712 | 27.666699 |
76645 | linkdirectoryonline | legit | 19 | 3.326360 | 116.879367 | 89.392747 | 27.486621 |
30430 | online-deal-site | legit | 16 | 3.202820 | 68.103656 | 40.887484 | 27.216172 |
27227 | free-site-submit | legit | 16 | 3.202820 | 64.158023 | 37.127294 | 27.030729 |
62951 | mybusiness-web-directory | legit | 24 | 3.772055 | 124.553982 | 97.538670 | 27.015312 |
# Lets plot some stuff!
# Here we want to see whether our new 'alexa_grams' feature can help us differentiate between Legit/DGA
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['length'], legit['alexa_grams'], s=120, c='#aaaaff', label='Alexa', alpha=.1)
plt.scatter(dga['length'], dga['alexa_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Alexa NGram Matches')
<matplotlib.text.Text at 0x110c87210>
# Lets plot some stuff!
# Here we want to see whether our new 'alexa_grams' feature can help us differentiate between Legit/DGA
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['entropy'], legit['alexa_grams'], s=120, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['entropy'], dga['alexa_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Entropy')
pylab.ylabel('Alexa Gram Matches')
<matplotlib.text.Text at 0x11058e490>
# Lets plot some stuff!
# Here we want to see whether our new 'word_grams' feature can help us differentiate between Legit/DGA
# Note: It doesn't look quite as good as the Alexa_grams but it might generalize better (less overfit).
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['length'], legit['word_grams'], s=120, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['length'], dga['word_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Dictionary NGram Matches')
<matplotlib.text.Text at 0x10ef8f750>
# Lets look at which Legit domains are scoring low on the word gram count
all_domains[(all_domains['word_grams']==0)].head()
domain | class | length | entropy | alexa_grams | word_grams | diff | |
---|---|---|---|---|---|---|---|
3429 | dftc777 | legit | 7 | 2.128085 | 2.707570 | 0 | 2.707570 |
3715 | 5221766 | legit | 7 | 2.235926 | 0.000000 | 0 | 0.000000 |
4144 | 28365365 | legit | 8 | 2.250000 | 4.050612 | 0 | 4.050612 |
4235 | mm-mm-mm | legit | 8 | 0.811278 | 4.260668 | 0 | 4.260668 |
4297 | fzzfgjj | legit | 7 | 1.950212 | 0.954243 | 0 | 0.954243 |
# Okay these look kinda weird, lets use some nice Pandas functionality
# to look at some statistics around our new features.
all_domains[all_domains['class']=='legit'].describe()
length | entropy | alexa_grams | word_grams | diff | |
---|---|---|---|---|---|
count | 60897.000000 | 60897.000000 | 60897.000000 | 60897.000000 | 60897.000000 |
mean | 10.873032 | 2.930306 | 33.083440 | 40.901852 | -7.818413 |
std | 3.393407 | 0.347134 | 19.233994 | 23.302539 | 9.388916 |
min | 7.000000 | -0.000000 | 0.000000 | 0.000000 | -77.958157 |
25% | 8.000000 | 2.725481 | 19.136340 | 24.056214 | -12.938013 |
50% | 10.000000 | 2.947703 | 28.703813 | 36.259089 | -7.108820 |
75% | 13.000000 | 3.169925 | 42.400101 | 53.036218 | -1.995136 |
max | 56.000000 | 4.070656 | 233.518879 | 233.648571 | 74.911550 |
# Lets look at how many domains that are both low in word_grams and alexa_grams (just plotting the max of either)
legit = all_domains[(all_domains['class']=='legit')]
max_grams = np.maximum(legit['alexa_grams'],legit['word_grams'])
ax = max_grams.hist(bins=80)
ax.figure.suptitle('Histogram of the Max NGram Score for Domains')
pylab.xlabel('Number of Domains')
pylab.ylabel('Maximum NGram Score')
<matplotlib.text.Text at 0x114ee5450>
# Lets look at which Legit domains are scoring low on both alexa and word gram count
weird_cond = (all_domains['class']=='legit') & (all_domains['word_grams']<3) & (all_domains['alexa_grams']<2)
weird = all_domains[weird_cond]
print weird.shape[0]
weird.head(30)
79
domain | class | length | entropy | alexa_grams | word_grams | diff | |
---|---|---|---|---|---|---|---|
85 | 9to5lol | legit | 7 | 2.235926 | 1.991226 | 2.359835 | -0.368609 |
2611 | akb48mt | legit | 7 | 2.807355 | 1.301030 | 1.041393 | 0.259637 |
3715 | 5221766 | legit | 7 | 2.235926 | 0.000000 | 0.000000 | 0.000000 |
4297 | fzzfgjj | legit | 7 | 1.950212 | 0.954243 | 0.000000 | 0.954243 |
6045 | crx7601 | legit | 7 | 2.807355 | 0.000000 | 0.000000 | 0.000000 |
8531 | mw7zrv2 | legit | 7 | 2.807355 | 0.000000 | 0.000000 | 0.000000 |
10802 | jmm1818 | legit | 7 | 1.950212 | 0.903090 | 0.000000 | 0.903090 |
11961 | qq66699 | legit | 7 | 1.556657 | 1.322219 | 0.000000 | 1.322219 |
13200 | twcczhu | legit | 7 | 2.521641 | 1.724276 | 0.000000 | 1.724276 |
13756 | hljdns4 | legit | 7 | 2.807355 | 1.724276 | 0.000000 | 1.724276 |
14763 | 6470355 | legit | 7 | 2.521641 | 0.000000 | 0.000000 | 0.000000 |
17322 | d20pfsrd | legit | 8 | 2.750000 | 0.000000 | 0.000000 | 0.000000 |
20591 | lgcct27 | legit | 7 | 2.521641 | 1.176091 | 0.845098 | 0.330993 |
23458 | jdoqocy | legit | 7 | 2.521641 | 0.000000 | 2.813581 | -2.813581 |
24661 | 95178114 | legit | 8 | 2.405639 | 1.591065 | 0.000000 | 1.591065 |
24720 | ggmmxxoo | legit | 8 | 2.000000 | 1.113943 | 0.602060 | 0.511883 |
26454 | ggmm777 | legit | 7 | 1.556657 | 1.477121 | 0.602060 | 0.875061 |
27222 | rkg1866 | legit | 7 | 2.521641 | 0.954243 | 0.000000 | 0.954243 |
27676 | 1616bbs | legit | 7 | 1.950212 | 1.806180 | 1.322219 | 0.483961 |
29142 | 5278bbs | legit | 7 | 2.521641 | 1.806180 | 1.322219 | 0.483961 |
29551 | 05tz2e9 | legit | 7 | 2.807355 | 0.000000 | 0.000000 | 0.000000 |
29858 | 1532777 | legit | 7 | 2.128085 | 1.477121 | 0.000000 | 1.477121 |
30119 | 5311314 | legit | 7 | 1.842371 | 1.000000 | 0.000000 | 1.000000 |
30290 | zzgcjyzx | legit | 8 | 2.405639 | 0.000000 | 0.000000 | 0.000000 |
30739 | xn--g5t518j | legit | 11 | 3.095795 | 1.000000 | 0.000000 | 1.000000 |
31465 | 7210578 | legit | 7 | 2.521641 | 0.903090 | 0.000000 | 0.903090 |
31951 | fj96336 | legit | 7 | 2.235926 | 0.000000 | 0.000000 | 0.000000 |
34455 | xn--42cgk1gc8crdb1htg3d | legit | 23 | 3.849224 | 1.255273 | 2.411620 | -1.156347 |
35554 | 720pmkv | legit | 7 | 2.807355 | 0.000000 | 0.000000 | 0.000000 |
36166 | d4ffr55 | legit | 7 | 2.235926 | 1.079181 | 2.260071 | -1.180890 |
# Epiphany... Alexa really may not be the best 'exemplar' set...
# (probably a no-shit moment for everyone else :)
#
# Discussion: If you're using these as exemplars of NOT DGA, then your probably
# making things very hard on your machine learning algorithm.
# Perhaps we should have two categories of Alexa domains, 'legit'
# and a 'weird'. based on some definition of weird.
# Looking at the entries above... we have approx 80 domains
# that we're going to mark as 'weird'.
#
all_domains.loc[weird_cond, 'class'] = 'weird'
print all_domains['class'].value_counts()
all_domains[all_domains['class'] == 'weird'].head()
legit 60818 dga 2397 weird 79 dtype: int64
domain | class | length | entropy | alexa_grams | word_grams | diff | |
---|---|---|---|---|---|---|---|
85 | 9to5lol | weird | 7 | 2.235926 | 1.991226 | 2.359835 | -0.368609 |
2611 | akb48mt | weird | 7 | 2.807355 | 1.301030 | 1.041393 | 0.259637 |
3715 | 5221766 | weird | 7 | 2.235926 | 0.000000 | 0.000000 | 0.000000 |
4297 | fzzfgjj | weird | 7 | 1.950212 | 0.954243 | 0.000000 | 0.954243 |
6045 | crx7601 | weird | 7 | 2.807355 | 0.000000 | 0.000000 | 0.000000 |
# Now we try our machine learning algorithm again with the new features
# Alexa and Dictionary NGrams and the exclusion of the bad exemplars.
X = all_domains.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])
# Labels (scikit learn uses 'y' for classification labels)
y = np.array(all_domains['class'].tolist())
# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Now plot the results of the 80/20 split in a confusion matrix
labels = ['legit', 'weird', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats legit/legit: 99.58% (12140/12191) legit/weird: 0.01% (1/12191) legit/dga: 0.41% (50/12191) weird/legit: 0.00% (0/10) weird/weird: 30.00% (3/10) weird/dga: 70.00% (7/10) dga/legit: 14.63% (67/458) dga/weird: 0.22% (1/458) dga/dga: 85.15% (390/458)
# Hun, well that seem to work 'ok', but you don't really want a classifier
# that outputs 3 classes, you'd like a classifier that flags domains as DGA or not.
# This was a path that seemed like a good idea until it wasn't....
# Perhaps we will just exclude the weird class from our ML training
not_weird = all_domains[all_domains['class'] != 'weird']
X = not_weird.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])
# Labels (scikit learn uses 'y' for classification labels)
y = np.array(not_weird['class'].tolist())
# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Now plot the results of the 80/20 split in a confusion matrix
labels = ['legit', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats legit/legit: 99.56% (12111/12165) legit/dga: 0.44% (54/12165) dga/legit: 17.99% (86/478) dga/dga: 82.01% (392/478)
# Well it's definitely better.. but haven't we just cheated by removing
# the weird domains? Well perhaps, but on some level we're removing
# outliers that are bad exemplars. So to validate that the model is still
# doing the right thing lets try our new model prediction on our hold out sets.
# First train on the whole thing before looking at prediction performance
clf.fit(X, y)
# Pull together our hold out set
hold_out_domains = pd.concat([hold_out_alexa, hold_out_dga], ignore_index=True)
# Add a length field for the domain
hold_out_domains['length'] = [len(x) for x in hold_out_domains['domain']]
hold_out_domains = hold_out_domains[hold_out_domains['length'] > 6]
# Add a entropy field for the domain
hold_out_domains['entropy'] = [entropy(x) for x in hold_out_domains['domain']]
# Compute NGram matches for all the domains and add to our dataframe
hold_out_domains['alexa_grams']= alexa_counts * alexa_vc.transform(hold_out_domains['domain']).T
hold_out_domains['word_grams']= dict_counts * dict_vc.transform(hold_out_domains['domain']).T
hold_out_domains.head()
domain | class | length | entropy | alexa_grams | word_grams | |
---|---|---|---|---|---|---|
0 | alcatelonetouch | legit | 15 | 3.106891 | 49.001768 | 79.015001 |
1 | optumhealthfinancial | legit | 20 | 3.584184 | 68.667084 | 87.158661 |
4 | elderscrollsonline | legit | 18 | 3.016876 | 76.441834 | 94.462092 |
5 | mobango | legit | 7 | 2.521641 | 18.020832 | 22.072036 |
6 | costaud | legit | 7 | 2.807355 | 16.037393 | 25.008755 |
# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
hold_X = hold_out_domains.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])
# Labels (scikit learn uses 'y' for classification labels)
hold_y = np.array(hold_out_domains['class'].tolist())
# Now run through the predictive model
hold_y_pred = clf.predict(hold_X)
# Add the prediction array to the dataframe
hold_out_domains['pred'] = hold_y_pred
# Now plot the results
labels = ['legit', 'dga']
cm = confusion_matrix(hold_y, hold_y_pred, labels)
plot_cm(cm, labels)
Confusion Matrix Stats legit/legit: 99.51% (6713/6746) legit/dga: 0.49% (33/6746) dga/legit: 15.73% (42/267) dga/dga: 84.27% (225/267)
# Okay so on our 10% hold out set of 10k domains about ~100 domains were mis-classified
# at this point we're made some good progress so we're going to claim success :)
# - Out of 10k domains 100 were mismarked
# - false positives (Alexa marked as DGA) = ~0.6%
# - about 80% of the DGA are getting marked
# Note: Alexa 1M results on the 10% hold out (100k domains) were in the same ballpark
# - Out of 100k domains 432 were mismarked
# - false positives (Alexa marked as DGA) = 0.4%
# - about 70% of the DGA are getting marked
# Now were going to just do some post analysis on how the ML algorithm performed.
# Lets look at a couple of plots to see which domains were misclassified.
# Looking at Length vs. Alexa NGrams
fp_cond = ((hold_out_domains['class'] == 'legit') & (hold_out_domains['pred']=='dga'))
fp = hold_out_domains[fp_cond]
fn_cond = ((hold_out_domains['class'] == 'dba') & (hold_out_domains['pred']=='legit'))
fn = hold_out_domains[fn_cond]
okay = hold_out_domains[hold_out_domains['class'] == hold_out_domains['pred']]
plt.scatter(okay['length'], okay['alexa_grams'], s=100, c='#aaaaff', label='Okay', alpha=.1)
plt.scatter(fp['length'], fp['alexa_grams'], s=40, c='r', label='False Positive', alpha=.5)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Alexa NGram Matches')
<matplotlib.text.Text at 0x115ea7390>
# Looking at Length vs. Dictionary NGrams
cond = (hold_out_domains['class'] != hold_out_domains['pred'])
misclassified = hold_out_domains[cond]
okay = hold_out_domains[~cond]
plt.scatter(okay['length'], okay['word_grams'], s=100, c='#aaaaff', label='Okay', alpha=.2)
plt.scatter(misclassified['length'], misclassified['word_grams'], s=40, c='r', label='Misclassified', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Dictionary NGram Matches')
<matplotlib.text.Text at 0x115e9b990>
misclassified.head()
domain | class | length | entropy | alexa_grams | word_grams | pred | |
---|---|---|---|---|---|---|---|
896 | dom2-fan | legit | 8 | 3.000000 | 6.568955 | 5.656685 | dga |
1296 | mm8mm8-6642 | legit | 11 | 2.368523 | 0.000000 | 0.000000 | dga |
1378 | 4390208 | legit | 7 | 2.521641 | 0.000000 | 0.000000 | dga |
1514 | sqrt121 | legit | 7 | 2.521641 | 0.000000 | 0.000000 | dga |
1687 | 02022222222 | legit | 11 | 0.684038 | 0.903090 | 0.000000 | dga |
misclassified[misclassified['class'] == 'dga'].head()
domain | class | length | entropy | alexa_grams | word_grams | pred | |
---|---|---|---|---|---|---|---|
9184 | usbiezgac | dga | 9 | 3.169925 | 7.825928 | 9.172547 | legit |
9185 | ushcnewo | dga | 8 | 3.000000 | 12.265642 | 13.904812 | legit |
9187 | usnspdph | dga | 8 | 2.500000 | 5.182278 | 6.556287 | legit |
9190 | utamehz | dga | 7 | 2.807355 | 10.741352 | 14.733893 | legit |
9192 | utfowept | dga | 8 | 2.750000 | 7.095911 | 17.416355 | legit |
# We can also look at what features the learning algorithm thought were the most important
importances = zip(['length', 'entropy', 'alexa_grams', 'word_grams'], clf.feature_importances_)
importances
# From the list below we see our feature importance scores. There's a lot of feature selection,
# sensitivity study, etc stuff that you could do if you wanted at this point.
[('length', 0.13110737655160343), ('entropy', 0.15589784074688856), ('alexa_grams', 0.48657282029928439), ('word_grams', 0.22642196240222362)]
# Discussion for how to use the resulting models.
# Typically Machine Learning comes in two phases
# - Training of the Model
# - Evaluation of new observations against the Model
# This notebook is about exploration of the data and training the model.
# After you have a model that you are satisfied with, just 'pickle' it
# at the end of the your training script and then in a separate
# evaluation script 'unpickle' it and evaluate/score new observations
# coming in (through a file, or ZeroMQ, or whatever...)
#
# In this case we'd have to pickle the RandomForest classifier
# and the two vectorizing transforms (alexa_grams and word_grams).
# See 'test_it' below for how to use them in evaluation mode.
# test_it shows how to do evaluation, also fun for manual testing below :)
def test_it(domain):
_alexa_match = alexa_counts * alexa_vc.transform([domain]).T # Woot matrix multiply and transpose Woo Hoo!
_dict_match = dict_counts * dict_vc.transform([domain]).T
_X = [len(domain), entropy(domain), _alexa_match, _dict_match]
print '%s : %s' % (domain, clf.predict(_X)[0])
# Examples (feel free to change these and see the results!)
test_it('google')
test_it('google88')
test_it('facebook')
test_it('1cb8a5f36f')
test_it('pterodactylfarts')
test_it('ptes9dro-dwacty2lfa5rrts')
test_it('beyonce')
test_it('bey666on4ce')
test_it('supersexy')
test_it('yourmomissohotinthesummertime')
test_it('35-sdf-09jq43r')
test_it('clicksecurity')
google : legit google88 : legit facebook : legit 1cb8a5f36f : dga pterodactylfarts : legit ptes9dro-dwacty2lfa5rrts : dga beyonce : legit bey666on4ce : dga supersexy : legit yourmomissohotinthesummertime : legit 35-sdf-09jq43r : dga clicksecurity : legit
The combination of iPython, Pandas and Scikit Learn let us pull in some junky data, clean it up, plot it, understand it and slap it with some machine learning!
Clearly a lot more formality could be used, plotting learning curves, adjusting for overfitting, feature selection, on and on... there are some really great machine learning resources that cover this deeper material. In particular we highly recommend the work and presentations of Olivier Grisel at INRIA Saclay. http://ogrisel.com/
Some papers on detecting DGA domains: