Exercise to detect Algorithmically Generated Domain Names.¶

In this notebook we're going to use some great python modules to explore, understand and classify domains as being 'legit' or having a high probability of being generated by a DGA (Dynamic Generation Algorithm). We have 'legit' in quotes as we're using the domains in Alexa as the 'legit' set. The primary motivation is to explore the nexus of iPython, Pandas and scikit-learn with DGA classification as a vehicle for that exploration. The exercise intentionally shows common missteps, warts in the data, paths that didn't work out that well and results that could definitely be improved upon. In general capturing what worked and what didn't is not only more realistic but often much more informative. :)

Python Modules Used:¶

Pandas: Python Data Analysis Library (http://pandas.pydata.org)
Scikit Learn (http://scikit-learn.org) Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Matplotlib: Python 2D plotting library (http://matplotlib.org)

Suggestions/Comments: Please send suggestions or bugs (I'm sure) to bwylie at clicksecurity.com. Also if you have some datasets or would like to explore alternative approaches please touch base.

In [42]:

import sklearn.feature_extraction
sklearn.__version__

Out[42]:

'0.14.1'

In [43]:

import pandas as pd
pd.__version__

Out[43]:

'0.12.0'

In [44]:

# Set default pylab stuff
pylab.rcParams['figure.figsize'] = (14.0, 5.0)
pylab.rcParams['axes.grid'] = True

In [45]:

# Version 0.12.0 of Pandas has a DeprecationWarning about Height blah that I'm ignoring
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [46]:

# This is the Alexa 100k domain list, we're not using the 1 Million just for speed reasons. Results
# for the Alexa 1M are given at the bottom of the notebook.
alexa_dataframe = pd.read_csv('data/alexa_100k.csv', names=['rank','uri'], header=None, encoding='utf-8')
alexa_dataframe.head()

Out[46]:

	rank	uri
0	1	facebook.com
1	2	google.com
2	3	youtube.com
3	4	yahoo.com
4	5	baidu.com

In [47]:

# Okay for this exercise we need the 2LD and nothing else
import tldextract

def domain_extract(uri):
    ext = tldextract.extract(uri)
    if (not ext.suffix):
        return np.nan
    else:
        return ext.domain

alexa_dataframe['domain'] = [ domain_extract(uri) for uri in alexa_dataframe['uri']]
del alexa_dataframe['rank']
del alexa_dataframe['uri']
alexa_dataframe.head()

Out[47]:

	domain
0	facebook
1	google
2	youtube
3	yahoo
4	baidu

In [48]:

alexa_dataframe.tail()

Out[48]:

	domain
99995	rhbabyandchild
99996	rm
99997	sat1
99998	nahimunkar
99999	musi

In [49]:

# It's possible we have NaNs from blanklines or whatever
alexa_dataframe = alexa_dataframe.dropna()
alexa_dataframe = alexa_dataframe.drop_duplicates()

# Set the class
alexa_dataframe['class'] = 'legit'

# Shuffle the data (important for training/testing)
alexa_dataframe = alexa_dataframe.reindex(np.random.permutation(alexa_dataframe.index))
alexa_total = alexa_dataframe.shape[0]
print 'Total Alexa domains %d' % alexa_total

# Hold out 10%
hold_out_alexa = alexa_dataframe[alexa_total*.9:]
alexa_dataframe = alexa_dataframe[:alexa_total*.9]

print 'Number of Alexa domains: %d' % alexa_dataframe.shape[0]

Total Alexa domains 91712
Number of Alexa domains: 82540

In [50]:

alexa_dataframe.head()

Out[50]:

	domain	class
20904	transworld	legit
82690	lkfun	legit
85167	islam2all	legit
62859	pulitzer	legit
85573	sge	legit

In [51]:

# Read in the DGA domains
dga_dataframe = pd.read_csv('data/dga_domains.txt', names=['raw_domain'], header=None, encoding='utf-8')

# We noticed that the blacklist values just differ by captilization or .com/.org/.info
dga_dataframe['domain'] = dga_dataframe.applymap(lambda x: x.split('.')[0].strip().lower())
del dga_dataframe['raw_domain']

# It's possible we have NaNs from blanklines or whatever
dga_dataframe = dga_dataframe.dropna()
dga_dataframe = dga_dataframe.drop_duplicates()
dga_total = dga_dataframe.shape[0]
print 'Total DGA domains %d' % dga_total

# Set the class
dga_dataframe['class'] = 'dga'

# Hold out 10%
hold_out_dga = dga_dataframe[dga_total*.9:]
dga_dataframe = dga_dataframe[:dga_total*.9]

print 'Number of DGA domains: %d' % dga_dataframe.shape[0]

Total DGA domains 2664
Number of DGA domains: 2397

In [52]:

dga_dataframe.head()

Out[52]:

	domain	class
0	04055051be412eea5a61b7da8438be3d	dga
1	1cb8a5f36f	dga
2	30acd347397c34fc273e996b22951002	dga
3	336c986a284e2b3bc0f69f949cb437cb	dga
5	40a43e61e56a5c218cf6c22aca27f7ee	dga

In [53]:

# Concatenate the domains in a big pile!
all_domains = pd.concat([alexa_dataframe, dga_dataframe], ignore_index=True)

In [54]:

# Add a length field for the domain
all_domains['length'] = [len(x) for x in all_domains['domain']]

# Okay since we're trying to detect dynamically generated domains and short
# domains (length <=6) are crazy random even for 'legit' domains we're going
# to punt on short domains (perhaps just white/black list for short domains?)
all_domains = all_domains[all_domains['length'] > 6]

In [55]:

# Grabbed this from Rosetta Code (rosettacode.org)
import math
from collections import Counter
 
def entropy(s):
    p, lns = Counter(s), float(len(s))
    return -sum( count/lns * math.log(count/lns, 2) for count in p.values())

In [56]:

# Add a entropy field for the domain
all_domains['entropy'] = [entropy(x) for x in all_domains['domain']]

In [57]:

all_domains.head()

Out[57]:

	domain	class	length	entropy
0	transworld	legit	10	3.121928
2	islam2all	legit	9	2.419382
3	pulitzer	legit	8	3.000000
6	danarimedia	legit	11	2.663533
7	heartbreakers	legit	13	2.815072

In [58]:

all_domains.tail()

Out[58]:

	domain	class	length	entropy
84932	ulxxqduryvv	dga	11	2.913977
84933	ummvzhin	dga	8	2.750000
84934	umsgnwgc	dga	8	2.750000
84935	umzsbhpkrgo	dga	11	3.459432
84936	umzuyjrfwyf	dga	11	2.913977

Lets plot some stuff!¶

In [59]:

# Boxplots show you the distribution of the data (spread).
# http://en.wikipedia.org/wiki/Box_plot

# Plot the length and entropy of domains
all_domains.boxplot('length','class')
pylab.ylabel('Domain Length')
all_domains.boxplot('entropy','class')
pylab.ylabel('Domain Entropy')

Out[59]:

<matplotlib.text.Text at 0x10ce00b50>

In [60]:

# Split the classes up so we can set colors, size, labels
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
alexa = all_domains[~cond]
plt.scatter(alexa['length'], alexa['entropy'], s=140, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['length'], dga['entropy'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Domain Entropy')

# Below you can see that our DGA domains do tend to have higher entropy than Alexa on average.

Out[60]:

<matplotlib.text.Text at 0x10680ff50>

In [61]:

# Lets look at the types of domains that have entropy higher than 4
high_entropy_domains = all_domains[all_domains['entropy'] > 4]
print 'Num Domains above 4 entropy: %.2f%% %d (out of %d)' % \
            (100.0*high_entropy_domains.shape[0]/all_domains.shape[0],high_entropy_domains.shape[0],all_domains.shape[0])
print "Num high entropy legit: %d" % high_entropy_domains[high_entropy_domains['class']=='legit'].shape[0]
print "Num high entropy DGA: %d" % high_entropy_domains[high_entropy_domains['class']=='dga'].shape[0]
high_entropy_domains[high_entropy_domains['class']=='legit'].head()

# Looking at the results below, we do see that there are more domains
# in the DGA group that are high entropy but only a small percentage
# of the domains are in that high entropy range...

Num Domains above 4 entropy: 0.57% 361 (out of 63294)
Num high entropy legit: 3
Num high entropy DGA: 358

Out[61]:

	domain	class	length	entropy
29392	theukwebdesigncompany	legit	21	4.070656
37378	texaswithlove1982-amomentlikethis	legit	33	4.051822
55073	congresomundialjjrperu2009	legit	26	4.056021

In [62]:

high_entropy_domains[high_entropy_domains['class']=='dga'].head()

Out[62]:

	domain	class	length	entropy
82558	a17btkyb38gxe41pwd50nxmzjxiwjwdwfrp52	dga	37	4.540402
82559	a17c49l68ntkqnuhvkrmyb28fubvn30e31g43dq	dga	39	4.631305
82560	a17d60gtnxk47gskti15izhvlviyksh64nqkz	dga	37	4.270132
82561	a17erpzfzh64c69csi35bqgvp52drita67jzmy	dga	38	4.629249
82562	a17fro51oyk67b18ksfzoti55j36p32o11fvc29cr	dga	41	4.305859

In [63]:

# In preparation for using scikit learn we're just going to use
# some handles that help take us from pandas land to scikit land

# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
X = all_domains.as_matrix(['length', 'entropy'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(all_domains['class'].tolist()) # Yes, this is weird but it needs 
                                            # to be an np.array of strings

In [64]:

# Random Forest is a popular ensemble machine learning classifier.
# http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html
#
import sklearn.ensemble
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=20, compute_importances=True) # Trees in the forest

In [65]:

# Now we can use scikit learn's cross validation to assess predictive performance.
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=5, n_jobs=4)
print scores

[ 0.9688759   0.96784896  0.96729599  0.96753298  0.96887344]

In [66]:

# Wow 96% accurate! At this point we could claim success and we'd be gigantic morons...
# Recall that we have ~100k 'legit' domains and only 3.5k DGA domains
# So a classifier that marked everything as legit would be about
# 96% accurate....

# So we dive in a bit and look at the predictive performance more deeply.

# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [67]:

# Now plot the results of the 80/20 split in a confusion matrix
from sklearn.metrics import confusion_matrix
labels = ['legit', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)

def plot_cm(cm, labels):
    
    # Compute percentanges
    percent = (cm*100.0)/np.array(np.matrix(cm.sum(axis=1)).T)  # Derp, I'm sure there's a better way
    
    print 'Confusion Matrix Stats'
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            print "%s/%s: %.2f%% (%d/%d)" % (label_i, label_j, (percent[i][j]), cm[i][j], cm[i].sum())

    # Show confusion matrix
    # Thanks kermit666 from stackoverflow :)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.grid(b=False)
    cax = ax.matshow(percent, cmap='coolwarm')
    pylab.title('Confusion matrix of the classifier')
    fig.colorbar(cax)
    ax.set_xticklabels([''] + labels)
    ax.set_yticklabels([''] + labels)
    pylab.xlabel('Predicted')
    pylab.ylabel('True')
    pylab.show()

plot_cm(cm, labels)

# We can see below that our suspicions were correct and the classifier is
# marking almost everything as Alexa. We FAIL.. science is hard... lets go drinking....

Confusion Matrix Stats
legit/legit: 99.89% (12152/12165)
legit/dga: 0.11% (13/12165)
dga/legit: 80.16% (396/494)
dga/dga: 19.84% (98/494)

In [68]:

# Well our Mom told us we were still cool.. so with that encouragement we're
# going to compute NGrams for every Alexa domain and see if we can use the
# NGrams to help us better differentiate and mark DGA domains...

# Scikit learn has a nice NGram generator that can generate either char NGrams or word NGrams (we're using char).
# Parameters: 
#       - ngram_range=(3,5)  # Give me all ngrams of length 3, 4, and 5
#       - min_df=1e-4        # Minimumum document frequency. At 1e-4 we're saying give us NGrams that 
#                            # happen in at least .1% of the domains (so for 100k... at least 100 domains)
alexa_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-4, max_df=1.0)

In [69]:

# I'm SURE there's a better way to store all the counts but not sure...
# At least the min_df parameters has already done some thresholding
counts_matrix = alexa_vc.fit_transform(alexa_dataframe['domain'])
alexa_counts = np.log10(counts_matrix.sum(axis=0).getA1())
ngrams_list = alexa_vc.get_feature_names()

In [70]:

# For fun sort it and show it
import operator
_sorted_ngrams = sorted(zip(ngrams_list, alexa_counts), key=operator.itemgetter(1), reverse=True)
print 'Alexa NGrams: %d' % len(_sorted_ngrams)
for ngram, count in _sorted_ngrams[:10]:
    print ngram, count

Alexa NGrams: 27012
ing 3.40001963507
lin 3.3818368
ine 3.35295391171
tor 3.22349594096
ter 3.21827285357
ion 3.20411998266
ent 3.18184358794
por 3.1562461904
the 3.15228834438
ree 3.11693964655

In [71]:

# We're also going to throw in a bunch of dictionary words
word_dataframe = pd.read_csv('data/words.txt', names=['word'], header=None, dtype={'word': np.str}, encoding='utf-8')

# Cleanup words from dictionary
word_dataframe = word_dataframe[word_dataframe['word'].map(lambda x: str(x).isalpha())]
word_dataframe = word_dataframe.applymap(lambda x: str(x).strip().lower())
word_dataframe = word_dataframe.dropna()
word_dataframe = word_dataframe.drop_duplicates()
word_dataframe.head(10)

Out[71]:

	word
37	a
48	aa
51	aaa
53	aaaa
54	aaaaaa
55	aaal
56	aaas
57	aaberg
58	aachen
59	aae

In [72]:

# Now compute NGrams on the dictionary words
# Same logic as above...
dict_vc = sklearn.feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=1e-5, max_df=1.0)
counts_matrix = dict_vc.fit_transform(word_dataframe['word'])
dict_counts = np.log10(counts_matrix.sum(axis=0).getA1())
ngrams_list = dict_vc.get_feature_names()

In [73]:

# For fun sort it and show it
import operator
_sorted_ngrams = sorted(zip(ngrams_list, dict_counts), key=operator.itemgetter(1), reverse=True)
print 'Word NGrams: %d' % len(_sorted_ngrams)
for ngram, count in _sorted_ngrams[:10]:
    print ngram, count

Word NGrams: 142275
ing 4.38730082245
ess 4.20487933376
ati 4.19334725639
ion 4.16503647999
ter 4.16241503611
nes 4.11250445877
tio 4.07682242334
ate 4.07236020396
ent 4.06963110262
tion 4.04960561259

In [74]:

# We use the transform method of the CountVectorizer to form a vector
# of ngrams contained in the domain, that vector is than multiplied
# by the counts vector (which is a column sum of the count matrix).
def ngram_count(domain):
    alexa_match = alexa_counts * alexa_vc.transform([domain]).T  # Woot vector multiply and transpose Woo Hoo!
    dict_match = dict_counts * dict_vc.transform([domain]).T
    print '%s Alexa match:%d Dict match: %d' % (domain, alexa_match, dict_match)

# Examples:
ngram_count('google')
ngram_count('facebook')
ngram_count('1cb8a5f36f')
ngram_count('pterodactylfarts')
ngram_count('ptes9dro-dwacty2lfa5rrts')
ngram_count('beyonce')
ngram_count('bey666on4ce')

google Alexa match:17 Dict match: 14
facebook Alexa match:30 Dict match: 27
1cb8a5f36f Alexa match:0 Dict match: 0
pterodactylfarts Alexa match:34 Dict match: 77
ptes9dro-dwacty2lfa5rrts Alexa match:19 Dict match: 28
beyonce Alexa match:15 Dict match: 16
bey666on4ce Alexa match:2 Dict match: 1

In [75]:

# Compute NGram matches for all the domains and add to our dataframe
all_domains['alexa_grams']= alexa_counts * alexa_vc.transform(all_domains['domain']).T 
all_domains['word_grams']= dict_counts * dict_vc.transform(all_domains['domain']).T 
all_domains.head()

Out[75]:

	domain	class	length	entropy	alexa_grams	word_grams
0	transworld	legit	10	3.121928	39.051439	44.033642
2	islam2all	legit	9	2.419382	15.475215	17.367964
3	pulitzer	legit	8	3.000000	14.458222	28.441721
6	danarimedia	legit	11	2.663533	40.189599	54.829856
7	heartbreakers	legit	13	2.815072	45.354321	69.734483

In [76]:

all_domains.tail()

Out[76]:

	domain	class	length	entropy	alexa_grams	word_grams
84932	ulxxqduryvv	dga	11	2.913977	3.745231	6.464859
84933	ummvzhin	dga	8	2.750000	6.183945	7.180022
84934	umsgnwgc	dga	8	2.750000	3.272306	3.847079
84935	umzsbhpkrgo	dga	11	3.459432	1.653213	2.546543
84936	umzuyjrfwyf	dga	11	2.913977	0.000000	0.000000

In [77]:

# Use the vectorized operations of the dataframe to investigate differences
# between the alexa and word grams
all_domains['diff'] = all_domains['alexa_grams'] - all_domains['word_grams']
all_domains.sort(['diff'], ascending=True).head(10)

# The table below shows those domain names that are more 'dictionary' and less 'web'

Out[77]:

	domain	class	length	entropy	alexa_grams	word_grams	diff
63819	bipolardisorderdepressionanxiety	legit	32	3.616729	115.885999	193.844156	-77.958157
34524	stirringtroubleinternationally	legit	30	3.481728	131.209086	207.204729	-75.995643
63954	americansforresponsiblesolutions	legit	32	3.667838	145.071369	218.363956	-73.292587
49070	channel4embarrassingillnesses	legit	29	3.440070	98.201709	169.721499	-71.519790
5902	pragmatismopolitico	legit	19	3.326360	59.877723	121.536223	-61.658500
49210	egaliteetreconciliation	legit	23	3.186393	92.257111	152.125325	-59.868214
74130	interoperabilitybridges	legit	23	3.588354	93.803640	153.626312	-59.822673
36976	foreclosurephilippines	legit	22	3.447402	72.844280	132.514638	-59.670358
47055	corazonindomablecapitulos	legit	25	3.813661	74.706878	133.762750	-59.055872
70113	annamalicesissyselfhypnosis	legit	27	3.429908	68.066490	126.667692	-58.601201

In [78]:

all_domains.sort(['diff'], ascending=False).head(50)

# The table below shows those domain names that are more 'web' and less 'dictionary'
# Good O' web....

Out[78]:

	domain	class	length	entropy	alexa_grams	word_grams	diff
22647	gay-sex-pics-porn-pictures-gay-sex-porn-gay-se...	legit	56	3.661056	160.035734	85.124184	74.911550
44091	article-directory-free-submission-free-content	legit	46	3.786816	233.518879	188.230453	45.288426
63865	stream-free-movies-online	legit	25	3.509275	118.944026	74.496915	44.447110
38570	top-bookmarking-site-list	legit	25	3.723074	117.162056	74.126061	43.035995
79963	best-online-shopping-site	legit	25	3.452879	122.152194	79.596640	42.555554
12532	watch-free-movie-online	legit	23	3.708132	101.010995	58.943451	42.067543
30198	free-online-directory	legit	21	3.403989	122.359797	80.735030	41.624767
40859	free-links-articles-directory	legit	29	3.702472	152.063809	110.955361	41.108448
30875	online-web-directory	legit	20	3.584184	114.439863	74.082948	40.356915
79001	web-directory-online	legit	20	3.584184	114.313583	74.082948	40.230634
78947	movie-news-online	legit	17	3.175123	81.036910	41.705735	39.331174
51532	xxx-porno-sexvideos	legit	19	3.260828	73.025165	35.176549	37.848617
42200	free-tv-video-online	legit	20	3.284184	83.341214	45.662984	37.678230
40771	freegamesforyourwebsite	legit	23	3.551191	114.291735	78.515881	35.775855
58275	free-web-mobile-themes	legit	22	3.356492	88.503556	54.149725	34.353831
70724	seowebdirectoryonline	legit	21	3.499228	126.111921	91.819498	34.292423
28283	download-free-games	legit	19	3.576618	84.492962	50.661490	33.831472
18894	web-link-directory-site	legit	23	3.729446	102.993078	69.367186	33.625893
4838	the-web-directory	legit	17	3.454822	87.520339	54.697986	32.822353
65871	social-bookmarking-site	legit	23	3.762267	116.664791	84.545021	32.119769
21743	free-links-directory	legit	20	3.646439	104.050046	71.956644	32.093402
74449	money-news-online	legit	17	3.101881	77.587799	45.775375	31.812424
48456	free-sexvideosfc2	legit	17	3.381580	63.659477	31.878432	31.781045
57427	your-new-directory-site	legit	23	3.555533	99.130671	67.468067	31.662605
49041	addsiteurlfreewebdirectory	legit	26	3.609496	134.446230	103.178748	31.267482
34821	own-free-website	legit	16	3.250000	59.564153	28.839294	30.724859
10080	web-directory-plus	legit	18	3.836592	89.030979	58.484138	30.546841
43762	web-directory-sites	legit	19	3.471354	98.528255	68.088416	30.439839
34811	free-sex-for-you	legit	16	3.030639	46.653059	16.670504	29.982555
21390	online-deal-coupon	legit	18	3.308271	77.862004	47.886115	29.975889
48204	acme-people-search-forum	legit	24	3.553509	87.829242	57.898987	29.930255
73304	free-webdirectory	legit	17	3.337175	93.606205	63.858372	29.747833
44221	good-web-directory	legit	18	3.461320	88.201881	58.629789	29.572091
50633	all-free-download	legit	17	3.219528	69.337916	39.909696	29.428220
57095	free-link-directory	legit	19	3.536887	95.869062	66.507042	29.362020
58652	global-web-directory	legit	20	3.721928	100.465474	71.293587	29.171887
74259	online-games-zone	legit	17	3.292770	74.987811	45.881826	29.105985
77290	us-web-directory	legit	16	3.625000	80.044863	50.969551	29.075312
72128	bookmarking-sites-lists	legit	23	3.621176	115.664939	86.595393	29.069546
64948	web-marketing-directory	legit	23	3.849224	125.587313	96.714227	28.873086
79557	freewebdirectory101	legit	19	3.471354	100.131488	71.474824	28.656664
72737	free-seo-news	legit	13	2.777363	45.267539	17.089020	28.178520
53449	website-traffic-hog	legit	19	3.721612	77.199578	49.156126	28.043452
50837	myonlinewebdirectory	legit	20	3.584184	121.155376	93.276322	27.879054
29303	business-web-directorys	legit	23	3.621176	125.854338	98.160126	27.694212
41310	free-online-submission	legit	22	3.413088	113.459411	85.792712	27.666699
76645	linkdirectoryonline	legit	19	3.326360	116.879367	89.392747	27.486621
30430	online-deal-site	legit	16	3.202820	68.103656	40.887484	27.216172
27227	free-site-submit	legit	16	3.202820	64.158023	37.127294	27.030729
62951	mybusiness-web-directory	legit	24	3.772055	124.553982	97.538670	27.015312

In [79]:

# Lets plot some stuff!
# Here we want to see whether our new 'alexa_grams' feature can help us differentiate between Legit/DGA
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['length'], legit['alexa_grams'], s=120, c='#aaaaff', label='Alexa', alpha=.1)
plt.scatter(dga['length'], dga['alexa_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Alexa NGram Matches')

Out[79]:

<matplotlib.text.Text at 0x110c87210>

In [80]:

# Lets plot some stuff!
# Here we want to see whether our new 'alexa_grams' feature can help us differentiate between Legit/DGA
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['entropy'], legit['alexa_grams'],  s=120, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['entropy'], dga['alexa_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Entropy')
pylab.ylabel('Alexa Gram Matches')

Out[80]:

<matplotlib.text.Text at 0x11058e490>

In [81]:

# Lets plot some stuff!
# Here we want to see whether our new 'word_grams' feature can help us differentiate between Legit/DGA
# Note: It doesn't look quite as good as the Alexa_grams but it might generalize better (less overfit).
cond = all_domains['class'] == 'dga'
dga = all_domains[cond]
legit = all_domains[~cond]
plt.scatter(legit['length'], legit['word_grams'],  s=120, c='#aaaaff', label='Alexa', alpha=.2)
plt.scatter(dga['length'], dga['word_grams'], s=40, c='r', label='DGA', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Dictionary NGram Matches')

Out[81]:

<matplotlib.text.Text at 0x10ef8f750>

In [82]:

# Lets look at which Legit domains are scoring low on the word gram count
all_domains[(all_domains['word_grams']==0)].head()

Out[82]:

	domain	class	length	entropy	alexa_grams	diff
3429	dftc777	legit	7	2.128085	2.707570	2.707570
3715	5221766	legit	7	2.235926	0.000000	0.000000
4144	28365365	legit	8	2.250000	4.050612	4.050612
4235	mm-mm-mm	legit	8	0.811278	4.260668	4.260668
4297	fzzfgjj	legit	7	1.950212	0.954243	0.954243

In [83]:

# Okay these look kinda weird, lets use some nice Pandas functionality
# to look at some statistics around our new features.
all_domains[all_domains['class']=='legit'].describe()

Out[83]:

	length	entropy	alexa_grams	word_grams	diff
count	60897.000000	60897.000000	60897.000000	60897.000000	60897.000000
mean	10.873032	2.930306	33.083440	40.901852	-7.818413
std	3.393407	0.347134	19.233994	23.302539	9.388916
min	7.000000	-0.000000	0.000000	0.000000	-77.958157
25%	8.000000	2.725481	19.136340	24.056214	-12.938013
50%	10.000000	2.947703	28.703813	36.259089	-7.108820
75%	13.000000	3.169925	42.400101	53.036218	-1.995136
max	56.000000	4.070656	233.518879	233.648571	74.911550

In [84]:

# Lets look at how many domains that are both low in word_grams and alexa_grams (just plotting the max of either)
legit = all_domains[(all_domains['class']=='legit')]
max_grams = np.maximum(legit['alexa_grams'],legit['word_grams'])
ax = max_grams.hist(bins=80)
ax.figure.suptitle('Histogram of the Max NGram Score for Domains')
pylab.xlabel('Number of Domains')
pylab.ylabel('Maximum NGram Score')

Out[84]:

<matplotlib.text.Text at 0x114ee5450>

In [85]:

# Lets look at which Legit domains are scoring low on both alexa and word gram count
weird_cond = (all_domains['class']=='legit') & (all_domains['word_grams']<3) & (all_domains['alexa_grams']<2)
weird = all_domains[weird_cond]
print weird.shape[0]
weird.head(30)

Out[85]:

	domain	class	length	entropy	alexa_grams	word_grams	diff
85	9to5lol	legit	7	2.235926	1.991226	2.359835	-0.368609
2611	akb48mt	legit	7	2.807355	1.301030	1.041393	0.259637
3715	5221766	legit	7	2.235926	0.000000	0.000000	0.000000
4297	fzzfgjj	legit	7	1.950212	0.954243	0.000000	0.954243
6045	crx7601	legit	7	2.807355	0.000000	0.000000	0.000000
8531	mw7zrv2	legit	7	2.807355	0.000000	0.000000	0.000000
10802	jmm1818	legit	7	1.950212	0.903090	0.000000	0.903090
11961	qq66699	legit	7	1.556657	1.322219	0.000000	1.322219
13200	twcczhu	legit	7	2.521641	1.724276	0.000000	1.724276
13756	hljdns4	legit	7	2.807355	1.724276	0.000000	1.724276
14763	6470355	legit	7	2.521641	0.000000	0.000000	0.000000
17322	d20pfsrd	legit	8	2.750000	0.000000	0.000000	0.000000
20591	lgcct27	legit	7	2.521641	1.176091	0.845098	0.330993
23458	jdoqocy	legit	7	2.521641	0.000000	2.813581	-2.813581
24661	95178114	legit	8	2.405639	1.591065	0.000000	1.591065
24720	ggmmxxoo	legit	8	2.000000	1.113943	0.602060	0.511883
26454	ggmm777	legit	7	1.556657	1.477121	0.602060	0.875061
27222	rkg1866	legit	7	2.521641	0.954243	0.000000	0.954243
27676	1616bbs	legit	7	1.950212	1.806180	1.322219	0.483961
29142	5278bbs	legit	7	2.521641	1.806180	1.322219	0.483961
29551	05tz2e9	legit	7	2.807355	0.000000	0.000000	0.000000
29858	1532777	legit	7	2.128085	1.477121	0.000000	1.477121
30119	5311314	legit	7	1.842371	1.000000	0.000000	1.000000
30290	zzgcjyzx	legit	8	2.405639	0.000000	0.000000	0.000000
30739	xn--g5t518j	legit	11	3.095795	1.000000	0.000000	1.000000
31465	7210578	legit	7	2.521641	0.903090	0.000000	0.903090
31951	fj96336	legit	7	2.235926	0.000000	0.000000	0.000000
34455	xn--42cgk1gc8crdb1htg3d	legit	23	3.849224	1.255273	2.411620	-1.156347
35554	720pmkv	legit	7	2.807355	0.000000	0.000000	0.000000
36166	d4ffr55	legit	7	2.235926	1.079181	2.260071	-1.180890

In [86]:

# Epiphany... Alexa really may not be the best 'exemplar' set...  
#             (probably a no-shit moment for everyone else :)
#
# Discussion: If you're using these as exemplars of NOT DGA, then your probably
#             making things very hard on your machine learning algorithm.
#             Perhaps we should have two categories of Alexa domains, 'legit'
#             and a 'weird'. based on some definition of weird.
#             Looking at the entries above... we have approx 80 domains
#             that we're going to mark as 'weird'.
#
all_domains.loc[weird_cond, 'class'] = 'weird'
print all_domains['class'].value_counts()
all_domains[all_domains['class'] == 'weird'].head()

legit    60818
dga       2397
weird       79
dtype: int64

Out[86]:

	domain	class	length	entropy	alexa_grams	word_grams	diff
85	9to5lol	weird	7	2.235926	1.991226	2.359835	-0.368609
2611	akb48mt	weird	7	2.807355	1.301030	1.041393	0.259637
3715	5221766	weird	7	2.235926	0.000000	0.000000	0.000000
4297	fzzfgjj	weird	7	1.950212	0.954243	0.000000	0.954243
6045	crx7601	weird	7	2.807355	0.000000	0.000000	0.000000

In [87]:

# Now we try our machine learning algorithm again with the new features
# Alexa and Dictionary NGrams and the exclusion of the bad exemplars.
X = all_domains.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(all_domains['class'].tolist())

# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [88]:

# Now plot the results of the 80/20 split in a confusion matrix
labels = ['legit', 'weird', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels)

Confusion Matrix Stats
legit/legit: 99.58% (12140/12191)
legit/weird: 0.01% (1/12191)
legit/dga: 0.41% (50/12191)
weird/legit: 0.00% (0/10)
weird/weird: 30.00% (3/10)
weird/dga: 70.00% (7/10)
dga/legit: 14.63% (67/458)
dga/weird: 0.22% (1/458)
dga/dga: 85.15% (390/458)

In [89]:

# Hun, well that seem to work 'ok', but you don't really want a classifier
# that outputs 3 classes, you'd like a classifier that flags domains as DGA or not.
# This was a path that seemed like a good idea until it wasn't....

In [90]:

# Perhaps we will just exclude the weird class from our ML training
not_weird = all_domains[all_domains['class'] != 'weird']
X = not_weird.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])

# Labels (scikit learn uses 'y' for classification labels)
y = np.array(not_weird['class'].tolist())

# Train on a 80/20 split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [91]:

# Now plot the results of the 80/20 split in a confusion matrix
labels = ['legit', 'dga']
cm = confusion_matrix(y_test, y_pred, labels)
plot_cm(cm, labels) 

Confusion Matrix Stats
legit/legit: 99.56% (12111/12165)
legit/dga: 0.44% (54/12165)
dga/legit: 17.99% (86/478)
dga/dga: 82.01% (392/478)

In [92]:

# Well it's definitely better.. but haven't we just cheated by removing
# the weird domains?  Well perhaps, but on some level we're removing
# outliers that are bad exemplars. So to validate that the model is still
# doing the right thing lets try our new model prediction on our hold out sets.

# First train on the whole thing before looking at prediction performance
clf.fit(X, y)

# Pull together our hold out set
hold_out_domains = pd.concat([hold_out_alexa, hold_out_dga], ignore_index=True)

# Add a length field for the domain
hold_out_domains['length'] = [len(x) for x in hold_out_domains['domain']]
hold_out_domains = hold_out_domains[hold_out_domains['length'] > 6]

# Add a entropy field for the domain
hold_out_domains['entropy'] = [entropy(x) for x in hold_out_domains['domain']]

# Compute NGram matches for all the domains and add to our dataframe
hold_out_domains['alexa_grams']= alexa_counts * alexa_vc.transform(hold_out_domains['domain']).T
hold_out_domains['word_grams']= dict_counts * dict_vc.transform(hold_out_domains['domain']).T

hold_out_domains.head()

Out[92]:

	domain	class	length	entropy	alexa_grams	word_grams
0	alcatelonetouch	legit	15	3.106891	49.001768	79.015001
1	optumhealthfinancial	legit	20	3.584184	68.667084	87.158661
4	elderscrollsonline	legit	18	3.016876	76.441834	94.462092
5	mobango	legit	7	2.521641	18.020832	22.072036
6	costaud	legit	7	2.807355	16.037393	25.008755

In [93]:

# List of feature vectors (scikit learn uses 'X' for the matrix of feature vectors)
hold_X = hold_out_domains.as_matrix(['length', 'entropy', 'alexa_grams', 'word_grams'])

# Labels (scikit learn uses 'y' for classification labels)
hold_y = np.array(hold_out_domains['class'].tolist())

# Now run through the predictive model
hold_y_pred = clf.predict(hold_X)

# Add the prediction array to the dataframe
hold_out_domains['pred'] = hold_y_pred

# Now plot the results
labels = ['legit', 'dga']
cm = confusion_matrix(hold_y, hold_y_pred, labels)
plot_cm(cm, labels) 

Confusion Matrix Stats
legit/legit: 99.51% (6713/6746)
legit/dga: 0.49% (33/6746)
dga/legit: 15.73% (42/267)
dga/dga: 84.27% (225/267)

In [94]:

# Okay so on our 10% hold out set of 10k domains about ~100 domains were mis-classified
# at this point we're made some good progress so we're going to claim success :)
#       - Out of 10k domains 100 were mismarked
#       - false positives (Alexa marked as DGA) = ~0.6%
#       - about 80% of the DGA are getting marked

# Note: Alexa 1M results on the 10% hold out (100k domains) were in the same ballpark 
#       - Out of 100k domains 432 were mismarked
#       - false positives (Alexa marked as DGA) = 0.4%
#       - about 70% of the DGA are getting marked

# Now were going to just do some post analysis on how the ML algorithm performed.

# Lets look at a couple of plots to see which domains were misclassified.
# Looking at Length vs. Alexa NGrams
fp_cond = ((hold_out_domains['class'] == 'legit') & (hold_out_domains['pred']=='dga'))
fp = hold_out_domains[fp_cond]
fn_cond = ((hold_out_domains['class'] == 'dba') & (hold_out_domains['pred']=='legit'))
fn = hold_out_domains[fn_cond]
okay = hold_out_domains[hold_out_domains['class'] == hold_out_domains['pred']]
plt.scatter(okay['length'], okay['alexa_grams'], s=100,  c='#aaaaff', label='Okay', alpha=.1)
plt.scatter(fp['length'], fp['alexa_grams'], s=40, c='r', label='False Positive', alpha=.5)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Alexa NGram Matches')

Out[94]:

<matplotlib.text.Text at 0x115ea7390>

In [95]:

# Looking at Length vs. Dictionary NGrams
cond = (hold_out_domains['class'] != hold_out_domains['pred'])
misclassified = hold_out_domains[cond]
okay = hold_out_domains[~cond]
plt.scatter(okay['length'], okay['word_grams'], s=100,  c='#aaaaff', label='Okay', alpha=.2)
plt.scatter(misclassified['length'], misclassified['word_grams'], s=40, c='r', label='Misclassified', alpha=.3)
plt.legend()
pylab.xlabel('Domain Length')
pylab.ylabel('Dictionary NGram Matches')

Out[95]:

<matplotlib.text.Text at 0x115e9b990>

In [96]:

misclassified.head()

Out[96]:

	domain	class	length	entropy	alexa_grams	word_grams	pred
896	dom2-fan	legit	8	3.000000	6.568955	5.656685	dga
1296	mm8mm8-6642	legit	11	2.368523	0.000000	0.000000	dga
1378	4390208	legit	7	2.521641	0.000000	0.000000	dga
1514	sqrt121	legit	7	2.521641	0.000000	0.000000	dga
1687	02022222222	legit	11	0.684038	0.903090	0.000000	dga

In [97]:

misclassified[misclassified['class'] == 'dga'].head()

Out[97]:

	domain	class	length	entropy	alexa_grams	word_grams	pred
9184	usbiezgac	dga	9	3.169925	7.825928	9.172547	legit
9185	ushcnewo	dga	8	3.000000	12.265642	13.904812	legit
9187	usnspdph	dga	8	2.500000	5.182278	6.556287	legit
9190	utamehz	dga	7	2.807355	10.741352	14.733893	legit
9192	utfowept	dga	8	2.750000	7.095911	17.416355	legit

In [98]:

# We can also look at what features the learning algorithm thought were the most important
importances = zip(['length', 'entropy', 'alexa_grams', 'word_grams'], clf.feature_importances_)
importances

# From the list below we see our feature importance scores. There's a lot of feature selection,
# sensitivity study, etc stuff that you could do if you wanted at this point.

Out[98]:

[('length', 0.13110737655160343),
 ('entropy', 0.15589784074688856),
 ('alexa_grams', 0.48657282029928439),
 ('word_grams', 0.22642196240222362)]

In [99]:

# Discussion for how to use the resulting models.
# Typically Machine Learning comes in two phases
#    - Training of the Model
#    - Evaluation of new observations against the Model
# This notebook is about exploration of the data and training the model.
# After you have a model that you are satisfied with, just 'pickle' it
# at the end of the your training script and then in a separate
# evaluation script 'unpickle' it and evaluate/score new observations
# coming in (through a file, or ZeroMQ, or whatever...)
#
# In this case we'd have to pickle the RandomForest classifier
# and the two vectorizing transforms (alexa_grams and word_grams).
# See 'test_it' below for how to use them in evaluation mode.


# test_it shows how to do evaluation, also fun for manual testing below :)
def test_it(domain):
    
    _alexa_match = alexa_counts * alexa_vc.transform([domain]).T  # Woot matrix multiply and transpose Woo Hoo!
    _dict_match = dict_counts * dict_vc.transform([domain]).T
    _X = [len(domain), entropy(domain), _alexa_match, _dict_match]
    print '%s : %s' % (domain, clf.predict(_X)[0])

In [100]:

# Examples (feel free to change these and see the results!)
test_it('google')
test_it('google88')
test_it('facebook')
test_it('1cb8a5f36f')
test_it('pterodactylfarts')
test_it('ptes9dro-dwacty2lfa5rrts')
test_it('beyonce')
test_it('bey666on4ce')
test_it('supersexy')
test_it('yourmomissohotinthesummertime')
test_it('35-sdf-09jq43r')
test_it('clicksecurity')

google : legit
google88 : legit
facebook : legit
1cb8a5f36f : dga
pterodactylfarts : legit
ptes9dro-dwacty2lfa5rrts : dga
beyonce : legit
bey666on4ce : dga
supersexy : legit
yourmomissohotinthesummertime : legit
35-sdf-09jq43r : dga
clicksecurity : legit

Conclusions:¶

The combination of iPython, Pandas and Scikit Learn let us pull in some junky data, clean it up, plot it, understand it and slap it with some machine learning!

Clearly a lot more formality could be used, plotting learning curves, adjusting for overfitting, feature selection, on and on... there are some really great machine learning resources that cover this deeper material. In particular we highly recommend the work and presentations of Olivier Grisel at INRIA Saclay. http://ogrisel.com/

Some papers on detecting DGA domains:

S. Yadav, A. K. K. Reddy, A. L. N. Reddy, and S. Ranjan, “Detecting algorithmically generated malicious domain names,” presented at the the 10th annual conference, New York, New York, USA, 2010, pp. 48–61. [http://conferences.sigcomm.org/imc/2010/papers/p48.pdf]
S. Yadav, A. K. K. Reddy, A. L. N. Reddy, and S. Ranjan, “Detecting algorithmically generated domain-flux attacks with DNS traffic analysis,” IEEE/ACM Transactions on Networking (TON, vol. 20, no. 5, Oct. 2012.
A. Reddy, “Detecting Networks Employing Algorithmically Generated Domain Names,” 2010.
Z. Wei-wei and G. Qian, “Detecting Machine Generated Domain Names Based on Morpheme Features,” 2013.
P. Barthakur, M. Dahal, and M. K. Ghose, “An Efficient Machine Learning Based Classification Scheme for Detecting Distributed Command & Control Traffic of P2P Botnets,” International Journal of Modern …, 2013.