In [1]:

%load_ext autoreload
%autoreload 2

word2vec¶

This notebook is equivalent to demo-word.sh, demo-analogy.sh, demo-phrases.sh and demo-classes.sh from Google.

Training¶

Download some data, for example: http://mattmahoney.net/dc/text8.zip

In [2]:

import word2vec

Run word2phrase to group up similar words "Los Angeles" to "Los_Angeles"

In [3]:

word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)

[u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']
Starting training using file /Users/drodriguez/Downloads/text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206

This will create a text8-phrases that we can use as a better input for word2vec. Note that you could easily skip this previous step and use the origial data as input for word2vec.

Train the model using the word2phrase output.

In [4]:

word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 286.52k

That generated a text8.bin file containing the word vectors in a binary format.

Do the clustering of the vectors based on the trained model.

In [5]:

word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.02%  Words/thread/sec: 287.55k

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Predictions¶

In [1]:

import word2vec

Import the word2vec binary file created above

In [2]:

model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')

We can take a look at the vocabulaty as a numpy array

In [3]:

model.vocab

Out[3]:

array([u'</s>', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'], 
      dtype='<U78')

Or take a look at the whole matrix

In [4]:

model.vectors.shape

Out[4]:

(98331, 100)

In [5]:

model.vectors

Out[5]:

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.1220774 ,  0.04939618,  0.09545057, ..., -0.00804222,
        -0.05441621, -0.10076696],
       [ 0.16844609,  0.03734054,  0.22085373, ...,  0.05854521,
         0.04685341,  0.02546694],
       ..., 
       [-0.06760896,  0.03737842,  0.09344187, ...,  0.14559349,
        -0.11704484, -0.05246212],
       [ 0.02228479, -0.07340827,  0.15247506, ...,  0.01872172,
        -0.18154132, -0.06813737],
       [ 0.02778879, -0.06457976,  0.07102411, ..., -0.00270281,
        -0.0471223 , -0.135444  ]])

We can retreive the vector of individual words

In [6]:

model['dog'].shape

Out[6]:

(100,)

In [7]:

model['dog'][:10]

Out[7]:

array([ 0.05753701,  0.0585594 ,  0.11341395,  0.02016246,  0.11514406,
        0.01246986,  0.00801256,  0.17529851,  0.02899276,  0.0203866 ])

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [8]:

indexes, metrics = model.cosine('socks')
indexes, metrics

Out[8]:

(array([20002, 28915, 30711, 33874, 27482, 14631, 22992, 24195, 25857, 23705]),
 array([ 0.8375354 ,  0.83590846,  0.82818749,  0.82533614,  0.82278399,
         0.81476386,  0.8139092 ,  0.81253798,  0.8105933 ,  0.80850171]))

This returned a tuple with 2 items:

numpy array with the indexes of the similar words in the vocabulary
numpy array with cosine similarity to each word

Its possible to get the words of those indexes

In [9]:

model.vocab[indexes]

Out[9]:

array([u'hairy', u'pumpkin', u'gravy', u'nosed', u'plum', u'winged',
       u'bock', u'petals', u'biscuits', u'striped'], 
      dtype='<U78')

There is a helper function to create a combined response: a numpy record array

In [10]:

model.generate_response(indexes, metrics)

Out[10]:

rec.array([(u'hairy', 0.8375353970603848), (u'pumpkin', 0.8359084628493809),
       (u'gravy', 0.8281874915608026), (u'nosed', 0.8253361379785071),
       (u'plum', 0.8227839904046932), (u'winged', 0.8147638561412592),
       (u'bock', 0.8139092031538545), (u'petals', 0.8125379796045767),
       (u'biscuits', 0.8105933044655644), (u'striped', 0.8085017054444408)], 
      dtype=[(u'word', '<U78'), (u'metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [11]:

model.generate_response(indexes, metrics).tolist()

Out[11]:

[(u'hairy', 0.8375353970603848),
 (u'pumpkin', 0.8359084628493809),
 (u'gravy', 0.8281874915608026),
 (u'nosed', 0.8253361379785071),
 (u'plum', 0.8227839904046932),
 (u'winged', 0.8147638561412592),
 (u'bock', 0.8139092031538545),
 (u'petals', 0.8125379796045767),
 (u'biscuits', 0.8105933044655644),
 (u'striped', 0.8085017054444408)]

Phrases¶

Since we trained the model with the output of word2phrase we can ask for similarity of "phrases"

In [12]:

indexes, metrics = model.cosine('los_angeles')
model.generate_response(indexes, metrics).tolist()

Out[12]:

[(u'san_francisco', 0.886558000570455),
 (u'san_diego', 0.8731961018831669),
 (u'seattle', 0.8455603712285231),
 (u'las_vegas', 0.8407843553947962),
 (u'miami', 0.8341796009062884),
 (u'detroit', 0.8235412519780195),
 (u'cincinnati', 0.8199138493085706),
 (u'st_louis', 0.8160655356728751),
 (u'chicago', 0.8156786240847214),
 (u'california', 0.8154244925085712)]

Analogies¶

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric

In [13]:

indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics

Out[13]:

(array([1087, 1145, 7523, 3141, 6768, 1335, 8419, 1826,  648, 1426]),
 array([ 0.2917969 ,  0.27353295,  0.26877692,  0.26596514,  0.26487509,
         0.26428581,  0.26315492,  0.26261258,  0.26136635,  0.26099078]))

In [14]:

model.generate_response(indexes, metrics).tolist()

Out[14]:

[(u'queen', 0.2917968955611075),
 (u'prince', 0.27353295205311695),
 (u'empress', 0.2687769174818083),
 (u'monarch', 0.2659651399832089),
 (u'regent', 0.26487508713026797),
 (u'wife', 0.2642858109968327),
 (u'aragon', 0.2631549214361766),
 (u'throne', 0.26261257728511833),
 (u'emperor', 0.2613663460665488),
 (u'bishop', 0.26099078142148696)]

Clusters¶

In [15]:

clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [16]:

clusters['dog']

Out[16]:

We can see get all the words grouped on an specific cluster

In [17]:

clusters.get_words_on_cluster(90).shape

Out[17]:

(221,)

In [18]:

clusters.get_words_on_cluster(90)[:10]

Out[18]:

array(['along', 'together', 'associated', 'relationship', 'deal',
       'combined', 'contact', 'connection', 'bond', 'respect'], dtype=object)

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [19]:

model.clusters = clusters

In [20]:

indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)

In [21]:

model.generate_response(indexes, metrics).tolist()

Out[21]:

[(u'berlin', 0.32333651414395953, 20),
 (u'munich', 0.28851564633559, 20),
 (u'vienna', 0.2768927258877336, 12),
 (u'leipzig', 0.2690537010929304, 91),
 (u'moscow', 0.26531859560322785, 74),
 (u'st_petersburg', 0.259534503067277, 61),
 (u'prague', 0.25000637367753303, 72),
 (u'dresden', 0.2495974800117785, 71),
 (u'bonn', 0.24403155303236473, 8),
 (u'frankfurt', 0.24199720792200027, 31)]

In [ ]: