word2vec¶

This notebook is equivalent to demo-word.sh, demo-analogy.sh, demo-phrases.sh and demo-classes.sh from Google.

In [1]:

%load_ext autoreload
%autoreload 2

Training¶

Download some data, for example: http://mattmahoney.net/dc/text8.zip

In [3]:

import word2vec

Run word2phrase to group up similar words "Los Angeles" to "Los_Angeles"

In [4]:

word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206

This created a text8-phrases file that we can use as a better input for word2vec. Note that you could easily skip this previous step and use the text data as input for word2vec directly.

Now actually train the word2vec model.

In [5]:

word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 323.95k

That created a text8.bin file containing the word vectors in a binary format.

Now we generate the clusters of the vectors based on the trained model.

In [6]:

word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 317.72k

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Predictions¶

In [1]:

%load_ext autoreload
%autoreload 2

In [3]:

import word2vec

Import the word2vec binary file created above

In [4]:

model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')

We can take a look at the vocabulary as a numpy array

In [5]:

model.vocab

Out[5]:

array(['</s>', 'the', 'of', ..., 'dakotas', 'nias', 'burlesques'],
      dtype='<U78')

Or take a look at the whole matrix

In [6]:

model.vectors.shape

Out[6]:

(98331, 100)

In [7]:

model.vectors

Out[7]:

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.07306823,  0.1179086 ,  0.10995189, ...,  0.09345266,
        -0.1312812 , -0.00915683],
       [ 0.26229969,  0.02270839,  0.05854911, ...,  0.03924898,
        -0.03867628,  0.21437503],
       ...,
       [-0.1427108 ,  0.10650002,  0.07283197, ...,  0.14563465,
        -0.06967127,  0.037186  ],
       [ 0.06538665, -0.04184594,  0.13385373, ...,  0.08183857,
        -0.07006828, -0.09386028],
       [-0.00991228, -0.12096601,  0.10771658, ...,  0.01684521,
        -0.143217  , -0.10602982]])

We can retreive the vector of individual words

In [8]:

model['dog'].shape

Out[8]:

(100,)

In [9]:

model['dog'][:10]

Out[9]:

array([ 0.06666815,  0.12450022,  0.02513653,  0.12673911,  0.13396765,
       -0.00938436,  0.06476378,  0.15387769,  0.05472341, -0.08388881])

We can calculate the distance between two or more (all combinations) words.

In [10]:

model.distance("dog", "cat", "fish")

Out[10]:

[('dog', 'cat', 0.8693732680572173),
 ('dog', 'fish', 0.5900484800297155),
 ('cat', 'fish', 0.6269017149314428)]

Similarity¶

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [11]:

indexes, metrics = model.similar("dog")
indexes, metrics

Out[11]:

(array([ 2437,  5478,  7593, 10230,  3964,  9963,  2428, 10309,  4812,
         2391]),
 array([0.86937327, 0.83396105, 0.77854628, 0.7692265 , 0.76743628,
        0.7612772 , 0.7600788 , 0.75935677, 0.75693881, 0.75438956]))

This returned a tuple with 2 items:

numpy array with the indexes of the similar words in the vocabulary
numpy array with cosine similarity to each word

We can get the words for those indexes

In [12]:

model.vocab[indexes]

Out[12]:

array(['cat', 'cow', 'goat', 'pig', 'dogs', 'rabbit', 'bear', 'rat',
       'wolf', 'girl'], dtype='<U78')

There is a helper function to create a combined response as a numpy record array

In [13]:

model.generate_response(indexes, metrics)

Out[13]:

rec.array([('cat', 0.86937327), ('cow', 0.83396105), ('goat', 0.77854628),
           ('pig', 0.7692265 ), ('dogs', 0.76743628),
           ('rabbit', 0.7612772 ), ('bear', 0.7600788 ),
           ('rat', 0.75935677), ('wolf', 0.75693881),
           ('girl', 0.75438956)],
          dtype=[('word', '<U78'), ('metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [14]:

model.generate_response(indexes, metrics).tolist()

Out[14]:

[('cat', 0.8693732680572173),
 ('cow', 0.8339610529888226),
 ('goat', 0.7785462766666428),
 ('pig', 0.7692265048531302),
 ('dogs', 0.7674362783482181),
 ('rabbit', 0.7612771996422674),
 ('bear', 0.7600788045286304),
 ('rat', 0.7593567655129181),
 ('wolf', 0.7569388070301634),
 ('girl', 0.754389556345068)]

Phrases¶

Since we trained the model with the output of word2phrase we can ask for similarity of "phrases", basically compained words such as "Los Angeles"

In [15]:

indexes, metrics = model.similar('los_angeles')
model.generate_response(indexes, metrics).tolist()

Out[15]:

[('san_francisco', 0.8876351265573288),
 ('san_diego', 0.8652920422732189),
 ('seattle', 0.8387625165949533),
 ('las_vegas', 0.8325965377422355),
 ('california', 0.8252775393303263),
 ('miami', 0.8167069457881345),
 ('detroit', 0.8164911899252103),
 ('chicago', 0.813283620659967),
 ('cincinnati', 0.8116379669114295),
 ('cleveland', 0.810708205429068)]

Analogies¶

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric

In [16]:

indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
indexes, metrics

Out[16]:

(array([1087, 6768, 1145, 7523, 1335, 8419, 3141, 1827,  344, 4980]),
 array([0.28823424, 0.26614362, 0.26265608, 0.26111525, 0.26091172,
        0.25844542, 0.25781944, 0.25678284, 0.25424551, 0.2529607 ]))

In [17]:

model.generate_response(indexes, metrics).tolist()

Out[17]:

[('queen', 0.28823424120681784),
 ('regent', 0.26614361576778933),
 ('prince', 0.2626560787162791),
 ('empress', 0.2611152451318436),
 ('wife', 0.26091172315990346),
 ('aragon', 0.25844541581050506),
 ('monarch', 0.25781944140528035),
 ('throne', 0.256782835877586),
 ('son', 0.25424550637754495),
 ('heir', 0.25296070456687614)]

Clusters¶

In [18]:

clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [19]:

clusters.vocab

Out[19]:

array(['</s>', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'],
      dtype='<U29')

We can see get all the words grouped on an specific cluster

In [20]:

clusters.get_words_on_cluster(90).shape

Out[20]:

(206,)

In [21]:

clusters.get_words_on_cluster(90)[:10]

Out[21]:

array(['along', 'associated', 'relations', 'relationship', 'deal',
       'combined', 'contact', 'connection', 'respect', 'mixed'],
      dtype='<U29')

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [22]:

model.clusters = clusters

In [23]:

indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])

In [24]:

model.generate_response(indexes, metrics).tolist()

Out[24]:

[('berlin', 0.3187078682472152, 15),
 ('vienna', 0.28562803640143397, 12),
 ('munich', 0.28527806428082675, 21),
 ('moscow', 0.27085681100243797, 74),
 ('leipzig', 0.2697639527846636, 8),
 ('st_petersburg', 0.25841328545046965, 61),
 ('prague', 0.2571333430942206, 72),
 ('bonn', 0.2546126113385251, 8),
 ('dresden', 0.2471285069069249, 71),
 ('warsaw', 0.2450778083401204, 74)]

In [ ]: