This notebook is equivalent to demo-word.sh
, demo-analogy.sh
, demo-phrases.sh
and demo-classes.sh
from Google.
%load_ext autoreload
%autoreload 2
Download some data, for example: http://mattmahoney.net/dc/text8.zip
import word2vec
Run word2phrase
to group up similar words "Los Angeles" to "Los_Angeles"
word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8 Words processed: 17000K Vocab size: 4399K Vocab size (unigrams + bigrams): 2419827 Words in train file: 17005206
This created a text8-phrases
file that we can use as a better input for word2vec
.
Note that you could easily skip this previous step and use the text data as input for word2vec
directly.
Now actually train the word2vec model.
word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8-phrases Vocab size: 98331 Words in train file: 15857306 Alpha: 0.000002 Progress: 100.03% Words/thread/sec: 323.95k
That created a text8.bin
file containing the word vectors in a binary format.
Now we generate the clusters of the vectors based on the trained model.
word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8 Vocab size: 71291 Words in train file: 16718843 Alpha: 0.000002 Progress: 100.04% Words/thread/sec: 317.72k
That created a text8-clusters.txt
with the cluster for every word in the vocabulary
%load_ext autoreload
%autoreload 2
import word2vec
Import the word2vec
binary file created above
model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')
We can take a look at the vocabulary as a numpy array
model.vocab
array(['</s>', 'the', 'of', ..., 'dakotas', 'nias', 'burlesques'], dtype='<U78')
Or take a look at the whole matrix
model.vectors.shape
(98331, 100)
model.vectors
array([[ 0.14333282, 0.15825513, -0.13715845, ..., 0.05456942, 0.10955409, 0.00693387], [ 0.07306823, 0.1179086 , 0.10995189, ..., 0.09345266, -0.1312812 , -0.00915683], [ 0.26229969, 0.02270839, 0.05854911, ..., 0.03924898, -0.03867628, 0.21437503], ..., [-0.1427108 , 0.10650002, 0.07283197, ..., 0.14563465, -0.06967127, 0.037186 ], [ 0.06538665, -0.04184594, 0.13385373, ..., 0.08183857, -0.07006828, -0.09386028], [-0.00991228, -0.12096601, 0.10771658, ..., 0.01684521, -0.143217 , -0.10602982]])
We can retreive the vector of individual words
model['dog'].shape
(100,)
model['dog'][:10]
array([ 0.06666815, 0.12450022, 0.02513653, 0.12673911, 0.13396765, -0.00938436, 0.06476378, 0.15387769, 0.05472341, -0.08388881])
We can calculate the distance between two or more (all combinations) words.
model.distance("dog", "cat", "fish")
[('dog', 'cat', 0.8693732680572173), ('dog', 'fish', 0.5900484800297155), ('cat', 'fish', 0.6269017149314428)]
We can do simple queries to retreive words similar to "socks" based on cosine similarity:
indexes, metrics = model.similar("dog")
indexes, metrics
(array([ 2437, 5478, 7593, 10230, 3964, 9963, 2428, 10309, 4812, 2391]), array([0.86937327, 0.83396105, 0.77854628, 0.7692265 , 0.76743628, 0.7612772 , 0.7600788 , 0.75935677, 0.75693881, 0.75438956]))
This returned a tuple with 2 items:
We can get the words for those indexes
model.vocab[indexes]
array(['cat', 'cow', 'goat', 'pig', 'dogs', 'rabbit', 'bear', 'rat', 'wolf', 'girl'], dtype='<U78')
There is a helper function to create a combined response as a numpy record array
model.generate_response(indexes, metrics)
rec.array([('cat', 0.86937327), ('cow', 0.83396105), ('goat', 0.77854628), ('pig', 0.7692265 ), ('dogs', 0.76743628), ('rabbit', 0.7612772 ), ('bear', 0.7600788 ), ('rat', 0.75935677), ('wolf', 0.75693881), ('girl', 0.75438956)], dtype=[('word', '<U78'), ('metric', '<f8')])
Is easy to make that numpy array a pure python response:
model.generate_response(indexes, metrics).tolist()
[('cat', 0.8693732680572173), ('cow', 0.8339610529888226), ('goat', 0.7785462766666428), ('pig', 0.7692265048531302), ('dogs', 0.7674362783482181), ('rabbit', 0.7612771996422674), ('bear', 0.7600788045286304), ('rat', 0.7593567655129181), ('wolf', 0.7569388070301634), ('girl', 0.754389556345068)]
Since we trained the model with the output of word2phrase
we can ask for similarity of "phrases", basically compained words such as "Los Angeles"
indexes, metrics = model.similar('los_angeles')
model.generate_response(indexes, metrics).tolist()
[('san_francisco', 0.8876351265573288), ('san_diego', 0.8652920422732189), ('seattle', 0.8387625165949533), ('las_vegas', 0.8325965377422355), ('california', 0.8252775393303263), ('miami', 0.8167069457881345), ('detroit', 0.8164911899252103), ('chicago', 0.813283620659967), ('cincinnati', 0.8116379669114295), ('cleveland', 0.810708205429068)]
Its possible to do more complex queries like analogies such as: king - man + woman = queen
This method returns the same as cosine
the indexes of the words in the vocab and the metric
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])
indexes, metrics
(array([1087, 6768, 1145, 7523, 1335, 8419, 3141, 1827, 344, 4980]), array([0.28823424, 0.26614362, 0.26265608, 0.26111525, 0.26091172, 0.25844542, 0.25781944, 0.25678284, 0.25424551, 0.2529607 ]))
model.generate_response(indexes, metrics).tolist()
[('queen', 0.28823424120681784), ('regent', 0.26614361576778933), ('prince', 0.2626560787162791), ('empress', 0.2611152451318436), ('wife', 0.26091172315990346), ('aragon', 0.25844541581050506), ('monarch', 0.25781944140528035), ('throne', 0.256782835877586), ('son', 0.25424550637754495), ('heir', 0.25296070456687614)]
clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')
We can see get the cluster number for individual words
clusters.vocab
array(['</s>', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'], dtype='<U29')
We can see get all the words grouped on an specific cluster
clusters.get_words_on_cluster(90).shape
(206,)
clusters.get_words_on_cluster(90)[:10]
array(['along', 'associated', 'relations', 'relationship', 'deal', 'combined', 'contact', 'connection', 'respect', 'mixed'], dtype='<U29')
We can add the clusters to the word2vec model and generate a response that includes the clusters
model.clusters = clusters
indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"])
model.generate_response(indexes, metrics).tolist()
[('berlin', 0.3187078682472152, 15), ('vienna', 0.28562803640143397, 12), ('munich', 0.28527806428082675, 21), ('moscow', 0.27085681100243797, 74), ('leipzig', 0.2697639527846636, 8), ('st_petersburg', 0.25841328545046965, 61), ('prague', 0.2571333430942206, 72), ('bonn', 0.2546126113385251, 8), ('dresden', 0.2471285069069249, 71), ('warsaw', 0.2450778083401204, 74)]