This is an experimental code developed by Tomas Mikolov found in the word2vec google group: https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ
This is not yet available on Pypi you need the latest master branch from git.
The input format for doc2vec
is still one big text document but every line should be one document prepended with an unique id, for example:
_*0 This is sentence 1
_*1 This is sentence 2
nltk
tar -xvf aclImdb_v1.tar.gz
Merge data into one big document with an id per line and do some basic preprocessing: word tokenizer.
from __future__ import unicode_literals
import os
import nltk
directories = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']
input_file = open('/Users/drodriguez/Downloads/alldata.txt', 'w')
id_ = 0
for directory in directories:
rootdir = os.path.join('/Users/drodriguez/Downloads/aclImdb', directory)
for subdir, dirs, files in os.walk(rootdir):
for file_ in files:
with open(os.path.join(subdir, file_), 'r') as f:
doc_id = '_*%i' % id_
id_ = id_ + 1
text = f.read()
text = text.decode('utf-8')
tokens = nltk.word_tokenize(text)
doc = ' '.join(tokens).lower()
doc = doc.encode('ascii', 'ignore')
input_file.write('%s %s\n' % (doc_id, doc))
input_file.close()
%load_ext autoreload
%autoreload 2
import word2vec
word2vec.doc2vec('/Users/drodriguez/Downloads/alldata.txt', '/Users/drodriguez/Downloads/vectors.bin', cbow=0, size=100, window=10, negative=5, hs=0, sample='1e-4', threads=12, iter_=20, min_count=1, verbose=True)
Starting training using file /Users/drodriguez/Downloads/alldata.txt Vocab size: 355046 Words in train file: 28300990 Alpha: 0.000002 Progress: 100.01% Words/thread/sec: 92.57k
Is possible to load the vectors using the same wordvectors class as a regular word2vec binary file.
%load_ext autoreload
%autoreload 2
import word2vec
model = word2vec.load('/Users/drodriguez/Downloads/vectors.bin')
model.vectors.shape
(355046, 100)
The documents vector are going to be identified by the id we used in the preprocesing section, for example document 1 is going to have vector:
model['_*1']
array([-0.09961915, -0.02504692, -0.00935447, -0.06283784, 0.0496435 , 0.08408608, 0.07249928, 0.02332756, 0.04755085, 0.14972655, -0.13693954, 0.01296212, -0.05615778, 0.05408363, -0.01922397, 0.00903398, 0.11205222, 0.02491457, 0.04302743, -0.06734619, -0.2004746 , -0.10970256, -0.04777983, -0.05336951, -0.10399633, -0.06500414, 0.0393892 , -0.08285502, 0.05692215, 0.01362013, 0.0013779 , -0.24589944, -0.16099831, -0.11000603, -0.08007748, -0.05447361, 0.10116527, 0.06073807, 0.00416331, 0.00434075, -0.02536621, 0.12531835, -0.0312396 , -0.03754066, 0.10542928, -0.01937485, 0.03270554, 0.03367785, -0.31589472, 0.00840659, -0.09368768, -0.11164349, -0.02970047, -0.11497822, 0.06357043, -0.16664146, 0.02935979, 0.25292322, 0.01335857, -0.19644944, 0.08630948, 0.05118916, -0.08062234, 0.03329093, -0.13994266, 0.07419056, -0.0284326 , -0.04101218, -0.01186225, 0.10280388, 0.00699921, 0.07681306, 0.0986157 , -0.06155488, -0.17678751, -0.0433546 , -0.1698599 , 0.00764652, -0.11591533, -0.12973167, -0.01140277, -0.02404138, -0.06018848, -0.02115276, -0.14684282, -0.18135296, -0.03216174, -0.02125036, 0.2539596 , 0.16910006, -0.11961638, 0.03961169, 0.07747228, 0.02761923, 0.07856126, 0.06564176, 0.05922704, 0.10623101, 0.04387141, -0.14101151])
We can ask for similarity words or documents on document 1
indexes, metrics = model.cosine('_*1')
model.generate_response(indexes, metrics).tolist()
[(u'houselessness', 0.9697854490765448), (u'_*62909', 0.8435915200187546), (u'_*92297', 0.8382383325156331), (u'_*62902', 0.8354628520801568), (u'_*31249', 0.8321578405132342), (u'_*20758', 0.8302829157776485), (u'_*12342', 0.8263513274559964), (u'_*32435', 0.8237210585123108), (u'_*67836', 0.823590134267539), (u'_*31245', 0.8230957438394273)]
Now its just a case of matching the id to the data created on the preprocessing step