Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings both CBOW and SkipGram methods using Genism and Fasttext.
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]
#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.
#Summarize the loaded model
print(model_cbow)
#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)
#Acess vector for one word
print(model_cbow['dog'])
Word2Vec(vocab=6, size=100, alpha=0.025) ['dog', 'bites', 'man', 'eats', 'meat', 'food'] [-3.1667745e-03 2.5268614e-03 -4.9504861e-03 2.3797194e-03 -3.3511904e-03 1.7659335e-03 -9.6838089e-04 3.6862001e-03 3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03 4.7231275e-03 2.1875298e-03 4.9989321e-03 -4.7024325e-04 4.6936749e-03 4.5417100e-03 -4.8383311e-03 4.5522186e-03 9.4010920e-04 -2.8778350e-03 -2.3938445e-03 7.6240452e-04 2.8537741e-05 -1.0585956e-03 1.5203804e-03 1.1994856e-04 4.3881699e-03 3.5755127e-04 1.9964906e-03 -3.3893189e-03 2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03 1.9576577e-03 -5.4296525e-04 2.5505766e-03 1.4563937e-03 1.1214090e-03 3.1200200e-03 3.5230191e-03 4.4931062e-03 -5.5389071e-04 1.6268899e-03 -4.6736463e-03 -1.9612674e-04 1.5486709e-03 -3.5581242e-03 1.5163666e-03 2.2859944e-03 -3.5728619e-03 -3.5505979e-03 7.8282715e-04 -4.8093311e-03 -3.1324120e-03 -3.6213300e-03 -1.4478542e-03 3.4006054e-03 2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03 -4.9103238e-03 -2.2635974e-03 -3.9036905e-03 3.8846405e-03 -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03 -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03 -6.5973290e-04 -2.3705217e-03 4.3961490e-03 3.2166531e-03 3.6933657e-04 -6.2054797e-04 2.0661615e-04 3.7390803e-04 -3.5061471e-03 3.6587315e-03 2.1328868e-03 -2.5964181e-03 4.3381471e-03 4.0168604e-03 1.8054987e-03 -1.2192487e-03 1.5615283e-03 -1.8635839e-03 2.9529419e-03 -3.3825964e-03 -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]
#Compute similarity
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))
Similarity between eats and bites: -0.09852024 Similarity between eats and man: -0.17088428
From the above similarity scores we can conclude that eats is more similar to bites than man.
#Most similarity
model_cbow.most_similar('meat')
[('bites', 0.1353721022605896), ('man', 0.1094527617096901), ('food', -0.02215239405632019), ('dog', -0.1444159597158432), ('eats', -0.16309654712677002)]
# save model
model_cbow.save('model_cbow.bin')
# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)
Word2Vec(vocab=6, size=100, alpha=0.025)
In skipgram, the task is to predict the context words from the center word.
#Summarize the loaded model
print(model_skipgram)
#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)
#Acess vector for one word
print(model_skipgram['dog'])
Word2Vec(vocab=6, size=100, alpha=0.025) ['dog', 'bites', 'man', 'eats', 'meat', 'food'] [-3.1667745e-03 2.5268614e-03 -4.9504861e-03 2.3797194e-03 -3.3511904e-03 1.7659335e-03 -9.6838089e-04 3.6862001e-03 3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03 4.7231275e-03 2.1875298e-03 4.9989321e-03 -4.7024325e-04 4.6936749e-03 4.5417100e-03 -4.8383311e-03 4.5522186e-03 9.4010920e-04 -2.8778350e-03 -2.3938445e-03 7.6240452e-04 2.8537741e-05 -1.0585956e-03 1.5203804e-03 1.1994856e-04 4.3881699e-03 3.5755127e-04 1.9964906e-03 -3.3893189e-03 2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03 1.9576577e-03 -5.4296525e-04 2.5505766e-03 1.4563937e-03 1.1214090e-03 3.1200200e-03 3.5230191e-03 4.4931062e-03 -5.5389071e-04 1.6268899e-03 -4.6736463e-03 -1.9612674e-04 1.5486709e-03 -3.5581242e-03 1.5163666e-03 2.2859944e-03 -3.5728619e-03 -3.5505979e-03 7.8282715e-04 -4.8093311e-03 -3.1324120e-03 -3.6213300e-03 -1.4478542e-03 3.4006054e-03 2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03 -4.9103238e-03 -2.2635974e-03 -3.9036905e-03 3.8846405e-03 -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03 -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03 -6.5973290e-04 -2.3705217e-03 4.3961490e-03 3.2166531e-03 3.6933657e-04 -6.2054797e-04 2.0661615e-04 3.7390803e-04 -3.5061471e-03 3.6587315e-03 2.1328868e-03 -2.5964181e-03 4.3381471e-03 4.0168604e-03 1.8054987e-03 -1.2192487e-03 1.5615283e-03 -1.8635839e-03 2.9529419e-03 -3.3825964e-03 -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]
#Compute similarity
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))
Similarity between eats and bites: -0.09852936 Similarity between eats and man: -0.17089055
From the above similarity scores we can conclude that eats is more similar to bites than man.
#Most similarity
model_skipgram.most_similar('meat')
[('bites', 0.1353721022605896), ('man', 0.10945276916027069), ('food', -0.022152386605739594), ('dog', -0.1444159746170044), ('eats', -0.16317100822925568)]
# save model
model_skipgram.save('model_skipgram.bin')
# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(new_model_skipgram)
Word2Vec(vocab=6, size=100, alpha=0.025)
The entire wiki corpus as of 28/04/2020 is just over 16GB in size. We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.
The file size is 294MB so it can take a while to download.
Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039
import os
import requests
os.makedirs('data/en', exist_ok= True)
file_name = "data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2"
file_id = "11804g0GcWnBIVDahjo5fQyc05nQLXGwF"
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
if not os.path.exists(file_name):
download_file_from_google_drive(file_id, file_name)
else:
print("file already exists, skipping download")
print(f"File at: {file_name}")
file already exists, skipping download File at: data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time
#Preparing the Training data
wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())
#if you get a memory error executing the lines above
#comment the lines out and uncomment the lines below.
#loading will be slower, but stable.
# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})
# sentences = list(wiki.get_texts())
#if you still get a memory error, try settings processes to 1 or 2 and then run it again.
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()
print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))
CBOW Model Training Complete. Time taken for training is:0.04 hrs
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)
#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)
#Acess vector for one word
print(f"Length of vector: {len(word2vec_cbow['film'])}")
print(word2vec_cbow['film'])
print("-"*30)
#Compute similarity
print("Similarity between film and drama:",word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_cbow.similarity('film', 'tiger'))
print("-"*30)
Word2Vec(vocab=111150, size=100, alpha=0.025) ------------------------------ Length of vocabulary: 111150 Printing the first 30 words. ['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during'] ------------------------------ Length of vector: 100 [-0.25941572 -1.6287326 2.5331333 -1.5818936 0.9024474 0.8614945 2.4875445 -0.95802265 -1.3792082 -1.1744157 -4.300686 1.0071316 0.10418405 4.855032 0.6251962 -0.06472338 0.19993098 -0.7291219 2.342258 -1.7298651 0.7895099 -2.2819378 0.7158192 -0.62419826 0.6720258 3.6712303 1.3836899 0.17808275 -3.7205396 0.2529162 1.0290879 -0.9228959 0.9451632 1.7415334 1.9618814 1.4535053 2.670452 0.9272077 0.25056183 -0.4078236 0.5795217 0.6316829 0.50204426 -0.19865237 -2.697352 0.75351495 1.0796617 2.247825 -2.956658 2.6606686 -0.42392135 -0.44319883 -2.9274392 -1.0198026 1.404833 -0.10840467 0.50829273 1.0767945 -0.65002084 -3.4231277 4.719826 -1.5996053 0.82882935 1.635043 -0.45730942 -1.3166244 -1.3349417 -2.3565981 1.7141095 -2.6643796 -1.2148786 0.2972199 -2.2865987 -1.6022073 2.0965865 -0.87479544 -1.4143106 -0.9149557 2.2900226 1.1464663 -2.6113467 -1.5517493 1.3018385 4.1072307 1.1441547 1.0222696 0.4847384 2.4148073 -2.881392 -0.67044157 -2.482836 -0.417894 3.1442287 -1.6087203 1.865813 -3.717568 0.5994761 1.8819104 3.355772 -1.9087372 ] ------------------------------ Similarity between film and drama: 0.4986632 Similarity between film and tiger: 0.15477756 ------------------------------
# save model
from gensim.models import Word2Vec, KeyedVectors
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)
# load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)
#SkipGram
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()
print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))
SkipGram Model Training Complete Time taken for training is:0.10 hrs
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)
#Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)
#Acess vector for one word
print(f"Length of vector: {len(word2vec_skipgram['film'])}")
print(word2vec_skipgram['film'])
print("-"*30)
#Compute similarity
print("Similarity between film and drama:",word2vec_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.similarity('film', 'tiger'))
print("-"*30)
Word2Vec(vocab=111150, size=100, alpha=0.025) ------------------------------ Length of vocabulary: 111150 Printing the first 30 words. ['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during'] ------------------------------ Length of vector: 100 [ 1.94889292e-01 -7.88324535e-01 4.66947220e-02 2.57520348e-01 2.65304267e-01 3.63538593e-01 4.63590741e-01 -1.62654325e-01 9.11010578e-02 -6.58479631e-02 -6.97350129e-02 -6.56900406e-02 2.19506964e-01 2.20394313e-01 1.05092540e-01 8.26439075e-03 -9.39796269e-02 5.50851583e-01 7.65753444e-04 -2.22807571e-01 -3.17346871e-01 3.20529372e-01 4.51157093e-02 -1.93709806e-01 2.07626969e-02 1.69344515e-01 2.77250055e-02 1.10369585e-02 -4.75540310e-01 1.10796697e-01 4.28172469e-01 4.06191871e-02 5.15495241e-01 -6.85295224e-01 -5.06723702e-01 -4.52192919e-03 1.51265517e-03 -3.84557724e-01 -2.22782314e-01 5.11201501e-01 1.42252162e-01 -7.73397386e-01 -2.78606623e-01 4.70017433e-01 -2.70037323e-01 5.04850507e-01 -1.48356587e-01 2.26073325e-01 -3.36060971e-01 -1.19667962e-01 -2.59654850e-01 -4.44965392e-01 1.11614995e-01 1.62986945e-02 4.82374012e-01 -7.87460804e-02 -1.13825299e-01 -2.24003598e-01 4.93353546e-01 -5.57069406e-02 2.43176505e-01 -1.84876159e-01 2.13489812e-02 3.42909366e-01 2.02496469e-01 -4.25657362e-01 8.17572057e-01 -2.83644646e-01 -5.23434244e-02 -3.27616245e-01 4.43994589e-02 -3.90237272e-01 2.12029487e-01 -7.25788534e-01 5.52469850e-01 -4.72590374e-03 -2.02829018e-01 -9.59078223e-03 3.68973225e-01 -2.69762665e-01 -2.85591751e-01 -2.68359333e-01 3.10093671e-01 2.02198789e-01 5.80960453e-01 -2.47493789e-01 -7.37856887e-03 -3.59723950e-03 3.14893663e-01 1.12885557e-01 -5.09416103e-01 -7.58459032e-01 5.30587435e-01 -1.51896626e-01 -3.37440372e-01 4.22841489e-01 -3.34523350e-01 3.21759552e-01 7.44457126e-01 -1.26014173e-01] ------------------------------ Similarity between film and drama: 0.63833964 Similarity between film and tiger: 0.22270091 ------------------------------
# save model
word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)
# load model
# new_model_skipgram = Word2Vec.load('model_skipgram.bin')
# print(model_skipgram)
#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()
print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))
FastText CBOW Model Training Complete Time taken for training is:0.12 hrs
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)
#Summarize vocabulary
words = list(fasttext_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)
#Acess vector for one word
print(f"Length of vector: {len(fasttext_cbow['film'])}")
print(fasttext_cbow['film'])
print("-"*30)
#Compute similarity
print("Similarity between film and drama:",fasttext_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.similarity('film', 'tiger'))
print("-"*30)
FastText(vocab=111150, size=100, alpha=0.025) ------------------------------ Length of vocabulary: 111150 Printing the first 30 words. ['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during'] ------------------------------ Length of vector: 100 [ 0.47473213 1.6783198 -4.766255 -3.2404876 0.80164665 1.993539 3.4226568 -0.7035685 -3.0426116 1.5137119 3.8207133 1.3821473 -0.7379625 -0.6726444 1.8303355 -2.1288188 1.2368282 -3.0745962 1.4226121 -2.8884995 7.2847705 -1.564321 2.869352 0.6962616 4.469778 2.5569658 2.621335 -4.612509 -2.2389078 3.6648748 0.7189718 1.0702186 -3.175641 2.7648733 0.13811935 -2.441776 -3.9559126 -0.03163956 -1.1257534 -0.64402825 -1.5076644 -0.58919376 -0.14338583 4.2466817 4.3784213 3.0076942 -5.972965 2.2950342 -0.50719374 -3.916504 -2.1366098 -2.661619 2.3540869 2.1862476 5.1004434 4.1282 -4.164653 1.1288711 -4.001655 -4.051289 2.5718336 -0.40600455 3.8396242 2.214367 1.8413899 4.5216975 -1.6419586 2.7617378 -2.0902452 2.598776 4.041824 -5.1805005 -2.777213 -0.02546828 -0.07393612 -3.2800605 -2.9874747 -0.6490991 3.6039045 -1.4168853 3.6110177 -1.0872458 -0.6365031 -1.0161037 3.7344344 0.29839793 0.421953 -1.811646 1.3730506 7.575645 3.3998368 5.0335827 -0.2107324 -2.331183 0.19383769 3.0550041 4.1529713 3.988616 0.04955976 1.3424706 ] ------------------------------ Similarity between film and drama: 0.5669882 Similarity between film and tiger: 0.24975622 ------------------------------
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()
print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))
FastText SkipGram Model Training Complete Time taken for training is:0.20 hrs
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)
#Summarize vocabulary
words = list(fasttext_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)
#Acess vector for one word
print(f"Length of vector: {len(fasttext_skipgram['film'])}")
print(fasttext_skipgram['film'])
print("-"*30)
#Compute similarity
print("Similarity between film and drama:",fasttext_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.similarity('film', 'tiger'))
print("-"*30)
FastText(vocab=111150, size=100, alpha=0.025) ------------------------------ Length of vocabulary: 111150 Printing the first 30 words. ['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during'] ------------------------------ Length of vector: 100 [-8.4101312e-02 -6.9478154e-04 3.3954462e-01 -3.6973858e-01 1.6844368e-01 3.4855682e-01 8.0026442e-01 -5.0405812e-01 -6.0389137e-01 2.1694953e-02 4.0937051e-01 -3.5893116e-02 -1.3717794e-01 4.0389201e-01 3.9567137e-01 2.4365921e-01 5.6551516e-02 -1.5994829e-01 -1.8148309e-01 -2.6480275e-01 -4.8462763e-01 9.5473409e-02 -1.1126036e-02 -1.8805853e-01 2.4277805e-01 2.4251699e-01 -1.7501226e-01 -4.3078136e-01 -3.6442232e-01 9.1702184e-03 -3.2344624e-01 -1.0232232e-01 -5.2684498e-01 -2.7622378e-01 4.2112619e-01 -4.3196991e-02 3.1967857e-01 1.7001998e-01 3.3157614e-01 -2.4995559e-01 -1.3239473e-01 -3.4502399e-01 2.1341468e-01 5.8890671e-01 1.7721146e-01 1.5974782e-01 -3.8579264e-01 -2.8241745e-01 6.7402735e-02 -7.1903253e-01 1.3665260e-01 -5.9633050e-02 -5.9002697e-01 -6.1173952e-01 -1.0246418e-03 -5.1254374e-01 -1.5101396e-01 1.6967247e-01 2.8125226e-01 -4.6728057e-01 -5.4966863e-02 -1.3736627e-02 -1.5689149e-01 8.3176725e-02 1.8850440e-02 4.1858605e-01 -1.1376646e-02 -4.0758383e-02 -1.7871203e-01 2.7792713e-01 5.5813068e-01 -3.5465869e-01 1.3662770e-01 2.5777066e-01 -3.0423281e-01 7.8141141e-01 1.1446947e-02 -4.0541172e-01 2.9406905e-01 6.0151044e-02 4.9637925e-02 -3.9679220e-01 4.5333567e-01 1.0888510e-02 2.7147910e-01 -1.7305572e-01 -2.8098795e-01 -6.1907400e-03 -2.3080334e-01 5.8609635e-01 -1.0097053e-01 6.6119152e-01 1.8578811e-01 -5.9025098e-02 -5.3886050e-01 2.6664239e-01 -2.2193529e-02 7.0487672e-01 3.9477929e-01 3.7981489e-01] ------------------------------ Similarity between film and drama: 0.626041 Similarity between film and tiger: 0.27831402 ------------------------------
An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases. We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram.