https://github.com/RaRe-Technologies/gensim-data is a central repository of data and pre-trained models for Gensim.
model | size | number of vectors | description |
---|---|---|---|
glove-twitter-25 | 104 MB | 1,193,514 | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) |
word2vec-google-news-300 | 1,662 MB | 3,000,000 | Google News (about 100 billion words) |
glove-wiki-gigaword-300 | 376 MB | 400,000 | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) |
TODO : Inspect some of the models avilable
# Get a list of models available
import gensim.downloader as downloader
from pprint import pprint
info = downloader.info()
# print(info)
# pprint (info)
%%time
# this will download and load the model
# data will be downloaded to ~/gensim-data
model_glove_tw25 = downloader.load("glove-twitter-25")
print(model_glove_tw25)
<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f03c5b424a8> CPU times: user 41.7 s, sys: 262 ms, total: 41.9 s Wall time: 42.1 s
model_glove_tw25.most_similar('cat')
[('dog', 0.9590819478034973), ('monkey', 0.9203578233718872), ('bear', 0.9143137335777283), ('pet', 0.9108031392097473), ('girl', 0.8880630135536194), ('horse', 0.8872727155685425), ('kitty', 0.8870542049407959), ('puppy', 0.886769711971283), ('hot', 0.8865255117416382), ('lady', 0.8845518827438354)]
#printing similarity index
print ("similarity(woman, man) : ", model_glove_tw25.similarity('woman', 'man'))
print ("similarity(girl, boy) : ", model_glove_tw25.similarity('girl', 'boy'))
print ("similarity(prince, princess) : ", model_glove_tw25.similarity('prince', 'princess'))
similarity(woman, man) : 0.76541775 similarity(girl, boy) : 0.95961404 similarity(prince, princess) : 0.87682575
model_glove_tw25.most_similar(positive=['woman', 'king'], negative=['man'])
[('meets', 0.8841923475265503), ('prince', 0.832163393497467), ('queen', 0.8257461190223694), ('’s', 0.8174097537994385), ('crow', 0.8134994506835938), ('hunter', 0.8131038546562195), ('father', 0.811583399772644), ('soldier', 0.8111359477043152), ('mercy', 0.8082392811775208), ('hero', 0.8082262873649597)]
Large download (1.6 GB)
Sometimes you might run out of memory!
%%time
# this will download and load the model
# data will be downloaded to ~/gensim-data
model_google_news = downloader.load("word2vec-google-news-300")
print(model_google_news)
<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f0384f14518> CPU times: user 1min 49s, sys: 2.58 s, total: 1min 52s Wall time: 1min 51s
model_google_news.most_similar('cat')
[('cats', 0.8099379539489746), ('dog', 0.7609456777572632), ('kitten', 0.7464985251426697), ('feline', 0.7326233983039856), ('beagle', 0.7150583267211914), ('puppy', 0.7075453996658325), ('pup', 0.6934291124343872), ('pet', 0.6891531348228455), ('felines', 0.6755931377410889), ('chihuahua', 0.6709762215614319)]
#printing similarity index
print ("similarity(woman, man) : ", model_google_news.similarity('woman', 'man'))
print ("similarity(girl, boy) : ", model_google_news.similarity('girl', 'boy'))
print ("similarity(prince, princess) : ", model_google_news.similarity('prince', 'princess'))
similarity(woman, man) : 0.76640123 similarity(girl, boy) : 0.8543272 similarity(prince, princess) : 0.69865096
model_google_news.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431011199951), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.518113374710083), ('sultan', 0.5098593235015869), ('monarchy', 0.5087411999702454)]
model_google_news.most_similar(positive=['Paris', 'France'], negative=['Rome'])
[('French', 0.6156964302062988), ('PARIS_AFX_Gaz_de', 0.5759478807449341), ('Villebon_Sur_Yvette', 0.5739676356315613), ('extradites_Noriega', 0.5631396174430847), ('Belgium', 0.5630872845649719), ('Dordogne_region', 0.5518567562103271), ('called_Xynthia_blew', 0.5481809973716736), ('Evian_Les_Bains', 0.5411219000816345), ('Nantes', 0.5385209321975708), ('Anny_Cazenave', 0.5256197452545166)]
model_google_news.most_similar(positive=['father', 'son'], negative=['mother'], topn=10)
[('brother', 0.8007099628448486), ('nephew', 0.7814147472381592), ('younger_brother', 0.7780234217643738), ('eldest_son', 0.769737720489502), ('uncle', 0.7542626857757568), ('grandson', 0.7425380349159241), ('sons', 0.7106431722640991), ('grandfather', 0.6886354684829712), ('dad', 0.6858929395675659), ('elder_brother', 0.6622239947319031)]
Now that we have tried two models, which one seems to give more accurate answers?
Why ?