Lab : Using Word2Vec pre-trained models¶

We will learn to use widely avilable pre-trained models

Runtime¶

20 mins

Step 1 - Data¶

https://github.com/RaRe-Technologies/gensim-data is a central repository of data and pre-trained models for Gensim.

Few notable datasets:¶

'20-news-groups' : newsgroup postings (size : 13 MB)
'patent-2017' : US patents (size : ~3G)
'wiki-english-20171001' : wikipedia dump on 2017 (size : ~6G)

Popular pre-trained word2vec models¶

model	size	number of vectors	description
glove-twitter-25	104 MB	1,193,514	Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
word2vec-google-news-300	1,662 MB	3,000,000	Google News (about 100 billion words)
glove-wiki-gigaword-300	376 MB	400,000	Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)

TODO : Inspect some of the models avilable

Step 2 - Tweets¶

2.1 - Available Models¶

In [1]:

# Get a list of models available

import gensim.downloader as downloader
from pprint import pprint

info = downloader.info()
# print(info)
# pprint (info)

2.2 - Download Glove-Twitter-25¶

In [2]:

%%time
# this will download and load the model
# data will be downloaded to  ~/gensim-data
model_glove_tw25 = downloader.load("glove-twitter-25") 
print(model_glove_tw25)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f03c5b424a8>
CPU times: user 41.7 s, sys: 262 ms, total: 41.9 s
Wall time: 42.1 s

2.3 - Query the model¶

In [3]:

model_glove_tw25.most_similar('cat')

Out[3]:

[('dog', 0.9590819478034973),
 ('monkey', 0.9203578233718872),
 ('bear', 0.9143137335777283),
 ('pet', 0.9108031392097473),
 ('girl', 0.8880630135536194),
 ('horse', 0.8872727155685425),
 ('kitty', 0.8870542049407959),
 ('puppy', 0.886769711971283),
 ('hot', 0.8865255117416382),
 ('lady', 0.8845518827438354)]

In [4]:

#printing similarity index
print ("similarity(woman, man) : ",  model_glove_tw25.similarity('woman', 'man'))
print ("similarity(girl, boy) : ",  model_glove_tw25.similarity('girl', 'boy'))
print ("similarity(prince, princess) : ",  model_glove_tw25.similarity('prince', 'princess'))

similarity(woman, man) :  0.76541775
similarity(girl, boy) :  0.95961404
similarity(prince, princess) :  0.87682575

In [5]:

model_glove_tw25.most_similar(positive=['woman', 'king'], negative=['man'])

Out[5]:

[('meets', 0.8841923475265503),
 ('prince', 0.832163393497467),
 ('queen', 0.8257461190223694),
 ('’s', 0.8174097537994385),
 ('crow', 0.8134994506835938),
 ('hunter', 0.8131038546562195),
 ('father', 0.811583399772644),
 ('soldier', 0.8111359477043152),
 ('mercy', 0.8082392811775208),
 ('hero', 0.8082262873649597)]

Step 3 - Google News Model¶

3.1 - Download¶

Large download (1.6 GB)
Sometimes you might run out of memory!

In [6]:

%%time
# this will download and load the model
# data will be downloaded to  ~/gensim-data
model_google_news = downloader.load("word2vec-google-news-300") 
print(model_google_news)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f0384f14518>
CPU times: user 1min 49s, sys: 2.58 s, total: 1min 52s
Wall time: 1min 51s

3.2 - Query¶

In [7]:

model_google_news.most_similar('cat')

Out[7]:

[('cats', 0.8099379539489746),
 ('dog', 0.7609456777572632),
 ('kitten', 0.7464985251426697),
 ('feline', 0.7326233983039856),
 ('beagle', 0.7150583267211914),
 ('puppy', 0.7075453996658325),
 ('pup', 0.6934291124343872),
 ('pet', 0.6891531348228455),
 ('felines', 0.6755931377410889),
 ('chihuahua', 0.6709762215614319)]

In [8]:

#printing similarity index
print ("similarity(woman, man) : ",  model_google_news.similarity('woman', 'man'))
print ("similarity(girl, boy) : ",  model_google_news.similarity('girl', 'boy'))
print ("similarity(prince, princess) : ",  model_google_news.similarity('prince', 'princess'))

similarity(woman, man) :  0.76640123
similarity(girl, boy) :  0.8543272
similarity(prince, princess) :  0.69865096

In [9]:

model_google_news.most_similar(positive=['woman', 'king'], negative=['man'])

Out[9]:

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

In [10]:

model_google_news.most_similar(positive=['Paris', 'France'], negative=['Rome'])

Out[10]:

[('French', 0.6156964302062988),
 ('PARIS_AFX_Gaz_de', 0.5759478807449341),
 ('Villebon_Sur_Yvette', 0.5739676356315613),
 ('extradites_Noriega', 0.5631396174430847),
 ('Belgium', 0.5630872845649719),
 ('Dordogne_region', 0.5518567562103271),
 ('called_Xynthia_blew', 0.5481809973716736),
 ('Evian_Les_Bains', 0.5411219000816345),
 ('Nantes', 0.5385209321975708),
 ('Anny_Cazenave', 0.5256197452545166)]

In [11]:

model_google_news.most_similar(positive=['father', 'son'], negative=['mother'], topn=10)

Out[11]:

[('brother', 0.8007099628448486),
 ('nephew', 0.7814147472381592),
 ('younger_brother', 0.7780234217643738),
 ('eldest_son', 0.769737720489502),
 ('uncle', 0.7542626857757568),
 ('grandson', 0.7425380349159241),
 ('sons', 0.7106431722640991),
 ('grandfather', 0.6886354684829712),
 ('dad', 0.6858929395675659),
 ('elder_brother', 0.6622239947319031)]

Step 4 - Compare the results between models¶

Now that we have tried two models, which one seems to give more accurate answers?
Why ?

In [ ]: