Embeddings are important now-a-days in recommender systems. In this notebook, we'll look at trained word embeddings. We'll plot the embeddings so we can attempt to visually compare embeddings. We'll then look at analogies and word similarities. We'll use the Gensim library which makes it easy to work with embeddings.
import gensim
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Download embeddings (66MB, glove, trained on wikipedia)
model = api.load("glove-wiki-gigaword-50")
[==================================================] 100.0% 66.0/66.0MB downloaded
What's the embedding of 'king'?
model['king']
array([ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ], dtype=float32)
model.vectors.shape
(400000, 50)
Which means:
Let's plot the vector so we can have a colorful visual of values in the embedding vector
def plot_embeddings(vectors, labels=None):
n_vectors = len(vectors)
fig = plt.figure(figsize=(12, n_vectors))
# ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])
# ax = fig.add_axes([1, 1, 1, 1])
ax = plt.gca()
sns.heatmap(vectors, cmap='RdBu', vmax=2, vmin=-2, ax=ax)
if labels:
ax.set_yticklabels(labels,rotation=0)
ax.tick_params(axis='both', which='major', labelsize=30)
plt.tick_params(axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False) # labels along the bottom edge are off
# From https://github.com/mwaskom/seaborn/issues/1773
# fix for mpl bug that cuts off top/bottom of seaborn viz
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show() # ta-da!
Let's plot the embedding of king
plot_embeddings([model['king']], ['king'])
We can also compare multiple embeddings:
plot_embeddings([model['king'], model['man'], model['woman'], model['girl'], model['boy']],
['king', 'man', 'woman', 'girl', 'boy'])
Here's another example including a number of different concepts:
plot_embeddings([model['king'], model['water'], model['god'], model['love'], model['star']],
['king', 'water', 'god', 'love', 'star'])
model.most_similar(positive=["king", "woman"], negative=["man"])
[('queen', 0.8523603677749634), ('throne', 0.7664334177970886), ('prince', 0.759214460849762), ('daughter', 0.7473883032798767), ('elizabeth', 0.7460220456123352), ('princess', 0.7424569725990295), ('kingdom', 0.7337411642074585), ('monarch', 0.7214490175247192), ('eldest', 0.7184861898422241), ('widow', 0.7099430561065674)]
plot_embeddings([model['king'],
model['man'],
model['woman'],
model['king'] - model['man'] + model['woman'],
model['queen']],
['king', 'man', 'woman', 'king-man+woman', 'queen'])
2019 update: This turned out to be a misconception. The result is actually closer to "king" than it is to "queen", it's just that the code rules out the input vectors as possible outputs
Fair is Better than Sensational:Man is to Doctor as Woman is to Doctor
To verify, let's calculate cosine distance between the result of the analogy, and queen
.
result = model['king'] - model['man'] + model['woman']
# Similarity between result and 'queen'
cosine_similarity(result.reshape(1, -1), model['queen'].reshape(1, -1))
array([[0.8609581]], dtype=float32)
Let's compare that to the distance between the result and king
:
# Similarity between result and 'king'
cosine_similarity(result.reshape(1, -1), model['king'].reshape(1, -1))
array([[0.8859834]], dtype=float32)
So the result is more similar to king (0.8859834 similarity score) than it is to queen (0.8609581 similarity score).
plot_embeddings( [model['king'],
result,
model['queen']],
['king', 'king-man+woman', 'queen'])
# TODO: fill-in values
model.most_similar(positive=[], negative=[])
# TODO: do analogy algebra
result = model[''] - model[''] + model['']
# Similarity between result and 'nurse'
cosine_similarity(result.reshape(1, -1), model['nurse'].reshape(1, -1))
# Similarity between result and 'doctor'
cosine_similarity(result.reshape(1, -1), model['doctor'].reshape(1, -1))