*A guide to scikit-learn compatible nearest neighbors classification using the recently introduced word mover's distance (WMD). *
Joint post with the awesome Matt Kusner!

Source of this Jupyter notebook.

In document classification and other natural language processing applications, having a good measure of the similarity of two texts can be a valuable building block. Ideally, such a measure would capture semantic information. Cosine similarity on bag-of-words vectors is known to do well in practice, but it inherently cannot capture when documents say the same thing in completely different words.

Take, for example, two headlines:

*Obama speaks to the media in Illinois**The President greets the press in Chicago*

These have no content words in common, so according to most bag of words--based metrics, their distance would be maximal. (For such applications, you probably don't want to count stopwords such as *the* and *in*, which don't truly signal semantic similarity.)

One way out of this conundrum is the word mover's distance (WMD), introduced in
*From Word Embeddings To Document Distances*,
(Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, ICML 2015).
WMD adapts the earth mover's distance to the space of documents: the distance between two texts is given by the total amount of "mass" needed to move the words from one side into the other, multiplied by the distance the words need to move. So, starting from a measure of the distance between different words, we can get a principled document-level distance. Here is a visualisation of the idea, from the ICML slides:

The key ingredient in WMD is a good distance measure between words. Dense representations of words, also known by the trendier name "word embeddings" (because "distributed word representations" didn't stick), do the trick here. We could train the embeddings ourselves, but for meaningful results we would need tons of documents, and that might take a while. So let's just use the ones from the `word2vec`

team. (download link)

In [1]:

```
import os
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
```

In [2]:

```
if not os.path.exists("data/embed.dat"):
print("Caching word embeddings in memmapped format...")
from gensim.models.word2vec import Word2Vec
wv = Word2Vec.load_word2vec_format(
"data/GoogleNews-vectors-negative300.bin.gz",
binary=True)
fp = np.memmap("data/embed.dat", dtype=np.double, mode='w+', shape=wv.syn0norm.shape)
fp[:] = wv.syn0norm[:]
with open("data/embed.vocab", "w") as f:
for _, w in sorted((voc.index, word) for word, voc in wv.vocab.items()):
print(w, file=f)
del fp, wv
W = np.memmap("data/embed.dat", dtype=np.double, mode="r", shape=(3000000, 300))
with open("data/embed.vocab") as f:
vocab_list = map(str.strip, f.readlines())
```

In [3]:

```
vocab_dict = {w: k for k, w in enumerate(vocab_list)}
```

In [33]:

```
d1 = "Obama speaks to the media in Illinois"
d2 = "The President addresses the press in Chicago"
vect = CountVectorizer(stop_words="english").fit([d1, d2])
print("Features:", ", ".join(vect.get_feature_names()))
```

Features: addresses, chicago, illinois, media, obama, president, press, speaks

The two documents are completely orthogonal in terms of bag-of-words

In [71]:

```
from scipy.spatial.distance import cosine
v_1, v_2 = vect.transform([d1, d2])
v_1 = v_1.toarray().ravel()
v_2 = v_2.toarray().ravel()
print(v_1, v_2)
print("cosine(doc_1, doc_2) = {:.2f}".format(cosine(v_1, v_2)))
```

[0 0 1 1 1 0 0 1] [1 1 0 0 0 1 1 0] cosine(doc_1, doc_2) = 1.00

In [58]:

```
from sklearn.metrics import euclidean_distances
W_ = W[[vocab_dict[w] for w in vect.get_feature_names()]]
D_ = euclidean_distances(W_)
print("d(addresses, speaks) = {:.2f}".format(D_[0, 7]))
print("d(addresses, chicago) = {:.2f}".format(D_[0, 1]))
```

d(addresses, speaks) = 1.16 d(addresses, chicago) = 1.37

`pyemd`

, a Python wrapper for Pele and Werman's implementation of the earth mover's distance.

In [74]:

```
from pyemd import emd
# pyemd needs double precision input
v_1 = v_1.astype(np.double)
v_2 = v_2.astype(np.double)
v_1 /= v_1.sum()
v_2 /= v_2.sum()
D_ = D_.astype(np.double)
D_ /= D_.max() # just for comparison purposes
print("d(doc_1, doc_2) = {:.2f}".format(emd(v_1, v_2, D_)))
```

d(doc_1, doc_2) = 0.74

We will use the *20 Newsgroups* classification task. Because WMD is an expensive computation, for this demo we just use a subset. To emphasize the power of the method, we use a larger test size, but train on relatively few samples.

In [4]:

```
newsgroups = fetch_20newsgroups()
docs, y = newsgroups.data, newsgroups.target
```

In [5]:

```
docs_train, docs_test, y_train, y_test = train_test_split(docs, y,
train_size=100,
test_size=300,
random_state=0)
```

`W`

embedding array is pretty huge, we might as well restrict it to just the words that actually occur in the dataset.

In [6]:

```
vect = CountVectorizer(stop_words="english").fit(docs_train + docs_test)
common = [word for word in vect.get_feature_names() if word in vocab_dict]
W_common = W[[vocab_dict[w] for w in common]]
```

We can then create a fixed-vocabulary vectorizer using only the words we have embeddings for.

In [7]:

```
vect = CountVectorizer(vocabulary=common, dtype=np.double)
X_train = vect.fit_transform(docs_train)
X_test = vect.transform(docs_test)
```

One way to proceed is to just pre-compute the pairwise distances between all documents, and use them to search for hyperparameters and evaluate the model. However, that would incur some extra computation, and WMD is expensive. Also, it's not the most pleasant user interface. So we define some scikit-learn compatible estimators for computing the WMD.

** WordMoversKNN** subclasses from

`KNeighborsClassifier`

and overrides the `predict`

function to compute the WMD between all training and test samples.In practice, however, we often don't know what is the best `n_neighbors`

to use. Simply wrapping `WordMoversKNN`

in a `GridSearchCV`

would be rather expensive because of all the distances that would need to be recomputed for every value of `n_neighbors`

. So we introduce ** WordMoversKNNCV**, which, when fitted, performs

`n_neighbors`

(under any given evaluation metric), while only computing the WMD once per fold, and only across folds (saving `n_folds * fold_size ** 2`

evaluations).In [16]:

```
"""%%file word_movers_knn.py"""
# Authors: Vlad Niculae, Matt Kusner
# License: Simplified BSD
import numpy as np
from sklearn.metrics import euclidean_distances
from sklearn.externals.joblib import Parallel, delayed
from sklearn.neighbors import KNeighborsClassifier
from sklearn.utils import check_array
from sklearn.cross_validation import check_cv
from sklearn.metrics.scorer import check_scoring
from sklearn.preprocessing import normalize
from pyemd import emd
class WordMoversKNN(KNeighborsClassifier):
"""K nearest neighbors classifier using the Word Mover's Distance.
Parameters
----------
W_embed : array, shape: (vocab_size, embed_size)
Precomputed word embeddings between vocabulary items.
Row indices should correspond to the columns in the bag-of-words input.
n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for :meth:`k_neighbors` queries.
n_jobs : int, optional (default = 1)
The number of parallel jobs to run for Word Mover's Distance computation.
If ``-1``, then the number of jobs is set to the number of CPU cores.
verbose : int, optional
Controls the verbosity; the higher, the more messages. Defaults to 0.
References
----------
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger
From Word Embeddings To Document Distances
The International Conference on Machine Learning (ICML), 2015
http://mkusner.github.io/publications/WMD.pdf
"""
_pairwise = False
def __init__(self, W_embed, n_neighbors=1, n_jobs=1, verbose=False):
self.W_embed = W_embed
self.verbose = verbose
super(WordMoversKNN, self).__init__(n_neighbors=n_neighbors, n_jobs=n_jobs,
metric='precomputed', algorithm='brute')
def _wmd(self, i, row, X_train):
"""Compute the WMD between training sample i and given test row.
Assumes that `row` and train samples are sparse BOW vectors summing to 1.
"""
union_idx = np.union1d(X_train[i].indices, row.indices)
W_minimal = self.W_embed[union_idx]
W_dist = euclidean_distances(W_minimal)
bow_i = X_train[i, union_idx].A.ravel()
bow_j = row[:, union_idx].A.ravel()
return emd(bow_i, bow_j, W_dist)
def _wmd_row(self, row, X_train):
"""Wrapper to compute the WMD of a row with all training samples.
Assumes that `row` and train samples are sparse BOW vectors summing to 1.
Useful for parallelization.
"""
n_samples_train = X_train.shape[0]
return [self._wmd(i, row, X_train) for i in range(n_samples_train)]
def _pairwise_wmd(self, X_test, X_train=None):
"""Computes the word mover's distance between all train and test points.
Parallelized over rows of X_test.
Assumes that train and test samples are sparse BOW vectors summing to 1.
Parameters
----------
X_test: scipy.sparse matrix, shape: (n_test_samples, vocab_size)
Test samples.
X_train: scipy.sparse matrix, shape: (n_train_samples, vocab_size)
Training samples. If `None`, uses the samples the estimator was fit with.
Returns
-------
dist : array, shape: (n_test_samples, n_train_samples)
Distances between all test samples and all train samples.
"""
n_samples_test = X_test.shape[0]
if X_train is None:
X_train = self._fit_X
dist = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(
delayed(self._wmd_row)(test_sample, X_train)
for test_sample in X_test)
return np.array(dist)
def fit(self, X, y):
"""Fit the model using X as training data and y as target values
Parameters
----------
X : scipy sparse matrix, shape: (n_samples, n_features)
Training data.
y : {array-like, sparse matrix}
Target values of shape = [n_samples] or [n_samples, n_outputs]
"""
X = check_array(X, accept_sparse='csr', copy=True)
X = normalize(X, norm='l1', copy=False)
return super(WordMoversKNN, self).fit(X, y)
def predict(self, X):
"""Predict the class labels for the provided data
Parameters
----------
X : scipy.sparse matrix, shape (n_test_samples, vocab_size)
Test samples.
Returns
-------
y : array of shape [n_samples]
Class labels for each data sample.
"""
X = check_array(X, accept_sparse='csr', copy=True)
X = normalize(X, norm='l1', copy=False)
dist = self._pairwise_wmd(X)
return super(WordMoversKNN, self).predict(dist)
class WordMoversKNNCV(WordMoversKNN):
"""Cross-validated KNN classifier using the Word Mover's Distance.
Parameters
----------
W_embed : array, shape: (vocab_size, embed_size)
Precomputed word embeddings between vocabulary items.
Row indices should correspond to the columns in the bag-of-words input.
n_neighbors_try : sequence, optional
List of ``n_neighbors`` values to try.
If None, tries 1-5 neighbors.
scoring : string, callable or None, optional, default: None
A string (see model evaluation documentation) or
a scorer callable object / function with signature
``scorer(estimator, X, y)``.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- An object to be used as a cross-validation generator.
- An iterable yielding train/test splits.
For integer/None inputs, StratifiedKFold is used.
n_jobs : int, optional (default = 1)
The number of parallel jobs to run for Word Mover's Distance computation.
If ``-1``, then the number of jobs is set to the number of CPU cores.
verbose : int, optional
Controls the verbosity; the higher, the more messages. Defaults to 0.
Attributes
----------
cv_scores_ : array, shape (n_folds, len(n_neighbors_try))
Test set scores for each fold.
n_neighbors_ : int,
The best `n_neighbors` value found.
References
----------
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger
From Word Embeddings To Document Distances
The International Conference on Machine Learning (ICML), 2015
http://mkusner.github.io/publications/WMD.pdf
"""
def __init__(self, W_embed, n_neighbors_try=None, scoring=None, cv=3,
n_jobs=1, verbose=False):
self.cv = cv
self.n_neighbors_try = n_neighbors_try
self.scoring = scoring
super(WordMoversKNNCV, self).__init__(W_embed,
n_neighbors=None,
n_jobs=n_jobs,
verbose=verbose)
def fit(self, X, y):
"""Fit KNN model by choosing the best `n_neighbors`.
Parameters
-----------
X : scipy.sparse matrix, (n_samples, vocab_size)
Data
y : ndarray, shape (n_samples,) or (n_samples, n_targets)
Target
"""
if self.n_neighbors_try is None:
n_neighbors_try = range(1, 6)
else:
n_neighbors_try = self.n_neighbors_try
X = check_array(X, accept_sparse='csr', copy=True)
X = normalize(X, norm='l1', copy=False)
cv = check_cv(self.cv, X, y)
knn = KNeighborsClassifier(metric='precomputed', algorithm='brute')
scorer = check_scoring(knn, scoring=self.scoring)
scores = []
for train_ix, test_ix in cv:
dist = self._pairwise_wmd(X[test_ix], X[train_ix])
knn.fit(X[train_ix], y[train_ix])
scores.append([
scorer(knn.set_params(n_neighbors=k), dist, y[test_ix])
for k in n_neighbors_try
])
scores = np.array(scores)
self.cv_scores_ = scores
best_k_ix = np.argmax(np.mean(scores, axis=0))
best_k = n_neighbors_try[best_k_ix]
self.n_neighbors = self.n_neighbors_ = best_k
return super(WordMoversKNNCV, self).fit(X, y)
```

Overwriting word_movers_knn.py

In [10]:

```
knn_cv = WordMoversKNNCV(cv=3,
n_neighbors_try=range(1, 20),
W_embed=W_common, verbose=5, n_jobs=3)
knn_cv.fit(X_train, y_train)
```

[Parallel(n_jobs=3)]: Done 12 tasks | elapsed: 30.8s [Parallel(n_jobs=3)]: Done 34 out of 34 | elapsed: 2.0min finished [Parallel(n_jobs=3)]: Done 12 tasks | elapsed: 25.7s [Parallel(n_jobs=3)]: Done 33 out of 33 | elapsed: 2.9min finished [Parallel(n_jobs=3)]: Done 12 tasks | elapsed: 53.3s [Parallel(n_jobs=3)]: Done 33 out of 33 | elapsed: 2.0min finished

Out[10]:

WordMoversKNNCV(W_embed=memmap([[ 0.04283, -0.01124, ..., -0.05679, -0.00763], [ 0.02884, -0.05923, ..., -0.04744, 0.06698], ..., [ 0.08428, -0.15534, ..., -0.01413, 0.04561], [-0.02052, 0.08666, ..., 0.03659, 0.10445]]), cv=3, n_jobs=3, n_neighbors_try=range(1, 20), scoring=None, verbose=5)

In [11]:

```
print("CV score: {:.2f}".format(knn_cv.cv_scores_.mean(axis=0).max()))
```

CV score: 0.38

In [12]:

```
print("Test score: {:.2f}".format(knn_cv.score(X_test, y_test)))
```

[Parallel(n_jobs=3)]: Done 12 tasks | elapsed: 32.2s [Parallel(n_jobs=3)]: Done 66 tasks | elapsed: 4.3min [Parallel(n_jobs=3)]: Done 156 tasks | elapsed: 12.5min [Parallel(n_jobs=3)]: Done 282 tasks | elapsed: 30.5min [Parallel(n_jobs=3)]: Done 300 out of 300 | elapsed: 48.9min finished

Test score: 0.31

Now let's see how WMD compares with some common approaches, on bag of words features. The most apples-to-apples comparison would be K nearest neighbors with a cosine similarity metric. This approach performs worse than using WMD. (All scores are accuracies.)

In [13]:

```
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV
```

In [21]:

```
knn_grid = GridSearchCV(KNeighborsClassifier(metric='cosine', algorithm='brute'),
dict(n_neighbors=list(range(1, 20))),
cv=3)
knn_grid.fit(X_train, y_train)
print("CV score: {:.2f}".format(knn_grid.best_score_))
print("Test score: {:.2f}".format(knn_grid.score(X_test, y_test)))
```

CV score: 0.34 Test score: 0.22

In [17]:

```
svc_grid = GridSearchCV(LinearSVC(),
dict(C=np.logspace(-6, 6, 13, base=2)),
cv=3)
svc_grid.fit(X_train, y_train)
print("CV score: {:.2f}".format(svc_grid.best_score_))
print("Test score: {:.2f}".format(svc_grid.score(X_test, y_test)))
```

CV score: 0.35 Test score: 0.27

WMD is much better at capturing semantic similarity between documents than cosine, due to its ability to generalize to unseen words. The SVM does somewhat better than cosine KNN, but still lacks such out-of-vocabulary generalization. Given enough data, WMD can probably improve this margin, especially using something like metric learning on top.

The exact WMD, as we have used it here, is pretty slow. This code is not optimized as much as it could be, there is potential through caching and using Cython. However, a major limitation remains the cost of actually computing the EMD. To scale even higher, exactness can be relaxed by using lower bounds. In our next post, we will compare such optimization strategies, as discussed in the WMD paper.