(Incomplete!)
In a previous notebook, we discussed how to quickly find words with meanings similar to other words. In this notebook, I demonstrate how to find words that sound like other words.
I'm going to make use of some of my recent research in phonetic similarity. The algorithm I made uses phoneme transcriptions from the CMU pronouncing dictionary along with information about articulatory/acoustic features of those phonemes to produce vector representations of the sound of every word in the dictionary.
In this notebook, I show how to make a fast approximate nearest neighbor lookup of words by their phonetic similarity. Then I show a few potential applications in creative language generation using that lookup, plus a bit of vector arithmetic.
You'll need the numpy
, spacy
and simpleneighbors
packages to run this code.
import random
from collections import defaultdict
import numpy as np
import spacy
from simpleneighbors import SimpleNeighbors
You can download the phonetic similarity vectors using the following command:
!curl -L -s https://github.com/aparrish/phonetic-similarity-vectors/blob/master/cmudict-0.7b-simvecs?raw=true >cmudict-0.7b-simvecs
The vector file looks like this:
WARNER -1.178800 1.883123 -1.101779 -0.698869 -0.109708 -0.482693 -0.291353 1.179281 0.191032 -1.192597 -0.684268 -1.132983 0.072473 -0.626924 0.569412 -1.639735 -3.000464 -1.414111 1.806220 -1.075352 1.274347 -0.111253 0.675737 -0.579840 -1.111530 -0.960682 -1.664172 0.872162 1.311749 -0.182414 3.062428 -1.333462 1.375817 0.947289 1.699605 1.799368 2.434342 0.382153 0.383062 2.583699 -0.756335 1.862328 -0.189235 -2.033432 -0.609034 -0.782589 0.394311 -1.056266 -1.288209 0.055472
word_vecs = []
for line in open("./cmudict-0.7b-simvecs", encoding='latin1'):
line = line.strip()
word, vec = line.split(" ")
word = word.rstrip('(0123)').lower()
vec = tuple(float(n) for n in vec.split())
word_vecs.append((word, vec))
len(word_vecs)
133859
group_by_vec = defaultdict(list)
for word, vec in word_vecs:
group_by_vec[vec].append(word)
len(group_by_vec)
113694
lookup = {}
for word, vec in word_vecs:
if word in lookup:
continue
lookup[word] = np.array(vec)
len(lookup)
125071
nlp = spacy.load('en_core_web_md')
nns = SimpleNeighbors(50)
lookup = {}
for vec, words in group_by_vec.items():
sort_by_prob = sorted(words, key=lambda x: nlp.vocab[x].prob)
nns.add_one(sort_by_prob[0], vec)
nns.build(50)
nns.nearest(lookup['parrish'])
['parrish', 'perished', 'parish', 'buresh', 'parrishes', "paris'", 'barrish', 'marish', 'cherish', 'perishing', 'garish', 'maresh']
current = 'allison'
for i in range(50):
print(current, end=" ")
current = random.choice(nns.nearest(lookup[current])[1:])
allison allinson allison's ilalis's alissa isolate oscillate ocelot assad facade futch sutch suss tussle solicitous solicits tussle tussles suss genesis janus dishon zisson sissom cynicism nissei taisei chace chaste tastes chests sets stet stetz test's pests pastes missteps misstates pastes misstates misstates allstate's tastes pastes mists mistrust trust's strutz constructs
frost_doc = nlp(open("frost.txt").read())
output = []
for word in frost_doc:
if word.text.lower() in lookup:
new_word = random.choice(nns.nearest(lookup[word.text.lower()]))
output.append(new_word)
else:
output.append(word.text)
output.append(word.whitespace_)
print(''.join(output))
thuy lodes diverging inning eh colello woodward, unbend sarra eh toogood knoche travels boeve ende gyi awan traveller, lall aah stowed edmond tookes downtime urwin ass farb ige ee couldn't khuu hewell h. belt engh leitha undergrowth; then cookout the futher, as juts eye's fier, anand misbehaving hypercard uther geter claymore, because h. wah's glassey edmond wounded beware; xio ahs four's jass yother gassing geniere ahead whorl jim relay abide uther simm, edmond goethe that norling equality loye in. reeves' mono steppes hedge janardhan brakke. ayo, eh speck rather thirst form otherness deady! whet renewing haugh woy needs amman tucci byway, aux undoubted f. oooh shooed ivor cupp gapp. uhh schill bedke tailing matthes whiz oooh thigh somewheres gauges ende eases rench: toupee inroads diverged inning aue woodis, unland I— aigner put leitha one letsch raveled baye, odland sajak has mabe lall the difference.
frost_doc = nlp(open("frost.txt").read())
tint_word = 'soap'
tint_vec = lookup[tint_word]
tint_factor = 0.4
output = []
for word in frost_doc:
if word.text.lower() in lookup:
vec = lookup[word.text.lower()]
target_vec = (vec * (1-tint_factor)) + (tint_vec * tint_factor)
new_word = random.choice(nns.nearest(target_vec))
output.append(new_word)
else:
output.append(word.text)
output.append(word.whitespace_)
print(''.join(output))
tope rogues survivor's in. aue yellow woodwork, anand sa aw kote topknot travel busch england soapy urwin travenol, nall aw stowed oakland choke downe youn ahs fart edge aue tooke tope wickware chipote ghent ein posa choke; them choke ertha suther, ige justo eye's serr, umland heavy soper otha mater claymore, soco's schoepf swatch grassi unland footnoted swimwear; zhou aase fornoff that judge psychopath gehr hieb sworn zemke nilly taub ertha simm, earned putsch zag phoning emotionally sope innate reeves' lomonaco stake heid radant black. oooh, oie kepp judge scherf fuhr another's jade! hait lowing hah wye ladd's nahm touche erway, i. soot efface oooh photo's ever upham tabak. aw chalet beebe sellick this' which ee saye footware outages anand rages hentz: souk rototilles sope ame i. woodwork, earned I— ae cooke schoepf one selloff sope bip, odland jap hass mib auld soak referenced.
from scipy.spatial.distance import cosine
def cosine_similarity(a, b):
return 1 - cosine([a], [b])
cosine_similarity(np.array([1,2,3]), np.array([4,5,6]))
0.9746318461970761
semantic_nns = SimpleNeighbors(300)
for item in nlp.vocab:
if item.has_vector and item.prob > -15 and item.is_lower:
semantic_nns.add_one(item.text, item.vector)
semantic_nns.build(50)
def soundalike_synonym(word, target_vec, n=5):
return sorted(
[item for item in semantic_nns.nearest(nlp.vocab[word].vector, 50) if item in lookup],
key=lambda x: cosine_similarity(target_vec, lookup[x]), reverse=True)[:n]
soundalike_synonym('mastodon', lookup['soap'])
['chimp', 'hippo', 'toad', 'platypus', 'shark']
semantic_nns.nearest(nlp.vocab['mastodon'].vector, 5)
['velociraptor', 'dinosaur', 'caveman', 'dino', 'skeleton']
target_vec = lookup['green']
words = random.sample(semantic_nns.corpus, 16)
for item in words:
print(item, "→", soundalike_synonym(item, target_vec, 1)[0])
fog → grille willingly → gladly tolerates → gravitate casino → grand farmland → graze micromanage → discretionary grok → query crappy → crap arguably → greatest naughty → brunette prior → preceding encountered → initially dandruff → tanning gendered → transcends airborne → aircraft natures → glamour
frost_doc = nlp(open("frost.txt").read())
target_word = 'soap'
target_vec = lookup[target_word]
output = []
for word in frost_doc:
if word.is_alpha \
and word.pos_ in ('NOUN', 'VERB', 'ADJ') \
and word.text.lower() in lookup:
new_word = random.choice(soundalike_synonym(word.text.lower(), target_vec))
output.append(new_word)
else:
output.append(word.text)
output.append(word.whitespace_)
print(''.join(output))
Two motorists emerged in a silver spruce, And sorry I could not tours both And not one oasis, long I looked And thought down one as far as I not To where it crook in the foliage; Then stopped the particular, as just as but, And thought perhaps the make purported, Because it did tree and chose duds; Though as for that the turning there took pajama them really about the not, And both that night equally stood In fig no take took strut violet. Oh, I stopped the same for another summer! Yet knows how it resulted on to take, I admit if I not ever say back. I abide also surprised this with a sigh Somewhere child and mos hence: Two crossing transformed in a laminate, and I— I saw the same less embark by, And that could it most the because.