A *portmanteau* is a word that squishes together two words, like math + athlete = mathlete. Inspired by Darius Bacon, I covered this as a programming exercise in my 2012 Udacity course. In 2018 I was re-inspired by Tom Murphy VII, who added a new twist: *portmantout words* (*tout* from the French for all), which are defined as:
A portmantout of a set of words W is a string S such that:
- Every word in W is a substring of S.
- The words overlap: every word (except the first) starts at an index ≤ the end of the previous word.
- Nothing else is in S: every letter in S comes from the overlapping words.
Note that a word may appear more than once within S. Although not part of the definition, the goal is to get as short an S as possible, and to do it for a set W of over 100,000 words. This notebook develops a program that found these portmanteaux:
My intuition is that finding a shortest S is an NP-hard problem, and with 100,000 words to cover, it is unlikely that I can find the shortest possible solution in a reasonable amount of time. A common approach to NP-hard problems is a greedy algorithm: make the locally best choice at each step, in the hope that the steps will fit together into a solution that is not too far from the best solution.
Thus, my approach will be to build up a path, starting with one word, and then adding steps to the end of the evolving path, one at a time. Each step consists of a word from W that overlaps the end of the previous word by at least one letter. I will choose the step that seems to be the best choice at the time (the one that minimizes the number of excess letters added to the path, and will never undo a step, even if the path seems to get stuck later on. I distinguish two types of steps:
There's actually a third type of word, but it doesn't need a corresponding type of step:
ajar
is in W, then we know we have to place it in some step along the path. But if jar
is also in W, we don't need a separate step for it—whenever we place ajar
, we have automatically placed jar
. We can save computation time by initializing the unused words to be the nonsubwords in W.A subword will never be added as an unused word, but it may be used as a bridging word. (Note: I use the clumsy term "nonsubword" rather than "superword", because there are a small number of words, like "cozy" and "july," that are neither subwords nor superwords.)
Here is the exact definition of the metric we are trying to minimize:
Examples: In each row of the table below, 'ajar'
is the previous word, but each row makes different assumptions about what unused words remain, and thus we get different choices for the step to take. The table shows the overlapping letters between the previous word and the step, and in the case of bridges, it shows the next unused word that the step is bridging to. The final column shows the excess score (and letters).
Previous | Step(s) | Overlap | Bridge to | Type of Step | Excess |
---|---|---|---|---|---|
ajar | jarring | jar | unused word | -3 | |
ajar | arbitrary | ar | unused word | -2 | |
ajar | rabbits | r | unused word | -1 | |
ajar | argot | ar | goths | one-step bridge | 0 |
ajar | arrow | ar | owlets | one-step bridge | 1 (r) |
ajar | rani, iraq | r | quizzed | two-step bridge | 5 (anira) |
Let's go over the examples:
jarring
is an unused word, and it overlaps by 3 letters, giving it an excess cost of -3, which is the best possible (an overlap of 4 would mean ajar
is a subword, and we already agreed to eliminate subwords).quizzed
is the only remaining unused word. There is no single word that bridges from any suffix of ajar
to any prefix of quizzed
. But rani
can bridge from 'r'
to 'i'
and iraq
can bridge from 'i'
to 'q'
. This two-word bridge has an excess score of 5 due to the letters anira
not overlapping anything.We see that unused word steps always have a negative excess cost (that's good) while bridge steps always have a zero or positive excess cost; thus an unused word step is always better than a bridge step (according to this metric).
Here I describe how to implement the main data types in Python:
str
(as are subparts of words, like suffixes or individual letters).set
, denoting a set of words, plus some cached attributes.list
of steps.jarring
to ajar
is Step(3, 'jarring')
.Bridge(1, [Step(2, 'arrow')])
.Here are two example bridges:
W.bridges['ar']['ow'] == Bridge(1, [Step(2, 'arrow')])
W.bridges['r']['q'] == Bridge(5, [Step(1, 'rani'), Step(1, 'iraq')])
Here are the data types:
from collections import defaultdict, Counter, namedtuple
from typing import List, Tuple, Set, Dict, Any
class Wordset(set): """A set of words."""
Word = str
Step = namedtuple('Step', 'overlap, word')
Bridge = namedtuple('Bridge', 'excess, steps')
Path = List[Step]
Bridges = Dict[str, Dict[str, Bridge]]
I originally thought I would define a major function, portman
, to generate the portmantout string S from the set of words W, and a minor function, is_portman
, to verify the result. But verification was difficult. For example, S = 'helloworld'
, would be rejected as non-overlapping if parsed as 'hello'
+ 'world'
, but accepted if parsed as 'hello'
+ 'low'
+ 'world'
. It was hard for is_portman
to decide which parse was intended, which is a shame because portman
knew which was intended, as it built up the path of steps, but didn't return the path.
Therefore, I decided on the following calling and naming conventions:
P = natalie(W: Wordset) # Find a portmantout path P for a Wordset W
S = portman(P: Path) # Compute the string S from the path P
is_portman(P: Path, W: Wordset) # Check whether P is a valid path covering W
Thus I can generate a string S with:
S = portman(natalie(W))
portman
and is_portman
¶Here we define the functions portman
and is_portman
, and the tiny examples W1
, P1
, and S1
:
def portman(P: Path) -> Word:
"""Compute the portmantout string S from the path P."""
return ''.join(word[overlap:] for (overlap, word) in P)
def is_portman(P: Path, W: Wordset) -> bool:
"""Is the Path P a portmantout of the Wordset W?"""
S = portman(P)
return (all(word in S for word in W) # 1. Every word in W is a substring of S
and all(step.overlap > 0 # 2. The words overlap
for step in P[1:])
and all(step.word in S # 3. Nothing else is in S
for step in P))
W1 = Wordset({'anarchy', 'dashiki', 'grammarian', 'kimono', 'monogram',
'a', 'am', 'an', 'arc', 'arch', 'aria', 'as', 'ash', 'dash', 'gram',
'grammar', 'i', 'mar', 'maria', 'mono', 'narc', 'no', 'on', 'ram'})
P1 = [Step(0, 'dashiki'),
Step(2, 'kimono'),
Step(4, 'monogram'),
Step(4, 'grammarian'),
Step(2, 'anarchy')]
S1 = portman(P1)
S1
is_portman(P1, W1)
W
: Murphy's Wordset of 108,709 words¶We can make Tom Murphy's 108,709 word file "wordlist.asc"
into a Wordset
called W
:
! [ -e wordlist.asc ] || curl -O https://norvig.com/ngrams/wordlist.asc
W = Wordset(open('wordlist.asc').read().split())
len(W)
108709
natalie
¶The function natalie
does a greedy search for a portmantout path. As stated above, the approach is to start with a path of one word (either given as an optional argument or chosen arbitrarily from the word set W), and then repeatedly add steps, each step being either an unused_step
or one of bridging_steps
.
def natalie(W: Wordset, start=None) -> Path:
"""Return a portmantout path containing all words in W."""
cache_attributes(W)
P = add_step([], W, Step(0, start or first(W.unused_words)))
while W.unused_words:
for step in unused_step(W, P[-1].word) or bridging_steps(W, P[-1].word):
P = add_step(P, W, step)
return P
unused_step
and bridging_steps
¶unused_step
considers every suffix of the previous word, longest suffix first. If a suffix starts an unused words, we choose it. Since we're going longest-suffix first, no other word could do better.
bridging_steps
also considers every suffix of the previous word, and for each one it looks in the W.bridges[suf]
table (see below) to see what prefixes (of unused words) we can bridge to from this suffix. Consider all such W.bridges[suf][pre]
entries that bridge to the prefix of an unused word (as maintained in W.startswith[pre]
). Out of all such bridges, take one with the minimal excess cost, and return the steps that make up the bridge.
def unused_step(W: Wordset, prev_word: Word) -> List[Step]:
"""Return [Step(overlap, unused_word)] or []."""
for suf in suffixes(prev_word):
unused_word = first(W.startswith.get(suf, ()))
if unused_word:
return [Step(len(suf), unused_word)]
return []
def bridging_steps(W: Wordset, prev_word: Word) -> List[Step]:
"""The steps from the shortest bridge that bridges
from a suffix of prev_word to a prefix of any unused word."""
bridges = [W.bridges[suf][pre]
for suf in suffixes(prev_word) if suf in W.bridges
for pre in W.bridges[suf] if W.startswith[pre]]
if not bridges: print('prev_word', prev_word)
return min(bridges).steps
(Python trivia: in unused_step
I do W.startswith.get(suf, ())
, not W.startswith[suf]
because the dict in question is a defaultdict(set)
, and if there is no entry there, I don't want to insert an empty set entry.)
cache_attributes
and add_step
¶To make this efficient, cache_attributes(W)
caches the following information:
W.subwords
: a set of all the words that are contained within another word in W
.W.shortwords
: a set of short words used to build bridges.W.unused_words
: initially the set of nonsubwords in W
; when a word is used it is removed from the set.W.bridges
: a dict where W.bridges[suf][pre]
gives the best bridge between the affixes.W.startswith
: a dict that maps from a prefix to all the unused words that start with the prefix. A word is removed from all the places it appears when it is used. Example: W.startswith['somet'] == {'something', 'sometimes'}
.These structures are complicated, so don't be discouraged if you have to go over the code several times.
def cache_attributes(W, maxlen=5, end_letters='qujvz'):
"""Precompute and cache data structures on attributes of W:
.subwords, .shortwords, and .bridges are computed once and not changed.
.unused_words and .startswith are recomputed on each call to `natalie`,
and are updated by calls to `add_step`."""
if not hasattr(W, 'bridges'):
W.subwords = subwords(W)
W.shortwords = {w for w in W if len(w) <= maxlen + (w[-1] in end_letters)}
W.bridges = build_bridges(W)
W.unused_words = W - W.subwords
W.startswith = startswith_table(W.unused_words)
def add_step(P, W, step) -> Path:
"""Add step to P; remove word from `W.unused_words` and `W.startswith[pre] for each pre`."""
P.append(step)
word = step.word
assert word in W, f'attempt to add "{word}", which is not in the word set'
if word in W.unused_words:
W.unused_words.remove(word)
for pre in prefixes(word):
W.startswith[pre].remove(word)
if not W.startswith[pre]:
del W.startswith[pre]
return P
def multimap(pairs) -> Dict[Any, set]:
"""Given (key, val) pairs, make a dict of {key: {val,...}}."""
result = defaultdict(set)
for key, val in pairs:
result[key].add(val)
return result
def startswith_table(words) -> Dict[str, Set[Word]]:
"""A dict mapping a prefix to all the words it starts:
{'somet': {'something', 'sometimes'},...}."""
return multimap((pre, w) for w in words for pre in prefixes(w))
def subwords(W: Wordset) -> Set[str]:
"""All the words in W that are subparts of some other word."""
return {subword for w in W for subword in subparts(w) & W}
def suffixes(word) -> List[str]:
"""All non-empty proper suffixes of word, longest first."""
return [word[i:] for i in range(1, len(word))]
def prefixes(word) -> List[str]:
"""All non-empty proper prefixes of word."""
return [word[:i] for i in range(1, len(word))]
def subparts(word) -> Set[str]:
"""All non-empty proper substrings of word"""
return {word[i:j]
for i in range(len(word))
for j in range(i + 1, len(word) + 1)} - {word}
def first(iterable) -> object:
"""The first element in an iterable, or None"""
return next(iter(iterable), None)
(Math trivia: In this context, "proper" means "not whole". A proper subset is a subset that is not the whole set itself; a proper substring of a word is a substring that is not the whole word itself.)
The last piece of the program is the construction of the W.bridges
table. Recall that we want W.bridges[suf][pre]
to be a bridge between a suffix of the previous word and a prefix of an unused word, as in the examples:
W.bridges['ar']['ow'] == Bridge(1, [Step(2, 'arrow')])
W.bridges['ar']['c'] == Bridge(0, [Step(2, 'arc')])
W.bridges['r']['q'] == Bridge(5, [Step(1, 'rani'), Step(1, 'iraq')])
We build all the bridges in cache_attributes
, and don't update them as words are used. Thus, W.bridges['r']['q']
says "if there are any unused words starting with 'q'
, you can use this bridge, but I'm not promising there are any." The caller is responsible for checking that W.startswith['q']
contains unused word(s).
Bridges should be short. We don't need to consider antidisestablishmentarianism
as a possible bridge word. Instead, from our 108,709 word set W, we'll use W.shortwords
: 10,273 words with length up to 5, plus six-letter words that end in any of 'qujvz', the rarest letters (there are 20 of these). I also compute a shortstartswith
table for the shortwords
, where, for example,
shortstartswith['som'] == {'soma', 'somas', 'some'} # but not 'somebodies', 'something', ...
To build one-word bridges, consider every shortword, and split it up in all possible ways into a prefix that will overlap the previous word, a suffix that will overlap the next word, and a count of zero or more excess letters in the middle that don't overlap anything.
def splits(word) -> List[Tuple[int, str, str]]:
"""A sequence of (excess, pre, suf) tuples."""
return [(word[:i], excess, word[i+excess:])
for excess in range(len(word) - 1)
for i in range(1, len(word) - excess)]
splits('arrow')
[('a', 0, 'rrow'), ('ar', 0, 'row'), ('arr', 0, 'ow'), ('arro', 0, 'w'), ('a', 1, 'row'), ('ar', 1, 'ow'), ('arr', 1, 'w'), ('a', 2, 'ow'), ('ar', 2, 'w'), ('a', 3, 'w')]
The first element of the list says that 'arrow'
can bridge from 'a'
to 'rrow'
with 0 excess letters; the last says it can bridge from 'a'
to 'w'
with 3 excess letters (which happen to be 'rro'
). Each possible split is passed on to build_bridge
, which records the bridge in the table under bridges[pre][suf]
unless there is already a shorter bridge stored there.
def build_bridge(bridges, word, pre, excess, suf, step2=None):
"""Store a new bridge if it has less excess than the previous bridges[pre][suf]."""
if suf not in bridges[pre] or excess < bridges[pre][suf].excess:
steps = [Step(len(pre), word)]
if step2: steps.append(step2)
bridges[pre][suf] = Bridge(excess, steps)
Now for two-word bridges. I thought that if I allowed all possible two-word bridges the program would be slow because there would be so many of them, and most of them would be too long to be of any use. Thus, I decided to only use two-word bridges that bridge from the last letter in the previous word to the first letter in an unused word.
We start out the same way, looking at every shortword. But this time we look at every suffix of each shortword, and see if the suffix starts another shortword. If it does, then we have a two-word bridge. Here's the complete build_bridges
function:
def build_bridges(W: Wordset):
"""A table of bridges[pre][suf] == Bridge(excess, [Step(overlap, word)]), e.g.
bridges['ar']['c'] == Bridge(0, [Step(2, 'arc')])."""
bridges = defaultdict(dict)
shortstartswith = startswith_table(W.shortwords)
# One-word bridges
for word in W.shortwords:
for split in splits(word):
build_bridge(bridges, word, *split)
# Two-word bridges
for word1 in W.shortwords:
for suf in suffixes(word1):
for word2 in shortstartswith[suf]:
excess = len(word1) + len(word2) - len(suf) - 2
A, B = word1[0], word2[-1] # First and last letters
if A != B: # No sense bridging from A to A
step2 = Step(len(suf), word2)
build_bridge(bridges, word1, A, excess, B, step2)
return bridges
Is natalie
guaranteed to terminate? Every iteration either uses up an unused word, or builds a bridge to an unused word that will be used on the next iteration. So, eventually all the words will be used and natalie
will return a solution. The only way this can fail to happen is if there is no bridge to an unused word. I can prove that this can't happen if I can verify that there is a bridge from every one-letter suffix to every one-letter prefix. The function missing_bridges
checks for this.
def missing_bridges(W, alphabet='abcdefghijklmnopqrstuvwxyz'):
"""What 1-to-1-letter bridges are missing from W.bridges?"""
return {(A, B) for A in alphabet for B in alphabet
if A != B and B not in W.bridges[A]}
cache_attributes(W)
assert not missing_bridges(W)
Great! W has no missing bridges. But the tiny W1 is missing 623 out of 26 × 25 = 650 1-to-1-letter bridges:
cache_attributes(W1)
len(missing_bridges(W1))
623
Finally! We're ready to make portmantouts. First for the tiny word set W1
, for which we must carefully choose the starting word:
natalie(W1, start='dashiki')
[Step(overlap=0, word='dashiki'), Step(overlap=2, word='kimono'), Step(overlap=4, word='monogram'), Step(overlap=4, word='grammarian'), Step(overlap=2, word='anarchy')]
portman(natalie(W1, start='dashiki'))
'dashikimonogrammarianarchy'
Now for the big word set W
:
%time P = natalie(W)
CPU times: user 6.81 s, sys: 25.6 ms, total: 6.84 s Wall time: 6.84 s
I thought it might take 10 minutes, so under 10 seconds is great!
S = portman(P)
len(P), len(S)
(103178, 553669)
The portmantout is about 100,000 steps and a half-million letters long.
Notice I haven't actually looked at the portmantout yet. I didn't want to dump half a million letters into an output cell. Instead, I'll define report
to print various statistics, summarize the begin and end of the portmantout, and save the full string S into the file natalie.txt.
def report(W, P, steps=100, letters=1000, save='natalie.txt'):
S = portman(P)
sub = W.subwords
nonsub = W - sub
uniq = {step.word for step in P} # unique step words in P
bridge = len(P) - len(nonsub) # number of bridge steps in P
bridges = sum(len(W.bridges[pre]) for pre in W.bridges) # number of bridges in W
def L(words): return sum(map(len, words)) # Number of letters
print(f'W has {len(W):,d} words ({len(nonsub):,d} nonsubwords; {len(sub):,d} subwords).')
print(f'P has {len(P):,d} steps ({len(uniq):,d} unique words; {bridge:,d} bridge words).')
print(f'S has {len(S):,d} letters; W has {L(W):,d}; nonsubs have {L(nonsub):,d}.')
print(f'P has an average overlap of {(L(s.word for s in P)-len(S))/(len(P)-1):.2f} letters.')
print(f'S has a compression ratio (letters(W)/letters(S)) of {L(W)/len(S):.2f}.')
print(f'P (and thus S) is {"" if is_portman(P, W) else "NOT "}a valid portmantout of W.')
print(f'W has {bridges:,d} bridges from {len(W.shortwords):,d} shortwords, '
f'and {len(missing_bridges(W))} missing 1-to-1-letter bridges.')
open(save, "w").write(S)
print(f'S saved as the file "{save}".')
items = ['\n...' if w is ... else w[:i] + '⋅' + w[i:]
for i, w in P[:steps] + [(..., ...)] + P[-steps:]]
print(f'\nThe first and last {letters} letters:\n\n{S[:letters]}\n...{S[-letters:]}')
print(f'\nThe first and last {steps} steps:\n\n{", ".join(items)[1:]}')
The step Step(1, 'sir')
is printed as s⋅ir
to indicate that s
is the 1-letter overlap.
I will redefine is_portman
to be faster. Python trivia: if X, Y
and Z
are sets, X <= Y <= Z
means "is X
a subset of Y
and Y
a subset of Z
?" We use the notation here to say that the set of words in P must contain all the nonsubwords and can only contain words from W.
def is_portman(P: Path, W: Wordset) -> str:
"""Verify that P forms a valid portmantout string for W."""
all_words = (W - W.subwords) <= set(step.word for step in P) <= W
overlaps = all((overlap > 0 and P[i - 1].word[-overlap:] == word[:overlap])
for i, (overlap, word) in enumerate(P[1:], 1)) and P[0].overlap == 0
return all_words and overlaps
report(W, P)
W has 108,709 words (64,389 nonsubwords; 44,320 subwords). P has 103,178 steps (65,051 unique words; 38,789 bridge words). S has 553,669 letters; W has 931,823; nonsubs have 595,805. P has an average overlap of 1.65 letters. S has a compression ratio (letters(W)/letters(S)) of 1.68. P (and thus S) is a valid portmantout of W. W has 56,477 bridges from 10,293 shortwords, and 0 missing 1-to-1-letter bridges. S saved as the file "natalie.txt". The first and last 1000 letters: circumspectionicitywidelysiannealerstwhiledinburgherselfnessentiallylsubtenanciescortinglesbianschlussrefectionisinglassertionstagehandsomenessayingsandpileupstairstreamiestrousseaustrianswersatzestfulnessesquicentenniallyricistsktskedgingivalutassellingulauncheddarseniousheredityrannosaursinewyvernshortcakestrelsesquipedalianastigmatickingsidelinesmanganesiannihilatorsoesophagussiestashesitancestriesteemskulledificesspoolstuntsaristsarismskepticsanatoriumswappingrassessableaterserializingersarsaparillaserdiskswallowingspansophiesteeminglershortchangesticulationshoredistrictsarinasmuchnessencesuraeronauticallysergichthyologynecologicallyriformaldehydescriberserksamphirescindeduciblentossupshiftsimmeshinglessenedithersiticalvingloriouslylysinecurescheduledemasculinizedscullerythrocytestatrixescritoiresharpenedginessentiallopathsolariatediouslyerbassettinglershorteddieselsynspiraeastboundlessnessayeditorialistablerstillbirthstoneselflessnessayistsardomshrubbiersnakierasurestringingeredis ...ymenoiraquothaquandariesiraquarreledeliraquieteningorkiraquainteraniraquartzesiraquarrelersiraquarterdecksiraquondamaquickiesiraquagsiraquailedeliraqindarsiraquarrellersiraquartilesiraqophsiraquailsiraquantityoniraquaffersiraquakersiraquiversiraquartsiraquakedeliraquotidianoiraquietensiraquantifiesiraquackishlyoniraquaintnessiraquebecolloquorumsiraquilledeliraquittorsiraquirksiraquicksilveraniraquicklimemiraquantitiesiraquackieraniraquotersiraquintalsiraquintillionthsiraquietusesiraquagmiryoniraquintanoiraquackismsiraquotationallyoniraquantizedeliraqueeringorkiraquantedeliraquackingorkiraquarantinablemiraquarriesiraquackedeliraquarterbacksiraquakiestuqueuesiraquoinedeliraquailingorkiraquintillionsiraquahogsiraquiveringlyoniraquaveredeliraquaaludesiraquakerismaquintuplicatingorkiraquininesiraquichesiraquickeningorkiraquackyoniraquackeryoniraquaggyoniraquakieraniraquixoticallyoniraquintupledeliraquirkinessiraquincunxesiraquietismsiraquirkedeliraquittancesiraquoinsiraqueazyoniraquantified The first and last 100 steps: circumspection, ion⋅icity, city⋅wide, wide⋅ly, ely⋅sian, an⋅nealers, ers⋅twhile, while⋅d, ed⋅inburgh, burgh⋅ers, hers⋅elf, self⋅ness, ess⋅entially, ally⋅ls, s⋅ubtenancies, es⋅corting, ting⋅les, les⋅bians, ans⋅chluss, uss⋅r, r⋅efection, ion⋅ising, ising⋅lass, glass⋅er, asser⋅tions, ons⋅tage, stage⋅hands, hands⋅omeness, ess⋅aying, saying⋅s, s⋅andpile, pile⋅ups, ups⋅tairs, airs⋅tream, stream⋅iest, est⋅rous, trous⋅seaus, aus⋅trians, ans⋅wers, ers⋅atzes, zes⋅tfulness, fulness⋅es, ses⋅quicentennially, ly⋅ricists, ts⋅ktsked, ked⋅ging, ging⋅ival, val⋅utas, tas⋅selling, ling⋅ula, la⋅unched, ched⋅dars, ars⋅enious, us⋅hered, hered⋅ity, ty⋅rannosaurs, urs⋅ine, sine⋅wy, wy⋅verns, s⋅hortcakes, kes⋅trels, els⋅es, ses⋅quipedalian, lian⋅as, anas⋅tigmatic, tic⋅kings, kings⋅ide, side⋅lines, lines⋅man, man⋅ganesian, an⋅nihilators, tors⋅oes, oes⋅ophagus, gus⋅sies, sies⋅tas, stas⋅hes, hes⋅itance, ance⋅stries, es⋅teems, s⋅kulled, ed⋅ifices, ces⋅spools, s⋅tunts, ts⋅arists, ts⋅arisms, s⋅keptics, s⋅anatoriums, s⋅wapping, ping⋅rasses, asses⋅sable, ble⋅aters, ters⋅er, ser⋅ializing, zing⋅ers, s⋅arsaparillas, las⋅erdisks, s⋅wallowing, wing⋅spans, pans⋅ophies, es⋅teeming, ..., g⋅orki, i⋅raq, q⋅uanted, d⋅eli, i⋅raq, q⋅uacking, g⋅orki, i⋅raq, q⋅uarantinable, e⋅mir, ir⋅aq, q⋅uarries, s⋅ir, ir⋅aq, q⋅uacked, d⋅eli, i⋅raq, q⋅uarterbacks, s⋅ir, ir⋅aq, q⋅uakiest, t⋅uque, que⋅ues, s⋅ir, ir⋅aq, q⋅uoined, d⋅eli, i⋅raq, q⋅uailing, g⋅orki, i⋅raq, q⋅uintillions, s⋅ir, ir⋅aq, q⋅uahogs, s⋅ir, ir⋅aq, q⋅uiveringly, y⋅oni, i⋅raq, q⋅uavered, d⋅eli, i⋅raq, q⋅uaaludes, s⋅ir, ir⋅aq, q⋅uakerism, m⋅aqui, qui⋅ntuplicating, g⋅orki, i⋅raq, q⋅uinines, s⋅ir, ir⋅aq, q⋅uiches, s⋅ir, ir⋅aq, q⋅uickening, g⋅orki, i⋅raq, q⋅uacky, y⋅oni, i⋅raq, q⋅uackery, y⋅oni, i⋅raq, q⋅uaggy, y⋅oni, i⋅raq, q⋅uakier, r⋅ani, i⋅raq, q⋅uixotically, y⋅oni, i⋅raq, q⋅uintupled, d⋅eli, i⋅raq, q⋅uirkiness, s⋅ir, ir⋅aq, q⋅uincunxes, s⋅ir, ir⋅aq, q⋅uietisms, s⋅ir, ir⋅aq, q⋅uirked, d⋅eli, i⋅raq, q⋅uittances, s⋅ir, ir⋅aq, q⋅uoins, s⋅ir, ir⋅aq, q⋅ueazy, y⋅oni, i⋅raq, q⋅uantified
The program is complete, but there are still many interesting things to explore, and questions to answer.
Question: is there an imbalance in starting and ending letters of words? That could lead to a need for many two-word bridges. We saw in the last 100 steps of P multiple repetitions of the two-word bridge "s⋅ir, ir⋅aq". That suggests there are too many words that end in "s" and too many that start with "q". Let's investigate:
cache_attributes(W)
len(W.unused_words)
64389
starts = Counter(w[0] for w in W.unused_words)
ends = Counter(w[-1] for w in W.unused_words)
def ratio(L) -> str:
"""Approximate ratio of words that start with L to words that end with L."""
s, e = starts[L], ends[L]
return f'{round(s/e)}:1' if (s > e and e != 0) else f'1:{round(e/s)}'
print('Letter Starts Ends Ratio')
print('------ ------ ------ -----')
for L in sorted(starts):
print(f'{L:>5} {starts[L]:6,d} {ends[L]:6,d} {ratio(L):>5}')
Letter Starts Ends Ratio ------ ------ ------ ----- a 3,528 384 9:1 b 3,776 6 629:1 c 5,849 908 6:1 d 4,093 7,520 1:2 e 2,470 3,215 1:1 f 2,794 51 55:1 g 2,177 6,343 1:3 h 2,169 351 6:1 i 2,771 128 22:1 j 638 0 1:0 k 566 157 4:1 l 1,634 1,182 1:1 m 3,405 657 5:1 n 1,542 1,860 1:1 o 1,797 113 16:1 p 4,977 123 40:1 q 330 0 1:0 r 3,811 1,994 2:1 s 7,388 29,056 1:4 t 3,097 2,107 1:1 u 2,557 11 232:1 v 1,032 6 172:1 w 1,561 42 37:1 x 51 68 1:1 y 207 8,086 1:39 z 169 21 8:1
Yes, there is a problem: there are many more words that start with b
, f
, p
, u
, u
and v
than that end with those letters. In the other direction 45% of all words end in s
, but only a quarter of that number start with s
. The start:end ratio for y
is 1:39.
ends['s'] / len(W.unused_words)
0.451257202317166
Question: what are the most common words in P? These will be bridge words. What do they have in common?
Counter(step.word for step in P).most_common(25)
[('sac', 3172), ('so', 2212), ('lyre', 1655), ('of', 1651), ('dab', 1622), ('gab', 1498), ('sun', 1491), ('sin', 1427), ('sip', 1214), ('yam', 1076), ('sew', 1000), ('lye', 730), ('spa', 602), ('gun', 500), ('erst', 486), ('yen', 471), ('type', 463), ('go', 401), ('econ', 399), ('she', 395), ('semi', 356), ('yep', 331), ('gap', 328), ('sex', 317), ('simp', 317)]
Indeed, bridging away from s
is a big concern (half of the top dozen bridges). Even though sir
and iraq
dominated the last 50 steps, that's not true of P
overall. Also, lyre
and lye
bridge away from an adverb ending, ly
.
I'm surprised that of
shows up so frequently. Let's see what it is bridging from:
Counter(P[i-1].word for i, step in enumerate(P) if step.word == 'of').most_common()
[('so', 1210), ('go', 203), ('do', 190), ('to', 31), ('maleficio', 1), ('whereto', 1), ('mexico', 1), ('vitro', 1), ('modulo', 1), ('monaco', 1), ('poco', 1), ('proximo', 1), ('pronto', 1), ('vulgo', 1), ('puerto', 1), ('pizzicato', 1), ('furioso', 1), ('fresno', 1), ('franco', 1), ('francisco', 1), ('fortissimo', 1)]
We see that of
is used in two-word bridges with so
, go
, do
and to
to bridge away from four letters with a surplus of ends-with over starts-with.
Question: What is the distribution of word lengths?
Counter(sorted(map(len, W.unused_words))) # Counter of word lengths
Counter({3: 2, 4: 186, 5: 1796, 6: 4364, 7: 8672, 8: 11964, 9: 11950, 10: 8443, 11: 6093, 12: 4423, 13: 2885, 14: 1765, 15: 1017, 16: 469, 17: 198, 18: 91, 19: 33, 20: 22, 21: 9, 22: 4, 23: 2, 28: 1})
Question: What is the longest word?
max(W, key=len)
'antidisestablishmentarianism'
Question: What is the distribution of letters in the Wordset?
Counter(L for w in W.unused_words for L in w).most_common() # Counter of letters
[('e', 68038), ('s', 60080), ('i', 53340), ('a', 43177), ('n', 42145), ('r', 41794), ('t', 38093), ('o', 35027), ('l', 32356), ('c', 23100), ('d', 22448), ('u', 19898), ('g', 17815), ('p', 16128), ('m', 16062), ('h', 12673), ('y', 11889), ('b', 11581), ('f', 7885), ('v', 5982), ('k', 4892), ('w', 4880), ('z', 2703), ('x', 1677), ('j', 1076), ('q', 1066)]
Question: How many bridges are there?
# Make a list of all bridges, B
B = [W.bridges[suf][pre] for suf in W.bridges for pre in W.bridges[suf]]
len(B)
56477
B[::2000] # Sample every 2000th bridge
[Bridge(excess=0, steps=[Step(overlap=1, word='umbel')]), Bridge(excess=0, steps=[Step(overlap=1, word='circe')]), Bridge(excess=0, steps=[Step(overlap=1, word='pin')]), Bridge(excess=1, steps=[Step(overlap=1, word='frosh')]), Bridge(excess=0, steps=[Step(overlap=1, word='top')]), Bridge(excess=1, steps=[Step(overlap=1, word='mixed')]), Bridge(excess=0, steps=[Step(overlap=1, word='ben')]), Bridge(excess=0, steps=[Step(overlap=2, word='beer')]), Bridge(excess=1, steps=[Step(overlap=1, word='spurs')]), Bridge(excess=0, steps=[Step(overlap=2, word='boffo')]), Bridge(excess=0, steps=[Step(overlap=1, word='yuks')]), Bridge(excess=1, steps=[Step(overlap=1, word='herby')]), Bridge(excess=0, steps=[Step(overlap=1, word='kiwis')]), Bridge(excess=1, steps=[Step(overlap=2, word='micro')]), Bridge(excess=1, steps=[Step(overlap=1, word='idles')]), Bridge(excess=0, steps=[Step(overlap=3, word='lingo')]), Bridge(excess=0, steps=[Step(overlap=3, word='lulu')]), Bridge(excess=1, steps=[Step(overlap=2, word='kill')]), Bridge(excess=0, steps=[Step(overlap=2, word='joker')]), Bridge(excess=0, steps=[Step(overlap=3, word='swop')]), Bridge(excess=1, steps=[Step(overlap=2, word='yang')]), Bridge(excess=1, steps=[Step(overlap=3, word='dater')]), Bridge(excess=0, steps=[Step(overlap=4, word='notre')]), Bridge(excess=0, steps=[Step(overlap=3, word='biffy')]), Bridge(excess=1, steps=[Step(overlap=3, word='eland')]), Bridge(excess=0, steps=[Step(overlap=4, word='aglee')]), Bridge(excess=0, steps=[Step(overlap=4, word='forma')]), Bridge(excess=0, steps=[Step(overlap=4, word='lapel')]), Bridge(excess=0, steps=[Step(overlap=4, word='kafir')])]
Question: How many excess letters do the bridges have?
# Counter of bridge excess letters
BC = Counter(b.excess for b in B)
BC
Counter({0: 37189, 1: 16708, 2: 2425, 3: 95, 5: 21, 4: 32, 6: 6, 8: 1})
from statistics import mean
mean(BC.elements())
0.3916638631655364
Question: How many 1-step and 2-step bridges are there?
Counter(len(b.steps) for b in B)
Counter({1: 56327, 2: 150})
Question: What strange letter combinations are there? Let's look at two-letter suffixes or prefixes that only appear in one or two nonsubwords.
{pre: W.startswith[pre] # Rare two-letter prefixes
for pre in W.startswith if len(pre) == 2 and len(W.startswith[pre]) <= 2}
{'tc': {'tchaikovsky'}, 'hd': {'hdqrs'}, 'qa': {'qaids', 'qatar'}, 'uf': {'ufos'}, 'qo': {'qophs'}, 'ik': {'ikebanas', 'ikons'}, 'if': {'iffiness'}, 'ez': {'ezekiel'}, 'ip': {'ipecacs'}, 'mc': {'mcdonald'}, 'bw': {'bwanas'}, 'fb': {'fbi'}, 'gw': {'gweducks', 'gweducs'}, 'sf': {'sforzatos'}, 'ek': {'ekistics'}, 'jn': {'jnanas'}, 'xm': {'xmases'}, 'ay': {'ayahs', 'ayatollahs'}, 'kw': {'kwachas', 'kwashiorkor'}, 'ym': {'ymca'}, 'yc': {'ycleped', 'yclept'}, 'll': {'llamas', 'llanos'}, 'aj': {'ajar'}, 'zl': {'zlotys'}, 'iv': {'ivories', 'ivory'}, 'ie': {'ieee'}, 'dv': {'dvorak'}, 'xi': {'xiphoids', 'xiphosuran'}, 'wu': {'wurzel'}, 'ee': {'eelgrasses', 'eelworm'}, 'zw': {'zwiebacks'}, 'gj': {'gjetosts'}, 'ct': {'ctrl'}, 'pf': {'pfennigs'}, 'dn': {'dnieper'}, 'oj': {'ojibwas'}, 'fj': {'fjords'}}
endswith = multimap((w[-2:], w) for w in W.unused_words)
{suf: endswith[suf] # Rare two-letter suffixes
for suf in endswith if len(endswith[suf]) <= 2}
{'ui': {'maqui', 'prosequi'}, 'ua': {'joshua'}, 'nx': {'bronx', 'meninx'}, 'hm': {'microhm'}, 'lu': {'honolulu'}, 'ec': {'filespec', 'quebec'}, 'ud': {'aloud', 'overproud'}, 'yx': {'styx'}, 'tl': {'peyotl', 'shtetl'}, 'xe': {'deluxe', 'maxixe'}, 'ep': {'asleep', 'shlep'}, 'td': {'retd'}, 'oi': {'hanoi', 'polloi'}, 'zt': {'liszt'}, 'gm': {'apophthegm'}, 'eh': {'mikveh', 'yahweh'}, 'nc': {'dezinc', 'quidnunc'}, 'mt': {'daydreamt', 'undreamt'}, 'ao': {'chiao', 'ciao'}, 'wa': {'kiowa', 'okinawa'}, 'su': {'shiatsu'}, 'zo': {'diazo', 'palazzo'}, 'xo': {'convexo'}, 'mb': {'clomb', 'whitecomb'}, 'ob': {'blowjob'}, 'pa': {'tampa'}, 'ku': {'haiku'}, 'vo': {'concavo'}, 'fa': {'khalifa'}, 'zm': {'transcendentalizm'}, 'oe': {'monroe'}, 'bm': {'ibm', 'icbm'}, 'dt': {'rembrandt'}, 'uc': {'caoutchouc'}, 'ko': {'gingko', 'stinko'}, 'ab': {'skylab'}, 'sr': {'ussr'}, 'ou': {'thankyou'}, 'za': {'organza'}, 'lm': {'stockholm', 'unhelm'}, 'dn': {'haydn'}, 'hn': {'mendelssohn'}, 'ho': {'groucho'}, 'hu': {'buchu'}, 'mp': {'prestamp'}, 'ug': {'bedrug', 'sparkplug'}, 'xs': {'duplexs'}, 'sz': {'grosz'}, 'we': {'zimbabwe'}, 'tu': {'impromptu'}, 'aa': {'markkaa'}, 'rf': {'waldorf', 'windsurf'}, 'ji': {'fiji'}, 'ai': {'bonsai'}, 'po': {'troppo'}, 'ef': {'unicef'}, 'gn': {'champaign'}, 'ub': {'beelzebub'}, 'vt': {'govt'}, 'ru': {'nehru'}, 'rb': {'cowherb'}, 'nu': {'vishnu'}, 'nz': {'franz'}, 'oz': {'kolkhoz'}, 'hr': {'kieselguhr'}, 'ln': {'lincoln'}, 'cd': {'recd'}}
The two-letter prefixes definitely include some strange words.
The list of two-letter suffixes is mostly picking out proper names and pointing out flaws in the word list. For example, lots of words end in ab
: blab, cab, dab, gab, jab, lab, etc. But must of them are subwords of plural forms; only skylab
made it into the word list in singular form but not plural.
To compare my program to Tom Murphy's:
I'm reminded of the Traveling Salesperson Problem where one algorithm is to form a single path, always progressing to the nearest neighbor, and another algorithm is to maintain a pool of shorter segments and repeatedly join together the two closest segments. The two approaches are different, but they are both suboptimal greedy methods, andit is not clear whether one is better than the other. You could try it!
(English trivia: my program builds a single path of words, and when the path gets stuck and I need something to allow me to continue, it makes sense to call that thing a bridge. Murphy's program starts by building a large pool of small portmanteaux that he calls particles, and when he can build no more particles, his next step is to put two particles together, so he calls it a join. The different metaphors for what our programs are doing lead to different terminology for the same idea.)
In terms of implementation:
It appears Murphy perhaps didn't quite have the complete concept of subwords. He did mention that when he adds 'bulleting'
, he crosses 'bullet'
and 'bulletin'
off the list, but somehow his string contains both 'spectacular'
and 'spectaculars'
. My guess is that when he adds 'spectaculars'
he crosses off 'spectacular'
, but if he happens to add 'spectacular'
first, he will later add 'spectaculars'
. Support for this view is that his output in bench.txt
says "I skipped 24319 words that were already substrs", but I computed that there are 44,320 such subwords; he found about half of them. I think those missing 20,001 words are the main reason why my strings are coming in at around 554,000 letters; less than Murphy's 611,820 letters.
Also, Murphy's joins are always between one-letter prefixes and suffixes. I do the same thing for two-word bridges, because having a W.bridges[A][B]
for every letter A
and B
is the easiest way to prove that the program will terminate. But for one-word bridges, I allow prefixes and suffixes of any length up to a total of 6 for len(pre) + len(suf)
. I can get away with this because I limited my candidate pool to the 10,000 W.shortwords
. It would have been time-consuming to build all bridges for all 100,000 words, and probably would not have helped shorten S appreciably.
I should say that I stole one important trick from Murphy. After I finished the first version of my program, I looked at his highly-entertaining video and paper and I noticed that I had a problem in my use of bridges. My natalie
function originally contained something like this:
... unused_step(...) or one_word_bridge(...) or two_word_bridge(...)
That is, I only considered two-word bridges when there was no one-word bridge, on the assumption that one word is shorter than two. But Murphy showed that my assumption was wrong: for bridges['w']['c']
I had 'workaholic'
, the best one-word bridge, but he had the two-word bridge 'war' + 'arc' = 'warc'
, which saves six excess letters over my single word. After seeing that, I shamelessly copied his approach, and now I too get a four-letter bridges['w']['c']
(sometimes 'war' + 'arc'
and sometimes 'wet' + 'etc'
or 'we' + 'etc'
).
W.bridges['w']['c']
Bridge(excess=2, steps=[Step(overlap=1, word='war'), Step(overlap=2, word='arc')])
I'll stop here, but you should feel free to do more experimentation of your own.
Here are some things you could do to make the portmantouts more interesting:
Here are some things you could do to make S shorter:
haydn
and dnieper
are made to go together in that order; they're the only two words with dn
as an affix. Similarly, womenfolk
and menfolks
should go together in that order, for a 7-letter overlap. But if we happened to place dnieper
or menfolks
first, we would loose the chance of these nice overlaps. Maybe there could be a system that assures the proper ordering, or a preprocessing step that joins together words that go together uniquely well.startswith_table
could sort the words in each key's bucket so that the "difficult" words (say, the ones that end in unusual letters) are encountered earlier in the program's execution, when there are more available words for them to connect to.natalie
.Here are some things you could do to make the program more robust:
P
.