In [1]:
import sys
print(sys.version)

3.6.2 (default, Jul 17 2017, 16:44:45)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)]


# Common Boggle Words¶

What words appear most often in a game of Boggle?

Naively, we can count $16! \cdot 6^{16} \approx 5.9 \cdot 10^{25}$ different boards, not taking duplicate letters into account. This many boards is too a numerous; even if discovering words per board was as fast as $\mathcal{O}(\mathtt{pass})$, it would be too much.

In [2]:
import time
import numpy as np
from datetime import timedelta

def simple_loop(n):
"""Return how many seconds a simple loop takes."""
start = time.time()
for _ in range(n):
pass
return time.time() - start

x = np.arange(start=1, stop=1e6, step=1e4)
y = [simple_loop(int(n)) for n in x]

a, b = np.polyfit(x, y, deg=1)
seconds = a*1e25 - b
print(timedelta(seconds))

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-2-305267f68761> in <module>()
18 a, b = np.polyfit(x, y, deg=1)
19 seconds = a*1e25 - b
---> 20 print(timedelta(seconds))

OverflowError: normalized days too large to fit in a C int

Great Scott!

In [3]:
from math import log10

C_MAX_INT = 2**31
log10(C_MAX_INT/365)

Out[3]:
6.769637001126942

So we will recourse to sampling!

The game Boggle has a square board, with 16 slots for cubes. Each cube has letters on its 6 faces. Starting each game, the board is shuffled so that each cube is in a random slot, and each cube has a random face showing.

Players compete to find most words showing on the resultant board. A player's score is based on the number of words that no other players found. Where longer words that no other players found are scored higher.

Words can be constructed from any non-repeating contiguous sequence of horizontal, vertical, or diagonal cubes. But sequences can't wrap around the board. Note too that the same letter may appear twice in a path, but not the same cube.

Lastly, the words must belong to an agreed upon dictionary, and be at least some minimum length.

## Simulating Boggle¶

There are 16 cubes one will find in a standard Boggle board:

In [7]:
cubes = [
('a', 'a', 'e', 'e', 'g', 'n'),
('a', 'b', 'b', 'j', 'o', 'o'),
('a', 'c', 'h', 'o', 'p', 's'),
('a', 'f', 'f', 'k', 'p', 's'),
('a', 'o', 'o', 't', 't', 'w'),
('c', 'i', 'm', 'o', 't', 'u'),
('d', 'e', 'i', 'l', 'r', 'x'),
('d', 'e', 'l', 'r', 'v', 'y'),
('d', 'i', 's', 't', 't', 'y'),
('e', 'e', 'g', 'h', 'n', 'w'),
('e', 'e', 'i', 'n', 's', 'u'),
('e', 'h', 'r', 't', 'v', 'w'),
('e', 'i', 'o', 's', 's', 't'),
('e', 'l', 'r', 't', 't', 'y'),
('h', 'i', 'm', 'n', 'qu', 'u'),
('h', 'l', 'n', 'n', 'r', 'z'),
]


To simulate the process of shuffling a board, we uniformly choose a slots and faces.

In [8]:
from random import choice, shuffle
from itertools import islice

def rand_board():
board = []
shuffle(cubes)
cube_order = iter(cubes)
for i in range(4):
row = islice(cube_order, 4)
board.append([choice(cube) for cube in row])
return board

In [9]:
rand_board()

Out[9]:
[['w', 't', 'i', 'p'],
['n', 'x', 'e', 'n'],
['r', 'e', 's', 's'],
['t', 'b', 'n', 't']]

To find viable words on a board, we'll recursively traverse paths across the showing letters. Once a sequence goes out of bounds or repeats a cube, traversal in that direction will stop.

Now, at each step in this process we could look up the sequence in a hash table. But, it may be the case that the current sequence is not a prefix of any word in the dictionary. So, it will be more efficient to store the dictionary in a trie. Here is an inspiring elegant implementation.

In [10]:
def make_trie(words):
root = {}
for word in words:
node = root
for letter in word:
node = node.setdefault(letter, {})
node[None] = None    # signify that node is a word
return root

In [11]:
import json

t = make_trie(['apply', 'apple', 'apt'])
print(json.dumps(t, indent=2))

{
"a": {
"p": {
"p": {
"l": {
"y": {
"null": null
},
"e": {
"null": null
}
}
},
"t": {
"null": null
}
}
}
}


Now let's write that recursive function that generates unique words on a given board, of at least some length.

In [12]:
from itertools import product

def get_words(board, trie, min_len=3):
"""Yield possible words from a given board and trie."""
offsets = [(-1, -1), (-1, 0), (-1, 1),
(0,  -1),          (0,  1),
(1,  -1), (1,  0), (1,  1)]

def recur(i, j, path, seen, node):
if (i, j) in seen:    # path can't cross itself
return
if not (0 <= i < 4) or not (0 <= j < 4):    # path can't wrap around board
return
letter = board[i][j]
path += letter
if letter in node:    # path is a prefix
if None in node[letter]:    # path is a word
yield path
for i_off, j_off in offsets:    # traverse neighbors even if path is word
yield from recur(i + i_off, j + j_off, path, seen.copy(), node[letter])

seen = set()
for i, j in product(range(4), range(4)):
for word in recur(i, j, '', set(), trie):
if len(word) >= min_len and word not in seen:
yield word


Finding words necessitates a dictionary, and a Scrabble-esque dictionary seems fitting. Conveniently, the Words With Friends game's dictionary is publicly available: the Enhanced North American Benchmark Lexicon.

In [13]:
with open('enable1.txt') as f:
trie = make_trie(word.strip() for word in f)


## Estimating Word Probabilities¶

To estimate word frequencies, we'll gather relative frequencies from random boards. But how many boards to sample from?

We can view the appearance of a specific word as a Bernoulli random variable with some unknown probability. We'd like to sample enough boards so that the resultant $k$ most common words is are in fact the most common, and in order.

In the land of probability bounds, where Markov and Chebyshev wander wide, one called Hoeffding's inequality is particularly powerful when it comes to Bernoulli variables. It can tell us how tight an $\epsilon$-sized confidence interval should be with $n$ samples.

Specifically, let $X_1, \ldots, X_n \sim \mathrm{Bernoulli}(p)$. Then, for any $\epsilon > 0$,

$$P(\left| \overline{X_n} - p \right| > \epsilon) \le 2 e^{-2n\epsilon^2}$$

So, to be $\alpha$ confident that some variable is within error $\epsilon$ of it's true value, we'll need

$$n \ge \frac{1}{2\epsilon^2}\log{\frac{2}{\alpha}}$$

samples.

This applies to any Bernoullie variable, and each random Boggle board gives a Bernoulli value to every word (it appears or it doesn't). So, we can just sample a bunch of boards, which will probably produce a tight enough bound.

However, we'll have to be very diligent about choosing our confidence $\alpha$, and tightness $\epsilon$. We can't guarantee that the top $k$ words are in fact the top $k$, and in order, but it does ossify those top probabilities with a diligently chosen $\epsilon$.

Experimenting with 1000 boards showed that top 10 words' probabilities overlap, suggesting that we need $\epsilon < 0.001$ to be, say, $\alpha = 0.1$ confident that the ordering is correct. Unfortunately, this would lead to $n \ge 1.5 \times 10^6$ boards, which is too many hours of computation. Instead we'll shoot for just having the same set of top words, rather than correct ranking. The difference in probability between the empirical most common word and the 20th (we choose $k$ arbitrarily) was about 0.01, yielding $n \ge 14979$ boards, which is more manageable.

In [20]:
%%time
from collections import Counter
from tabulate import tabulate
import multiprocessing as mp

NBOARDS = 14979

def update_counts(_):
board = rand_board()
return Counter(get_words(board, trie, min_len=4))

counts = Counter()
nprocs = mp.cpu_count()
with mp.Pool(processes=nprocs) as pool:
it = pool.imap_unordered(update_counts, range(NBOARDS),
chunksize=500)
for sampled_counts in it:
counts += sampled_counts

table = [(i, word, count/NBOARDS) for i, (word, count)
in enumerate(counts.most_common(20), start=1)]

  Rank  Word      Frequency
------  ------  -----------
1  teen         0.0642
2  tees         0.0639
3  toes         0.0575
4  note         0.0575
5  teat         0.0573
6  tent         0.0565
7  tone         0.0565
8  toea         0.0558
9  neat         0.0525
10  test         0.0524
11  nets         0.0517
12  sent         0.0511
13  nest         0.0510
14  teas         0.0508
15  thee         0.0505
16  seta         0.0502
17  ates         0.0502
18  tens         0.0501
19  ante         0.0501
20  etna         0.0501
CPU times: user 52.8 s, sys: 562 ms, total: 53.4 s
Wall time: 1min 13s


## Common and Ordinary Word Ranking¶

The most common list is interesting, but some of the words are so ordinary that other players will probably know them and also find them. To improve one's chances of winning Boggle, it would be nice to have a ranking that takes into account both a word's probability of occurence, and ordinariness.

This is reminiscent tf-idf weighting! Our current list gives us the term frequency, so all that remains is to find each term's inverse document frequency. Here, we assume that usage corresponds to how likely it is for a word to be known.

We'll use word usage counts from American National Corpus. However, some of the words from Boggle might have various endings (e.g. "apple", "apples"), and those should add in frequency for "ordinariness." So we'll stem the words for lookup.

In [21]:
from nltk.stem import SnowballStemmer
from collections import defaultdict

snow = SnowballStemmer('english')
freqs = defaultdict(int)
min_freq = 1    # smallest frequency found, for use with Boggle words not in list

with open('ANC-token-count.txt', encoding='ISO-8859-1') as f:    # there's weird characters
for word, *_, freq in map(str.split, f):
if word == 'Total':    # there was a total count of words at the end
continue
word, freq = snow.stem(word), float(freq)
freqs[word] += freq
min_freq = freqs[word]


Now, we just need to calculate the scores of the words output a new sorted list.

In [22]:
from math import log

tfs, idfs = [], []
for word, count in counts.items():
word = snow.stem(word)
tfs.append(1 + count)
idfs.append(log(1 + 1/freqs.get(word, min_freq)))

scores = sorted([(tf*idf, word) for tf, idf, word in zip(tfs, idfs, counts.keys())], reverse=True)
table = [(i, word, score) for i, (score, word) in enumerate(scores[:20], start=1)]

  Rank  Word      Score
------  ------  -------
1  toea    14157
2  seta    12736.3
3  nett    12668.6
4  teat    12641.7
5  stet    12228.8
6  tret    11721.4
7  rete    11719.2
8  sett    11653.8
9  tees    11617.2
10  etna    11493.7
11  teel    11078.7
12  hest    10994.1
13  teth    10981.5
14  haet    10841.9
15  nite    10547.4
16  hent    10419
17  eath    10419
18  tine    10407.1
19  leet    10359.1
20  sate    10339.6


# Epilogue¶

After completing this task, several sub-topics came to mind. We address some of those here. Unlike in previous sections, we will be lax proving the correct number of boards.

## Number of Possible Words Distribution¶

It would be interesting to see the minimum, maximum, mean, etc. of the number of words found over all boards. For the sake of brevity, rather than fitting the distribution and devising estimators, we'll just plot a simple histogram. And I choose an arbitrarily large number of boards to sample from.

In [23]:
import matplotlib.pyplot as plt
import seaborn

%matplotlib inline
seaborn.set_context('notebook')
seaborn.set_style('whitegrid')

num_words = []
for _ in range(2000):
num_words.append(len(list(get_words(rand_board(), trie))))

plt.hist(num_words, bins=40, normed=True)
plt.title('Number of Possible Words Distribution')
plt.xlabel('Number of words')
plt.ylabel('Proportion')
plt.show()


## Best Proportion of Vowels¶

A friend gave the comment that the "best" Boggle boards are comprised of 40% vowels. Best, here, was intended to mean highest number of word solutions. Now to experimentally prove this claim.

In [24]:
from itertools import chain
from collections import defaultdict

NBOARDS = 1000
data = defaultdict(int)    # store how many words found per vowel, summed over boards
vowels = set('aeiou')    # for simplicity we assume that these are the only vowels

for _ in range(NBOARDS):
num_vowels = 0
board = rand_board()
for letter in chain(*board):
if letter in vowels:
num_vowels += 1
num_words = len(list(get_words(board, trie, min_len=4)))
data[num_vowels] += num_words


Now to plot the mean number of found words per number of vowels:

In [28]:
x, y = zip(*sorted(data.items()))

plt.plot(x, [t/NBOARDS for t in y], 'bo:')
plt.title(r'Number of Words in Board per Vowel')
plt.xlabel('Number of vowels')
plt.ylabel('Total number of found words')
plt.show()

In [26]:
6/16

Out[26]:
0.375

Would you look at that, 40% was pretty spot on for an off the cuff comment.

## Super Big Boggle¶

In [27]:
big_cubes = [
('s', 's', 's', 'u', 'n', 'e'),
('qu', 'j', 'k', 'z', 'w', 'x'),
('a', 'a', 'a', 's', 'f', 'r'),
('h', 'o', 'r', 'd', 'l', 'n'),
('g', 'o', 'r', 'w', 'r', 'v'),
('f', 's', 'i', 'r', 'a', 'a'),
('an', 'in', 'er', 'he', 'th', 'qu'),
('a', 'e', 'a', 'e', 'e', 'e'),
('l', 'i', 'e', 'a', 'n', 'm'),
('g', 'f', 'n', 'u', 'y', 'c'),
('a', 'e', 'm', 'e', 'e', 'e'),
('h', 'd', 't', 'n', 'o', 'd'),
('h', 's', 'e', 'r', 'i', 'l'),
('n', 'u', 'e', 'o', 'i', 'a'),
('o', 'n', 'd', 'w', 'h', 'h'),
('a', 'o', 'i', 'e', 'b', 'd'),
('l', 'h', 'h', 'r', 'd', 'o'),
('g', 'n', 'a', 'm', 'n', 'e'),
('s', 't', 'r', 'h', 'p', 'o'),
('n', 'd', 'e', 'a', 'n', 'n'),
('o', 'o', 'a', 'e', 'a', 'e'),
('t', 'm', 't', 'e', 'o', 't'),
('t', 'i', 's', 'r', 'v', 'h'),
('a', 'e', 'm', 'g', 'u', 'e'),
('t', 'i', 'l', 's', 'p', 'e'),
('n', 'o', 'w', 'o', 'u', 't'),
('y', 'r', 'y', 's', 'i', 'p'),
('e', 's', 'p', 't', 'i', 'c'),
('c', 't', 'n', 's', 'e', 'c'),
('y', 'r', 'f', 's', 'i', 'a'),
('l', 'c', 'd', 'd', 'n', 'n'),
('t', 'i', 't', 'c', 't', 'e', 'i'),
('e', 't', 'i', 'l', 'i', 's'),
('j', 'b', 'b', 'z', 'x', 'k'),
('o', 'o', 'o', 't', 'u', 't'),
('o', 'i', None, None, 'e', None)
]