5x5 Grids of English Words

The Problem

Here's an example of a valid 5x5 grid. It contains 10 English words (5 across words and 5 down words).

a w a r e
m o l a l
a v o i d
s e n s e
s n e e r

This project aims to find all 5x5 grids of English words. The word list I'm using is the python library from english_words import english_words_lower_alpha_set. This set contains 3213 words. I removed the majority of the proper nouns and all of the non-words (I hope!), which brings the word count down to 2481.

There are \(2481^5 = 9.4 \times 10^16 = 94\) million billion possible groups of 5 non-unique words. If we can process 100 grids per second, then we'd need about \((2481^5) / 100\) seconds (so, 29.8 million years!) to compute the set of all grids. This is back-of-the-napkin math, but it gives us a general sense of the orders of magnitudes we're dealing with here.

Here is a naive implementation that finds all valid 5x5 grids (without excluding duplicate words).

from tqdm.notebook import tqdm

def find_grids_slow(words):
    words2 = set(words) # checking set containment is faster than checking list containment
    for a0 in words:
        for a1 in words:
            for a2 in tqdm(words):
                for a3 in words:
                    for a4 in words:
                        d0 = a0[0] + a1[0] + a2[0] + a3[0] + a4[0]
                        d1 = a0[1] + a1[1] + a2[1] + a3[1] + a4[1]
                        d2 = a0[2] + a1[2] + a2[2] + a3[2] + a4[2]
                        d3 = a0[3] + a1[3] + a2[3] + a3[3] + a4[3]
                        d4 = a0[4] + a1[4] + a2[4] + a3[4] + a4[4]
                        if (d0 in words2) and (d1 in words2) and (d2 in words2) and (d3 in words2) and (d4 in words2):
                            grid = "\n".join((a0,a1,a2,a3,a4,"\n"))
                            return grid

I ran this function using a set of 3069 words. For a0 = "sloth", it took 9 hours to run for each of the first two values of a2. (I didn't print out what those two values were). There are 3069 possible values for a2. So, it will take \(9 * 3069 = 27621\) hours to complete the computation for all values of a2. Then we need to run that loop for each value of a1 and each value of a0, bringing us to 260,155,597,581 hours, which is 29.7 million years. That's too long. Even if we improved this by a few orders of magnitude and ran it in parallel, the brute force approach isn't going to be viable.

Note that this function processes \((3069^5) / (260,155,597,581 * 60 * 60) = 290\) grids per second.

The Implementation

Starting out, I had a poor intuition for how uncommon 5x5 grids actually are. While it would be nice to compute all grids, a few hundred grids would suffice for my use case. So, I decided to start out by trying to compute just one grid. I fixed a0. Then I chose a random d0 (starting with the first letter of a0). Then I chose a random a1. Then a random d1. Then a random a2. Such a grid looked like this.

s t e e r
i r a t e
l a p s e
k c _ _ _
y e _ _ _

There are no English words that can complete this grid. I ran this function enough times to convince myself that statistically I'd never find a complete grid using this method. And I definitely wouldn't find hundreds (if there even are hundreds!).

Next I experimented with fixing a0, and then iterating over all valid d0s, for all valid a1 words, for all valid d1 words, etc. The indexing was kind of unweidly, and I realized I wasn't gaining anything from alternating between across words and down words.

So, I decided to fix a0. Then loop over each a1, filtering out those values of a1 for which there were not valid down words. Then loop over each a2 in the same fashion. After running this function I noticed that, after fixing the first three across words, there were often only a handful of valid down words remaining. So, instead of iterating over the full set of words two more times (once for each of a3 and a4), we can just consider the valid remaining down words, checking that the a3 and a4 they produce are valid.

s t o v e
h a v e n
a l e r t
_ _ _ _ _ 
_ _ _ _ _ 
from tqdm.notebook import tqdm

def has_prefix(prefix, words):
    for word in words:
        if word.startswith(prefix):
            return True

def filter_prefix(prefix, words):
    out = set()
    for word in words:
        if word.startswith(prefix):
            out.add(word)
    return out

def find_grids(a0_words, all_words, output_file):
    all_words_set = set(all_words)
    for a0 in a0_words:
        print('a0:', a0)
        for a1 in tqdm(all_words_set - {a0}):
            p0 = a0[0] + a1[0]
            p1 = a0[1] + a1[1]
            p2 = a0[2] + a1[2]
            p3 = a0[3] + a1[3]
            p4 = a0[4] + a1[4]
            if not(has_prefix(p0, all_words_set) and has_prefix(p1, all_words_set) and has_prefix(p2, all_words_set) and has_prefix(p3, all_words_set) and has_prefix(p4, all_words_set)):
                continue
            for a2 in all_words_set - {a0,a1}:
                p0 = a0[0] + a1[0] + a2[0]
                p1 = a0[1] + a1[1] + a2[1]
                p2 = a0[2] + a1[2] + a2[2]
                p3 = a0[3] + a1[3] + a2[3]
                p4 = a0[4] + a1[4] + a2[4]
                # precomputing downs rather than filtering by prefix first is maybe faster
                d0s = filter_prefix(p0, all_words_set - {a0,a1,a2})
                d1s = filter_prefix(p1, all_words_set - {a0,a1,a2})
                d2s = filter_prefix(p2, all_words_set - {a0,a1,a2})
                d3s = filter_prefix(p3, all_words_set - {a0,a1,a2})
                d4s = filter_prefix(p4, all_words_set - {a0,a1,a2})
                if not((len(d0s) > 0) and (len(d1s) > 0) and (len(d2s) > 0) and (len(d3s) > 0) and (len(d4s) > 0)):
                    continue
                for d0 in d0s:
                    for d1 in d1s:
                        for d2 in d2s:
                            for d3 in d3s:
                                for d4 in d4s:
                                    a3 = d0[3] + d1[3] + d2[3] + d3[3] + d4[3]
                                    a4 = d0[4] + d1[4] + d2[4] + d3[4] + d4[4]
                                    if (a3 in all_words_set - {d0,d1,d2,d3,d4,a0,a1,a2}) and (a4 in all_words_set - {d0,d1,d2,d3,d4,a0,a1,a2,a3}):
                                        grid = "\n".join((a0,a1,a2,a3,a4,"\n"))
                                        with open(output_file, 'a') as file:
                                            file.write(grid)

(I'd be surprised if I couldn't rewrite this function in a more general way using fancy numpy array index tricks. But, this is fine for now. Also, I could improve this roughly 2x by not double-computing grid transposes (e.g. after computing a grid, store d0 in a file and then check that file each time the function starts with a fresh a0), but that doesn't seem like a big enough win given the work involved.)

I wrote this function so that I could run it in parallel (simply by running it at the same time in different jupyter notebook tabs) by passing in a subset of the full word set as a0_words. I divided the full word set into subsets of size 100 for this purpose.

The Results

I ran find_grids in 8 or 9 separate jupyter notebook tabs simultaneously. This was the first time I've heard the fan spin up loudly on my computer.

I ran the first 800 words and the last 881 words using the above function. I ran the middle 800 words using a different function, which considered all down words after fixing only a0 and a1. This was slower, but still completed.

The outputs of the timing look like this.

a0: bless 

100%
2480/2480 [08:00<00:00, 7.16it/s]

a0: lathe 

100%
2480/2480 [03:07<00:00, 12.77it/s]

I ran awk 'NR % 5 == 0' timedresults.txt on that file to output every 5th row (the rows that contain the times). Then I removed everything but the times themselves in vim using :%s/<.*//. The rows look like this.

00:28
08:17
04:02

Faster Algorithm

Verify that all rows have length 5 (i.e. just minutes and seconds). Multiply the seconds by 60, and then sum them.

awk '{print length}' timedresults_first800_fast.txt | sort -n | uniq -c
  800 5

cat timedresults_first800_fast.txt | awk -F: '{ print ($1 * 60) + $2 }' | awk '{n += $1}; END{print n}'
  308379

awk '{print length}' timedresults_third881_fast.txt | sort -n | uniq -c 
  881 5

cat timedresults_third881_fast.txt | awk -F: '{ print ($1 * 60) + $2 }' | awk '{n += $1}; END{print n}' 
  364620

These 1681 values for a0 took a total of \((308379 + 364620) / 3600 = 187\) CPU hours to run. I ran them in 17 jupyter notebook tabs (of 100 a0 values each (one tab had 81)), so each tab took on average 11 hours to complete.

Slower Algorithm

awk '{print length}' timedresults_second800_slow_minutes.txt | sort -n | uniq -c
  758 5

cat timedresults_second800_slow_minutes.txt | awk -F: '{ print ($1 * 60) + $2 }' | awk '{n += $1}; END{print n}' 
  377945

awk '{print length}' timedresults_second800_slow_hours.txt | sort -n | uniq -c
  42 7

cat timedresults_second800_slow_hours.txt | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }' | awk '{n += $1}; END{print n}'
  360329

These 800 values for a0 took a total of \((377945 + 360329) / 3600 = 205\) CPU hours to run. I ran them in 8 juypter notebook tabs, so each tab took on average 25 hours to complete.

These data have a wide tail. Here are the results greater than 2 hours.

2:01:28
2:01:48
2:06:33
2:09:05
2:11:10
2:19:48
2:36:54
2:37:35
2:52:42
3:12:39
3:12:42
3:30:57
3:33:50
3:57:20
4:40:07
6:13:06
7:20:52
9:07:59

For those curious, a0 = spasm took 9:07:59.

The Finale

If I'd used the faster function exclusively, it'd have taken approximately 276 CPU hours to compute all grids. This is a considerable improvement over 30 million years.

There are 210 unique grids (excluding transposes). Far fewer than I expected!

Optimization doesn't need to make things as fast as possible; it just needs to make things fast enough.

Instead of making the fundamental operations faster, I executed far fewer operations.