Wordle letters

Like all internet hipsters, I started playing Wordle a few days before the New York Times article that introduced it to the great unwashed. I don’t think I’ll stick with it for very long—the universe of five-letter words seems like something that will wear thin soon—but I am interested in the strategy. So I did a little scripting.

Clearly, the idea is to identify as many letters in the target word as quickly as possible. Letter frequencies in English text famously follow the ETAOIN SHRDLU order, an ordering that was built into Linotype keyboards back in the days of hot metal type. But Wordle isn’t based on general English text, it’s based specifically on five-letter words. So we need the letter frequencies for that restricted set.

Mac and Linux computers carry on the Unix tradition of including a file, /usr/share/dict/words that’s used for spell checking. It’s an alphabetical list with one word per line, which is very convenient for working out letter frequencies. But first, we’ll need to pull out just the five-letter words, leaving behind any proper nouns. That can be done with a simple Perl one-liner:1

perl -nle 'print if /^[a-z]{5}$/' /usr/share/dict/words > words5.txt

The regular expression that is the backbone of this command matches only five-letter words with no capitals. After running this, we have a file, words5.txt, that contains just the words we need for Wordle. It has about 8,500 entries.

Now that we have a file with just five-letter words, we can compute the letter frequencies with this script:

 1:  #!/usr/bin/perl
 3:  while($word = <>){
 4:    chomp $word;
 5:    foreach (split //, $word){
 6:      $freq{$_}++;
 7:    }
 8:  }
10:  foreach $letter (sort keys %freq){
11:    print "$letter\t$freq{$letter}\n";
12:  }

Lines 3—8 loop through the lines of the input file and build up a hash (or associative array) named %freq with letters as the keys and their counts as the values. Lines 10—12 then print out the hash in alphabetical order:

a   4467
b   1162
c   1546
d   1399
e   4255
f   661
g   1102
h   1323
i   2581
j   163
k   882
l   2368
m   1301
n   2214
o   2801
p   1293
q   84
r   3043
s   2383
t   2381
u   1881
v   466
w   685
x   189
y   1605
z   250

I could have included a sorting command to reorder the hash in frequency order, but it was easier to just copy the output, paste it into a spreadsheet, and do the reordering there. I also used the spreadsheet to sum the counts and present the frequencies as percentages.

Letter Count Frequency
a 4467 10.5%
e 4255 10.0%
r 3043 7.2%
o 2801 6.6%
i 2581 6.1%
s 2383 5.6%
t 2381 5.6%
l 2368 5.6%
n 2214 5.2%
u 1881 4.4%
y 1605 3.8%
c 1546 3.6%
d 1399 3.3%
h 1323 3.1%
m 1301 3.1%
p 1293 3.0%
b 1162 2.7%
g 1102 2.6%
k 882 2.1%
w 685 1.6%
f 661 1.6%
v 466 1.1%
z 250 0.6%
x 189 0.4%
j 163 0.4%
q 84 0.2%

Using /usr/share/dict/words as a source of words was convenient, as it was already on my computer, but I doubted it was the source of legal words in Wordle. Wouldn’t a word gamer use a Scrabble dictionary? A little searching led me to this page and this one. I copied the source code for each, opened them in BBEdit, and after a few search-and-replaces, had a two new lists of five-letter words. They were nearly the same, differing by about 60 words out of 8,900. I merged the two and named the result scrabble5.txt.2

Running the letter frequency script on this new file and doing the same sorting and percentage calculations as before gave me this list:

Letter Count Frequency
s 4623 10.4%
e 4585 10.3%
a 3986 8.9%
o 2977 6.7%
r 2916 6.5%
i 2633 5.9%
l 2440 5.5%
t 2319 5.2%
n 2022 4.5%
d 1727 3.9%
u 1697 3.8%
c 1475 3.3%
y 1403 3.1%
p 1384 3.1%
m 1339 3.0%
h 1214 2.7%
g 1113 2.5%
b 1096 2.5%
k 949 2.1%
f 790 1.8%
w 690 1.5%
v 475 1.1%
z 249 0.6%
x 213 0.5%
j 186 0.4%
q 79 0.2%

The leap of s from 5.6% to 10.4% suggest plurals play a big role in Scrabble dictionaries and not much of one in /usr/share/dict/words. I checked this by running

perl -nle 'print if /s$/' words5.txt | wc -l


perl -nle 'print if /s$/' scrabble5.txt | wc -l

to tell me how words that end in s are in each of the two files. There were 357 such words in words5.txt and 2771 in scrabble.txt. This told me two things:

  1. Spell checkers that use /usr/share/dict/words must use algorithmic methods to deal with plurals.
  2. A lot of legal five-letter Scrabble words are just pluralized four-letter words.

I did this on Monday and was pretty happy with it until I read that Times article on Wednesday, where it said that Josh Wardle, the creator of Wordle, had started with a list of 12,000 words but then

…narrowed down the list of Wordle words to about 2,500, which should last for a few years.

That would mean my frequencies are based on a much broader set of words than Wordle considers legal, which could throw off my calculated frequencies.

And yet…

Today I sacrificed my score by trying out some oddball Scrabble words to see if Wordle would accept them. It did. Here’s my game:

Bad Wordle game

I don’t know about you, but if I were limiting myself to just 2,500 words, things like heuch and vrows wouldn’t make the cut. (I would definitely include rebus and tapir, which some dorks—I would also include dorks—have apparently complained about.) So I’m wondering if the Times got this part of the story mixed up somehow. (Update: Nope, see below.)

You might be wondering if counting the number of times each letter appears in the list of legal five-letter words is the right way to characterize the frequency of letters. Maybe we should be counting the number of words each letter appears in. This script, which uses the uniq function in the List::Util module to filter out repeated letters, does just that:

 1:  #!/usr/bin/perl
 3:  use List::Util qw(uniq);
 5:  while($word = <>){
 6:    chomp $word;
 7:    foreach (uniq split //, $word){
 8:      $freq{$_}++;
 9:    }
10:  }
12:  foreach $letter (sort keys %freq){
13:    print "$letter\t$freq{$letter}\n";
14:  }

Using this way of counting gives us another letter frequency table:

Letter Count Frequency
s 4106 46.1%
e 3993 44.8%
a 3615 40.5%
r 2751 30.9%
o 2626 29.5%
i 2509 28.1%
l 2231 25.0%
t 2137 24.0%
n 1912 21.4%
u 1655 18.6%
d 1615 18.1%
c 1403 15.7%
y 1371 15.4%
p 1301 14.6%
m 1267 14.2%
h 1185 13.3%
g 1050 11.8%
b 1023 11.5%
k 913 10.2%
f 707 7.9%
w 686 7.7%
v 465 5.2%
z 227 2.5%
x 212 2.4%
j 184 2.1%
q 79 0.9%

In this table, the frequency column gives the percentage of words in which each letter appears. The ordering of the letters is basically the same as before, so I don’t think this way of counting will change your strategy.

Update 1/8/2022 5:29 PM
People have gently tweeted me that the Wordle source code—which you can easily download, and I could have easily downloaded before writing this post—has two lists. One is words that might be answers (2,315), and the other is additional words that can be guessed (10,657). You can run my scripts on either of these lists (or their concatenation) to refine your strategies. If you’re a serious Wordle player, think carefully about whether doing so would spoil your fun.

Thanks to Tim Dierks and Antonio Bueno.

Update 01/9/2022 11:22 AM
Todd Wells devised a way to grade five-letter words according to how well they match a decent-sized corpus of five-letter words. His grade is based on both the number of letters that match and whether they’re in the right position. Both his Python code and his explanation of how it works are very clear, and you should go take a look.

I suppose I should have said this in the original post, but I’ll say it here: These methods of scoring letters and words are really just for your first—and maybe your second—guess (Todd makes this explicit by naming his script wordle_starting_guess). They are ways of increasing the probability of “hits” early in the game. After that, it’s a matter of vocabulary and logic to bring you home.

  1. Please don’t tweet me shell commands that can do this with less typing. I know they exist, but this was the most efficient for me because I could type it out with virtually no thought at all. 

  2. You may remember I did a similar thing a couple of years ago to create lists of words with 6–9 letters to help me cheat at Countdown

Automating the annotation of PDFs

A few days ago, on the Automators podcast forum, thatchrisharper asked about a way to automatically add the filename to the first page of a PDF. While I knew of many tools that allow you to overlay one PDF on top of another, I didn’t know of any to directly add text to a specific spot of a PDF. But it was the sort of thing I’ve occasionally needed to do, so I went looking for a solution. This post, most of which is in my answer to thatchrisharper’s question, is what I found.

The solution comes from a combination of Ghostscript, which you can install through Homebrew, and pdfMark (or maybe pdfmark without the intercap—the naming isn’t consistent), a system created by Adobe and described this way:

The pdfmark operator is a PostScript-language extension that describes features that are present in PDF, but not in standard PostScript.

Basically, pdfMark was a way for Adobe’s Distiller application to add PDF-specific bits (like annotations) to PostScript files as it was converting them to PDF. It came out back in the 90s, when PostScript was well established, but PDF was still in its infancy. We can use Ghostscript in place of Distiller.

Let’s say we have a PDF that consists of several letter-sized pages, and we want to add some text centered in the top margin. If our original looks like this,

PDF before annotation

we want the annotated version to look like this,

PDF after annotation

where the annotation appears on the first page only. Here’s what we do:

First, create a text file (we’ll call it pdfmark.txt, but the name can be anything) with the following contents:

1:  [
2:  /Subtype /FreeText
3:  /SrcPg 1
4:  /Rect [206 758 406 774]
5:  /Color [1 1 .75]
6:  /DA (/HeBo 14 Tf 0 0 .5 rg)
7:  /Contents (My Annotation Text Here)
8:  /ANN pdfmark

This file can be saved anywhere, but for convenience we’ll assume it’s in the same folder as the original PDF, which I will cleverly name original.pdf.

Now we run this Ghostscript command,

gs -dBATCH -dNOPAUSE -dQUIET -sDEVICE=pdfwrite -sOutputFile=annotated.pdf  pdfmark.txt original.pdf

and we end up a new PDF, annotated.pdf, with the annotation shown above.

You can probably figure out what each line of pdfmark.txt does, but let’s run through it anyway.

The opening bracket on Line 1 is the necessary start of every pdfMark command. If you go looking for the matching closing bracket, you won’t find one. If you want to know why there’s no closing bracket, you’ll have to ask Adobe. Seems like really dumb syntax to me.

Line 2 declares this mark to be of the FreeText subtype. There are over a dozen subtypes you can use; see page 16 of the manual.

Line 3 tells the annotation to appear on the first page of the output document. As far as I can tell, there’s no convenient way to extend this command to multiple pages. /SrcPg can only be followed by a single integer argument, so if you want the same thing on several pages, your pdfmark.txt file will have to have this command repeated for each page.

Lines 4 and 5 define the bounding box for the annotation and set its background color. PostScript and PDF coordinates are in points (1/72 inch) with the origin at the lower left corner of the page. Unlike a lot of graphics formats, but like most graphs you see in math class, the y-coordinate increases as you go up. A letter-sized page is 612 points wide and 792 points tall, so the bounding box in Line 4 is 200 points wide, centered left/right, and its top edge is ¼ inch down from the top of the page. Colors are defined by an red-green-blue triplet of numbers that run from 0 to 1. Black is 0 0 0 and white is 1 1 1. White is the default, so if you leave out Line 5, it’s equivalent to /Color [1 1 1]

Line 6 defines the font used in the annotation. /DA means default appearance, and the rest of the line tells the text to appear in 14-point Helvetica Bold with a dark blue color.

Line 7 defines the text of the annotation between the parentheses.

Finally, Line 8 identifies the type of pdfmark as an annotation.

(There’s a nice document with other examples of pdfMark commands and another Adobe reference manual.)

So how do we automate this? Fundamentally, we create a temporary file for the pdfMark commands, run the Ghostscript command, and then delete the temporary file. Here’s a quickly written script, annotatePDF:

 1:  #!/usr/bin/env python
 3:  import os
 4:  import subprocess
 5:  import sys
 6:  import tempfile
 8:  # Set the parameters
 9:  annText = sys.argv[1]
10:  originalPDF = sys.argv[2]
11:  annotatedPDF = sys.argv[3]
13:  # Build the pdfMark command
14:  pdfMarkCommand = f"""[
15:  /Subtype /FreeText
16:  /SrcPg 1
17:  /Rect [156 758 456 774]
18:  /DA (/HeBo 14 Tf)
19:  /Contents ({annText})
20:  /ANN pdfmark
21:  """
23:  # Create a temporary file for the pdfMark commands and write to it
24:  fh, fpath = tempfile.mkstemp()
25:  with open(fpath, 'w') as f:
26:    f.write(pdfMarkCommand)
28:  # Run the Ghostscript command to make the annotated file
29:  subprocess.run(['gs', '-dBATCH', '-dNOPAUSE', '-dQUIET', '-sDEVICE=pdfwrite', f'-sOutputFile={annotatedPDF}',  fpath, originalPDF])
31:  # Delete the pdfMark command file
32:  os.remove(fpath)

This is not a great script. No error handling, no options, and no way to automate the naming of the output file. But it works.

annotatePDF 'Hello, world!' original.pdf annotated.pdf

As you can see from Lines 14—21, I don’t really want a yellow box with dark blue text. That was just to show some of pdfMark’s features.

By the way, the annotations you add through pdfMark are just like annotations you add in Preview or PDFpen. They can be selected, moved around, and edited. Here’s what the output file looks like in Preview after clicking on the added text.

Annotation selected in Preview

How many Thursdays?

A couple of weeks ago, this question from ldebritto appeared in the Automators forum:

I was looking for a way to have Shortcuts peek into last month’s calendar and count how many Thursdays were that month (4 or 5).

Any ideas on how to make it?

Stephen Millard, who posts there under the name sylumer, answered the question with a straightforward shortcut that loops through the days of last month, counting the Thursdays it meets along the way. It’s the kind of solution I employ often: a brute force approach that can be written quickly and uses looping instead of excessive cleverness. And in this case, since there can’t be more than 31 trips around the loop, it will run extremely fast. It’s a good solution.

But it nagged at me. This blog has a long history of showing calendrical calculations, and I had written a post on a similar problem (finding months with five Fridays, Saturdays, and Sundays) eleven years ago. The scripts in that post had some brute force aspects, but they also took advantage of how months are structured to filter the loops. I wanted to do something similar for the “how many Thursdays” problem.

As ldebritto said in his question, every month has either four or five Thursdays (indeed, four or five of any weekday). Whether it’s four or five depends on the number of days in the month and the date of the first Thursday of the month. We’ll consider each of these pieces of information and how they affect the answer.

Since every month has 28–31 days, we know that each month has four weeks plus 0–3 “extra” days. If the date of the first Thursday is less than or equal to the number of extra days, there will be five Thursdays in that month. For example:

The logic is simple, but implementing it in Shortcuts takes a little effort because there’s no “date of the first Thursday” function. Instead, we’ll figure out how many days from the first of the month to the first Thursday. And because this is one less than the date of the first Thursday the “less than or equal to” comparison we used above will turn into a simple “less than” comparison.

So how do we get the number of days from the first of the month to its first Thursday? We’ll need to use custom date formatting codes and do a little modulo arithmetic. Here’s the whole shortcut:

1 Thursdays Last Month Step 01 Get the current date
2 Thursdays Last Month Step 02 Move to the start of this month. We call this magic variable StartOfThisMonth.
3 Thursdays Last Month Step 03 Go back one month. We call this magic variable StartOfLastMonth.
4 Thursdays Last Month Step 04 Get the number of days between StartOfLastMonth and StartOfThisMonth, which is the length of last month.
5 Thursdays Last Month Step 05 Get the number of days over 4 weeks in last month. We call this magic variable ExtraDaysInMonth.
6 Thursdays Last Month Step 06 Format StartOfLastMonth with the c code, which gives the day of the week as a number from 1 (Sunday in my locale) to 7 (Saturday).
7 Thursdays Last Month Step 07 This is where we do the modulo arithmetic to get the number of days from the start of last month to last month’s first Thursday. We call this magic variable DaysToFirstThursday.
8 Thursdays Last Month Step 08 If this is true…
9 Thursdays Last Month Step 09 Return 5.
10 Thursdays Last Month Step 10 Otherwise…
11 Thursdays Last Month Step 11 Return 4.
12 Thursdays Last Month Step 12 End If block.
13 Thursdays Last Month Step 13 Show the answer.

The only tricky parts here are Steps 6 and 7. Custom date formatting patterns in Shortcuts follow Unicode Technical Standard #35, where c will give the day of the week as a digit from 1 to 7. In my locale Day 1 is Sunday and Day 7 is Saturday; it may be different where you live.

Step 7 uses the number we got in Step 6 and calculates the number of days from the start of the month to the first Thursday. This involves modulo 7 arithmetic, so the first thing we have to do is subtract 1 from the Step 6 result. Numbers in mod 7 go from 0 to 6, not 1 to 7. After subtraction, we’re in a day numbering system where Thursday is 4. To get the number of days from the start of the month to the first Thursday, we subtract the weekday number from 4, add 7 (to force a positive result), and then get the remainder (mod) after dividing by 7.

Let’s do a couple of examples. If I run the shortcut today, last month is September 2021, which started on a Wednesday. Step 6 returns 4, and Step 7 returns

(4 - (4 - 1) + 7) mod 7 = (4 - 3 + 7) mod 7 = 8 mod 7 = 1

which makes sense, as there’s one day from the start of September 2021 to the first Thursday of the month.

If you’re wondering why we bothered adding 7 and doing the mod stuff—why not just do 4 - (4 - 1) = 1?—think of what happens when we run the shortcut next month. October 2021 starts on a Friday. Step 6 returns 6, and Step 7 returns

(4 - (6 - 1) + 7) mod 7 = (4 - 5 + 7) mod 7 = 6 mod 7 = 6

and there are, in fact, 6 days from the start of October 2021 to the first Thursday of the month. Here, without the + 7 we would have gotten a result of -1 which is not the number of days from the start of October to its first Thursday. By including the + 7 and mod 7 terms, we have an equation that handles all cases.

This is not nearly as easy to understand as Stephen Millard’s solution, so I didn’t post it to the forum. On the other hand, his solution didn’t exercise my modulo muscles.

By the way, it didn’t take me two weeks to work this out. I had the shortcut written the day of the question, but I soon learned that my splitflow script for creating the table of Shortcuts steps you see above was broken. The Shortcuts team changed the look of the app, and the computer vision functions that splitflow uses weren’t able to figure out the top and bottom boundaries of the steps. So I put this post on ice until I had energy to update splitflow. It still needs some work—the padding above the steps shouldn’t be wider than the padding below—but it’s passable.

Semiautomated LaTeX tables

As a followup to my last post about automating the creation of Markdown tables, here’s a simple pair of functions that’ve been very helpful in making LaTeX tables quickly.

A few years ago, I wrote about how much I hated the syntax of LaTeX tables and how I was shifting to building tables as graphics files that I could insert using the \includegraphics{} command. In the early stages, I was doing this by hand, as outlined in that post, but after I developed a sense of the kind of spacing I liked, I translated that into a set of Python functions that used the ReportLab module to create decent-looking tables as PDF files. A Python solution made the most sense, as the data from which I made the tables typically came out of data analysis done in Python.

The functions I wrote worked well enough, but the overall system was more fussy than it should have been, and I realized I’m better at programming the manipulation of text than the manipulation of graphics. So I started thinking about ways to make LaTeX tables directly from Python data.

Here’s an artificial example that isn’t too far away from the kinds of tables I need to make. Let’s say we want a short table of (base 10) logarithms. We can generate the data—a list of values and a parallel list of their logs—like this:

from math import log10

x = [ (10+i)/10 for i in range(90) ]
lx = [ f'{log10(y):.4f}' for y in x ]

And what I want is a table that looks like this:

Log table

I use the booktabs package for making tables. I like its clean look, and I like its nice \addlinespace command for adding a little extra space between certain rows to make reading long tables easier. The code that produced the table above is

$x$ & $\log x$ & $x$ & $\log x$ & $x$ & $\log x$ \\
1.0 & 0.0000 & 4.0 & 0.6021 & 7.0 & 0.8451 \\
1.1 & 0.0414 & 4.1 & 0.6128 & 7.1 & 0.8513 \\
1.2 & 0.0792 & 4.2 & 0.6232 & 7.2 & 0.8573 \\
1.3 & 0.1139 & 4.3 & 0.6335 & 7.3 & 0.8633 \\
1.4 & 0.1461 & 4.4 & 0.6435 & 7.4 & 0.8692 \\
1.5 & 0.1761 & 4.5 & 0.6532 & 7.5 & 0.8751 \\
3.7 & 0.5682 & 6.7 & 0.8261 & 9.7 & 0.9868 \\
3.8 & 0.5798 & 6.8 & 0.8325 & 9.8 & 0.9912 \\
3.9 & 0.5911 & 6.9 & 0.8388 & 9.9 & 0.9956 \\
\caption{Partial logarithm table}

The boilerplate at the top and bottom is produced by a Keyboard Maestro macro. Except for the column alignment line and caption—which need to be set for each table—it never changes. The troublesome parts are the header and, especially, the body. For those, I use two functions, theader and tbody, to build the LaTeX code without having to touch the ampersand or backslash keys. To make the table in the same script as the data, I adjust the script to this:

from math import log10
from latex_tables import tbody, theader

x = [ (10+i)/10 for i in range(90) ]
lx = [ f'{log10(y):.4f}' for y in x ]
headers = ['$x$', '$\log x$']*3
print(tbody(x[:30], lx[:30], x[30:60], lx[30:60], x[60:], lx[60:], group=5))

The output is

$x$ & $\log x$ & $x$ & $\log x$ & $x$ & $\log x$ \\
1.0 & 0.0000 & 4.0 & 0.6021 & 7.0 & 0.8451 \\
1.1 & 0.0414 & 4.1 & 0.6128 & 7.1 & 0.8513 \\
1.2 & 0.0792 & 4.2 & 0.6232 & 7.2 & 0.8573 \\
1.3 & 0.1139 & 4.3 & 0.6335 & 7.3 & 0.8633 \\
1.4 & 0.1461 & 4.4 & 0.6435 & 7.4 & 0.8692 \\
1.5 & 0.1761 & 4.5 & 0.6532 & 7.5 & 0.8751 \\
3.7 & 0.5682 & 6.7 & 0.8261 & 9.7 & 0.9868 \\
3.8 & 0.5798 & 6.8 & 0.8325 & 9.8 & 0.9912 \\
3.9 & 0.5911 & 6.9 & 0.8388 & 9.9 & 0.9956 \\

which, as you can see, is exactly what goes between the top and bottom boilerplate.

The functions are defined in the file latex_tables.py, which is saved in my site-packages directory. Here’s the code:

 1:  def tbody(*cols, group=0):
 2:    "Given the columns, return the body of a LaTeX table."
 4:    # Add blanks at the ends of short columns
 5:    lens = [ len(c) for c in cols ]
 6:    nrows = max(lens)
 7:    cols = [ c + [' ']*(nrows - len(c)) for c in cols ]
 9:    # Assemble the rows of the table
10:    rows = []
11:    for i in range(nrows):
12:      if (group > 0) and (i > 0) and (i % group == 0):
13:        rows.append(r'\addlinespace')
14:      row = [ str(c[i]) for c in cols ]
15:      rows.append(' & '.join(row) + r' \\')
17:    return '\n'.join(rows)
19:  def theader(hcols):
20:    "Given a list of column headers, return the header lines of a LaTeX table."
22:    # Figure out the maximum number of lines in the header
23:    hcols = [ x.splitlines() for x in hcols ]
24:    maxlines = max(len(x) for x in hcols)
25:    hcols = [ [' ']*(maxlines - len(x)) + x for x in hcols ]
27:    # Assemble the rows of the header
28:    rows = []
29:    for i in range(maxlines):
30:      row = [ str(c[i]) for c in hcols ]
31:      rows.append(' & '.join(row) + r' \\')
33:    return '\n'.join(rows) + '\n\\midrule'

Like the example log table, most of the tables I make come from lists of data that are meant to go into the columns of the table. So the main feature of tbody is assembling the rows from those lists. The other significant feature is placing the \addlinespace command according to the group parameter.

The theader function is even simpler. The only interesting thing about it is how it handles multiline headers. If it’s called like this,

print(theader(['One line', 'Two\nlines', 'And\nthree\nlines']))

the output will be

    &   & And \\
    & Two & three \\
One line & lines & lines \\

which will produce a table that has a header that looks like this:

Multiline header

The header lines are bottom-aligned instead of top-aligned. The main purpose of theader is to automate that alignment.

Trickier table arrangements still have me hauling out my copy of Kopka & Daly, but theader and tbody do the bulk of the work even in those cases.