Holes in the Wolfram Knowledgebase

Wolfram touts its Knowledgebase as “the world’s largest and broadest repository of computable knowledge” and “carefully curated expert knowledge directly derived from primary sources.” There’s certainly a lot in there, but there are some inexplicable holes that could be filled with little effort.

I used the Knowledgebase last year in my post about orbital curvature. Things like

Entity["Planet", "Earth"]["AverageOrbitDistance"]

and

Entity["Planet", "Earth"]["Mass"]

pulled information out of the Knowledgebase so I didn’t have to look it up outside of Mathematica and paste it into my code. Very convenient.

But I learned a few months ago that another part of the Knowledgebase was missing data, which could get in the way of other types of calculation. I was testing out the kind of state- and county-level information I could access, and my initial explorations focused on where I live: DuPage County, Illinois.

To be sure, the Knowledgebase has lots of info on DuPage County. It knows, for example, the area, the population, the per capita income, and the number of annual births and deaths. But it doesn’t know the county seat, which I would think is easier to determine and enter into the Knowledgebase than most of the other stuff—not to mention more stable than transient figures like population and income.

Broadening my exploration to all the counties in Illinois, I learned that of our 102 counties, Wolfram knew the capitals of all of them except DuPage and DeKalb counties. So this command

AdministrativeDivisionData[
 Entity["AdministrativeDivision", {"CookCounty", "Illinois", 
  "UnitedStates"}], "CapitalName"]

returns

Chicago

as expected, while both of these commands,

AdministrativeDivisionData[
 Entity["AdministrativeDivision", {"DuPageCounty", "Illinois", 
  "UnitedStates"}], "CapitalName"]

and

AdministrativeDivisionData[
 Entity["AdministrativeDivision", {"DeKalbCounty", "Illinois", 
  "UnitedStates"}], "CapitalName"]

return

Missing["NotAvailable"]

This is not exactly obscure information, and there are reliable sources from which to get it. Here, for example is a map from the Illinois Blue Book, an official publication of the state.

Illinois county map

As you (and the folks at Wolfram) can see, the DuPage and DeKalb county seats are Wheaton and Sycamore, respectively.

I sent an email to Wolfram about the missing county seat data and got a boilerplate reply saying their development team would review it. That was in August; the Knowledgebase still returns Missing["NotAvailable"].

Recently, I decided to look for missing county seats in every state. Here are all the counties—or administrative divisions that Wolfram treats like counties— that are missing their capitals in the Knowledgebase:

County w/o seat State
DeKalb County Alabama
Aleutians West Alaska
Bethel Alaska
Chugach Alaska
Copper River Alaska
Dillingham Alaska
Hoonah-Angoon Alaska
Nome Alaska
Prince of Wales-Hyder Alaska
Petersburg Alaska
Skagway Alaska
Southeast Fairbanks Alaska
Kusilvak Census Area Alaska
Wrangell Alaska
Yukon-Koyukuk Alaska
Mono County California
Sierra County California
Conejos County Colorado
Wakulla County Florida
Columbia County Georgia
Crawford County Georgia
DeKalb County Georgia
Echols County Georgia
Kalawao County Hawaii
Owyhee County Idaho
DeKalb County Illinois
DuPage County Illinois
DeKalb County Indiana
LaPorte County Indiana
Plaquemines Parish Louisiana
St. James Parish Louisiana
Keweenaw County Michigan
Lake of the Woods County Minnesota
DeSoto County Mississippi
Franklin County Mississippi
DeKalb County Missouri
McPherson County Nebraska
Esmeralda County Nevada
Eureka County Nevada
Lincoln County Nevada
Storey County Nevada
Burlington County New Jersey
Mora County New Mexico
Rio Arriba County New Mexico
Bronx County (The Bronx) New York
Broome County New York
Kings County (Brooklyn) New York
New York County (Manhattan) New York
Queens County (Queens) New York
Richmond County (Staten Island) New York
Camden County North Carolina
Currituck County North Carolina
Hyde County North Carolina
Dunn County North Dakota
Bristol County Rhode Island
Kent County Rhode Island
Buffalo County South Dakota
DeKalb County Tennessee
Borden County Texas
Glasscock County Texas
Kenedy County Texas
King County Texas
Loving County Texas
McMullen County Texas
Montague County Texas
Palo Pinto County Texas
Young County Texas
Rich County Utah
Alexandria (independent city) Virginia
Amelia County Virginia
Bath County Virginia
Bland County Virginia
Bristol (independent city) Virginia
Buckingham County Virginia
Buena Vista (independent city) Virginia
Charles City County Virginia
Charlottesville (independent city) Virginia
Chesapeake (independent city) Virginia
Colonial Heights (independent city) Virginia
Covington (independent city) Virginia
Cumberland County Virginia
Danville (independent city) Virginia
Dinwiddie County Virginia
Emporia (independent city) Virginia
Fairfax (independent city) Virginia
Falls Church (independent city) Virginia
Fluvanna County Virginia
Franklin (independent city) Virginia
Fredericksburg (independent city) Virginia
Galax (independent city) Virginia
Goochland County Virginia
Hampton (independent city) Virginia
Hanover County Virginia
Harrisonburg (independent city) Virginia
Hopewell (independent city) Virginia
Isle of Wight County Virginia
King and Queen County Virginia
King George County Virginia
King William County Virginia
Lancaster County Virginia
Lexington (independent city) Virginia
Lunenburg County Virginia
Lynchburg (independent city) Virginia
Manassas (independent city) Virginia
Manassas Park (independent city) Virginia
Martinsville (independent city) Virginia
Mathews County Virginia
Middlesex County Virginia
Nelson County Virginia
New Kent County Virginia
Newport News (independent city) Virginia
Norfolk (independent city) Virginia
Northumberland County Virginia
Norton (independent city) Virginia
Nottoway County Virginia
Petersburg (independent city) Virginia
Poquoson (independent city) Virginia
Portsmouth (independent city) Virginia
Powhatan County Virginia
Prince George County Virginia
Radford (independent city) Virginia
Richmond County Virginia
Roanoke County Virginia
Salem (independent city) Virginia
Stafford County Virginia
Staunton (independent city) Virginia
Suffolk (independent city) Virginia
Sussex County Virginia
Virginia Beach (independent city) Virginia
Waynesboro (independent city) Virginia
Williamsburg (independent city) Virginia
Winchester (independent city) Virginia

Quite a list. Now there are legitimate (or at least arguable) reasons some of these counties are missing their county seat:

  1. Some counties really don’t have a county seat. Kalawao County in Hawaii, for example.
  2. Some administrative districts in Alaska don’t have a capital. Alaska has boroughs rather than counties, and one of them, called the Unorganized Borough,1 is further subdivided into “census areas,” none of which have a capital.
  3. Virginia has, in addition to counties, “independent cities,” which appear to be at the same level as counties.2 While I think Wolfram would be justified in saying these independent cities are their own capitals, it’s decided to say the capitals are missing, which is a legitimate choice, too.
  4. Five counties in New York match up with the boroughs of New York City. Pretty hard to choose the capital of a portion of a city, so these counties also have missing capitals.

But most of the counties with missing county seats are like DuPage and DeKalb counties—regular counties with regular county seats that are just not included in the Knowledgebase, despite them being easy to look up and verify. There are fewer than 100 of them. I don’t know why they’re missing, but filling in missing values like this is a pretty standard data cleaning operation. And as I said earlier, this is pretty much a one-time operation; counties just don’t change seats very often.

I haven’t sent this list off to Wolfram. If the people on its development team can’t be bothered to clean the data in their own home state, how likely is it that they’ll fill in all the other states’ data? But they should.

Update 4 Dec 2023 11:45 AM
Chon Torres on Mastodon informed me that the two California counties, Mono and Sierra, do have county seats, but they’re unincorporated, and that might explain why they’re missing from the Knowledgebase. That’s a good explanation, but I would argue with Wolfram that it’s a poor reason for excluding a capital. A county seat should have county government offices—Chon mentioned that he’s been at the Sierra County Courthouse in Downieville—but I don’t see why it needs a municipal government.

Adding to the madness of Virginia government, Sam Davies told me that an independent city can also be the county seat of a county that it’s been carved out of. The example he gave was Charlottesville, which is both an independent city and the capital of Albermarle County. To me, a more disturbing example is Fairfax, which is an independent city but also the county seat of—yes, that’s right—Fairfax County.

Thanks to Chon and Sam for the local government expertise.


  1. Which is apparently not actually a borough itself, despite its name. This is more than I wanted to know about Alaska’s government. 

  2. As with the Unorganized Borough in Alaska, this is more than I wanted to know about Virginia’s government. 


Dates, triangles, and Python

Tuesday morning, which was November 28, John D. Cook started a post with

The numbers in today’s date—11, 28, and 23—make up the sides of a triangle. This doesn’t always happen; the two smaller numbers have to add up to more than the larger number.

He went on to figure out the angles of the plane triangle with side lengths of 11, 28, and 23 and then extended his analysis to triangles on a sphere and a pseudosphere. But I got hung up on the quoted paragraph. Which days can and can’t be the sides of a triangle? And how does the number of such “triangle days” change from year to year?

So I wrote a little Python to answer these questions.

python:
 1:  from datetime import date, timedelta
 2:  
 3:  def isTriangleDay(dt):
 4:    "Can the the year, month, and day of the given date be the sides of a triangle?"
 5:    y = dt.year % 100
 6:    m = dt.month
 7:    d = dt.day
 8:    sides = sorted(int(x) for x in (y, m, d))
 9:    return sides[0] + sides[1] > sides[2]
10:  
11:  def allDays(y):
12:    "Return a list of all days in the given year."
13:    start = date(y, 1, 1)
14:    end = date(y, 12, 31)
15:    numDays = (end - start).days + 1
16:    return [ start + timedelta(days=n) for n in range(numDays) ]
17:  
18:  def triangleDays(y):
19:    "Return a list of all the triangle days in the given year."
20:    return [x for x in allDays(y) if isTriangleDay(x) ]

isTriangleDay is a Boolean function that implements the test Cook described for a datetime.date object. Note that Line 5 extracts just the last two digits of the year, which is what Cook intends. You could, I suppose, change Line 9 to

python:
 9:    return sides[0] + sides[1] >= sides[2]

if you want to accept degenerate triangles, where the three sides collapse onto a single line. I don’t.

The allDays function uses a list comprehension to return a list of all the days in a given year, and triangleDays calls isTriangleDay to filter the results of allDays down to just triangle days. I think both of these functions are self-explanatory.

With these functions defined, I got all the triangle days for 2023 via

python:
print('\n'.join(x.strftime('%Y-%m-%d') for x in triangleDays(2023)))

which returned this list of dates (after reshaping into four columns):

2023-01-23     2023-06-27     2023-09-19     2023-11-17
2023-02-22     2023-06-28     2023-09-20     2023-11-18
2023-02-23     2023-07-17     2023-09-21     2023-11-19
2023-02-24     2023-07-18     2023-09-22     2023-11-20
2023-03-21     2023-07-19     2023-09-23     2023-11-21
2023-03-22     2023-07-20     2023-09-24     2023-11-22
2023-03-23     2023-07-21     2023-09-25     2023-11-23
2023-03-24     2023-07-22     2023-09-26     2023-11-24
2023-03-25     2023-07-23     2023-09-27     2023-11-25
2023-04-20     2023-07-24     2023-09-28     2023-11-26
2023-04-21     2023-07-25     2023-09-29     2023-11-27
2023-04-22     2023-07-26     2023-09-30     2023-11-28
2023-04-23     2023-07-27     2023-10-14     2023-11-29
2023-04-24     2023-07-28     2023-10-15     2023-11-30
2023-04-25     2023-07-29     2023-10-16     2023-12-12
2023-04-26     2023-08-16     2023-10-17     2023-12-13
2023-05-19     2023-08-17     2023-10-18     2023-12-14
2023-05-20     2023-08-18     2023-10-19     2023-12-15
2023-05-21     2023-08-19     2023-10-20     2023-12-16
2023-05-22     2023-08-20     2023-10-21     2023-12-17
2023-05-23     2023-08-21     2023-10-22     2023-12-18
2023-05-24     2023-08-22     2023-10-23     2023-12-19
2023-05-25     2023-08-23     2023-10-24     2023-12-20
2023-05-26     2023-08-24     2023-10-25     2023-12-21
2023-05-27     2023-08-25     2023-10-26     2023-12-22
2023-06-18     2023-08-26     2023-10-27     2023-12-23
2023-06-19     2023-08-27     2023-10-28     2023-12-24
2023-06-20     2023-08-28     2023-10-29     2023-12-25
2023-06-21     2023-08-29     2023-10-30     2023-12-26
2023-06-22     2023-08-30     2023-10-31     2023-12-27
2023-06-23     2023-09-15     2023-11-13     2023-12-28
2023-06-24     2023-09-16     2023-11-14     2023-12-29
2023-06-25     2023-09-17     2023-11-15     2023-12-30
2023-06-26     2023-09-18     2023-11-16     2023-12-31

That’s 136 triangle days for this year. To see how this count changes from year to year, I ran

python:
for y in range(2000, 2051):
  print(f'{y}   {len(triangleDays(y)):3d}')

which returned

2000     0
2001    12
2002    34
2003    54
2004    72
2005    88
2006   102
2007   114
2008   124
2009   132
2010   138
2011   142
2012   144
2013   144
2014   144
2015   144
2016   144
2017   144
2018   144
2019   144
2020   144
2021   142
2022   140
2023   136
2024   132
2025   127
2026   120
2027   113
2028   104
2029    93
2030    82
2031    72
2032    61
2033    51
2034    41
2035    33
2036    25
2037    19
2038    13
2039     8
2040     5
2041     2
2042     1
2043     0
2044     0
2045     0
2046     0
2047     0
2048     0
2049     0
2050     0

I knew there was no point in checking on years later in the century—it was obvious that every year after 2042 would have no triangle days. As you can see, the 2010s were the peak decade for triangle days. We’re now in the early stages of a 20-year decline.

After doing this, I looked back at my code and decided that most serious Python programmers wouldn’t have done it the way I did. Instead of functions that returned lists, they would build allDays and triangleDays as iterators.1 Not because there’s any need to save space—the space used by 366 datetime.date objects is hardly even noticeable—but because that’s more the current style.

So to make myself feel more like a real Pythonista, I rewrote the code like this:

python:
 1:  from datetime import date, timedelta
 2:  
 3:  def isTriangleDay(dt):
 4:    "Can the the year, month, and day of the given date be the sides of a triangle?"
 5:    y = dt.year % 100
 6:    m = dt.month
 7:    d = dt.day
 8:    sides = sorted(int(x) for x in (y, m, d))
 9:    return sides[0] + sides[1] > sides[2]
10:  
11:  def allDays(y):
12:    "Iterator for all days in the given year."
13:    d = date(y, 1, 1)
14:    end = date(y, 12, 31)
15:    while d <= end:
16:      yield d
17:      d = d + timedelta(days=1)
18:  
19:  def triangleDays(y):
20:    "Iterator for all the triangle days in the given year."
21:    return filter(isTriangleDay, allDays(y))

isTriangleDay is unchanged, but allDays now works its way through the days of the year with a while loop and the yield statement, and triangleDays uses the filter function to iterate through just the triangle days.

Using these functions is basically the same as using the list-based versions, except that you can’t pass an iterator to len. So determining the number of triangle days over a range of years can be done by either by converting the iterator to a list before passing it to len,

python:
for y in range(2000, 2051):
  print(f'{y}   {len(list(triangleDays(y))):3d}')

or by using the sum command with an argument that produces a one for each element of triangleDays,

python:
for y in range(2000, 2051):
  print(f'{y}    {sum(1 for x in triangleDays(y))}')

The former sort of defeats the purpose of using an iterator, so I guess it’s better practice to use the latter, even though I find it weird looking.

It may well be that my perception of “real” Python programmers is wrong and they wouldn’t bother with yield and filter in such a piddly little problem as this. But at least I got some practice with them.


  1. A confession: I find it hard to distinguish between between the proper use of the terms generator and iterator. My sense is that generators provide a way of creating iterators. So once the function is written, do you have a generator, an iterator, or both? 


A couple of game followups

Here are some little tricks associated with Wordle and Conlextions, the Connections-like puzzle put out by Lex Friedman.

Last week, I mentioned that I needed a new starting guess for Wordle, and I wrote about how I used some simple command-line tools to see if the word I was considering was an appropriate choice. In a nutshell, I had been using IRATE as my first guess, but because it was the answer recently I wanted to try a new initial guess that might be the answer in the future. ALTER seemed like a good choice, but I wanted to make sure it hadn’t already been used. I had a list of all the answers in chronological order in a file named answers.txt, and the most recent answer was DWELT.

My solution involved three separate commands. All three involved grep and one included head in a pipeline. Reader Leon Cowle was unsatisfied with this and began casting about for a cleaner solution. Here’s the single command he came up with to replace the three I used:

egrep 'alter|dwelt' answers.txt

The output was

dwelt
alter

which told me that ALTER was an answer and that it would come after DWELT. This was everything I needed to know in one step. Beautiful!

You might argue that for a one-off like this, the solution that occurs to you first is the best because it takes the least amount of your time. Generally speaking, I agree with that, and I’m not unhappy with my three-command solution. But Leon has given me the best of both worlds. I got to have my own inefficient solution that I thought of quickly and I got to learn from his more elegant solution. Knowing which of two text strings appears first in a file is something I’m pretty sure I’ve had to do before and will have to do again. Now I have a simple and effective solution to pull out of my toolbox. Thanks, Leon!

As for Conlextions, I’ve been playing it for a few weeks and recommend it to anyone who likes the NY Times Connections game but is finding it a little too easy. I’ve been playing Connections since early summer and while it has gotten more difficult, the sense of accomplishment I get in solving it with no mistakes is wearing off. Conlextions is more diabolical and more satisfying when you solve it in four guesses.

I have only three gripes with Conlextions:

  1. I don’t think the t belongs in its name.
  2. Lex’s tendency to define groups as “words that begin with S,” or some other letter, is frustrating as hell because it’s both ridiculously easy and a connection I almost never see.
  3. The info that you share with others when you solve the puzzle includes the time it took. I’m a slow and methodical player, certain that Lex has laid so many traps that I need to think through all the possibilities three or four times before committing myself to any grouping. Including the time I spend is embarrassing, especially when I see the solution times of people like Dan Moren and Greg Pierce.

Conlextions solution

In protest of this prejudice against the excessively careful, I’ve recently taken to deleting the solve time from my posts on Mastodon. And to avoid the tedium of backspacing, I made this Shortcut that does the deleting for me:

Shortcut for deleting solution time text from clipboard

It searches the clipboard for a linefeed, the word “Solve,” and all the text after that. It replaces that with the empty string and puts the updated text back onto the clipboard.

So now when I want to post my Conlextions result to Mastodon, I tap the “Share these results” button, press and hold the side button on my phone to activate Siri, and say “Delete Time.” The name of the Shortcut seems to be distinct enough that Siri hasn’t misinterpreted it yet.

I would not normally use Shortcuts for an automation like this. Keyboard Maestro would allow me to invoke it with a keystroke and could also do the pasting. But since I always play Conlextions on my phone, Shortcuts was the best option.


Nearly cheating

Last week I solved Wordle in a single guess. My go-to first guess, IRATE, finally came in after months (over a year, I think) of use. So what do I do now? Seems like a good time to switch my starting guess.

Before we go any further, I should mention, for those of you who don’t commit every utterance of mine to memory and are thinking “There was no IRATE last week,” that I don’t play the NY Times version of Wordle. Back in early 2022, right after the Times bought Wordle, I downloaded the original version and set it up on a server I control. That’s the game my family and I have been playing ever since. Overall, I’d say this was unnecessary. The Times hasn’t screwed up the game the way I thought it would,1 but it’s too late now for us to change.

Based on letter frequency tables, I figured ALTER would be a good new initial guess. But because I was so delighted when IRATE came up, I’d like to choose a word that

  1. Is among the list of 2315 words that are answers.
  2. Hasn’t been an answer already.

And I want to see if ALTER passes these two tests without spoiling myself with other answers, especially those that may be coming up soon.

I have the data to do this if I’m careful. In order to build my wordle script, I needed to pull out all the answers and acceptable guesses from the original Wordle JavaScript source code. I was able to do this nearly blind—I saw just the first and last couple of words in each list and have long since forgotten them—and save them to three files: guesses.txt, answers.txt, and wordle.txt, the last of which is a blending of the two others.

Some simple shell commands got me what I wanted. First, the simple part:

grep alter answers.txt

returned alter, which confirmed my belief that ALTER is an answer and didn’t reveal any other answers. Checking whether it had already been an answer was only a little trickier.

The answers.txt file has the answer words in chronological order, one per line. To figure out where I currently am in that list, I got the line number of yesterday’s word, DWELT.

grep -n dwelt answers.txt

The -n option causes the line number of the found text to be printed along with the text:

878:dwelt

With this information, I can now see if ALTER has already been the answer:

head -n 878 answers.txt | grep alter

The head command outputs just the first 878 lines of answers.txt and the grep command looks for alter in that text. It returned nothing, so ALTER has not been an answer so far. It passes my test and will be my first guess from now on.

What I like about this solution is that I was run my test quickly without seeing any other answer words and without tipping myself off to when ALTER will appear, which I could have easily done with

grep -n alter answers.txt

I recognize that some of you will read this and think I’ve crossed the line of Wordle ethics into outright cheating. You can keep your opinions to yourself.


  1. Although I recall some outrage about a year ago when FEAST was the too-on-the-nose word for Thanksgiving Day.