Holes in the Wolfram Knowledgebase
December 3, 2023 at 3:02 PM by Dr. Drang
Wolfram touts its Knowledgebase as “the world’s largest and broadest repository of computable knowledge” and “carefully curated expert knowledge directly derived from primary sources.” There’s certainly a lot in there, but there are some inexplicable holes that could be filled with little effort.
I used the Knowledgebase last year in my post about orbital curvature. Things like
Entity["Planet", "Earth"]["AverageOrbitDistance"]
and
Entity["Planet", "Earth"]["Mass"]
pulled information out of the Knowledgebase so I didn’t have to look it up outside of Mathematica and paste it into my code. Very convenient.
But I learned a few months ago that another part of the Knowledgebase was missing data, which could get in the way of other types of calculation. I was testing out the kind of state- and county-level information I could access, and my initial explorations focused on where I live: DuPage County, Illinois.
To be sure, the Knowledgebase has lots of info on DuPage County. It knows, for example, the area, the population, the per capita income, and the number of annual births and deaths. But it doesn’t know the county seat, which I would think is easier to determine and enter into the Knowledgebase than most of the other stuff—not to mention more stable than transient figures like population and income.
Broadening my exploration to all the counties in Illinois, I learned that of our 102 counties, Wolfram knew the capitals of all of them except DuPage and DeKalb counties. So this command
AdministrativeDivisionData[
Entity["AdministrativeDivision", {"CookCounty", "Illinois",
"UnitedStates"}], "CapitalName"]
returns
Chicago
as expected, while both of these commands,
AdministrativeDivisionData[
Entity["AdministrativeDivision", {"DuPageCounty", "Illinois",
"UnitedStates"}], "CapitalName"]
and
AdministrativeDivisionData[
Entity["AdministrativeDivision", {"DeKalbCounty", "Illinois",
"UnitedStates"}], "CapitalName"]
return
Missing["NotAvailable"]
This is not exactly obscure information, and there are reliable sources from which to get it. Here, for example is a map from the Illinois Blue Book, an official publication of the state.
As you (and the folks at Wolfram) can see, the DuPage and DeKalb county seats are Wheaton and Sycamore, respectively.
I sent an email to Wolfram about the missing county seat data and got a boilerplate reply saying their development team would review it. That was in August; the Knowledgebase still returns Missing["NotAvailable"]
.
Recently, I decided to look for missing county seats in every state. Here are all the counties—or administrative divisions that Wolfram treats like counties— that are missing their capitals in the Knowledgebase:
County w/o seat | State |
---|---|
DeKalb County | Alabama |
Aleutians West | Alaska |
Bethel | Alaska |
Chugach | Alaska |
Copper River | Alaska |
Dillingham | Alaska |
Hoonah-Angoon | Alaska |
Nome | Alaska |
Prince of Wales-Hyder | Alaska |
Petersburg | Alaska |
Skagway | Alaska |
Southeast Fairbanks | Alaska |
Kusilvak Census Area | Alaska |
Wrangell | Alaska |
Yukon-Koyukuk | Alaska |
Mono County | California |
Sierra County | California |
Conejos County | Colorado |
Wakulla County | Florida |
Columbia County | Georgia |
Crawford County | Georgia |
DeKalb County | Georgia |
Echols County | Georgia |
Kalawao County | Hawaii |
Owyhee County | Idaho |
DeKalb County | Illinois |
DuPage County | Illinois |
DeKalb County | Indiana |
LaPorte County | Indiana |
Plaquemines Parish | Louisiana |
St. James Parish | Louisiana |
Keweenaw County | Michigan |
Lake of the Woods County | Minnesota |
DeSoto County | Mississippi |
Franklin County | Mississippi |
DeKalb County | Missouri |
McPherson County | Nebraska |
Esmeralda County | Nevada |
Eureka County | Nevada |
Lincoln County | Nevada |
Storey County | Nevada |
Burlington County | New Jersey |
Mora County | New Mexico |
Rio Arriba County | New Mexico |
Bronx County (The Bronx) | New York |
Broome County | New York |
Kings County (Brooklyn) | New York |
New York County (Manhattan) | New York |
Queens County (Queens) | New York |
Richmond County (Staten Island) | New York |
Camden County | North Carolina |
Currituck County | North Carolina |
Hyde County | North Carolina |
Dunn County | North Dakota |
Bristol County | Rhode Island |
Kent County | Rhode Island |
Buffalo County | South Dakota |
DeKalb County | Tennessee |
Borden County | Texas |
Glasscock County | Texas |
Kenedy County | Texas |
King County | Texas |
Loving County | Texas |
McMullen County | Texas |
Montague County | Texas |
Palo Pinto County | Texas |
Young County | Texas |
Rich County | Utah |
Alexandria (independent city) | Virginia |
Amelia County | Virginia |
Bath County | Virginia |
Bland County | Virginia |
Bristol (independent city) | Virginia |
Buckingham County | Virginia |
Buena Vista (independent city) | Virginia |
Charles City County | Virginia |
Charlottesville (independent city) | Virginia |
Chesapeake (independent city) | Virginia |
Colonial Heights (independent city) | Virginia |
Covington (independent city) | Virginia |
Cumberland County | Virginia |
Danville (independent city) | Virginia |
Dinwiddie County | Virginia |
Emporia (independent city) | Virginia |
Fairfax (independent city) | Virginia |
Falls Church (independent city) | Virginia |
Fluvanna County | Virginia |
Franklin (independent city) | Virginia |
Fredericksburg (independent city) | Virginia |
Galax (independent city) | Virginia |
Goochland County | Virginia |
Hampton (independent city) | Virginia |
Hanover County | Virginia |
Harrisonburg (independent city) | Virginia |
Hopewell (independent city) | Virginia |
Isle of Wight County | Virginia |
King and Queen County | Virginia |
King George County | Virginia |
King William County | Virginia |
Lancaster County | Virginia |
Lexington (independent city) | Virginia |
Lunenburg County | Virginia |
Lynchburg (independent city) | Virginia |
Manassas (independent city) | Virginia |
Manassas Park (independent city) | Virginia |
Martinsville (independent city) | Virginia |
Mathews County | Virginia |
Middlesex County | Virginia |
Nelson County | Virginia |
New Kent County | Virginia |
Newport News (independent city) | Virginia |
Norfolk (independent city) | Virginia |
Northumberland County | Virginia |
Norton (independent city) | Virginia |
Nottoway County | Virginia |
Petersburg (independent city) | Virginia |
Poquoson (independent city) | Virginia |
Portsmouth (independent city) | Virginia |
Powhatan County | Virginia |
Prince George County | Virginia |
Radford (independent city) | Virginia |
Richmond County | Virginia |
Roanoke County | Virginia |
Salem (independent city) | Virginia |
Stafford County | Virginia |
Staunton (independent city) | Virginia |
Suffolk (independent city) | Virginia |
Sussex County | Virginia |
Virginia Beach (independent city) | Virginia |
Waynesboro (independent city) | Virginia |
Williamsburg (independent city) | Virginia |
Winchester (independent city) | Virginia |
Quite a list. Now there are legitimate (or at least arguable) reasons some of these counties are missing their county seat:
- Some counties really don’t have a county seat. Kalawao County in Hawaii, for example.
- Some administrative districts in Alaska don’t have a capital. Alaska has boroughs rather than counties, and one of them, called the Unorganized Borough,1 is further subdivided into “census areas,” none of which have a capital.
- Virginia has, in addition to counties, “independent cities,” which appear to be at the same level as counties.2 While I think Wolfram would be justified in saying these independent cities are their own capitals, it’s decided to say the capitals are missing, which is a legitimate choice, too.
- Five counties in New York match up with the boroughs of New York City. Pretty hard to choose the capital of a portion of a city, so these counties also have missing capitals.
But most of the counties with missing county seats are like DuPage and DeKalb counties—regular counties with regular county seats that are just not included in the Knowledgebase, despite them being easy to look up and verify. There are fewer than 100 of them. I don’t know why they’re missing, but filling in missing values like this is a pretty standard data cleaning operation. And as I said earlier, this is pretty much a one-time operation; counties just don’t change seats very often.
I haven’t sent this list off to Wolfram. If the people on its development team can’t be bothered to clean the data in their own home state, how likely is it that they’ll fill in all the other states’ data? But they should.
Update 4 Dec 2023 11:45 AM
Chon Torres on Mastodon informed me that the two California counties, Mono and Sierra, do have county seats, but they’re unincorporated, and that might explain why they’re missing from the Knowledgebase. That’s a good explanation, but I would argue with Wolfram that it’s a poor reason for excluding a capital. A county seat should have county government offices—Chon mentioned that he’s been at the Sierra County Courthouse in Downieville—but I don’t see why it needs a municipal government.
Adding to the madness of Virginia government, Sam Davies told me that an independent city can also be the county seat of a county that it’s been carved out of. The example he gave was Charlottesville, which is both an independent city and the capital of Albermarle County. To me, a more disturbing example is Fairfax, which is an independent city but also the county seat of—yes, that’s right—Fairfax County.
Thanks to Chon and Sam for the local government expertise.
-
Which is apparently not actually a borough itself, despite its name. This is more than I wanted to know about Alaska’s government. ↩
-
As with the Unorganized Borough in Alaska, this is more than I wanted to know about Virginia’s government. ↩
Dates, triangles, and Python
December 1, 2023 at 12:31 PM by Dr. Drang
Tuesday morning, which was November 28, John D. Cook started a post with
The numbers in today’s date—11, 28, and 23—make up the sides of a triangle. This doesn’t always happen; the two smaller numbers have to add up to more than the larger number.
He went on to figure out the angles of the plane triangle with side lengths of 11, 28, and 23 and then extended his analysis to triangles on a sphere and a pseudosphere. But I got hung up on the quoted paragraph. Which days can and can’t be the sides of a triangle? And how does the number of such “triangle days” change from year to year?
So I wrote a little Python to answer these questions.
python:
1: from datetime import date, timedelta
2:
3: def isTriangleDay(dt):
4: "Can the the year, month, and day of the given date be the sides of a triangle?"
5: y = dt.year % 100
6: m = dt.month
7: d = dt.day
8: sides = sorted(int(x) for x in (y, m, d))
9: return sides[0] + sides[1] > sides[2]
10:
11: def allDays(y):
12: "Return a list of all days in the given year."
13: start = date(y, 1, 1)
14: end = date(y, 12, 31)
15: numDays = (end - start).days + 1
16: return [ start + timedelta(days=n) for n in range(numDays) ]
17:
18: def triangleDays(y):
19: "Return a list of all the triangle days in the given year."
20: return [x for x in allDays(y) if isTriangleDay(x) ]
isTriangleDay
is a Boolean function that implements the test Cook described for a datetime.date
object. Note that Line 5 extracts just the last two digits of the year, which is what Cook intends. You could, I suppose, change Line 9 to
python:
9: return sides[0] + sides[1] >= sides[2]
if you want to accept degenerate triangles, where the three sides collapse onto a single line. I don’t.
The allDays
function uses a list comprehension to return a list of all the days in a given year, and triangleDays
calls isTriangleDay
to filter the results of allDays
down to just triangle days. I think both of these functions are self-explanatory.
With these functions defined, I got all the triangle days for 2023 via
python:
print('\n'.join(x.strftime('%Y-%m-%d') for x in triangleDays(2023)))
which returned this list of dates (after reshaping into four columns):
2023-01-23 2023-06-27 2023-09-19 2023-11-17
2023-02-22 2023-06-28 2023-09-20 2023-11-18
2023-02-23 2023-07-17 2023-09-21 2023-11-19
2023-02-24 2023-07-18 2023-09-22 2023-11-20
2023-03-21 2023-07-19 2023-09-23 2023-11-21
2023-03-22 2023-07-20 2023-09-24 2023-11-22
2023-03-23 2023-07-21 2023-09-25 2023-11-23
2023-03-24 2023-07-22 2023-09-26 2023-11-24
2023-03-25 2023-07-23 2023-09-27 2023-11-25
2023-04-20 2023-07-24 2023-09-28 2023-11-26
2023-04-21 2023-07-25 2023-09-29 2023-11-27
2023-04-22 2023-07-26 2023-09-30 2023-11-28
2023-04-23 2023-07-27 2023-10-14 2023-11-29
2023-04-24 2023-07-28 2023-10-15 2023-11-30
2023-04-25 2023-07-29 2023-10-16 2023-12-12
2023-04-26 2023-08-16 2023-10-17 2023-12-13
2023-05-19 2023-08-17 2023-10-18 2023-12-14
2023-05-20 2023-08-18 2023-10-19 2023-12-15
2023-05-21 2023-08-19 2023-10-20 2023-12-16
2023-05-22 2023-08-20 2023-10-21 2023-12-17
2023-05-23 2023-08-21 2023-10-22 2023-12-18
2023-05-24 2023-08-22 2023-10-23 2023-12-19
2023-05-25 2023-08-23 2023-10-24 2023-12-20
2023-05-26 2023-08-24 2023-10-25 2023-12-21
2023-05-27 2023-08-25 2023-10-26 2023-12-22
2023-06-18 2023-08-26 2023-10-27 2023-12-23
2023-06-19 2023-08-27 2023-10-28 2023-12-24
2023-06-20 2023-08-28 2023-10-29 2023-12-25
2023-06-21 2023-08-29 2023-10-30 2023-12-26
2023-06-22 2023-08-30 2023-10-31 2023-12-27
2023-06-23 2023-09-15 2023-11-13 2023-12-28
2023-06-24 2023-09-16 2023-11-14 2023-12-29
2023-06-25 2023-09-17 2023-11-15 2023-12-30
2023-06-26 2023-09-18 2023-11-16 2023-12-31
That’s 136 triangle days for this year. To see how this count changes from year to year, I ran
python:
for y in range(2000, 2051):
print(f'{y} {len(triangleDays(y)):3d}')
which returned
2000 0
2001 12
2002 34
2003 54
2004 72
2005 88
2006 102
2007 114
2008 124
2009 132
2010 138
2011 142
2012 144
2013 144
2014 144
2015 144
2016 144
2017 144
2018 144
2019 144
2020 144
2021 142
2022 140
2023 136
2024 132
2025 127
2026 120
2027 113
2028 104
2029 93
2030 82
2031 72
2032 61
2033 51
2034 41
2035 33
2036 25
2037 19
2038 13
2039 8
2040 5
2041 2
2042 1
2043 0
2044 0
2045 0
2046 0
2047 0
2048 0
2049 0
2050 0
I knew there was no point in checking on years later in the century—it was obvious that every year after 2042 would have no triangle days. As you can see, the 2010s were the peak decade for triangle days. We’re now in the early stages of a 20-year decline.
After doing this, I looked back at my code and decided that most serious Python programmers wouldn’t have done it the way I did. Instead of functions that returned lists, they would build allDays
and triangleDays
as iterators.1 Not because there’s any need to save space—the space used by 366 datetime.date
objects is hardly even noticeable—but because that’s more the current style.
So to make myself feel more like a real Pythonista, I rewrote the code like this:
python:
1: from datetime import date, timedelta
2:
3: def isTriangleDay(dt):
4: "Can the the year, month, and day of the given date be the sides of a triangle?"
5: y = dt.year % 100
6: m = dt.month
7: d = dt.day
8: sides = sorted(int(x) for x in (y, m, d))
9: return sides[0] + sides[1] > sides[2]
10:
11: def allDays(y):
12: "Iterator for all days in the given year."
13: d = date(y, 1, 1)
14: end = date(y, 12, 31)
15: while d <= end:
16: yield d
17: d = d + timedelta(days=1)
18:
19: def triangleDays(y):
20: "Iterator for all the triangle days in the given year."
21: return filter(isTriangleDay, allDays(y))
isTriangleDay
is unchanged, but allDays
now works its way through the days of the year with a while
loop and the yield
statement, and triangleDays
uses the filter
function to iterate through just the triangle days.
Using these functions is basically the same as using the list-based versions, except that you can’t pass an iterator to len
. So determining the number of triangle days over a range of years can be done by either by converting the iterator to a list before passing it to len
,
python:
for y in range(2000, 2051):
print(f'{y} {len(list(triangleDays(y))):3d}')
or by using the sum
command with an argument that produces a one for each element of triangleDays
,
python:
for y in range(2000, 2051):
print(f'{y} {sum(1 for x in triangleDays(y))}')
The former sort of defeats the purpose of using an iterator, so I guess it’s better practice to use the latter, even though I find it weird looking.
It may well be that my perception of “real” Python programmers is wrong and they wouldn’t bother with yield
and filter
in such a piddly little problem as this. But at least I got some practice with them.
-
A confession: I find it hard to distinguish between between the proper use of the terms generator and iterator. My sense is that generators provide a way of creating iterators. So once the function is written, do you have a generator, an iterator, or both? ↩
A couple of game followups
November 20, 2023 at 12:20 PM by Dr. Drang
Here are some little tricks associated with Wordle and Conlextions, the Connections-like puzzle put out by Lex Friedman.
Last week, I mentioned that I needed a new starting guess for Wordle, and I wrote about how I used some simple command-line tools to see if the word I was considering was an appropriate choice. In a nutshell, I had been using IRATE as my first guess, but because it was the answer recently I wanted to try a new initial guess that might be the answer in the future. ALTER seemed like a good choice, but I wanted to make sure it hadn’t already been used. I had a list of all the answers in chronological order in a file named answers.txt
, and the most recent answer was DWELT.
My solution involved three separate commands. All three involved grep
and one included head
in a pipeline. Reader Leon Cowle was unsatisfied with this and began casting about for a cleaner solution. Here’s the single command he came up with to replace the three I used:
egrep 'alter|dwelt' answers.txt
The output was
dwelt
alter
which told me that ALTER was an answer and that it would come after DWELT. This was everything I needed to know in one step. Beautiful!
You might argue that for a one-off like this, the solution that occurs to you first is the best because it takes the least amount of your time. Generally speaking, I agree with that, and I’m not unhappy with my three-command solution. But Leon has given me the best of both worlds. I got to have my own inefficient solution that I thought of quickly and I got to learn from his more elegant solution. Knowing which of two text strings appears first in a file is something I’m pretty sure I’ve had to do before and will have to do again. Now I have a simple and effective solution to pull out of my toolbox. Thanks, Leon!
As for Conlextions, I’ve been playing it for a few weeks and recommend it to anyone who likes the NY Times Connections game but is finding it a little too easy. I’ve been playing Connections since early summer and while it has gotten more difficult, the sense of accomplishment I get in solving it with no mistakes is wearing off. Conlextions is more diabolical and more satisfying when you solve it in four guesses.
I have only three gripes with Conlextions:
- I don’t think the t belongs in its name.
- Lex’s tendency to define groups as “words that begin with S,” or some other letter, is frustrating as hell because it’s both ridiculously easy and a connection I almost never see.
- The info that you share with others when you solve the puzzle includes the time it took. I’m a slow and methodical player, certain that Lex has laid so many traps that I need to think through all the possibilities three or four times before committing myself to any grouping. Including the time I spend is embarrassing, especially when I see the solution times of people like Dan Moren and Greg Pierce.
In protest of this prejudice against the excessively careful, I’ve recently taken to deleting the solve time from my posts on Mastodon. And to avoid the tedium of backspacing, I made this Shortcut that does the deleting for me:
It searches the clipboard for a linefeed, the word “Solve,” and all the text after that. It replaces that with the empty string and puts the updated text back onto the clipboard.
So now when I want to post my Conlextions result to Mastodon, I tap the “Share these results” button, press and hold the side button on my phone to activate Siri, and say “Delete Time.” The name of the Shortcut seems to be distinct enough that Siri hasn’t misinterpreted it yet.
I would not normally use Shortcuts for an automation like this. Keyboard Maestro would allow me to invoke it with a keystroke and could also do the pasting. But since I always play Conlextions on my phone, Shortcuts was the best option.
Nearly cheating
November 14, 2023 at 11:47 AM by Dr. Drang
Last week I solved Wordle in a single guess. My go-to first guess, IRATE, finally came in after months (over a year, I think) of use. So what do I do now? Seems like a good time to switch my starting guess.
Before we go any further, I should mention, for those of you who don’t commit every utterance of mine to memory and are thinking “There was no IRATE last week,” that I don’t play the NY Times version of Wordle. Back in early 2022, right after the Times bought Wordle, I downloaded the original version and set it up on a server I control. That’s the game my family and I have been playing ever since. Overall, I’d say this was unnecessary. The Times hasn’t screwed up the game the way I thought it would,1 but it’s too late now for us to change.
Based on letter frequency tables, I figured ALTER would be a good new initial guess. But because I was so delighted when IRATE came up, I’d like to choose a word that
- Is among the list of 2315 words that are answers.
- Hasn’t been an answer already.
And I want to see if ALTER passes these two tests without spoiling myself with other answers, especially those that may be coming up soon.
I have the data to do this if I’m careful. In order to build my wordle
script, I needed to pull out all the answers and acceptable guesses from the original Wordle JavaScript source code. I was able to do this nearly blind—I saw just the first and last couple of words in each list and have long since forgotten them—and save them to three files: guesses.txt
, answers.txt
, and wordle.txt
, the last of which is a blending of the two others.
Some simple shell commands got me what I wanted. First, the simple part:
grep alter answers.txt
returned alter
, which confirmed my belief that ALTER is an answer and didn’t reveal any other answers. Checking whether it had already been an answer was only a little trickier.
The answers.txt
file has the answer words in chronological order, one per line. To figure out where I currently am in that list, I got the line number of yesterday’s word, DWELT.
grep -n dwelt answers.txt
The -n
option causes the line number of the found text to be printed along with the text:
878:dwelt
With this information, I can now see if ALTER has already been the answer:
head -n 878 answers.txt | grep alter
The head
command outputs just the first 878 lines of answers.txt
and the grep
command looks for alter
in that text. It returned nothing, so ALTER has not been an answer so far. It passes my test and will be my first guess from now on.
What I like about this solution is that I was run my test quickly without seeing any other answer words and without tipping myself off to when ALTER will appear, which I could have easily done with
grep -n alter answers.txt
I recognize that some of you will read this and think I’ve crossed the line of Wordle ethics into outright cheating. You can keep your opinions to yourself.
-
Although I recall some outrage about a year ago when FEAST was the too-on-the-nose word for Thanksgiving Day. ↩