How many Thursdays?

A couple of weeks ago, this question from ldebritto appeared in the Automators forum:

I was looking for a way to have Shortcuts peek into last month’s calendar and count how many Thursdays were that month (4 or 5).

Any ideas on how to make it?

Stephen Millard, who posts there under the name sylumer, answered the question with a straightforward shortcut that loops through the days of last month, counting the Thursdays it meets along the way. It’s the kind of solution I employ often: a brute force approach that can be written quickly and uses looping instead of excessive cleverness. And in this case, since there can’t be more than 31 trips around the loop, it will run extremely fast. It’s a good solution.

But it nagged at me. This blog has a long history of showing calendrical calculations, and I had written a post on a similar problem (finding months with five Fridays, Saturdays, and Sundays) eleven years ago. The scripts in that post had some brute force aspects, but they also took advantage of how months are structured to filter the loops. I wanted to do something similar for the “how many Thursdays” problem.

As ldebritto said in his question, every month has either four or five Thursdays (indeed, four or five of any weekday). Whether it’s four or five depends on the number of days in the month and the date of the first Thursday of the month. We’ll consider each of these pieces of information and how they affect the answer.

Since every month has 28–31 days, we know that each month has four weeks plus 0–3 “extra” days. If the date of the first Thursday is less than or equal to the number of extra days, there will be five Thursdays in that month. For example:

The logic is simple, but implementing it in Shortcuts takes a little effort because there’s no “date of the first Thursday” function. Instead, we’ll figure out how many days from the first of the month to the first Thursday. And because this is one less than the date of the first Thursday the “less than or equal to” comparison we used above will turn into a simple “less than” comparison.

So how do we get the number of days from the first of the month to its first Thursday? We’ll need to use custom date formatting codes and do a little modulo arithmetic. Here’s the whole shortcut:

StepActionComment
1 Thursdays Last Month Step 01 Get the current date
2 Thursdays Last Month Step 02 Move to the start of this month. We call this magic variable StartOfThisMonth.
3 Thursdays Last Month Step 03 Go back one month. We call this magic variable StartOfLastMonth.
4 Thursdays Last Month Step 04 Get the number of days between StartOfLastMonth and StartOfThisMonth, which is the length of last month.
5 Thursdays Last Month Step 05 Get the number of days over 4 weeks in last month. We call this magic variable ExtraDaysInMonth.
6 Thursdays Last Month Step 06 Format StartOfLastMonth with the c code, which gives the day of the week as a number from 1 (Sunday in my locale) to 7 (Saturday).
7 Thursdays Last Month Step 07 This is where we do the modulo arithmetic to get the number of days from the start of last month to last month’s first Thursday. We call this magic variable DaysToFirstThursday.
8 Thursdays Last Month Step 08 If this is true…
9 Thursdays Last Month Step 09 Return 5.
10 Thursdays Last Month Step 10 Otherwise…
11 Thursdays Last Month Step 11 Return 4.
12 Thursdays Last Month Step 12 End If block.
13 Thursdays Last Month Step 13 Show the answer.

The only tricky parts here are Steps 6 and 7. Custom date formatting patterns in Shortcuts follow Unicode Technical Standard #35, where c will give the day of the week as a digit from 1 to 7. In my locale Day 1 is Sunday and Day 7 is Saturday; it may be different where you live.

Step 7 uses the number we got in Step 6 and calculates the number of days from the start of the month to the first Thursday. This involves modulo 7 arithmetic, so the first thing we have to do is subtract 1 from the Step 6 result. Numbers in mod 7 go from 0 to 6, not 1 to 7. After subtraction, we’re in a day numbering system where Thursday is 4. To get the number of days from the start of the month to the first Thursday, we subtract the weekday number from 4, add 7 (to force a positive result), and then get the remainder (mod) after dividing by 7.

Let’s do a couple of examples. If I run the shortcut today, last month is September 2021, which started on a Wednesday. Step 6 returns 4, and Step 7 returns

(4 - (4 - 1) + 7) mod 7 = (4 - 3 + 7) mod 7 = 8 mod 7 = 1

which makes sense, as there’s one day from the start of September 2021 to the first Thursday of the month.

If you’re wondering why we bothered adding 7 and doing the mod stuff—why not just do 4 - (4 - 1) = 1?—think of what happens when we run the shortcut next month. October 2021 starts on a Friday. Step 6 returns 6, and Step 7 returns

(4 - (6 - 1) + 7) mod 7 = (4 - 5 + 7) mod 7 = 6 mod 7 = 6

and there are, in fact, 6 days from the start of October 2021 to the first Thursday of the month. Here, without the + 7 we would have gotten a result of -1 which is not the number of days from the start of October to its first Thursday. By including the + 7 and mod 7 terms, we have an equation that handles all cases.

This is not nearly as easy to understand as Stephen Millard’s solution, so I didn’t post it to the forum. On the other hand, his solution didn’t exercise my modulo muscles.

By the way, it didn’t take me two weeks to work this out. I had the shortcut written the day of the question, but I soon learned that my splitflow script for creating the table of Shortcuts steps you see above was broken. The Shortcuts team changed the look of the app, and the computer vision functions that splitflow uses weren’t able to figure out the top and bottom boundaries of the steps. So I put this post on ice until I had energy to update splitflow. It still needs some work—the padding above the steps shouldn’t be wider than the padding below—but it’s passable.


Semiautomated LaTeX tables

As a followup to my last post about automating the creation of Markdown tables, here’s a simple pair of functions that’ve been very helpful in making LaTeX tables quickly.

A few years ago, I wrote about how much I hated the syntax of LaTeX tables and how I was shifting to building tables as graphics files that I could insert using the \includegraphics{} command. In the early stages, I was doing this by hand, as outlined in that post, but after I developed a sense of the kind of spacing I liked, I translated that into a set of Python functions that used the ReportLab module to create decent-looking tables as PDF files. A Python solution made the most sense, as the data from which I made the tables typically came out of data analysis done in Python.

The functions I wrote worked well enough, but the overall system was more fussy than it should have been, and I realized I’m better at programming the manipulation of text than the manipulation of graphics. So I started thinking about ways to make LaTeX tables directly from Python data.

Here’s an artificial example that isn’t too far away from the kinds of tables I need to make. Let’s say we want a short table of (base 10) logarithms. We can generate the data—a list of values and a parallel list of their logs—like this:

python:
from math import log10

x = [ (10+i)/10 for i in range(90) ]
lx = [ f'{log10(y):.4f}' for y in x ]

And what I want is a table that looks like this:

Log table

I use the booktabs package for making tables. I like its clean look, and I like its nice \addlinespace command for adding a little extra space between certain rows to make reading long tables easier. The code that produced the table above is

\setlength{\tabcolsep}{.125in}
\begin{table}[htbp]
\begin{center}
\begin{tabular}{
    @{\hspace*{5pt}}
    cc@{\hspace{.75in}}cc@{\hspace{.75in}}cc
    @{\hspace*{5pt}}
}
\toprule
$x$ & $\log x$ & $x$ & $\log x$ & $x$ & $\log x$ \\
\midrule
1.0 & 0.0000 & 4.0 & 0.6021 & 7.0 & 0.8451 \\
1.1 & 0.0414 & 4.1 & 0.6128 & 7.1 & 0.8513 \\
1.2 & 0.0792 & 4.2 & 0.6232 & 7.2 & 0.8573 \\
1.3 & 0.1139 & 4.3 & 0.6335 & 7.3 & 0.8633 \\
1.4 & 0.1461 & 4.4 & 0.6435 & 7.4 & 0.8692 \\
\addlinespace
1.5 & 0.1761 & 4.5 & 0.6532 & 7.5 & 0.8751 \\
    .
    .
    .
3.7 & 0.5682 & 6.7 & 0.8261 & 9.7 & 0.9868 \\
3.8 & 0.5798 & 6.8 & 0.8325 & 9.8 & 0.9912 \\
3.9 & 0.5911 & 6.9 & 0.8388 & 9.9 & 0.9956 \\
\bottomrule
\end{tabular}
\caption{Partial logarithm table}
\label{table}
\end{center}
\end{table}

The boilerplate at the top and bottom is produced by a Keyboard Maestro macro. Except for the column alignment line and caption—which need to be set for each table—it never changes. The troublesome parts are the header and, especially, the body. For those, I use two functions, theader and tbody, to build the LaTeX code without having to touch the ampersand or backslash keys. To make the table in the same script as the data, I adjust the script to this:

python:
from math import log10
from latex_tables import tbody, theader

x = [ (10+i)/10 for i in range(90) ]
lx = [ f'{log10(y):.4f}' for y in x ]
headers = ['$x$', '$\log x$']*3
print(theader(headers))
print(tbody(x[:30], lx[:30], x[30:60], lx[30:60], x[60:], lx[60:], group=5))

The output is

$x$ & $\log x$ & $x$ & $\log x$ & $x$ & $\log x$ \\
\midrule
1.0 & 0.0000 & 4.0 & 0.6021 & 7.0 & 0.8451 \\
1.1 & 0.0414 & 4.1 & 0.6128 & 7.1 & 0.8513 \\
1.2 & 0.0792 & 4.2 & 0.6232 & 7.2 & 0.8573 \\
1.3 & 0.1139 & 4.3 & 0.6335 & 7.3 & 0.8633 \\
1.4 & 0.1461 & 4.4 & 0.6435 & 7.4 & 0.8692 \\
\addlinespace
1.5 & 0.1761 & 4.5 & 0.6532 & 7.5 & 0.8751 \\
    .
    .
    .
3.7 & 0.5682 & 6.7 & 0.8261 & 9.7 & 0.9868 \\
3.8 & 0.5798 & 6.8 & 0.8325 & 9.8 & 0.9912 \\
3.9 & 0.5911 & 6.9 & 0.8388 & 9.9 & 0.9956 \\

which, as you can see, is exactly what goes between the top and bottom boilerplate.

The functions are defined in the file latex_tables.py, which is saved in my site-packages directory. Here’s the code:

python:
 1:  def tbody(*cols, group=0):
 2:    "Given the columns, return the body of a LaTeX table."
 3:  
 4:    # Add blanks at the ends of short columns
 5:    lens = [ len(c) for c in cols ]
 6:    nrows = max(lens)
 7:    cols = [ c + [' ']*(nrows - len(c)) for c in cols ]
 8:  
 9:    # Assemble the rows of the table
10:    rows = []
11:    for i in range(nrows):
12:      if (group > 0) and (i > 0) and (i % group == 0):
13:        rows.append(r'\addlinespace')
14:      row = [ str(c[i]) for c in cols ]
15:      rows.append(' & '.join(row) + r' \\')
16:  
17:    return '\n'.join(rows)
18:  
19:  def theader(hcols):
20:    "Given a list of column headers, return the header lines of a LaTeX table."
21:  
22:    # Figure out the maximum number of lines in the header
23:    hcols = [ x.splitlines() for x in hcols ]
24:    maxlines = max(len(x) for x in hcols)
25:    hcols = [ [' ']*(maxlines - len(x)) + x for x in hcols ]
26:  
27:    # Assemble the rows of the header
28:    rows = []
29:    for i in range(maxlines):
30:      row = [ str(c[i]) for c in hcols ]
31:      rows.append(' & '.join(row) + r' \\')
32:  
33:    return '\n'.join(rows) + '\n\\midrule'

Like the example log table, most of the tables I make come from lists of data that are meant to go into the columns of the table. So the main feature of tbody is assembling the rows from those lists. The other significant feature is placing the \addlinespace command according to the group parameter.

The theader function is even simpler. The only interesting thing about it is how it handles multiline headers. If it’s called like this,

print(theader(['One line', 'Two\nlines', 'And\nthree\nlines']))

the output will be

    &   & And \\
    & Two & three \\
One line & lines & lines \\
\midrule

which will produce a table that has a header that looks like this:

Multiline header

The header lines are bottom-aligned instead of top-aligned. The main purpose of theader is to automate that alignment.

Trickier table arrangements still have me hauling out my copy of Kopka & Daly, but theader and tbody do the bulk of the work even in those cases.


Under the table

In my last post, I mentioned an AppleScript I wrote some time ago that turns a table in Numbers into a MultiMarkdown1 table. A couple of days later, jacobio asked a question in the Mac Power Users forum, that led me to posting the script there. After looking it over—and seeing jacobio’s independently developed script—I made a few small changes to the script and decided to talk about it here.

The premise for the script is that I’m writing a blog post in BBEdit, and I want to include a table. Writing a Markdown table isn’t especially hard, but doing so involves lots of typing of formatting characters, the pipes and colons that tell the Markdown processor how to make the table but aren’t part of the table’s content. It’s much easier to write a table in a spreadsheet or some other dedicated table-writing tool and then invoke a script that converts it to Markdown. And if the table I want to insert is already in a spreadsheet, using a conversion script is even faster.

(Let me pause here and mention TableFlip, which Rosemary Orchard talked about in that same thread. I haven’t used it, but it sounds great, especially if most of the tables you deal with have to be written from scratch. It’s killer feature is how it ties the Markdown table in your document to the spreadsheet-like form in TableFlip itself. Changes you make to the table in TableFlip are conveyed automatically to the Markdown document. The downside to using TableFlip is that you don’t have a version of the table in a spreadsheet, which is often very useful.)

I’ve created a BBEdit package of scripts and text filters that help me write blog posts. The Table from Numbers script is in that package and has a few lines that are package-specific. But most of it could be used as a standalone conversion script. Here’s the code:

applescript:
 1:  -- Get the path to the Normalize Table text filter in the package.
 2:  tell application "Finder" to set cPath to container of container of container of (path to me) as text
 3:  set normalize to quoted form of (POSIX path of cPath & "Text Filters/06)Blogging/02)Normalize Table.py")
 4:  
 5:  -- Construct a MultiMarkdown table from the top Numbers table
 6:  tell application "Numbers"
 7:    tell table 1 of sheet 1 of document 1
 8:      set tbl to ""
 9:      set h to header row count
10:      set c to column count
11:      set r to row count
12:      
13:      -- header lines
14:      repeat with i from 1 to h
15:        set tbl to tbl & "|"
16:        repeat with j from 1 to c
17:          set val to formatted value of cell j of row i
18:          if val = missing value then
19:            set val to ""
20:          end if
21:          set tbl to tbl & " " & val & " |"
22:        end repeat
23:        set tbl to tbl & linefeed
24:      end repeat
25:      
26:      -- formatting line
27:      set tbl to tbl & "|"
28:      repeat with j from 1 to c
29:        set a to alignment of cell j of row h
30:        if a = right then
31:          set tbl to tbl & "---:|"
32:        else if a = center then
33:          set tbl to tbl & ":--:|"
34:        else
35:          set tbl to tbl & ":---|"
36:        end if
37:      end repeat
38:      set tbl to tbl & linefeed
39:      
40:      -- body lines
41:      repeat with i from h + 1 to r
42:        set tbl to tbl & "|"
43:        repeat with j from 1 to c
44:          set val to formatted value of cell j of row i
45:          if val = missing value then
46:            set val to ""
47:          end if
48:          set tbl to tbl & " " & val & " |"
49:        end repeat
50:        set tbl to tbl & linefeed
51:      end repeat
52:      
53:    end tell -- table
54:  end tell -- Numbers
55:  
56:  -- Normalize the table
57:  set the clipboard to tbl
58:  set tbl to do shell script ("pbpaste | " & normalize)
59:  
60:  tell application "BBEdit" to set selection to tbl

Lines 1–3 are some of the package-specific stuff I talked about before. Let’s skip over that for now and get to the meat of the script, which starts on Line 6 with a long tell block that controls Numbers.

The script assumes that the table of interest is the first table of the first sheet of the frontmost Numbers document. Line 7 starts a tell block that speaks to that table. Line 8 initializes the tbl variable, which is where we’re going to put all the text of the Markdown table. Lines 9–11 get the sizes of the header and the table as a whole.

A word about headers is in order. Some Markdown implementations allow for only one header line, but others—including MultiMarkdown—allow for any number of header lines. This script accommodates the more general case. In Numbers, the header lines (rows) are set in the Table section of the Format side panel. There’s a pop-up button that lets you set the number of header rows, and this is typically reflected in the number of shaded rows at the top of the table.

Header rows

Lines 14–24 use nested loops to create the header lines of the Markdown table. The outer loop, which starts on Line 14, goes through the header rows of the table. The inner loop, which starts on Line 16, goes through the columns of each row. As is often the case, AppleScript’s idiosyncrasies are more difficult to deal with than the logic of the process. To make our Markdown table show what the Numbers table shows, we need to ask for the formatted value of each cell. Also, instead of asking for column j of row i in Line 17, we have to ask for cell j of row i. And finally, if a cell is empty, AppleScript will set val to the literal text “missing value” in Line 17, so we need Lines 18–20 to correct that to an empty string before appending val and the appropriate space and pipe (|) characters to tbl in Line 21.

Lines 27–38 use the alignment of the cells in the last header row (row h) to create the formatting line in the Markdown table. I think the logic here is pretty easy to understand. If the alignment of the cell is right, we add a short string of hyphens with a colon at the right end; if the alignment is left, we add a short string of hyphens with a colon at each end; for any other alignment, we treat the column as left-aligned and add a short string of hyphens with a colon at the left end. “Any other alignment” includes auto align; this is the default and is presented as left-aligned for text, which is what header cells usually contain.

Lines 41–51 loop through the body of the table. Apart from the limits on the repeat command in Line 41, this is the same code as in Lines 14–24. I probably should factor this out into a function.

When the code exits the tell block on Line 54, tbl contains a valid but ugly Markdown table. Something like this:

|  | County | Yes | No | Margin | CMargin |
|:---|:---|---:|---:|---:|---:|
| 1 | Kern | 126,999 | 78,477 | 48,522 | 48,522 |
| 2 | Placer | 114,643 | 85,302 | 29,341 | 77,863 |
| 3 | Shasta | 49,141 | 21,655 | 27,486 | 105,349 |
| 4 | Tulare | 64,372 | 41,009 | 23,363 | 128,712 |
| 5 | El Dorado | 58,393 | 39,907 | 18,486 | 147,198 |
| 6 | Stanislaus | 82,911 | 69,247 | 13,664 | 160,862 |
| 7 | Tehama | 15,958 | 6,186 | 9,772 | 170,634 |
| 8 | Madera | 25,638 | 16,233 | 9,405 | 180,039 |
| 9 | Sutter | 20,458 | 11,593 | 8,865 | 188,904 |
| 10 | Kings | 19,710 | 11,242 | 8,468 | 197,372 |

This can be made nicer looking by running it through my “Normalize Table” text filter, which is also part of my Blogging package and which I’ve written about before. Lines 2–3 figure out the path to the text filter, and Lines 57–58 apply it to the contents of tbl, putting the nicely-formatted result, e.g.,

|    | County     |     Yes |     No | Margin | CMargin |
|---:|:-----------|--------:|-------:|-------:|--------:|
|  1 | Kern       | 126,999 | 78,477 | 48,522 |  48,522 |
|  2 | Placer     | 114,643 | 85,302 | 29,341 |  77,863 |
|  3 | Shasta     |  49,141 | 21,655 | 27,486 | 105,349 |
|  4 | Tulare     |  64,372 | 41,009 | 23,363 | 128,712 |
|  5 | El Dorado  |  58,393 | 39,907 | 18,486 | 147,198 |
|  6 | Stanislaus |  82,911 | 69,247 | 13,664 | 160,862 |
|  7 | Tehama     |  15,958 |  6,186 |  9,772 | 170,634 |
|  8 | Madera     |  25,638 | 16,233 |  9,405 | 180,039 |
|  9 | Sutter     |  20,458 | 11,593 |  8,865 | 188,904 |
| 10 | Kings      |  19,710 | 11,242 |  8,468 | 197,372 |

back into tbl. Note that the code uses the clipboard and pbpaste to send the text through the “Normalize Table” filter. I could avoid the clipboard by setting up a shell command that uses a here-document, but that code is messy and easy to screw up (I speak from experience). Because I use Keyboard Maestro’s clipboard history manager, I can always get back to what was on the clipboard before I ran this script.

Finally, Line 60 puts the text of tbl into my current BBEdit document. If there’s a selection, it replaces the selection; if not, it inserts the text at the cursor.

If you want to adapt this script for use with any text editor, delete Lines 60 and 1–3 and change Line 58 to

set tbl to do shell script "pbpaste | /path/to/normalize | pbcopy"

For this last part to work, you’ll need to have a normalization script and replace /path/to/normalize with the path to it. When you get it working, you’ll have a script that will put the formatted Markdown table on your clipboard, ready to be pasted anywhere.

This script is no speed demon. AppleScript is very deliberate as it walks through the table, so there’s always a pause between choosing the Table from Numbers command and seeing the table appear in BBEdit. If you adapt it to put the table on your clipboard, you might want to add a command at the end to play a sound when the clipboard is ready for pasting.


  1. From now on, I’m just going to call these Markdown tables. Gruber’s Markdown doesn’t include tables, but pretty much every Markdown implementation has them—either built in or as an option. I bet most Markdown users don’t even know that tables are an extension. 


Data cleaning without boredom

The worst part of data analysis is organizing the data into an analyzable form. You want to get on with the real work of analysis, but first you have to collect and massage the data into a shape that your analysis tools can deal with. Often, this takes more time than the analysis itself. Even when it doesn’t take more time, it feels like it does because you have the sense that nothing important is getting done. I’m an analyst, dammit, I shouldn’t be stuck making sure that tabs and commas are in the right place.

But an unorganized pile of numbers is useless, and wishing it was well-ordered won’t make it so. When I have to organize disparate sets of data, I treat the process as an independent intellectual challenge, something I can be proud of in its own right. This keeps me interested and makes the time go faster. For me, the key is to figure out a way to do as little organization “by hand” as I can.

For the recall election post I wrote on Monday, the biggest problem was collecting the voting results from the California Secretary of State’s office. As far as I could tell, the SoS has published no single table of votes organized by county. Instead, there are 58 individual web pages that look like this:

Calaveras results page

Certainly, we can copy the Yes and No votes from this page and paste them into a spreadsheet or text file, but that process would have to be done 58 times. There’s no way I can do that without making several mistakes as I try to speed through it. So the challenge is in automating the process to avoid boredom-induced blunders.

First, I saw that the Calaveras County page shown above has this URL:

https://electionresults.sos.ca.gov/returns/governor-recall/county/calaveras

It’s a longish prefix followed by the lower-cased name of the county. A quick check on counties with multi-word names—of which California has a lot—showed that those URLs use dashes to separated the names, like

https://electionresults.sos.ca.gov/returns/governor-recall/county/santa-barbara

So if I have a list of the county names, I can use that list to construct the URL for each county in turn. A Google search turned up this page at Ballotpedia. I copied the list and pasted it into a text document named county-names.txt. There was one extra entry in the Ballotpedia list, “Los Angeles County not covered,” which I took to be all of Los Angeles County outside of the city. I deleted that line from the file.

Now that I can automate the generation of URLs, I need to be able to extract the Yes and No votes from an individual county’s page. Scanning through the HTML source of the Calaveras page, I found the code for the vote table:

<table class="propTblCounty stateCountyResultsTbl">
  <thead>
    <tr class="crsTblHdrTop">
      <th colspan="3" rowspan="2">Question</th>
      <th colspan="4">County</th>
      <th colspan="4">State</th>
    </tr>
    <tr class="crsTblHdrTop">
      <th>Yes<br />Votes</th>
      <th>%</th>
      <th>No<br />Votes</th>
      <th>%</th>
      <th>Yes<br />Votes</th>
      <th>%</th>
      <th>No<br />Votes</th>
      <th>%</th>
    </tr>
  </thead>
  <tbody>
    <tr class="evenRow resultsTblBorder">
      <td class="propYes" style="font-size:1.2em">Yes</td>
      <td class="bold"></td>
      <td class="bold">Shall Gavin Newsom</span> Be Recalled (Removed) From the Office of Governor?</td>
      <td class="votesProp">15,133</td>
      <td class="percProp">64.5%</td>
      <td class="votesProp">8,320</td>
      <td class="percProp">35.5%</td>
      <td class="votesProp">4,874,057</td>
      <td class="percProp">38.1%</td>
      <td class="votesProp">7,927,249</td>
      <td class="percProp">61.9%</td>
    </tr>
  </tbody>
</table>

A search through the full HTML source told me that the stateCountyResultsTbl class uniquely identifies the table of interest. And within the body of that table, the first two data cells with a votesProp class will yield the Yes and No votes for the county.

There are lots of ways to extract data from HTML. Because I’m most comfortable programming in Python, I chose to use it and the Beautiful Soup module, which I’ve used often, to get the Yes and No votes.

In the time it took me to drink a couple of cups of tea, I had the following script, collect-votes.py, written.

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import requests
 4:  from bs4 import BeautifulSoup
 5:  
 6:  urlPrefix = 'https://electionresults.sos.ca.gov/returns/governor-recall/county/'
 7:  
 8:  print('County,Yes,No')
 9:  
10:  # Loop through the counties, collecting their Yes and No votes
11:  with open('county-names.txt') as f:
12:    for line in f:
13:      # Construct the URL and parse the HTML from it.
14:      name = line.rstrip()
15:      url = urlPrefix + name.lower().replace(' ', '-')
16:      r = requests.get(url)
17:      soup = BeautifulSoup(r.text, 'html.parser')
18:  
19:      # The table we want is the first (and only)
20:      # of class stateCountyResultsTbl
21:      voteTable = soup.find('table', {"class": "stateCountyResultsTbl"})
22:  
23:      # The Yes and No votes are the first two cells of class votesProp
24:      # in the body of the table
25:      votes = voteTable.tbody.find_all('td', {'class': 'votesProp'})[:2]
26:  
27:      # Strip out any thousand separators
28:      votes = [ x.text.replace(',', '') for x in votes ]
29:      outLine = ','.join([name] + votes)
30:      print(outLine)

This script goes through the lines of county-names.txt and for each county

  1. Generates the URL (Line 15).
  2. Gets the HTML source using the Requests module (Line 16).
  3. Finds the vote table (Line 21).
  4. Finds the Yes and No votes (Line 25).
  5. Strips the thousands separators from the votes (Line 28).
  6. Prints a CSV-style line for the county (Lines 29-30).

A lot of Python programmers would say I should have used the csv module to write out the data. If output were more complicated, I would have done so, but this seemed simple enough to do with just the print function.

(You might think I stripped out the thousands separators simply so I could avoid having commas within the fields of my CSV output. No, that was just a nice side benefit. I stripped the thousands separators to make the numbers in the CSV file easy to parse. Although lots of systems can handle numbers with commas, every system can handle numbers without commas. This is in keeping with Postel’s Law: Be conservative in what you do, be liberal in what you accept from others.)

Running the script via

python collect-votes.py > alphabetical-votes.csv

gave me a CSV data file that looked like this:

County,Yes,No
Alameda,108081,465901
Alpine,225,354
Amador,12895,6957
Butte,43129,36128
Calaveras,15133,8320
Colusa,3977,1996
Contra Costa,130058,324747
Del Norte,5243,3505
El Dorado,58393,39907
Fresno,132691,125710
    etc.

Clean data makes the analysis simple. I could have opened the CSV file in a spreadsheet and done all the analysis there, but I prefer using Pandas, even when it’s overkill, because it keeps my Pandas skills fresh. The script that generated the ordered data table in my earlier post was this:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import pandas as pd
 4:  
 5:  df = pd.read_csv('alphabetical-votes.csv')
 6:  df['Margin'] = df.Yes - df.No
 7:  df.sort_values('Margin', ascending=False, inplace=True, ignore_index=True)
 8:  df['CMargin'] = df.Margin.cumsum()
 9:  df.index += 1
10:  df.to_csv('ordered-votes.csv')

The only tricky thing here is Line 9. Pandas, like Python itself, uses zero-based indexing for the rows of a data frame. I wanted one-based indexing for the table in my post because most people like lists to start with one. The output ordered-votes.csv file looks like this:

,County,Yes,No,Margin,CMargin
1,Kern,126999,78477,48522,48522
2,Placer,114643,85302,29341,77863
3,Shasta,49141,21655,27486,105349
4,Tulare,64372,41009,23363,128712
5,El Dorado,58393,39907,18486,147198
6,Stanislaus,82911,69247,13664,160862
7,Tehama,15958,6186,9772,170634
8,Madera,25638,16233,9405,180039
9,Sutter,20458,11593,8865,188904
10,Kings,19710,11242,8468,197372
    etc.

Long ago, I wrote an AppleScript that converts a table in Numbers to (Multi)Markdown format and pastes it into BBEdit. So I opened ordered-votes.csv in Numbers, turned on the thousands separators in the numeric columns, and ran the conversion script to get this:

|  | County | Yes | No | Margin | CMargin |
|---:|:---|---:|---:|---:|---:|
| 1 | Kern | 126,999 | 78,477 | 48,522 | 48,522 |
| 2 | Placer | 114,643 | 85,302 | 29,341 | 77,863 |
| 3 | Shasta | 49,141 | 21,655 | 27,486 | 105,349 |
| 4 | Tulare | 64,372 | 41,009 | 23,363 | 128,712 |
| 5 | El Dorado | 58,393 | 39,907 | 18,486 | 147,198 |
| 6 | Stanislaus | 82,911 | 69,247 | 13,664 | 160,862 |
| 7 | Tehama | 15,958 | 6,186 | 9,772 | 170,634 |
| 8 | Madera | 25,638 | 16,233 | 9,405 | 180,039 |
| 9 | Sutter | 20,458 | 11,593 | 8,865 | 188,904 |
| 10 | Kings | 19,710 | 11,242 | 8,468 | 197,372 |
    etc.

Another script normalized the table to get all the cells aligned.

|    | County     |     Yes |     No | Margin | CMargin |
|---:|:-----------|--------:|-------:|-------:|--------:|
|  1 | Kern       | 126,999 | 78,477 | 48,522 |  48,522 |
|  2 | Placer     | 114,643 | 85,302 | 29,341 |  77,863 |
|  3 | Shasta     |  49,141 | 21,655 | 27,486 | 105,349 |
|  4 | Tulare     |  64,372 | 41,009 | 23,363 | 128,712 |
|  5 | El Dorado  |  58,393 | 39,907 | 18,486 | 147,198 |
|  6 | Stanislaus |  82,911 | 69,247 | 13,664 | 160,862 |
|  7 | Tehama     |  15,958 |  6,186 |  9,772 | 170,634 |
|  8 | Madera     |  25,638 | 16,233 |  9,405 | 180,039 |
|  9 | Sutter     |  20,458 | 11,593 |  8,865 | 188,904 |
| 10 | Kings      |  19,710 | 11,242 |  8,468 | 197,372 |
    etc.

This isn’t strictly necessary—the HTML generated from the Markdown will be same regardless of the alignment—but it looks nicer as I’m writing the post. By the way, since my earlier post, the numbers have changed a bit. You can now take a subset of 46 counties and get Yes votes to outnumber the No votes:

County Yes No Margin CMargin
1 Kern 126,999 78,477 48,522 48,522
2 Placer 114,643 85,302 29,341 77,863
3 Shasta 49,141 21,655 27,486 105,349
4 Tulare 64,372 41,009 23,363 128,712
5 El Dorado 58,393 39,907 18,486 147,198
6 Stanislaus 82,911 69,247 13,664 160,862
7 Tehama 15,958 6,186 9,772 170,634
8 Madera 25,638 16,233 9,405 180,039
9 Sutter 20,458 11,593 8,865 188,904
10 Kings 19,710 11,242 8,468 197,372
11 Yuba 15,291 7,961 7,330 204,702
12 Riverside 362,958 355,630 7,328 212,030
13 Butte 43,129 36,128 7,001 219,031
14 Fresno 132,691 125,710 6,981 226,012
15 Lassen 8,538 1,604 6,934 232,946
16 Calaveras 15,133 8,320 6,813 239,759
17 Tuolumne 15,832 9,850 5,982 245,741
18 Amador 12,895 6,957 5,938 251,679
19 Siskiyou 11,282 6,951 4,331 256,010
20 Glenn 6,317 2,479 3,838 259,848
21 Plumas 5,837 3,407 2,430 262,278
22 Merced 30,210 27,867 2,343 264,621
23 Mariposa 5,378 3,376 2,002 266,623
24 Colusa 3,977 1,996 1,981 268,604
25 Modoc 2,508 706 1,802 270,406
26 Del Norte 5,243 3,505 1,738 272,144
27 Inyo 4,128 3,496 632 272,776
28 Trinity 2,699 2,106 593 273,369
29 Sierra 1,064 616 448 273,817
30 Alpine 225 354 -129 273,688
31 Mono 2,245 2,805 -560 273,128
32 Lake 3,728 5,605 -1,877 271,251
33 San Bernardino 285,650 288,291 -2,641 268,610
34 San Benito 9,181 12,595 -3,414 265,196
35 Nevada 25,426 29,851 -4,425 260,771
36 Imperial 12,193 18,210 -6,017 254,754
37 San Luis Obispo 61,148 68,365 -7,217 247,537
38 Mendocino 11,870 21,852 -9,982 237,555
39 San Joaquin 94,877 105,405 -10,528 227,027
40 Humboldt 18,179 32,585 -14,406 212,621
41 Napa 18,681 38,948 -20,267 192,354
42 Yolo 24,769 52,444 -27,675 164,679
43 Santa Barbara 57,355 92,905 -35,550 129,129
44 Orange 547,685 586,457 -38,772 90,357
45 Solano 58,372 97,935 -39,563 50,794
46 Monterey 38,169 80,664 -42,495 8,299
47 Ventura 136,610 182,470 -45,860 -37,561
48 Santa Cruz 25,454 90,874 -65,420 -102,981
49 Marin 24,273 108,599 -84,326 -187,307
50 Sonoma 57,396 160,550 -103,154 -290,461
51 Sacramento 220,498 329,952 -109,454 -399,915
52 San Mateo 64,329 227,174 -162,845 -562,760
53 San Diego 504,915 674,417 -169,502 -732,262
54 Contra Costa 130,058 324,747 -194,689 -926,951
55 San Francisco 47,193 292,744 -245,551 -1,172,502
56 Santa Clara 166,930 468,680 -301,750 -1,474,252
57 Alameda 108,081 465,901 -357,820 -1,832,072
58 Los Angeles 855,234 2,076,354 -1,221,120 -3,053,192

The overall change in votes makes Newsom’s winning margin only about 3500 votes lower than it was last Monday, but the way the pluses and minuses were distributed increased our worthless count of counties by one.

This, by the way, demonstrates another advantage of automating the data collection and organization. In addition to keeping me interested, it also makes it easy to rerun the analysis if the data change. When the vote-counting is finalized later in the month, I can check the numbers again by just running a couple of scripts.