Kilometers and the golden ratio
September 3, 2024 at 8:38 AM by Dr. Drang
Regular readers know I enjoy reading John D. Cook’s blog and often comment on it here. But I was a little creeped out by it a couple of days ago. It started off with something I’ve been thinking about a lot over the past several weeks, and it was as if he’d been reading my mind.
The post is about how the conversion factor between miles and kilometers, , is close to the golden ratio, . To convert kilometers to miles, you can make a good estimate by multiplying by , which means that you can convert in the other direction by multiplying by .
You may think multiplying by an irrational number is a pain in the ass, and you’d be right. Cook gets around this by approximating by the ratio of consecutive Fibonacci numbers, so you can, for example, convert from miles to kilometers by multiplying by 21 and dividing by 13. Similarly, you can use consecutive Lucas numbers in the same fashion, multiplying, say, by 29 and dividing by 18.
The problem with these calculations is that I’m not John D. Cook, and I can’t do multiplications and divisions like this in my head without losing digits along the way. So my conversion method is much cruder: to go from miles to kilometers, I multiply by 16; and to go in the opposite direction, I multiply by 6. In both cases, I finish by shifting the decimal point to divide by 10. If I need more precision than this, I pull out my phone, launch PCalc, and use its built-in conversions.
By the way, the main reason I’ve been converting between miles and kilometers lately is that I recently switched the units I use in the Activity/Fitness app1 from miles to kilometers. I now often find myself in the middle of one of my walking routes wondering how long it is in kilometers. I know the lengths of all my routes in miles but haven’t memorized their lengths in kilometers yet. It was while doing conversions like this that I noticed that the conversion factor was close to and started doing a lot of multiplication by 16.
If you’re wondering why I bothered switching units, it’s because I enter 5k races a few times a year and I like my time to be under 45 minutes. To keep myself in shape for this, every couple of weeks I push myself to do my first 5k in that time. It’s much easier to look at my watch and know that my pace should be about 9:00 per kilometer than 14:29 per mile. Also, it’s easier to know when I’m done—marking my time at 3.11 miles is as much a pain in the ass as multiplying by
-
Does it really have to have different names on the watch and the phone? I don’t get confused when I’m using them because the icons are the same, but I never know which name to use when talking about them. ↩
The Electoral College again, this time with aggregation
August 31, 2024 at 11:27 PM by Dr. Drang
After the last post, I had a conversation with ondaiwai on Mastodon about the to_markdown
and agg
functions in Pandas. After seeing his example code here and here, I rewrote my script, and I think it’s much better.
In a nutshell, the agg
function aggregates the data pulled together by groupby
and creates a new dataframe that can be formatted as a MultiMarkdown table via to_markdown
. I had used agg
several years ago but forgot about it. And I’d never seen to_markdown
before. Let’s see how they work to make a nicely formatted Electoral College table.
This time, instead of showing you the script in pieces, I’ll show the whole thing at once, and we’ll consider each section in turn
python:
1: #!/usr/bin/env python
2:
3: import pandas as pd
4:
5: # Import the raw data
6: df = pd.read_csv('states.csv')
7:
8: # Generate new fields for percentages
9: popSum = df.Population.sum()
10: ecSum = df.Electors.sum()
11: df['PopPct'] = df.Population/popSum
12: df['ECPct'] = df.Electors/ecSum
13:
14: # Collect states with same numbers of electors
15: basicAgg = {'State': 'count', 'Population': 'sum', 'PopPct': 'sum',\
16: 'Electors': 'sum', 'ECPct': 'sum'}
17: dfgBasic = df.groupby(by='Electors').agg(basicAgg)
18:
19: # Print out the summary table
20: print(dfgBasic)
21:
22: print()
23: print()
24:
25: # Collect as above but for summary table in blog post
26: tableAgg = {'Abbrev': lambda s: ', '.join(s), 'PopPct': 'sum', 'ECPct': 'sum'}
27: dfgTable = df.groupby(by='Electors').agg(tableAgg)
28:
29: # Print as Markdown table
30: print(dfgTable.to_markdown(floatfmt='.2%',\
31: headers=['Electors', 'States', 'Pop Pct', 'EC Pct'],\
32: colalign=['center', 'center', 'right', 'right'] ))
Lines 1–12 are the same as before; they import the state population and Electoral College information into a dataframe, df
, and then add a couple of columns for percentages of the totals. After Line 12, the first five rows of df
are
State | Abbrev | Population | Electors | PopPct | ECPct |
---|---|---|---|---|---|
Alabama | AL | 5108468 | 9 | 0.015253 | 0.016729 |
Alaska | AK | 733406 | 3 | 0.002190 | 0.005576 |
Arizona | AZ | 7431344 | 11 | 0.022189 | 0.020446 |
Arkansas | AR | 3067732 | 6 | 0.009160 | 0.011152 |
California | CA | 38965193 | 54 | 0.116344 | 0.100372 |
The code in Lines 15–20 groups the data according to the number of Electoral College votes per state and prints out a summary table intended for my use. “Intended for my use” means the output is clear but not the sort of thing I’d want to present to anyone else. “Quick and dirty” would be another way to describe the output, which is this:
State Population PopPct Electors ECPct
Electors
3 7 5379033 0.016061 21 0.039033
4 7 10196485 0.030445 28 0.052045
5 2 4092750 0.012220 10 0.018587
6 6 18766882 0.056035 36 0.066914
7 2 7671000 0.022904 14 0.026022
8 3 13333261 0.039811 24 0.044610
9 2 10482023 0.031298 18 0.033457
10 5 29902889 0.089285 50 0.092937
11 4 28421431 0.084862 44 0.081784
12 1 7812880 0.023328 12 0.022305
13 1 8715698 0.026024 13 0.024164
14 1 9290841 0.027741 14 0.026022
15 1 10037261 0.029970 15 0.027881
16 2 21864718 0.065284 32 0.059480
17 1 11785935 0.035191 17 0.031599
19 2 25511372 0.076173 38 0.070632
28 1 19571216 0.058436 28 0.052045
30 1 22610726 0.067512 30 0.055762
40 1 30503301 0.091078 40 0.074349
54 1 38965193 0.116344 54 0.100372
The agg
function has many options for passing arguments, one of which is a dictionary in which the column names are the keys and the aggregation functions to be applied to those columns are the values. We create that dictionary, basicAgg
, in Lines 15–16, where the State
column is counted and the other columns are summed. Notice that the values of basicAgg
are strings with the names of the functions to be applied.1
Line 17 then groups the dataframe by the Electors
column and runs the aggregation functions specified in basicAgg
on the appropriate columns. The output is a new dataframe, dfgBasic
, which is then printed out in Line 20. Nothing fancy in the print
function because this is meant to be quick and dirty.
After a couple of print()
s on Lines 22–23 to give us some whitespace, the rest of the code creates a Markdown table that looks like this:
| Electors | States | Pop Pct | EC Pct |
|:----------:|:--------------------------:|----------:|---------:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
which presents just the data I want to show and formats it nicely for pasting into the blog post.
Line 26 creates a new aggregation dictionary, tableAgg
. It’s similar to basicAgg
except for the function applied to the Abbrev
column. Because there’s no built-in function (as far as I know) for turning a column into a comma-separated string of its contents, I made one. Because this function is very short and I’m using it only once, I added it to tableAgg
as a lambda function:
python:
lambda s: ', '.join(s)
This illustrates the overall structure of an aggregation function. It takes a Pandas series (the column) as input and produces a scalar value as output. Because a series acts like a list, it can be passed directly to the join
function, producing the comma-separated string I wanted in the table.
The last few lines print out the table in Markdown format, using the to_markdown
function to generate the header line, the format line, and the pipe characters between the columns. While I didn’t know about to_markdown
until ondaiwai told me about it, it does its work via the tabulate
module, which I have used (badly) in the past. The keyword arguments given to to_markdown
in Lines 30–32 are passed on to tabulate
to control the formatting:
floatfmt='.2%'
sets the formatting of all the floating point numbers, which in this case are thePopPct
andECPct
columns. Note that in the quick-and-dirty table, these columns are printed as decimal values with more digits than necessary.headers=['Electors', 'States', 'Pop Pct', 'EC Pct']
sets the header names to something nicer than the default, which would be the names indf
.colalign=['center', 'center', 'right', 'right']
sets the alignment of the four columns. Without this, the Electors column would be right-aligned (the default for numbers) and the States column would be left-aligned (the default for strings).
Why do I think this code is better than what I showed in the last post? Mainly because this code operates on dataframes and series all at once instead of looping through items as my previous code did. This is how Pandas is meant to be used. Also, the functions being applied to the columns are more obvious because of how the basicAgg
and tableAgg
dictionaries are built. In my previous code, these functions are inside braces in f-strings, where they’re harder to see. Similarly, the formatting of the Markdown table is easier to see when it’s given as arguments to a function instead of buried in print
commands.
My thanks again to ondaiwai for introducing me to to_markdown
and getting me to think in a more Pandas-like way. He rewrote and extended the code snippets he posted on Mastodon and made a gist of it. After seeing his code, I felt pretty good about my script; it’s different from his, but it shows that I understood what he was saying in the tweets.
Update 7 Sep 2024 12:33 PM
Kieran Healy followed up with a post on doing this summary analysis in R, but there’s more to his post than a simple language translation. He does a better job than I did at explaining why we should use higher-level abstractions in our analysis code. While I just kind of waved my hands and said “this is how Pandas is meant to be used,” Kieran explains that this is a general principle, not just a feature of Pandas. He also does a good job of relating the Pandas agg
function to the split-apply-combine strategy, which has been formalized in Pandas, R, and Julia (and, I assume, other data analysis systems).
Using abstractions like this in programming is analogous to how we advance in mathematics. If you’re doing a problem that requires matrix multiplication, you don’t write out every step of row-column multiplication and addition, you just say
This is how you get to think about the purpose of the operation, not the nitty-gritty of how it’s done.
-
While you can, in theory, use the functions themselves as the values (e.g.,
sum
instead of'sum'
), I found that doesn’t necessarily work for all aggregation functions. Thecount
function, for example, seems to work only if it’s given as a string. I’m sure there’s a good reason for this, but I haven’t figured it out yet. ↩
Pandas and the Electoral College
August 29, 2024 at 12:15 PM by Dr. Drang
A couple of weeks ago, I used the Pandas groupby
function in some data analysis for work, so when I started writing my previous post on the Electoral College, groupby
came immediately to mind when I realized I wanted to add this table to the post:
Electors | States | Pop Pct | EC Pct |
---|---|---|---|
3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
5 | NE, NM | 1.22% | 1.86% |
6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
7 | CT, OK | 2.29% | 2.60% |
8 | KY, LA, OR | 3.98% | 4.46% |
9 | AL, SC | 3.13% | 3.35% |
10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
11 | AZ, IN, MA, TN | 8.49% | 8.18% |
12 | WA | 2.33% | 2.23% |
13 | VA | 2.60% | 2.42% |
14 | NJ | 2.77% | 2.60% |
15 | MI | 3.00% | 2.79% |
16 | GA, NC | 6.53% | 5.95% |
17 | OH | 3.52% | 3.16% |
19 | IL, PA | 7.62% | 7.06% |
28 | NY | 5.84% | 5.20% |
30 | FL | 6.75% | 5.58% |
40 | TX | 9.11% | 7.43% |
54 | CA | 11.63% | 10.04% |
I got the population and elector information from the Census Bureau and the National Archives, respectively, and put them in a CSV file named states.csv
(which you can download). The header and first ten rows are
State | Abbrev | Population | Electors |
---|---|---|---|
Alabama | AL | 5108468 | 9 |
Alaska | AK | 733406 | 3 |
Arizona | AZ | 7431344 | 11 |
Arkansas | AR | 3067732 | 6 |
California | CA | 38965193 | 54 |
Colorado | CO | 5877610 | 10 |
Connecticut | CT | 3617176 | 7 |
Delaware | DE | 1031890 | 3 |
District of Columbia | DC | 678972 | 3 |
Florida | FL | 22610726 | 30 |
(Because the District of Columbia has a spot in the Electoral College, I’m lumping it in with the 50 states and referring to all of them as “states” in the remainder of this post. That makes it easier for both you and me.)
Here’s the initial code I used to summarize the data:
python:
1: #!/usr/bin/env python
2:
3: import pandas as pd
4:
5: # Import the raw data
6: df = pd.read_csv('states.csv')
7:
8: # Generate new fields for percentages
9: popSum = df.Population.sum()
10: ecSum = df.Electors.sum()
11: df['PopPct'] = df.Population/popSum
12: df['ECPct'] = df.Electors/ecSum
13:
14: # Collect states with same numbers of electors
15: dfg = df.groupby(by='Electors')
16:
17: # Print out the summary table
18: print('State EC States Population Pop Pct Electors EC Pct')
19: for k, v in dfg:
20: print(f' {k:2d} {v.State.count():2d} {v.Population.sum():11,d}\
21: {v.PopPct.sum():6.2%} {v.Electors.sum():3d} {v.ECPct.sum():6.2%}')
The variable df
contains the dataframe of all the states. It’s read in from the CSV in Line 6, and then extended in Lines 9–12. After Line 12, the first five rows of df
are
State | Abbrev | Population | Electors | PopPct | ECPct |
---|---|---|---|---|---|
Alabama | AL | 5108468 | 9 | 0.015253 | 0.016729 |
Alaska | AK | 733406 | 3 | 0.002190 | 0.005576 |
Arizona | AZ | 7431344 | 11 | 0.022189 | 0.020446 |
Arkansas | AR | 3067732 | 6 | 0.009160 | 0.011152 |
California | CA | 38965193 | 54 | 0.116344 | 0.100372 |
The summarizing is done first by using the groupby
function in Line 15, which groups the states according to the number of electors they have. The resulting variable, dfg
, is of type
pandas.core.groupby.generic.DataFrameGroupBy
which works like a dictionary, where the keys are the numbers of electors per state and the values are dataframes of the subset of states with the number of electors given by that key. Lines 19–21 loop through the dictionary and print out summary information for each subset of states. The output is this:
State EC States Population Pop Pct Electors EC Pct
3 7 5,379,033 1.61% 21 3.90%
4 7 10,196,485 3.04% 28 5.20%
5 2 4,092,750 1.22% 10 1.86%
6 6 18,766,882 5.60% 36 6.69%
7 2 7,671,000 2.29% 14 2.60%
8 3 13,333,261 3.98% 24 4.46%
9 2 10,482,023 3.13% 18 3.35%
10 5 29,902,889 8.93% 50 9.29%
11 4 28,421,431 8.49% 44 8.18%
12 1 7,812,880 2.33% 12 2.23%
13 1 8,715,698 2.60% 13 2.42%
14 1 9,290,841 2.77% 14 2.60%
15 1 10,037,261 3.00% 15 2.79%
16 2 21,864,718 6.53% 32 5.95%
17 1 11,785,935 3.52% 17 3.16%
19 2 25,511,372 7.62% 38 7.06%
28 1 19,571,216 5.84% 28 5.20%
30 1 22,610,726 6.75% 30 5.58%
40 1 30,503,301 9.11% 40 7.43%
54 1 38,965,193 11.63% 54 10.04%
This is what I was looking at when I wrote the last few paragraphs of the post, but I didn’t want to present the data to you in this way—I wanted a nice table with only the necessary columns. I added a couple of print()
statements to the code above to add a little whitespace and then this code to create a MultiMarkdown table.
python:
25: # Print out the Markdown summary table
26: print('| Electors | States | Pop Pct | EC Pct |')
27: print('|:--:|:--:|--:|--:|')
28: for k, v in dfg:
29: print(f'| {k:2d} | {", ".join(v.Abbrev)} | {v.PopPct.sum():6.2%} \
30: | {v.ECPct.sum():6.2%} |')
This is very much like the previous output code. Apart from adding the pipe characters and the formatting line, there’s the
python:
{", ".join(v.Abbrev)}
piece in the f-string of Line 29. The join
part concatenates all the state abbreviations, putting a comma-space between them. I decided a list of states with the given number of electors would be better than just a count of how many such states there are.
The output from this additional code was
| Electors | States | Pop Pct | EC Pct |
|:--:|:--:|--:|--:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
which I then ran through my in BBEdit to produce the nicer looking filter
| Electors | States | Pop Pct | EC Pct |
|:--------:|:--------------------------:|--------:|-------:|
| 3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
| 4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
| 5 | NE, NM | 1.22% | 1.86% |
| 6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
| 7 | CT, OK | 2.29% | 2.60% |
| 8 | KY, LA, OR | 3.98% | 4.46% |
| 9 | AL, SC | 3.13% | 3.35% |
| 10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
| 11 | AZ, IN, MA, TN | 8.49% | 8.18% |
| 12 | WA | 2.33% | 2.23% |
| 13 | VA | 2.60% | 2.42% |
| 14 | NJ | 2.77% | 2.60% |
| 15 | MI | 3.00% | 2.79% |
| 16 | GA, NC | 6.53% | 5.95% |
| 17 | OH | 3.52% | 3.16% |
| 19 | IL, PA | 7.62% | 7.06% |
| 28 | NY | 5.84% | 5.20% |
| 30 | FL | 6.75% | 5.58% |
| 40 | TX | 9.11% | 7.43% |
| 54 | CA | 11.63% | 10.04% |
This is the Markdown I added to the post.
Update 1 Sep 2024 7:11 AM
There’s a better version of the script here.
What I didn't learn about the Electoral College
August 27, 2024 at 4:28 PM by Dr. Drang
People my age1 like to talk about how our education was superior to that of these kids today. I can say with absolute certainty that one aspect of my education was terribly deficient: how the Electoral College works.
I’m not talking about the existence of the Electoral College or how it works on a superficial level—I definitely learned that. But there were two aspects of the Electoral College system that were never mentioned.
First was the likelihood of the popular vote and the electoral vote being at odds with one another. My classmates and I were taught that it could happen, but it was treated as a sort of theoretical thing, not something of any practical significance. Maybe that sort of thing had happened long ago—I say “maybe” because the Hayes-Tilden election of 1876 was never a big focus in history class—but we modern Americans would never have to worry about that.
I suspect this misguided optimism came from the extremely close elections that had recently occurred: Kennedy-Nixon in 1960 and Nixon-Humphrey in 1968. In both cases, the winner of the popular vote won the electoral vote with room to spare, and that may have given people the feeling that the Electoral College would always reflect the popular vote.2
The second problem was how we were taught that the Electoral College could—theoretically, remember—lead to an inversion of results. The example we were given was two candidates, call them A and B, and two states, a large one and a small one. If A wins the large state by a very small margin and B wins the small state by a large margin, Anderson would get more electoral votes with fewer popular votes.
While this is true, it’s misleading in that it leads you to think that it’s the large states that have sort of an unfair advantage, when the truth is the exact opposite. Because every state gets two senators, they also get two “free” electoral votes, votes that are unrelated to their population.
Here’s a summary of the states, ranked by their number of electors and presenting their percentages of the population and the Electoral College. I took the 2023 population estimates from the Census Bureau and the current Electoral College vote counts from the National Archives.
Electors | States | Pop Pct | EC Pct |
---|---|---|---|
3 | AK, DE, DC, ND, SD, VT, WY | 1.61% | 3.90% |
4 | HI, ID, ME, MT, NH, RI, WV | 3.04% | 5.20% |
5 | NE, NM | 1.22% | 1.86% |
6 | AR, IA, KS, MS, NV, UT | 5.60% | 6.69% |
7 | CT, OK | 2.29% | 2.60% |
8 | KY, LA, OR | 3.98% | 4.46% |
9 | AL, SC | 3.13% | 3.35% |
10 | CO, MD, MN, MO, WI | 8.93% | 9.29% |
11 | AZ, IN, MA, TN | 8.49% | 8.18% |
12 | WA | 2.33% | 2.23% |
13 | VA | 2.60% | 2.42% |
14 | NJ | 2.77% | 2.60% |
15 | MI | 3.00% | 2.79% |
16 | GA, NC | 6.53% | 5.95% |
17 | OH | 3.52% | 3.16% |
19 | IL, PA | 7.62% | 7.06% |
28 | NY | 5.84% | 5.20% |
30 | FL | 6.75% | 5.58% |
40 | TX | 9.11% | 7.43% |
54 | CA | 11.63% | 10.04% |
There are lots of ways to slice this data, but it always looks bad for the large states. For example, the 14 states that have either 3 or 4 Electors combine to have 9.11% of the total Electoral College vote3 but only 4.65% of the population. Compare this to Texas, which has nearly twice the population (9.11% of the US total) but gets only 7.43% of the Electoral College vote. You could make a similar comparison between the 16 states with 5 or fewer Electors and California.
Starting in 2000, every election except the two Obama wins has either inverted the popular and electoral winner or come close to doing so. I’d like to think that’s changed the way civics is taught.
-
I turned 64 last Friday, which will be the last time my age will be a power of 2 or of 4. Still hoping to make it to the next power of 3. ↩
-
The Carter-Ford election of 1976 probably confirmed that feeling, but I was past civics classes by then. ↩
-
The table suggests it should be 9.10% (3.90% + 5.20%), but that’s because of rounding. To three digits, it’s 9.11% (49/538). ↩