Jake Arrieta and Python

Although Jake Arrieta still has good looking statistics for the season, my sense is that he’s been a mediocre pitcher since the All Star break. Am I right about this? Because it’s easy to get game-by-game statistics for almost any aspect of baseball, I figured it wouldn’t be hard to answer this question. More important, I could use this as an opportunity to practice my Python, Matplotlib, and Pandas skills.

Modern, stats-loving baseball fans can probably come up with a dozen ways to measure the quality of a pitcher, but I wasn’t interested in cutting-edge sabremetric research, so I stuck with pitching quality statistic of my youth, the earned run average. Remember, this is mostly an excuse to keep my Python skills sharp. Don’t write to tell me about your favorite pitching stat—I don’t care.

I started with a simple text file that looked like this:


Each line represents one game. The date is in month/day format, the earned runs are integers, and the innings pitched are in a hybrid decimal/ternary format common to baseball, with the full innings before the period and the partial innings after it. The full innings are regular decimal numbers, but the partials represent thirds of an inning. The number after the period can take on only the values 0, 1, or 2.

Because I knew Pandas would look at the IP values and interpret them as regular decimal numbers, I decided to massage the input file first. Also, I figured it wouldn’t hurt to add the year to the game dates. After a couple of find-and-replaces in BBEdit, the file looked like this:


Now there are separate fields for the full innnings and partial innings pitched. Time to fire up Jupyter and commence to analyzin’.

import pandas as pd
from functools import partial
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (12,9)

This is mostly standard preamble stuff. The odd ones are thercParams call, which makes the inline graphs bigger than the tiny Jupyter default, and the functools import, which will help us create ERAs over small portions of the season.

Next, we read in the data, tell Pandas how to interpret the dates, and create an innings pitched field in standard decimal form:

df = pd.read_csv('arrieta.txt')
df['Date'] = pd.to_datetime(df.Date, format='%m/%d/%Y')
df['IP'] = df.FIP + df.PIP/3

Now we can calclulate Jake’s ERA for each game.

df['GERA'] = df.ER/df.IP*9

To get his season ERA as the season develops, we need the cumulative numbers of innings pitched and earned runs given up. That turns out to be easy to do with the Panda’s cumsum method.

df['CIP'] = df.IP.cumsum()
df['CER'] = df.ER.cumsum()
df['ERA'] = df.CER/df.CIP*9

Let’s see how his ERA has developed:

plt.plot_date(df.Date, df.ERA, '-k', lw=2)

Arrieta ERA through season

The rise in May, though steep, doesn’t indicate poor pitching. Arrieta just started the season out so well that even very good pitching looks bad in comparison. It’s the jump in late June and early July that looks really bad. And now that I think of it, the Cubs did have a terrible three-week stretch just before the All Star break. Arrieta’s performance was part of that, so his slide started earlier than I was thinking.

But even after the big jump, there’s been a slow climb. So even though the Cubs have played excellent ball since mid-July, Arrieta hasn’t. To get a better handle on how poorly he’s been pitching, we could plot the game-by-game ERAs, but that’s likely to be too jumpy to see any patterns. A way to smooth that out is to calculate moving averages.

But what’s the appropriate number of games to average over? Two is certainly too few, but three might work. Or maybe four. I decided to try both three and four. To do this, I defined a function that operates on a row of the DataFrame to create a running average of ERA over the last n games:

def rera(games, row):
    if row.name+1 < games:
        ip = df.IP[:row.name+1].sum()
        er = df.ER[:row.name+1].sum()
        ip = df.IP[row.name+1-games:row.name+1].sum()
        er = df.ER[row.name+1-games:row.name+1].sum()
    return er/ip*9

The if part handles the early portion of the season when there aren’t yet n names to average over. Looking at the code now, I realize this could have been simplified by eliminating the if/else and just using

max(0, row.name+1-games)

as the lower bound of the slice. Oh, well. I’ll leave my crude code here as a reminder to think more clearly next time.

With this general function defined, we can use the partial function imported from functools to quickly define running average functions for three and four games and add those fields to the DataFrame.

era4 = partial(rera, 4)
era3 = partial(rera,3)
df['ERA4'] = df.apply(era4, axis=1)
df['ERA3'] = df.apply(era3, axis=1)

Now we can plot everything:

plt.plot_date(df.Date, df.ERA3, '-b', lw=2)
plt.plot_date(df.Date, df.ERA4, '-r', lw=2)
plt.plot_date(df.Date, df.GERA, '.k', ms=10)
plt.plot_date(df.Date, df.ERA, '--k', lw=2)

Arrieta various ERAs

As you can see, I didn’t bother to make this look nice. I just wanted to dump it all out and take a look at it. I don’t see much practical difference between the three-game average (blue) and the four-game average (red). Arrieta did have a good stretch there in late July and early August (which I hadn’t noticed), but he’s not been pitching well since then. It’s not unlike his early years with Baltimore. I’ll be curious to see where he’s put in the playoff rotation.

As is usually the case with my baseball posts, I’ve learned more about programming tools than I have about the game. I’ve used partial many times in the past, and I always feel like a real wizard when I do. But I’ve never used cumsum before, and I’m really impressed with it. Not only does it perform a useful function, it’s implemented in a way that couldn’t be easier to use for common cases like the ones here. I hope I don’t forget it.

Update Sep 24, 2016 9:33 PM
Of course I take full credit for Arrieta’s performance the day after this was posted. Jake obviously reads this blog, especially for the Python-related posts, and realized he needed to step up his game after seeing the clear evidence of how he’s gone downhill recently.

Fractionally improved

There’s a new version of PCalc in the App Store, and it has a great new feature for those of us who have to work with lengths measured in inches and fractions of an inch. It’s a display mode that allows all your calculations to be kept in fractional notation rather than converted to decimal form. This is remarkably generous for a developer, James Thomson, who lives in Europe, where you can, I believe, be thrown in jail for failing to worship the Decimal God of Metric.

PCalc has for some time been able to accept fractional input. If, for example, you wanted to enter 1⅝, you can tap out the sequence

1 . 5 . 8

And the number you want will appear in the display (assuming you have the Quick Fraction Entry preference set in Advanced Settings).

Quick Fraction Entry

Up until now, that number would change from




when you tapped any operation key or (if you work in RPN, as all right-thinking people do) the Enter key.

But with PCalc 3.6, you can set the display mode to Fraction, either in the Settings

Fraction Display Setting

or by tapping and holding on the display area.

Action Buttons

What’s great about this is that it allows you to get fractional answers when that’s the natural way to present the results. For example, we can add 1⅝″ and ¾″1

Fractional Input

to get 2⅜″.

Fractional Answer

Sure, it’s easy to see 2.375 and think 2⅜, but even us old hands at decimal-fraction conversion hesitate when dealing with 16ths and 32nds.

I’m told PCalc 3.6 has some cool features that work with iOS 10. I wouldn’t know, as I haven’t tried out Apple’s public beta and don’t intend to install the released version of iOS 10 until the initial excitement—and the initial bugs—have passed by. What I do know is that PCalc 3.6 is an excellent update even on my horribly outdated iPhone 6S running iOS 9.

  1. To enter a pure fraction, tap the decimal point key twice between the numerator and the denominator, e.g., 3 . . 4

S Club 7?

As a rule, I don’t buy a new phone every year. I like my iPhones, but a new phone every other year is good enough for me. I’ve come to prefer the S versions. With an S, you don’t get the newest enclosure design, but you do get the refinements that come with a year of experience, and you typically get a serious improvement to the internals. But it looks like that pattern will be broken with the 7.

I switched to the S Club with the 5S. I had a 5 (actually, I had three 5s—two of them were replaced because of dust between the camera lens and sensor), but I went ahead and bought a 5S because my son’s 4 was on its last legs and needed to be replaced. He got my 5 as a hand-me-down.

The 5S, as you may recall, was a big jump up from the 5 despite having essentially the same case geometry and overall look. The processor went from 32 bits to 64, the home button got Touch ID, the camera got burst mode, and the case became more scratch-resistant. Also, I never had any problems with dust under the 5S’s camera lens.

I skipped the 6 and waited for the 6S,1 which meant I got the way way faster Touch ID, the stronger frame, twice the RAM, improved LTE networking, and the usual faster processor and better camera. S Club wins again.

But the 7 is the third version of this enclosure design, and it seems extremely unlikely we’ll see an S version of it.2 More important, next year is the 10th anniversary of the original iPhone, and Apple will certainly mark the occasion with a major change in case design. Sadly, there will be no S Club 7.

What will I do? Get the 10th Anniversary iPhone? Wait around another year for what will certainly be a distinctly better version of that design? I suspect I’ll be unable to resist whatever comes out in 2017, and I’ll have to tear up my S Club membership card.

  1. By the way, I don’t care whether Apple capitalizes the S. I do because otherwise it looks like a plural. 

  2. Were it not for the Nazi connotations, the 7 would be more properly named the 6SS. 

Just a little more Lagrange

I don’t want to turn this into a blog exclusively about Lagrange points,1 but I got what I believe is the correct answer to the question I posed a few days ago:

Why, in A Fall of Moondust, did Arthur C. Clarke get the numbering the first two Lagrange points of the Earth-Moon system backward?

While it’s certainly true that the numbering of the points is arbitrary, the convention I’ve always seen has been to put L1 between the Earth and the Moon and L2 beyond the Moon. Clarke, no neophyte when it came to orbital mechanics, turned those numbers around.

Today, the likely answer arrived in my Twitter feed. Here’s Peter Demarest with a bit of scholarship:

@drdrang Szebehely’s “Theory of Orbits” used that numbering for the Langrange points, but it fell out of fashion long ago.
Peter Demarest (@pdemarest) Sep 1 2016 9:47 AM

It didn’t take me long to find Victor Szebehely and his wonderfully focused book, Theory of Orbits: The Restricted Problem of Three Bodies, published in 1967 by Academic Press. On page 133, we see this:

Szebehely page 133

As in the analysis we went through a couple of weeks ago, Szebehely sets his origin at the barycenter. But as Peter tweeted, Szebehely defines L1 as the colinear point beyond the smaller mass and L2 as the colinear point between the masses. Clarke may have used Szebehely directly or he may have referred to works that used the same nomenclature. Either way, it wasn’t a mistake.

Theory of Orbits is a delightful book, very much in keeping with the era in which it was written. I have many engineering texts from this period, and it’s common for them to treat numerical methods as something exotic. Here’s a passage from page 135 of Szebehely, where he gives advice on the most effective ways to solve the fifth-order algebraic equation that leads to the positions of the colinear Lagrange points:

Szebehely page 135

Take a look at the paragraph after Equation 20. He’s describing a direct iteration technique and actually concerns himself with the time savings that can be achieved by eliminating a single iteration. How quaint!

Thanks to Peter for solving the mystery and introducing me to this wonderful book.

  1. On the other hand, success in blogging is often said to be the result of a tight focus on a single topic. Maybe an all-Lagrange blog would be the big break I’ve been waiting for.