Baseball as an excuse for programming

I find it really hard to watch baseball nowadays because the game moves so slowly, but I do still like to look at statistics and standings. The standings in Yahoo! Sports include a figure that was uncommon when I was a kid: the teams’ run differential, the difference their runs scored and runs given up.

Final standings

What jumped out at me this year was how small the Orioles’ run differential was for a playoff-bound team. They ended up 24 games above .500 but outscored their opponents by only 7 runs over the course of the year. Contrast that with Tampa Bay, who had the third highest run differential in the majors but will be watching the playoffs on TV.

I decided to see how well a team’s run differential predicts its final record. I copied the standings to a text file and deleted most of the columns, ending up with a file that looked like this:

NYY 95  67  .586  +136
BAL 93  69  .574  +7
TAM 90  72  .556  +120
TOR 73  89  .451  -68
BOS 69  93  .426  -72
DET 88  74  .543  +56
CHS 85  77  .525  +72
KAN 72  90  .444  -70
CLE 68  94  .420  -178
MIN 66  96  .407  -131
OAK 94  68  .580  +99
TEX 93  69  .574  +101
LAA 89  73  .549  +68
SEA 75  87  .463  -32
WAS 98  64  .605  +137
ATL 94  68  .580  +100
PHI 81  81  .500  +4
NYM 74  88  .457  -59
MIA 69  93  .426  -115
CIN 97  65  .599  +81
STL 88  74  .543  +117
MIL 83  79  .512  +43
PIT 79  83  .488  -23
CHC 61  101 .377  -146
HOU 55  107 .340  -211
SFG 94  68  .580  +69
LAD 86  76  .531  +40
ARI 81  81  .500  +46
SDP 76  86  .469  -59
COL 64  98  .395  -132

Because every team had a 162-game season I was able to use wins as a proxy for a team’s record. Here’s the plot of wins vs. run differential for all the teams, along with the regression line.

Wins vs. run differential

As expected, the Orioles are way above the regression line, but so are the Reds and the Indians (this is cold comfort for Cleveland fans). The plot also shows how Tampa Bay’s wins are indeed fewer than expected for its run differential, but the Cardinals and Diamondbacks got even less out of their run differentials. St. Louis, at least, got into the playoffs.

This graph was, as much as anything, an excuse for me to practice using the NumPy and Matplotlib Python packages. Here’s the code that generated the plot:

 1:  #!/usr/bin/python
 3:  from scipy import stats
 4:  from pylab import *
 6:  # Read in the data.
 7:  mlb = loadtxt('mlb.txt', dtype=[('team', 'S3'), ('w', 'i'), ('l', 'i'), 
 8:                                   ('pct', 'f'),  ('rdiff', 'i')])
10:  # Plot the data with invisible points.
11:  scatter(mlb['rdiff'], mlb['w'], s=0)
12:  xlabel('Run differential')
13:  ylabel('Wins')
15:  # Put team names at the data points.
16:  for (t, w, rd) in zip(mlb['team'], mlb['w'], mlb['rdiff']):
17:    text(rd, w, t, size=9,
18:         horizontalalignment='center', verticalalignment='center')
20:  # Perform the linear regression
21:  m, b, r, p, stderr = stats.linregress(mlb['rdiff'], mlb['w'])
23:  # Get endpoints of regression line and plot it.
24:  rdMin = min(mlb['rdiff'])
25:  wMin = m*rdMin + b
26:  rdMax = max(mlb['rdiff'])
27:  wMax = m*rdMax + b
28:  plot([rdMin, rdMax], [wMin, wMax])
30:  show()

As you can see in Line 4, I’m using the PyLab package, which is supposed to provide a combination of NumPy, SciPy, and Matplotlib. And as you can see in Line 3, I was still forced to import the stats library from NumPy to do the regression analysis in Line 21. I’m afraid I’m not experienced enough with these libraries to understand why that’s the case.

Overall, I think the script is pretty easy to understand, with only a couple of things that may need additional explanation.

Because the input file has a mix of string, integer, and floating point values, I had to specify on Line 7 how the loadtxt function was to interpret the fields.

The s=0 parameter in the scatter call on Line 11 sets the size of the markers to zero, rendering them invisible. I did this so I could then plot the team names at their respective points in Lines 16-18. I had hopes that the scatter function could be instructed to set the markers to the team names directly, but I couldn’t find a way to do that. Consider this two-part solution a workaround.

I could have spent more time formatting the axes and doing other beautification, but the defaults seemed decent enough. At this point in the learning curve, I’m more interested in plotting than formatting.

10 Responses to “Baseball as an excuse for programming”

  1. Tony McDaniel says:

    Thanks for posting this. Using the team names as plot labels is really nice.

    I was wondering why you didn’t mention the correlation coefficient. While it’s interesting to note the outliers (particularly wrt the playoffs), a numerical measure of correlation would be interesting, particularly when comparing results from multiple years.

    You might be interested in Baseball Hacks by Joseph Adler. He uses R to do the analysis. It was an interesting read. I was surprised how much free data is available on baseball.

  2. Matthew Wyatt says:

    I’m with you on the relative “enjoyability” of baseball vs the statistics it generates. The Run differential is kind of fun - have you looked at the Pythagorean Expectation?

  3. Dr. Drang says:

    I didn’t mention the correlation coefficient (or the other statistics provided by the linregress function) because I thought the goodness of fit was pretty clear from the plot itself. It would be easy enough to add something like

    print '''m = %f
    b = %f
    r = %f
    p = %e
    stderr = %f''' % (m, b, r, p, stderr)

    to the code to have all those results printed. The results are

    m = 0.112356
    b = 81.000000
    r = 0.947662
    p = 2.055260e-15
    stderr = 0.007154

    So yes, pretty good correlation this year, at least.

    Also, I’ve read somewhere that sabremetricians see an equivalence between 9 runs and a win. I don’t know if our slope being very close to 1/9 is related to that or not, but it is suggestive. You’d have to look at more years, obviously, and I’m sure many people have already done that.

    I’ve used R in the past, but I dislike its syntax, and I don’t do serious statistics often enough to justify the effort needed to learn it well. Octave and Python can handle both the elementary stats and the more general numerical work I do on a regular basis.

  4. Ed says:

    Just echoing #2. Bill James came up with Pythagorean forecasts of W/L based on RF and RA almost 30 years ago. Is the author not aware?

  5. Rick Mc says:

    You know about pythagorean wins or pythag projection done by Bill James?

  6. Bob Delaney says:

    One wonder that if chess pieces were as sentient as baseball players what their game actions might be. In no other two games can the actions on the field/board be as free from random events (roll of the dice, tripping over the blue line etc.) as they are in baseball and chess. Both can be recreated: chess literally move-by-move, and baseball in very enjoyable algorithmic terms. Whether you are right or wrong; whether your analysis proves or disproves something isn’t the point. Baseball is to a quantitative mind what fine wine is to an epicurean. Your points are very interesting. Keep it up, and have fun.

  7. Danny says:

    One of the things I look at, before picking a game to go t0 - especially when buying the tickets a long time in advance, is the number of runs generated in the oponent team’s games, as an indicator of how many runs I’ll see. This brings up all kinds of questions, whether I want to go to a game where the opposing team may get a lot of runs scored or do I just want to see my team win… This brings up another intersting one I used to ask my friends: When you play against a computer and you need to set up the computer level. Say you always win at level 3, win most of the time at level 4, lose most of the time at level 5, and win very infrequently at level 6. How do you set up the computer level? Most people said 5. Some of my closer friends, who were more candid said 4… We like to win, but we like everyone else to think we like the challenge and the learning.

  8. Gary Huckabone says:

    What a geek/dork. I love it! :)

  9. Dr. Drang says:

    Yes, Virginia, I have managed to exist without hearing of the Pythagorean Expectation before now. Just as I hate the designated hitter and that the Brewers are in the National League, I’m suspicious of statistics that didn’t exist when I was a young fan. I worry that they’ll get used for nefarious purposes, like trying to deny the MVP award to the first Triple Crown winner in 45 years.

  10. Edward says:

    “In no other two games can the actions on the field/board be as free from random events (roll of the dice, tripping over the blue line etc.) “

    Anyone who believes this did not see this weekends play off games. Some of the things that affected the games this week. Drag marks left by placing and removing the tarp from the field, the pattern used in cutting the infield grass, the shadow of the stadium as it covers home plate but leaves the mound well lit, the hardness of the ground based on how much or how little rain has been received through out the year. The really spectacular once in a lifetime catch made by a journeyman fielder versus the mind boggling error of the most incredibly simple play by a gold glove winner.

    Not to mention the hitting. How many times have you seen a great hitter get served up a throughly fat extremely hittable pitch, only to foul it off. How many times has a massive powerful blast ended up being foul by inches? How many times have you seen an easy popup end up in no mans land and result in a double, or a dribbler in front of the plate that no one can get to result in driving in a run?

    Baseball is full of random events and that’s what makes it so much fun and so unpredictable. Example look at the pre-season forecasts of where the Nationals, the Os, and the As were supposed to place based on statistics.