Plastic and nature

Over the weekend, my wife and I went on a bike ride to a nearby lake and wetland area. The park district built a wooden deck that extends out a bit into the wetland, and we went out onto the deck to look at the herons, frogs, and turtles. There are benches on the deck that use plastic instead of wood for the seat and back slats, and we noticed that several of the seat slats were broken. Vandalism, I thought, until I looked at the breaks more carefully.

“Go get your phone and take a picture,” my wife said. “You know you want to.”

So I did. You can click/tap on the photos to see bigger versions.

Broken seat slats

The first thing you should notice is that the breaks occur where the screws come up from below to secure the slats to the steel frame of the bench. You should also see that each fracture surface has an oval that’s a darker color than the outer edges. Let’s look closer at the fracture on the right.

Broken slat detail

A foaming agent was used to get those little bubbles in the plastic. It reduces the amount of resin necessary to fill the mold, which cuts down on the weight and the cost to produce the slat.

The fracture has a few important features:

The cracks started at the screws because of the high stresses induced by driving the screws directly into the slats. The cracks grew outward, probably through a process of creep rupture, which created the dark ovals. As the cracks grew, the stresses driving the crack growth lessened because the slat became more compliant. The final failure probably occurred when someone stepped on the slats, but possibly when someone sat out on the cantilevered end of the seat. The outer rind of the weakened slat stretched, leaving behind the whitened zone. I’ll bet most of the other slats have similar internal oval cracks and are just one step away from snapping off. I’ll also bet the manufacturer of these slats recommends that screw holes be predrilled to avoid this problem.

The frogs and turtles and herons were fun to look at, too.


Pandas and Cubs

A few days ago, while looking at the major league standings, I felt certain the Cubs have been at a standstill since May. For the Cubs, this is an improvement—they spend most seasons in decline—but it’s less than what was expected of them according to several preseason predictions. I decided to use Pandas and Matplotlib to see how (and if) their season was progressing.

I last wrote about Matplotlib a couple of weeks ago, right after Apple disclosed its quarterly financials. Mike Williams looked over what I’d done and made a suggestion:

@drdrang 👍 did you try pandas? makes typical times series plots like this much easier (incl moving average and human-friendly date labels)
mikew (@williamsmj_) Jul 22 2015 2:44 PM

He also whipped out a quick IPython notebook and posted it on Gist. I don’t agree that Mike’s work is equivalent to mine—part of what I did was deliberately more complex than necessary because I wanted to practice certain data transformations, and Mike avoided those—but I do agree that Pandas would eliminate some of the drudgery of data import.

Pandas, in case you haven’t heard of it, is Wes McKinney’s Python module for data analysis. If you’re familiar with the statistics language R, you’ll recognize certain parts of Pandas. I have McKinney’s book, and I’ve used Pandas a few times at work for analyzing test results, but it’s not something I’ve used enough to feel completely comfortable with. I decided the Cubs problem would be a good one to practice on.

Here’s the plot I ended up with. I expanded the problem to include the Pirates and the Giants, the teams the Cubs have been competing with recently for the National League wildcard spots. I suppose I should’ve included the Mets, too, but I’m still pissed at the Mets for 1969, so to hell with them. You can click/tap on the plot to see a bigger version.

Wildcard plot

What we see here is that the Cubs have progressed a little bit over the past few months, but only a little bit. The Giants have been a streaky team in both directions. And the Pirates have been consistently good since pulling themselves out of the hole they dug in May.

The data for this year’s games came from Baseball Reference. I found that the more limited information in the mobile version of the site was easier to use. The Cubs data, for example, looked like this:

CHC 0 v. STL 3, Apr 5th
CHC 2 v. STL 0, Apr 8th
CHC 1 @ COL 5, Apr 10th
CHC 9 @ COL 5, Apr 11th
CHC 6 @ COL 5, Apr 12th
CHC 7 v. CIN 6, Apr 13th
CHC 2 v. CIN 3, Apr 14th
CHC 5 v. CIN 0, Apr 15th
etc.

Each line represents one game. I copied it out of the web page and pasted it into a BBEdit document. After a bit of manipulation, it looked like this:

R     A     O     W     D
0     0     XXX   X     Apr-1
0     3     STL   A     Apr-5
2     0     STL   A     Apr-8
1     5     COL   H     Apr-10
9     5     COL   H     Apr-11
6     5     COL   H     Apr-12
7     6     CIN   A     Apr-13
2     3     CIN   A     Apr-14
5     0     CIN   A     Apr-15
etc.

In the real file, called cubs-2015.txt, the columns are separated by tabs, whereas here I’ve separated them by enough spaces to look like tabs. The first line is for the column labels: Runs for, runs Against, Opponent, Where played, and Date. The second line is a fake game I put there for reasons I’ll explain later.

The script that generates the plot is this:

python:
 1:  #!/usr/bin/env python
 2:  
 3:  import numpy as np
 4:  import pandas as pd
 5:  import matplotlib.pyplot as plt
 6:  from matplotlib.ticker import MultipleLocator
 7:  import sys
 8:  
 9:  cubs = pd.read_table('cubs-2015.txt')
10:  cubs['G'] = np.cumsum(np.sign(cubs['R'] - cubs['A']))
11:  pirates = pd.read_table('pirates-2015.txt')
12:  pirates['G'] = np.cumsum(np.sign(pirates['R'] - pirates['A']))
13:  giants = pd.read_table('giants-2015.txt')
14:  giants['G'] = np.cumsum(np.sign(giants['R'] - giants['A']))
15:  
16:  fig, ax = plt.subplots(figsize=(12, 4))
17:  fig.set_tight_layout({'pad': 1.75})
18:  ax.plot(pirates['G'], color='#F8C633', linestyle='-', linewidth=3, label='Pirates')
19:  ax.plot(giants['G'], color='#EB5234', linestyle='-', linewidth=3, label='Giants')
20:  ax.plot(cubs['G'], color='#0B2B85', linestyle='-', linewidth=3, label='Cubs')
21:  
22:  # Tweak the layout.
23:  plt.title('2015 Wildcard')
24:  plt.xlabel('Game number')
25:  plt.ylabel('Games above .500')
26:  ax.grid(linewidth=1, which='major', color='#dddddd', linestyle='-')
27:  ax.set_axisbelow(True)
28:  plt.xlim((0, 162))
29:  ax.xaxis.set_major_locator(MultipleLocator(10))
30:  ax.xaxis.set_minor_locator(MultipleLocator(2))
31:  plt.ylim((-5, 20))
32:  ax.yaxis.set_major_locator(MultipleLocator(5))
33:  ax.yaxis.set_minor_locator(MultipleLocator(1))
34:  plt.legend(loc=(.04, .52))
35:  
36:  
37:  plt.savefig('20150802-Wildcard plot.png', format='png', dpi=100)

My use of Pandas is in Lines 9–14, where the data files are read in and the “Games above .500” value is calculated. Pandas reads the data into what it calls a DataFrame (R users are now nodding). From McKinney’s book:

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index.

Our DataFrames, which are given the team names, are created by the read_table commands on Lines 9, 11, and 13. Because read_table is smart, it figures out that the first line of each of the data files contains the names of the columns. The columns can then be referenced by cubs['R'], cubs['A'], cubs['O'], and so on.

The column indices start at zero; therefore the “first” game is assigned an index of zero. Because I wanted the index to represent the game number, and because I wanted each team to start at .500 after “Game 0,” I inserted a fake game, a 0–0 tie against opponent XXX, at the top of each data file.

Lines 10, 12, and 14 create a new column for each DataFrame: the number of games above .500. For each game, the runs against is subtracted from the runs scored, and the np.sign function is applied to the result, giving a +1 for a win and a –1 for a loss. Then the np.cumsum function is then applied to the series of positive and negative ones, and that gives us the number of games above .500. This new column is added to the DataFrame and assigned the index G. G is what we’re going to plot.

Lines 16–20 set up the figure and plot the data series. For fun, I used Acorn’s eyedropper tool to get the teams’ colors from their official MLB home pages.

Lines 23–34 format the plot. As I’ve mentioned before, I’m picky about how my plots look, and I’ve never been satisfied with the default layout of any plotting package. Default layouts are fine during the exploratory phase—there’s no point in tweaking plots that aren’t going to seen by anyone else—but not for final copy. Even the supposedly good ones like Seaborn need extra work. So I’m putting together a little template I can insert into every plotting script to set the limits, the ticks, the labels, etc. I’m not sure yet whether I want the template to be a TextExpander snippet or a BBEdit clipping, but whatever it turns out to be, Lines 23–34 are going to be the basis for it.

After I wrote out Lines 23–34, I realized that almost every line in it is analogous to a part of the preparatory work I used to have to do when I made plots by hand on graph paper. This is not a coincidence. A good plot needs attention to how it’s scaled and labeled.

Speaking of which, you might be wondering why I set the horizontal axis to run out to the full 162 game length of a season, even though the teams have played just over 100 games so far. Large empty areas are generally considered a mistake, but here I wanted the plot to give a sense of how much of the season is left.

Overall, I was happy with Pandas. For each team, I needed just one line to read in the file and one line to create the series I wanted to plot. That’s pretty damned efficient. I suspect that as I get more comfortable with it, I’ll be able to cut back on the file preparation and let Pandas parse more complicated input files.


Floating upstream and downstream

I had a lot of fun recording the Revolver episode of Inquisitive that Myke Hurley released earlier this week.

Revolver

I even enjoyed listening to it, except for a few goofy misstatements I made along the way:

  1. The Beatles went to Rishikesh to visit the Maharishi, not the Maharaja. I will say, though, that they would have been better off if they’d made a pilgrimage to study at the feet of Oscar Peterson. I’m sure they’d’ve gotten more out of it than Dick Cavett did.

  2. The George song I expect to hear coming out of “Eleanor Rigby” is “Love You To,” not “For No One.” I have no idea what made me give the name of a Paul song, but at least it was a Paul song on this album. Chances are, I also called George’s song “Love To You” at least once because that’s what the lyrics are. That mistake is not age-related; I’ve been saying “Love To You” instead of “Love You To” since I was a teenager.

  3. The artist who make Revolver’s cover is, of course, Klaus Voorman, not John Voorman. I did at least have the grace to call him Klaus every time other than when I first mentioned him. Although I said that Klaus was a friend of the Beatles from their Hamburg days, I forgot to say that he was also a friend of Astrid Kirchherr, whose photographs of the boys are so well known (including the cover of John’s Rock ’n’ Roll album) and who gave the boys their first Beatle haircuts.
  4. Not actually a misstatement, but as I was trying to remember the names of Donovan songs, I completely forgot “Mellow Yellow,” which probably wouldn’t have meant anything to Myke anyway. If he, or anyone in the UK, wants to bone up on the Scottish psychedelic folk singer, there’s a tour coming up in October and early November. Special VIP tickets, including lunch with Donovan, are only £2,500. Quite rightly.

One last thing: Myke and I talked a little after the show about the Yesterday and Today album, which is where the three John songs that were cut from the American release of Revolver ended up. He didn’t know anything about that album, so I told him to Google “Beatles butcher cover.” The sounds of disgust that came over Skype were delightful.


Two problems

As I read Jason Snell’s commentary on my Patterns post, I realized that no one—not on Twitter, not in email—had sent me Jamie Zawinski’s famous saying about regular expressions:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Zawinski was making tweet-sized aphorisms long before Twitter.

Jeffrey Friedl, he of the definitive book on regular expressions, wrote a much-updated post on the origin of this saying and its antecedents. I recommend it if you have any interest in regexes, programming, or Unix. The comments are nearly as fun as the post itself.

Literal-minded people take Zawinski’s gibe too seriously and either parrot it as great wisdom or get into long arguments on how it’s wrong. I agree with Friedl and Jeff Atwood that Zawinski’s real target was not regexes themselves, but their overuse, which he saw as a bad programming practice encouraged by Perl. It may be hard to believe now, but back in the 90s, Perl was being used everywhere. In his post, Friedl includes this other Zawinski quote:

The heavy use of regexps in Perl is due to them being far and away the most obvious hammer in the box.

This strikes me as exactly right and not hyperbolic in the least. When I switched from Perl to Python, the hardest thing to get used to was Python’s more verbose way of handling regexes. Regular expressions are so easy to use in Perl, and so intimately woven into the language, that I had gotten into the habit of using them everywhere. It wasn’t until I had been programming for a while in Python, where I preferred not to import re unless necessary, that I realized how much I had been overusing regexes. Judicious slicing combined with literal string searches and splits commonly work just as well as regexes on simple input, and proper parsing into data structures is almost always the better choice for highly organized data formats.

Even so, regexes are so often exactly what’s needed that I’m glad I have Patterns in my toolbox.