Spreadsheet programming problems

You may have heard today that a well-known paper in macroeconomics has, after three years of being cited as proof that national deficits must be held in check or ruin will result, finally had its underlying calculations scrutinized. The conclusions: the famous paper excluded some data, weighted other data in an unusual manner, and—our topic for today—included a bug in the spreadsheet that did the calculations.

The paper, known as “Reinhart and Rogoff” after its authors, concluded that

… across both advanced countries and emerging markets, high debt/GDP levels (90 percent and above) are associated with notably lower growth outcomes.

It refers to the 90% level as a threshold, above which growth takes a nosedive. A recent paper, which corrects the bug and recalculates the figures in Reinhart and Rogoff, says the threshold doesn’t exist.

I’ll leave the public policy dispute to the economists. The bug is pretty simple and is explained in both the new paper and Mike Konczal’s blog post. One calculation was supposed to take an average of the growth rates of the 20 countries included in the study, but the range used as the argument to the AVERAGE function left off the last five countries in the table.1

This is a truly cringeworthy bug, but we’ve all been there. As Kieran Healy tweeted:

R&R’s Excel errors do look bad, but if you didn’t get a tiny pit-of-the-stomach sympathy pain you’ve never done any data analysis.
Kieran Healy (@kjhealy) Tue Apr 16 2013 12:18 PM CDT

My relationship with spreadsheets is long and complicated. In the early days, they simply didn’t have the power to do the kinds of calculations I needed. In the late 80s, though, that changed,2 and I got very good at using Excel. Soon, though, I began to hate Excel as Microsoft made the Mac version more like the Windows version. Still, I kept using it. When I switched to Linux in the late 90s, there were no decent spreadsheets available, so I went back to more traditional programming/scripting. I found it superior to working with spreadsheets and haven’t looked back.

Clearly, everyone’s data analysis needs are different, and there are some real advantages to using spreadsheets, the main one being that all the calculated values are on display in front of you (provided your tables aren’t too large). But the disadvantages of spreadsheets tip the scale for me.

First, there’s the unfortunate flip side of seeing all the numbers: the formulas that give rise to those numbers are, for the most part, hidden. Ultimately, it’s the formulas that tell the tale of your analysis, and a traditional language displays the logic of the analysis (and the flaws in that logic) better than a spreadsheet does.

In fact, you seldom want to see all the numbers. The results of intermediate calculations are usually best kept offstage. You can do that in a spreadsheet by hiding cells, but when you’re in the process of building the spreadsheet you need those cells available for editing. What seems to happen most often is that people make longer formulas to avoid having to deal with cells with intermediate results. The longer the formula, the more likely it is to include an error.

Formulas with branching (IF) are almost always too long to be easily read because all three parts—the test and the two alternatives—are bound together in a single statement. It’s true that lots of languages have the similar ternary operator (?:), but most programmers use that only when the two alternatives are very short. Spreadsheet programmers use the IF function no matter how long the alternatives are. And if there’s nested branching, the formula can grow to absurd lengths.

Some of the difficulties in writing clearly understood formulas could be alleviated by the use of named variables. While it’s true that Excel allows you to name cells and cell ranges and even create local variables for use within individual cells, in my experience very few people do that. It’s so easy to just click, formulas get littered with A1s and B3:C5s, which is terrible for debugging. Good descriptive variable (and function) naming is what Edsger Dijkstra had in mind when he wrote

Besides a mathematical inclination, an exceptionally good mastery of one’s native tongue is the most vital asset of a competent programmer.

This is a truth that’s lost in most spreadsheet programming.

Clarity would also be served by including comments, which Excel allows but few people use.

You could argue that many of the problems I’ve outlined are due to poor programming practice, not spreadsheets per se. That’s true, but spreadsheets seem to attract especially poor programming, even from people who ought to know better.

  1. Because I said I was going to leave the dispute to the economists, I won’t give my opinion on the propriety of taking an average of 20 values when 12 of them are missing. 

  2. The key improvement for me was the addition of matrix inversion so I could solve simultaneous equations.