Spreadsheet programming problems

You may have heard today that a well-known paper in macroeconomics has, after three years of being cited as proof that national deficits must be held in check or ruin will result, finally had its underlying calculations scrutinized. The conclusions: the famous paper excluded some data, weighted other data in an unusual manner, and—our topic for today—included a bug in the spreadsheet that did the calculations.

The paper, known as “Reinhart and Rogoff” after its authors, concluded that

… across both advanced countries and emerging markets, high debt/GDP levels (90 percent and above) are associated with notably lower growth outcomes.

It refers to the 90% level as a threshold, above which growth takes a nosedive. A recent paper, which corrects the bug and recalculates the figures in Reinhart and Rogoff, says the threshold doesn’t exist.

I’ll leave the public policy dispute to the economists. The bug is pretty simple and is explained in both the new paper and Mike Konczal’s blog post. One calculation was supposed to take an average of the growth rates of the 20 countries included in the study, but the range used as the argument to the AVERAGE function left off the last five countries in the table.1

This is a truly cringeworthy bug, but we’ve all been there. As Kieran Healy tweeted:

R&R’s Excel errors do look bad, but if you didn’t get a tiny pit-of-the-stomach sympathy pain you’ve never done any data analysis.
Kieran Healy (@kjhealy) Tue Apr 16 2013 12:18 PM CDT

My relationship with spreadsheets is long and complicated. In the early days, they simply didn’t have the power to do the kinds of calculations I needed. In the late 80s, though, that changed,2 and I got very good at using Excel. Soon, though, I began to hate Excel as Microsoft made the Mac version more like the Windows version. Still, I kept using it. When I switched to Linux in the late 90s, there were no decent spreadsheets available, so I went back to more traditional programming/scripting. I found it superior to working with spreadsheets and haven’t looked back.

Clearly, everyone’s data analysis needs are different, and there are some real advantages to using spreadsheets, the main one being that all the calculated values are on display in front of you (provided your tables aren’t too large). But the disadvantages of spreadsheets tip the scale for me.

First, there’s the unfortunate flip side of seeing all the numbers: the formulas that give rise to those numbers are, for the most part, hidden. Ultimately, it’s the formulas that tell the tale of your analysis, and a traditional language displays the logic of the analysis (and the flaws in that logic) better than a spreadsheet does.

In fact, you seldom want to see all the numbers. The results of intermediate calculations are usually best kept offstage. You can do that in a spreadsheet by hiding cells, but when you’re in the process of building the spreadsheet you need those cells available for editing. What seems to happen most often is that people make longer formulas to avoid having to deal with cells with intermediate results. The longer the formula, the more likely it is to include an error.

Formulas with branching (IF) are almost always too long to be easily read because all three parts—the test and the two alternatives—are bound together in a single statement. It’s true that lots of languages have the similar ternary operator (?:), but most programmers use that only when the two alternatives are very short. Spreadsheet programmers use the IF function no matter how long the alternatives are. And if there’s nested branching, the formula can grow to absurd lengths.

Some of the difficulties in writing clearly understood formulas could be alleviated by the use of named variables. While it’s true that Excel allows you to name cells and cell ranges and even create local variables for use within individual cells, in my experience very few people do that. It’s so easy to just click, formulas get littered with A1s and B3:C5s, which is terrible for debugging. Good descriptive variable (and function) naming is what Edsger Dijkstra had in mind when he wrote

Besides a mathematical inclination, an exceptionally good mastery of one’s native tongue is the most vital asset of a competent programmer.

This is a truth that’s lost in most spreadsheet programming.

Clarity would also be served by including comments, which Excel allows but few people use.

You could argue that many of the problems I’ve outlined are due to poor programming practice, not spreadsheets per se. That’s true, but spreadsheets seem to attract especially poor programming, even from people who ought to know better.


  1. Because I said I was going to leave the dispute to the economists, I won’t give my opinion on the propriety of taking an average of 20 values when 12 of them are missing. 

  2. The key improvement for me was the addition of matrix inversion so I could solve simultaneous equations. 


33 Responses to “Spreadsheet programming problems”

  1. Tom S. says:

    There are programs out there that offer a good intermediate between analysis with programing and spreadsheets. I haven’t found a good free one but if you don’t mind popping for the money, Matlab is a pretty good alternative. It’s a high level programing language that will show you the values of the intermediate variables in a grid if you want to see them during or after a run. Its very handy for debugging.

  2. Nathan Gouwens says:

    You’ve inspired me to improve my programming practices with spreadsheet analysis! In grad school, I did a lot more complicated analyses than I do now, and I mostly used programs like the ones Tom S. refers to (Matlab, R, Igor Pro). I’d also write small scripts if I needed to process data files or things like that.

    But in my current job, most of my analysis ends up being in Excel, mostly because it’s what we have and what others know how to use. I’ve fallen into exactly the bad habits you mentioned even though I should know better. I guess I just didn’t view it as “real” programming and so never applied the same principles and practices. So thank you for writing this post - I’ll be much more mindful in the future!

  3. has says:

    “First, there’s the unfortunate flip side of seeing all the numbers: the formulas that give rise to those numbers are, for the most part, hidden.”

    That’s just an implementation problem. There’s no reason the UI couldn’t provide more than one view onto the document: the standard input/result view, an alternate view that displays all the equations and their relationships to cells, a test view that allows users to enter dummy data to exercise their code without affecting production data, and so on.

    The problem here is a common one: having cornered the spreadsheet market with Excel, MS have no real incentive to innovate - even if users realize things could be a lot better there’s no alternatives available. It’s the same sort of deal with the big Adobe apps. The only reason vendors even add new features is to keep users on the upgrade treadmill by defining new file formats that previous app versions can’t open. Indeed, their biggest competitors are the previous versions of their apps already installed on users’ computers.

    In theory, OpenOffice/LibreOffice could give MS a good drubbing here, but in practice they’ve even less likely to break the existing Excel mold. Unfortunately, FOSS culture is a lot better at commoditizing existing ideas rather than creating novel new ones; its comparative homogeneity (being mostly programmers scratching their own itch) means that there’s plenty folk capable of replicating existing specialist systems but few with the broad range of alternate skills to conceive completely new ones from scratch - in this case, knowledge and experience in fields such as accounting, end-user programming, and pedagogy.

    e.g. You only have to look at the current plague of “learn to code” sites to realize that the vast bulk of mainstream coders don’t have the first clue about EUP: how to design the platform (paradigm, language, editor, libraries, documentation), how to teach key programming concepts effectively or even what those concepts are (hint: the only thing that actually matters is Abstraction; see Papert et al). They just think it’s a matter of creating yet more complexity-loving micromanaging imperative code grinders just like themselves - which is precisely the opposite to what’s needed. And since a spreadsheet system is just one massive end-user-programmable dataflow-driven calculator… well, you see what the problem might be in leaving programmers to design it.

    So how do you crack such vendor apathy and developer ignorance to build a new generation of spreadsheet apps that apply the last 20 years’ research in advanced interactive environments and end-user language design? I don’t rightly know: it’s fundamentally a people problem, not a technical one: making the right folks aware that better approaches are out there, and motivating them to bring those advances to bear in mass-market products.

  4. has says:

    p.s. OT, but of interest to you:

    https://pypi.python.org/pypi/py-applescript

  5. Robert says:

    You’ve inspired me to improve my programming practices with spreadsheet analysis! In grad school, I did a lot more complicated analyses than I do now, and I mostly used programs like the ones Tom S. refers to (Matlab, R, Igor Pro). I’d also write small scripts if I needed to process data files or things like that.

    But in my current job, most of my analysis ends up being in Excel, mostly because it’s what we have and what others know how to use. I’ve fallen into exactly the bad habits you mentioned even though I should know better. I guess I just didn’t view it as “real” programming and so never applied the same principles and practices. So thank you for writing this post - I’ll be much more mindful in the future!

  6. tom says:

    Agree with the coments about Excel… I am certainly guilty of such errors… though I try. Naming does work as I was working a formula yesterday. Evaluate formula has helped me in revealing my errors.

    However there is a new data analysis capability that the scientific community is starting to use and that is IPython and IPython Notebooks. I like IPython just for the fact that you get feedback on every step! in fact there are some graduate level courses on research tools that are built on this approach!

  7. W. Clardy says:

    Leaving aside has’ “they’re all clueless” rant, Microsoft does offer a much better piece of software for end-user data-crunching — it’s called Access, and despite how much database professionals cringe when we look at what users create, it is much less prone to the typos and omissions common in spreadsheet formulas. It’s not used because spreadsheets are the hammer in everybody’s toolbelt, regardless of how screwy the problem might be.

    Personally, I think that the reason most users reach for spreadsheets even for extremely intricate data analysis is because the spreadsheet looks like a giant sheet of butcher paper, where they can scribble out calculations (formulas) and see what the results are looking like while they scribble.

  8. Paul Leiby says:

    @ Tom S: “There are programs out there that offer a good intermediate between analysis with programing and spreadsheets. I haven’t found a good free one but if you don’t mind popping for the money, Matlab”

    Agreed, but good news: see Octave, a highly compatible open-source substitute for MatLab. Also see R, if your application is long on data and statistics.

  9. Don says:

    First, there seems to be a misconception that MS-Excel “hides” formulas. Balderdash! Most obvious, there is the Formula bar. Less obvious to those who may work less with Excel, or have not worked with it for some time, there is an option to view formulas instead of values. Additionally, when a cell is selected and the sheet is in formula view mode, the precedent cells are highlighted.

    Second, there seems to be a position among some that MS sits on their laurels when it comes to the Excel product. While still not a fan of some of their recent changes in the user interface, I have noticed some steps forward in supporting data analysis, e.g. Tables. (Tables and Names, as mentioned above, make formulas much more understandable.)

    Now, on to Access. I’ll concede that Access may be the better tool for some data analysis, but it is less prevalently used than Excel . Like someone said, it’s the hammer in everybody’s toolbelt. That’s the “PRO” for Excel. We all know the interface and what it means.

    People I’ve known don’t use Access for several reasons, here are my top 4.

    (1) It’s not been part of the basic MS-Office suite (Don’t know about currently, but in the past Access was part of MS-Office Professional and not the standard product.) This means less likely to have used it and more difficulty sharing.

    (2) “Database” seems scary to people. Databases methodology has words like “normalization” in it! Graphical query design, an Access feature considered a mark of the weak-minded to “real” DB folks, is great for those who know how to use it, but an unfamiliar metaphor implying unnecessary complexity (especially when one starts imbedding formulas in result fields) that makes it Too Techie for some.

    (3) The capabilities for review and publishing are far behind Excel’s. In Excel, if a reviewer questions a result, s/he can add a Comment. If a reviewer questions a result from Access, the feedback needs to be either designed into a “results” table (anathema to relational bigots like me) or captured and fed back in a separate document.

  10. Andy says:

    The type of workflow you are describing is technically called scripting. Programming, in this environment, would involve using the visual basic programming interface to write a code module. For the most part these still are basically individual script steps that are executed in sequence. Programming is when the code you write goes through a compiler. Not really nitpicking your overall complaint but since I am a programmer who also programs the actual spreadsheet for a company, I just wanted to draw the line for you.

  11. Rand Rowlands says:

    I agree with W. Clardy to an extent. Most spreadsheet gurus know when it’s time to upgrade their data to Access. Often the curve to do so properly is long though. It requires the data to be in good form (i.e. full, complete records) which is often not the case especially if it originally came from an unskilled user. And then as Nathan Gowens pointed out, sharing Access is not as easy as sharing spreadsheets.

    I also wish Access had as good a stats package built in as Excel. Maybe I missed it in the last release but I still do my ANOVA’s in Excel. (MatLab is an alternative - have to up my skills on that one).

  12. Dan Sutton says:

    Funny. Every now and then, I find myself having to program Excel with the abysmal VBA… I’ll be the first one to admit that the code I write sucks, has no structure and is generally abysmal. Why? Because I can’t be bothered to do it properly — it’s a spreadsheet, fer chrissakes… I think this might be most of the problem with this sort of thing!

  13. Herbie Mountjoy says:

    Yes, VBA sucks. You can become boggle eyed trying to track formulae across hundreds of cells. Why not use the API to perform the maths and leave Excel to just display the result? Sometimes this approach is faster and simpler than the VBA route even though it uses pretty much the same methods and properties.

  14. J. Andrew Smith says:

    “Formulas with branching (IF) are almost always too long to be easily read because all three parts—the test and the two alternatives—are bound together in a single statement. It’s true that lots of languages have the similar ternary operator (?:), but most programmers use that only when the two alternatives are very short. Spreadsheet programmers use the IF function no matter how long the alternatives are. And if there’s nested branching, the formula can grow to absurd lengths.”

    At the risk of intruducing more intermediate results I tend to separate the condition and two alternatives into three cells, then in a fourth cell write a now-very-short IF formula referring to those three cells.

    “Some of the difficulties in writing clearly understood formulas could be alleviated by the use of named variables. While it’s true that Excel allows you to name cells and cell ranges and even create local variables for use within individual cells, in my experience very few people do that. It’s so easy to just click, formulas get littered with A1s and B3:C5s, which is terrible for debugging…”

    The problem with names is that you can’t immediately see where the cells are upon which the formula depends. I like cell addresses, but in the R1C1 style, where you can see immediately and clearly not only relative and absolute refernces but either relative offsets or absolute columns.

  15. has says:

    @ W. Clardy: Sorry, but many mainstream devs are mostly clueless. Having cut my teeth 15 years ago as a self-taught scripter, and slowly crawled my way up from there via every deadly tarpit and humiliating pratfall possible, these days I have a pretty good idea of just how little I actually know.

    What’s surprised me is realizing that while trained professional programmers do have a great depth of knowledge, that its scope is actually extremely narrow. Present them with a problem like relational data management, network communication, distributed processing, etc. and the first thing many of them will do is try to shoehorn it into an object-oriented solution (hence technological abominations like ORMs and SOAP), even though such problems are fundamentally incompatible with OO idioms. (Specifically single-dispatch class-based OO idioms, because they do the same thing to stuff like JavaScript as well, e.g. the oxymoronic Prototype.)

    It’s an unhealthy combination of arrogance, ignorance and laziness: having been spoon-fed C*/Java all the way through college (many so-called Computer Science courses now being a degenerate form of Software Engineering/Java code monkey diploma mill), they see everything as highly reductionist, mechanistic problems with highly reductionist, mechanistic solutions just waiting to be coded using their utterly ingrained imperative procedural tools and idioms. Don’t believe me? Go tally the number of C*/Java devs who spend a significant proportion of their time thinking and working in functional, dataflow, concatenative, logic, etc. etc. idioms and languages. (And I don’t mean dallying in the kiddie pool for two minutes a year; I mean spending months immersed at the bottom of the dive pool, solving serious and substantial problems with it.) I’ll be happy to wait: you can probably do it on your fingers, so it won’t take long.

    I may be a Bear of Very Little Brain, but a scarily high percentage of modern developers are high-functioning Dunning-Krugerites with a unhealthily self-replicating evangelical streak and a high enough concentration to generate permanent toxicity. Because, of course, the whole thing eventually goes full circle, where that whole mindet gets rote-trained into the next generation of professional developers, who inflict it on the next, and so on. And even that isn’t enough, because now they’re all driving this fresh plague of populist Learn to Code sites, pedagogical monstrosities that suck up hapless novices and precious educational resources alike in order to emplace the same crippled defective thinking in the minds of ordinary folks as well.

    I fear for a future populated by nothing but OO programmers who wouldn’t know how to wipe their own ass without a ToiletRoll factory to tear off a sheet and wrap it in an object for them.

    And why does all this matter here? Because applications like Excel and Access are fundamentally about empowering end users - people who may be highly trained and skilled in their own specialist fields (accounting, economics, science, etc) but who have neither the interest nor tolerance for swallowing the vast piles of conceptual and mechanical complexity that the aforementioned mainstream has spent decades constructing for its own amusement.

    When a finance officer sits at their screen, they are not interested in seeing a spreadsheet application or database application; their only concern is to use an accounts ledger tailored to the business’s specific requirements. And this is where end-user programming comes in, because the whole reason VisiCalc and the other 1st-gen spreadsheet systems gave manual book-keeping an absolute smackdown is not that it enabled such domain experts to construct their own accounts ledgers (which paper ledgers could also do) but to automate all the operations required to validate the manually-entered source data and populate all the calculated fields with correct results.

    The challenge then is to minimize the amount of learning, complexity and risk that such domain experts must deal with in order to build their custom systems. If the task in hand is to add up the values from cells C1 to C8 and enter the result into C10, a non-programmer will naturally express this requirement in those terms. While such natural language processing is an AI-complete problem, it is not actually hard to come up with a formal language that lets the user express this requirement quickly, easily and precisely. And indeed the early spreadsheet pioneers did exactly that, allowing the user to attach the expression:

    SUM(C1..C8)

    to the C10 field, such that the value of C10 would be automatically [re]calculated whenever the values of rows 1 to 8 of the C column changed.

    This is a fantastic and lovely bit of HCI design, a profoundly user-centric solution that places the users wants, needs, capabilities and limitations above all else. And all achieved despite the enormous platform limitations at the time (1KHz 8-bit CPUs, 32KB RAM, etc).

    But those old-school computing creatives lived in a very different age, when the scope and nature of mainstream programming was anything but settled - heck, VisiCalc was written in 6502 Assembler, about as free from idioms, metaphors, preconceptions and prejudices as you can get.

    Ask a modern OO-spoonfed Python/JavaScript/C*/Java kid to implement a user-programmable spreadsheet system completely from scratch, and you’re liable to get something that requires those same users to write stuff like this:

    void CalcC10(Document *doc) {
        int result = 0;
        for (int i=0; i<8; i++) {
            result += doc->Column('C')->Row(i).GetValue();
        }
        doc->Column('C')->Row(9).SetValue(result);
    }

    And if you can’t see the disaster where that’s heading, I’ll be happy to pursue this screed later, our esteemed host willing.

  16. has says:

    @W. Clardy: Another thought: It’s ironic you suggest Access as an alternative to Excel for heavier-duty data processing. Users who understand data and work with it on the desktop may, if they stop to think about it a moment, immediately realize that ‘spreadsheet’ and ‘database’ are merely two points on a continuum: tabular data. The only significant distinction is whether or not they which multiple tables to relate to one another, and if so how (i.e. referential vs relational).

    The only reason the distinction between ‘spreadsheet application’ and ‘database application’ exists is because (you guessed it) mainstream programmers consider them to be fundamentally different concepts. Why? Because the back-end storage mechanism is different for each case (non-relational vs relational models). But that’s merely an implementation detail; from the users’ POV there should be only one front-end interface: the one that lets them program, input and view data.

    It’s another example of implementation dictating interface, instead of the other way around: i.e. programmers designing programs in a way that suits their own desires, preconceptions and prejudices, rather than in the way that best fulfills the wants and needs of the programs’ actual users. And with little competition and even less creativity or vision, there’s no motivation for developers ever to rectify this badly factored mess.

    So, yes, you will get users who will use spreadsheets when they should be using a database - but your flawed implication that ‘spreadsheets are for small volumes of data, databases are for large’ just confirms that the current poor arrangement is causing users a lot of confusion and grief. Don’t blame ordinary users for the faults of ‘user-oriented’ tools. Data volume should be irrelevant; their only determining factor should be whether they want related tables to be connected via references or relationships.

    Consider if the whole lot was provided by a single application with a single graphical and programmatic user interface and unified interaction model, the application itself could then automatically choose between whatever back-end storage best suited the custom system the user was constructing in the UI. Which wouldn’t just be limited to ‘spreadsheet’ or ‘local database’, but could be ‘local text or XML file’, ‘central server-based database’, ‘web service for some online application’, etc. Not only would it be a lot more logical and simpler for end-users to work with, it could be a lot more powerful and less mistake-prone too.

    It really is painful how dreadfully architected many of our longest-serving desktop productivity suites are, for no other reason than they’ve been around for so long in their current form that nobody even stops to think it could all be done so much better. Geeks like to warn about the dangers of data silos to user freedom and empowerment, but some of the biggest, most impenetrable silos are the conceptual ones in their own heads.

  17. Tom S. says:

    Hi, Paul (and Nathan).

    Thanks for the suggestion. Its been a long time since I looked at Octave and I think maybe its come a long way. Unfortunately, I don’t think it has the IDE and the Workspace with the grid that you can view the intermediate values in (yet?). As I recall I was using Emacs to edit the data and GNU plot to graph the data. Man, I loved Emacs. Used to live in it in my Linux days.

    In any case, you’re right. I should have remembered Octave. Very powerful and the language is obviously meant to be very Matlab-like. And the price is certainly right.

    Cheers.

  18. Dr. Drang says:

    Such long comments, and me with only a narrow pipette of bandwidth. I’ll respond (and clean up the Markdown) when I return to civilization in a day or so.

  19. John N says:

    While it’s true that Excel allows you to name cells and cell ranges and even create local variables for use within individual cells…

    You can’t use names as local variables because the only available scopes are “workbook” and “worksheet”.

    Setting that aside, a name is never a variable in the usual sense because when Excel evaluates a name it doesn’t just fetch the corresponding value from memory, it first fetches whatever the name represents (e.g. a cell reference or a formula) and then evaluates it. So in conventional programming terms a name is more like a macro or a call to Eval() than a variable.

  20. Jess says:

    I’m surprised that in all that commentary no one has remarked upon one of the most salutary effects of Dr. Drang’s suggestion to actually code one’s work in a real language rather than cobbling about in a spreadsheet: actual code may be checked into a DVCS and shared, with one’s immediate colleagues or with the world. (Yes, of course one may also check in binary blobs like excel files but what good would that do?) As the old folks say, “all bugs are shallow”. In a research scenario, this would allow one to ask one’s buddy down the hall to clone a repo and read through the code before publishing the paper and suffering such ignominy as having one’s work corrected by pimply-faced students of less-prestigious schools after it has collected years of cites and political weight. Like this:

    $ git clone https://github.com/rogoff/growth_debt.git
    $ cd growth_debt
    $ python calc.py < data.csv
    $ vi calc.py
    $ python calc.py < data.csv
    $ git commit -am 'fixed avg() call'
    $ git push

    “Hey Ken you better look at those figures again!”

  21. Jess says:

    Haha yes I can confirm that all is not right with Markdown at this site.

    Impertinent editorial comment from the proprietor:
    No Jess, the problem is that you don’t know how to include multiline code blocks in Markdown. Perhaps the indentation isn’t necessary with some versions of Markdown, but in standard Markdown it is.

  22. Geraldo Coen says:

    It should be clear that spreadsheets were not invented for “serious” programming or data analysis. Spreadsheets are one of the best ideas in software, in fact a killer application, whose objective is to replicate the traditional accountant spreadsheet. That is it. For more serious applications, you need mathematical or statistical software, or genuine programming. By the way, I have a project to develop a true programming spreadsheet. Not VBA or complicated conditional formulae, but true programming. Universal machine. More on this when it will be ready.

  23. has says:

    @Jess: “I’m surprised that in all that commentary no one has remarked upon one of the most salutary effects of Dr. Drang’s suggestion to actually code one’s work in a real language rather than cobbling about in a spreadsheet: actual code may be checked into a DVCS and shared, with one’s immediate colleagues or with the world.”

    Difficulty with code management an implementation problem, same as above. There’s no fundamental reason spreadsheet logic couldn’t be decoupled from spreadsheet data, encoded in a form that is friendly to SCM, and so on. The reason MS don’t do it is because they don’t need to (since they’ve no competition to fear) and the reason FOSS doesn’t do it is because they’ve no clue (being creatively crippled by self-perpetuating homogeneity).

    And, once again, the mere fact that ‘real’ (i.e. general-purpose) languages already support SCM workflows does not make them good domain-specific tools, e.g. compare:

    C10 = SUM(C1..C8)
    

    to:

    total = 0
    for i in range(8):
        total += doc.columns['C'].rows[i].value
    doc.columns['C'].rows[9].value = total
    

    It is absolutely ludicrous to suggest the second solution is better than the first, yet there are many programmers who would. The reason for which is really simple once you actually examine it: their deep understanding of general programming is equalled only to their fundamental ignorance of the actual problem domain at hand.

    Apologies for ranting again, but as someone who’s been all round all sides of the block here, such deliberately unhealthy attitudes seriously honk me off now. For rather than eliminate that ignorance by educating themselves about the problem domain at hand, too many programmers ferociously maintain that ignorance by turning everything into general programming problems which they can then solve using the techniques they already know, without having to learn or do anything outside of their cripplingly narrow comfort zone.

    Such developers take a personal pride in not soiling their hands with such menial tasks such as talking to users or understanding what those users’ jobs involve. They’d far rather spend their entire lives in their own self-imposed soundproofed Skinner Box, diligently typing out code according to The Specification, with no responsibility for either the validity of that specification or the real-world consequences of the programs produced from it. But frankly it is ignorant, arrogant and lazy, not to mention an outright abdication of personal and professional responsibility. The only reason they so often get away with it is because so much of management is no better than they are: programmers who only know how to program, managers who only know how to manage, and so on.

    TheDailyWTF.com, for example, is full of horror stories caused by just this mentality. And while those may be amongst the more egregious WTFs, there’s no end of less dramatic but far more widespread misery, all thanks to the same willfully ignorant, apathetic and even contemptuous attitudes towards any activity except churning out endless reams of code.

  24. has says:

    TL;DR version: The problem isn’t the tools, it’s the culture that produces them. So if you wish to improve the tools, you will have to fix the culture.

  25. Jess says:

    haha has, you made real coding look awful but I have to admit you made spreadsheetery look pretty bad too. Thanks for your honesty, but what is C1 or C8? What is doc or for that matter columns, rows, or value? This is how I would code if I were an economist:

    python
    gross_global_prod = sum(nation.gdp_nom_dollars(year) for nation in nations)

    But of course that is an aesthetic preference, and I wouldn’t want to shatter any illusions about domain expertise. I actually just suspect anyone who could say, “Difficulty with code management [is] an implementation problem,” hasn’t really committed to a DVCS yet. Anyone old enough to have been plagued by p4 or VSS had his shit turned upside down when he started using git or hg on a regular basis. Everybody younger is just very lucky.

    My main complaint about your thesis is that it’s very impractical. E.g., widely-used spreadsheets will be made amenable to use with any sort of version control shortly after Hell freezes over and is colonized by Eskimos riding flying pigs. M$ can’t do it, because at this time they can’t do anything. Have you seen Windows 8? FOSS folks won’t do it not only because as you imply they are basement-dwelling mouth-breathers, but also because they clearly see there would be no point to polishing that turd. They are already part of the world of thousands of overlapping cooperative communities of programmers, engineers, and scientists who collaborate using DVCS. I’m merely suggesting that economists wouldn’t look so silly if they joined in. (Actually I could see Google adding this sort of functionality to their spreadsheet app, ten months before deprecating it.)

    I am assuming that economists are smart people who care about their lives’ work; I’m certainly not suggesting that Milton from Accounts Payable should ever see a CLI. We’ll build him a web app, but we won’t let him near a spreadsheet, for the company’s sake.

    TL;DR: My single suggestion is of course a concrete class of intellectual tools and their associated practical techniques that economists ought to use, which would instantly improve their culture.

  26. Jess says:

    OK I’ll shut up now because clearly Markdown is beyond my capacity.

  27. has says:

    @Jess:”haha has, you made real coding look awful but I have to admit you made spreadsheetery look pretty bad too. Thanks for your honesty, but what is C1 or C8?”

    Spreadsheet systems automatically assign generic cell names, e.g. C8 is cell in row 8 of column C. They are highly convenient for quick-n-dirty coding, and quite adequate for small applications with few parts. As with any programming venture, it is up to the application’s author to decide at what level of complexity they should switch from a quick-n-dirty approach to a planned and formal one.

    Just as any remotely decent general programming language allows its users to assign their own descriptive names to fields and/or values, so too will a spreadsheet system. For example, to better describe your application’s construction and logic, you might choose to label the C column ‘Balance’ and call the C10 cell ‘Total’, thus allowing you to express the formula as this:

    Total = SUM(Balance1..Balance8)
    

    This provides you all the readability you asked for while still retaining all the advantages of using a domain-specific vocabulary over a generic one.

    “What is doc or for that matter columns, rows, or value? This is how I would code if I were an economist:

    python
    gross_global_prod = sum(nation.gdp_nom_dollars(year) for nation in nations)"
    

    That code is very incomplete though as it lacks both concrete inputs and outputs. If you’re going to compare generic language programming against domain-specific language programming, please take care to show like for like, e.g. by including the necessary SELECT and UPDATE code for a database-based data source, or the cell referencing code for a spreadsheet (e.g. look at the xlrd and xlwt modules on PyPI for realistic examples of Python-spreadsheet interaction). Remember, this is the level of conciseness and expressiveness you need to beat:

    Total = SUM(Balance1..Balance8)
    

    I’ll be happy to wait. (Though I feel it’s only fair to warn you that not only do I have a fair bit existing experience in writing task-specific libraries for generic languages, but over the last 5 years have also designed, implemented, documented and trained my own end-user-oriented domain-specific languages. So I might have the slight advantage here.;)

  28. has says:

    @Jess: “I actually just suspect anyone who could say, “Difficulty with code management [is] an implementation problem,” hasn’t really committed to a DVCS yet.”

    I’ve been using Mercurial for the last 5 years, and SVN several years before that, and am well familiar with their strengths and weaknesses when it comes to dealing with file formats. Incidentally, the product I’m currently working on actually requires fully integrated SCM support, such that it can automatically deploy any client-specific/live/test/dev branch or revision of the system’s own repos to any machine so it then run a specific job on that particular codebase. So I’ve spent an unhealthy amount of time thinking about how to make DCVSes do some slightly exotic stuff, and all in a way that is highly automatable and just about accessible to mere mortals. And anyone who boldly states “Impossible!” without even exploring the possibility gets no shrift from me.

    “FOSS folks won’t do it not only because as you imply they are basement-dwelling mouth-breathers, but also because they clearly see there would be no point to polishing that turd.”

    You mean the same FOSSers who are currently sinking serious resources into madly polishing LibreOffice, itself is a big fat third-rate knock-off of MS’s crusty old Office designs? Perhaps if they stopped blindly mimicking both the development techniques and end products of the huge proprietary vendors long enough to notice a smell…? Nah.

    Those vendors’ vast monolithic applications that are murderously expensive to build and maintain are actually the greatest military strategy since the Greeks built that wooden horse, for anyone who tries to blindly clone them will be forever crippling themselves with equally insane development costs, thus ensuring they will never come close to equaling, never mind surpassing, the original.

    What FOSSers should be doing is asking themselves: “How can we build equivalent user functionality, but doing it in a way that plays to our strengths rather than to our competitors’ strengths and our weaknesses?” But they don’t, because as a painfully homogeneous bunch, they lack the wide mix of external knowledge, abilities and interests that true innovation requires. And, being a largely ahistoric bunch, they’ve long since forgotten the great insights that their vastly more resource-constrained forefathers had, such as how to achieve an enormously rich, flexible return on virtually bugger-all investment (see: Unix Philosophy). They don’t blindly mimic others because it’s the right thing to do, but because it’s the only thing they can do.

    For FOSS, truly innovating their design and construction of desktop productivity apps would mean stealing the old component architecture approach that IBM and Apple abandoned once the major application suite vendors made it clear that components were in everyone else’s interest except those vendors’ own. Because once the time and labor cost of development of functionality is lowered enough that almost anyone can create additional WP/SS/DB/etc features, allowing users to buy just the low-cost mix-n-match features they actually need, the bottom will drop straight out of the huge high-cost monstrosities that are the contrived lifeblood of MS and Adobe et al.

    FOSSers could, if they played their cards right, do a complete end run around the big boys, leaving them for the bureaucrats and shysters they are. But nope, they’d rather Work Hard Not Smart. So they continue to assemble these vast creaking monuments to their own general cleverness and unflinching devotion in the face of incredible odds, while the likes of MS and Adobe must laugh into their sleeves at how easily the mooks are played.

  29. has says:

    “I am assuming that economists are smart people who care about their lives’ work; I’m certainly not suggesting that Milton from Accounts Payable should ever see a CLI. We’ll build him a web app, but we won’t let him near a spreadsheet, for the company’s sake.”

    Oh great. Let me know when you manage to assemble a room full of web programmers who actually know the first thing about economics. Economists need yet another web app created by people who have the first clue about what economists actually do_ (what they know, how they think, how they work, what their responsibilities and liabilities are, and so on) like they need a hole in the head. (Actually, a hole would be more use; at least it lets the dust in.) Heck, even just find me a half-dozen programmers who _don’t think that working to acquire expert domain knowledge is utterly beneath them, menial work to be performed solely by Business Analysts who don’t mind touching Dirty Users, an unforgivable waste of precious bodily programmer time that should be devoted entirely to writing code.

    I’ve just let fly over on OSNews (http://www.osnews.com/permalink?559345) about such toxic and all-too-common programmer attitudes, so won’t go on here. Shorter version: programmers have no damn right telling economists and other domain experts what they should or shouldn’t do when they are too ignorant/arrogant/lazy to even try learning what it is those people do.

    p.s. You’re talking to a cranky, crusty 20-year domain expert who has on too many of those years had to roll up their sleeves and do programmers’ jobs for them. All because while I was open and willing to learn how programming worked, those buggers regularly refused to reciprocate. Maybe I’ve just been unlucky in my dealings (I have worked with some good ones too), but more often than not professional programmers have been the bane of my life when anything outside their own personal comfort zone needs to be done. So if I sound tetchy about the state of end-user programming in general and professional programmers’ ignorant, arrogant and often contemptuous attitudes to end-user programmers in particular, it is not without cause.

  30. Dr. Drang says:

    Well, the comments have taken on a life of their own, but I do want to follow up on a few things:

    1. The Formula bar shows only a single formula at a time, yes? That’s not enough to understand the flow of data.
    2. The option to view formulas in their cells (rather than the results) is inadequate because the cells are too small to view all but the most trivial formulas.
    3. Highlighting the argument cells is nice, but is no substitute for knowing the operation being performed on those cells. Obviously, many spreadsheets do little more than total and average; for simple sheets like that, highlighting may be sufficient (although it clearly wasn’t for Reinhart and Rogoff).
    4. The success of spreadsheets is due in part to their ability to beguile users into thinking they’re not programming. And they work really well for simple calculations. But as simple sheets grow more complicated, the discipline of real programming becomes more important, and is, in my experience, seldom imposed by sheet’s developer.
    5. I thought the distinction between “scripting” and “programming” had been consigned to the dustbin of history. Certainly, Fortran is compiled and shell scripts are interpreted, but so many languages use a mixture of compilation and interpretation that trying to categorize them one way or the other is fruitless. There’s also an implicit value judgement in saying that some code is programming while other code is (mere) scripting—a value judgement that is unrelated to the code’s correctness, subtlety, clarity, or elegance.

    Finally, I’ve edited the code sections of Jess and has’s comments to conform to standard Markdown rules. Remember, folks, the lines in a multiline code block must be preceded by four spaces (plus whatever spaces you use for indentation). Backticks are for inline code, not code blocks. It says so right in the manual.

  31. has says:

    @Drang:

    1-3. Yep, but it’s a point I’ve made already: modern Excel is only bad at visualizing non-data information (formulas, references, relationships, etc) because, with no real competition to drive innovation, MS have no motivation to improve their own products. So users are still stuck with an interaction model that was originally designed for 80x24 character displays and CPU and RAM hardware that couldn’t run a modern microwave.

    Now we have all this fantastic hardware and several more decades of learning and practice in communication and visualization techniques, and it’s all being squandered on frivolities and insular self-congratulatory circle jerks. Tragic? Appalling? Damn right. But such failed opportunities are not users’ fault.

    1. This is a general problem, not only in end-user programming, but professional programming as well. e.g. Consider the percentage of OO programmers wouldn’t know how to wipe their own arse without a ToiletRoll factory to tear off a sheet and wrap it in an object for them; now imagine what happens when you let that mindset loose on a profoundly non-OO domain like IPC or distributed computing. When the only tool in your toolbox is a nuclear-powered jackhammer, each of those craters once looked like a nail.

    The only difference between end-users and professionals is how much average damage they cause in overstepping the bounds of their competence: if anything, the pros are worse. End-users (when the penny finally drops) are generally willing to acknowledge they are out of their depth and need to improve their skills and knowledge or delegate advanced tasks to others. Whereas professional programmers - whether through ignorance/arrogance/fear of losing face before their peers - are liable to double down on the stupid instead, and theirs are more often the largest systems with the biggest responsibilities of all.

    Blindly believing they know everything worth knowing, and anything else is just an ‘abstraction problem’ for some Cocoa/Java/.Net class library to deal with, too many OO jobsworths won’t even contemplate raising their own knowledge and understanding till they can approach domain-specific problems on their own terms. Instead, they beat and drag each problem down to their own ingrained comfort level, regardless of how distorted, broken or irresovably mismatched with the original domain this leaves it. Look at SOAP, look at Rails, look at ORMs; the list goes on. Heck, compare the many Apple event bridges proudly presented by Real [OO] Programmers versus the tiny handful (appscript, Frontier and, to some extent, Mac::Glue) authored by individuals who invested the bulk of their time and effort in living native.

    BTW, I know the AE thing’s my favorite bugbear and I do rather bang on about it. But it also serves as a useful microcosm, neatly illustrating the kinds of lasting damage (both subtle and gross) that the same faulty attitudes can wreak on far greater platforms such as the WWW. For instance, consider Rails’ recent critical serialization vulnerability [1], ultimately rooted the unquestioned belief that Objects are absolute monarchy before which all others must kneel. “Let’s make all this stuff real easy for our OO users so they never have to think about it,” is how this particular road to hell is paved; and so a quarter-million website and their millions of users are put at dreadful risk for six whole years, all because Rails’ best customers are so freaking lazy they have to be fed nothing but pre-chewed pablum from the mouths of those little more aware than they are.

    And that’s just a superficial example: the rot runs right to the heart of what the Web is and what it should be about: an egalitarian global free exchange of user-authored documents of every imaginable shape and kind, with hyperlinks the only constant, not a one-way silly straw for the HTML-Only Writers Club to push their crippled vision of What the Web Is down to all us Poor Dumb Consumers. Honestly, I cannot emphasize enough the permanent fallout effects of orthodox developers who believe they are doing Good Work.

    So when Real Programmers sneer and belittle those end-user developers’ foolish, ignorant mistakes, “Look in a damn mirror!” is the only response I have for that. Because I will take a single domain expert honestly open to fresh learning over a million such time-served professionals any day of the year.

    1. The distinctions should be between ‘end-user programming’ and ‘professional software development’, and ‘domain-specific’ languages and ‘general’ ones. The terms ‘scripting’ and ‘programming’ are often used as a lazy shorthand for these, but as you indicate are frustratingly imprecise and dangerously misleading. For non-experts to make such mistakes is forgivable; for professional trained programmers to do so it highly telling.

    Both end-users (who are experts in their own specialist domain) and professional developers (experts in the programming domain) could achieve a lot by communicating and working effectively together. But how can that happen when end-users don’t know how to think and problem-solve in a logical structured way, and when programmers have zero desire to learn anything outside their own narrow comfort zone?

    Instead, those same programmers would claim The Answer is to make end-users into tiny replicas of themselves; e.g. see recent swarms of ‘learn to code’ sites. But evangelizing their own limited mindsets doesn’t turn everyone else into good thinkers and problem solvers, it just generates fresh waves of hapless rubes with utterly unrealistic views of their own abilities, and sticks them at the wheel of vast, brutal, unforgiving machines they do not truly understand and can barely control.

    [1] http://www.informationweek.co.uk/security/vulnerabilities/critical-ruby-on-rails-issue-threatens-2/240145891

    p.s. Ta for fixing my busted markup. (Y’wouldn’t like to be my copy ed. too?;p)

    p.p.s. Manuals are nice, but Preview buttons are Infinitely Better. :)

  32. has says:

    One last GYOB illustration of how spreadsheet systems could, and should, be - and then I really have to go do real work… :p

    OK, so as hinted at earlier, once you decouple the storage of user logic and user data from each other and decouple visual presentation from both of those, you create huge freedom in how and where and how each can be stored.

    In the simplest case, you can continue the classic single-user workflow. Install your desktop Spreadsheet.exe, and have it read/write documents as zip-compressed folders containing all data, logic and view files that make up a complete user-authored spreadsheet application. When a Spreadsheet.exe opens the zip document, it uses the view file to construct the GUI and the logic file to wire it all up, then flows in the contents of the data file as normal.

    Now let’s say you want to break out the data portion so it can be hosted on a central file server and shared (with locking) amongst other users. The existing data files are extracted by Spreadsheet.exe to a new zip and exported to the server, leaving the logic and view files unchanged. A new ‘data link’ file is inserted into the logic+view zip, telling any Spreadsheet.exe that opens the document where to find the data to be displayed. The logic+view zip is then copied to the other users’ computers, allowing each of them to access the centrally-held data zip in turn.

    Let’s go a step further, and get the logic and view into DCVS for development and audit purposes. Tell Spreadsheet.exe to convert the zip to unzipped folder format, more friendly to traditional DCVSes while still being openable and editable by Spreadsheet.exe.

    Now let’s convert the data file to hosted form. This might be as simple as telling Spreadsheet.exe to upload the original data zip to a central SpreadsheetServer web application, then replace the old data zip with a new ‘data link’ file that points to the web app instead. Now when users open their local logic+view zips, they get the updated data link that tells Spreadsheet.exe to pull the data from the SpreadsheetServer. And since SpreadsheetServer knows how to handle simultaneous access by multiple users, you no longer have the problem where each user has a local spreadsheet file containing stale or conflicting data.

    Or, you might wish to export the data to a standard RDBMS, allowing unrelated applications to query it as well. That’s a rather harder problem that would no doubt depend greatly on the type of data modeling, logic constraints, etc. involved. Tables linked by relationships would convert better than those linked by references; some or all of the custom logic would also need translated to SQL form as field types and constraints, procedures, triggers, etc. so that the integrity of the data is just as thoroughly protected within the SQL database as it was within the self-contained spreadsheet format.

    And, of course, since everyone likes their bells and whistles, there’s no reason the logic+view zips couldn’t be hosted by SpreadsheetServer too, with the latest version being pulled down and opened in Spreadsheet.exe each time a desktop user opens a local ‘spreadsheet’ document that is really no more than a URL bookmark, or just clicks on a spread:// URL in a web browser.

    Oh, and notice how through all of the above I’ve never said anything about what the internal formats of the data, logic or view files should be. That’s because we’re not just separating the different types of information from each other, we’re also giving ourselves total freedom in how best to represent each type - including not limiting ourselves to just one format per type either.

    It’s reasonable to assume that view files are just serialized versions of spreadsheet table objects (sheets, rows, columns, cells) so would probably just be XML blobs a-la .xibs, though they could be other formats too, such as human-readable declarative code. The data files could be any format: CSV, XML, SQLite, binary; again, it’s just an internal implementation detail. The logic files would, for practical purposes, most likely use a human-readable domain-specific language (thankfully, the fad for ‘XML Languages’ such as XSLT has largely collapsed), something with a minimalist syntax like LOGO or Io that is easily read and written by both humans and machines.

    Critically though, the logic language would be declarative/dataflow-based, same as existing spreadsheet ‘macro’ languages (which are really DSLs themselves). For example:

    C10 = SUM(C1..C8)
    C1 = 42
    

    would be a perfectly valid script here, since it is the machine’s job to determine the correct execution order, not the author’s.

    BTW, I have developed and taught a similarly non-linear domain-specific end-user language myself, so can state from firsthand experience that this is often a far more appropriate model for many such audiences and tasks than bog-standard generic imperative languages like JavaScript and Python. And this is yet another reason why you can’t trust the Python/Ruby/JavaScript/Lua is The One True Way to End-User Programming mob (or the Java/C* mob for that matter) to determine how the ‘programming’ side of a spreadsheet system should work. Because left to themselves I can pretty much guarantee they’ll blindly replicate the imperative, class-based model they are already intimately experienced in, and in no time at all the whole lot will resemble some vast Towers of Hanoi, full of classes and methods and callbacks and domain libraries and onclick/onchange triggers and so on… plus race conditions and performance sumps and all the other hazards that bites like a shark as soon as you start dealing with complex, highly parallel processing systems.

    And then even the most skilled Python/Ruby/JavaScript/Lua developer will struggle to eke correctly functioning imperative code from the whole hideously Lovecraftian horror. And finally that self-same bunch will declare that clearly the real problem is that Spreadsheets Are The Wrong Way To Do Things. And then they’ll toss the entire platform down the memory hole and make a fortune developing endless streams of miserably specced but excitingly AJAXy and Enterprisey Web2.0 PHP apps that Do It Right, which they then inflict on the ever more disempowered users.

    That’s not to say general-purpose languages couldn’t make a useful contribution, but their role would be much more limited, acting as an extension programming language that experienced developers can use to define completely new features that the DSL alone cannot express. Just as Python can be extended by C, so too can the DSL be extended by Python, allowing much more flexibility and agility in who develops DSL extensions and how they do it. These .py/.js/etc. files could then be bundled with the DSL-based logic files and automatically loaded by Spreadsheet.exe as necessary, allowing the DLS scripts to call those additional features.

    (I’ve done this too, BTW, and the result is a collaborative working arrangement exactly as described by Nardi. Basic and advanced end-users write and run most of their DSL code themselves. A single professional developer rides shotgun, producing Python-based plugins as and when users discover especially complex or totally new functionality is required. It works very well, as long as the pro developer meets users on their turf, not his own.)

  33. Patrick O'Beirne says:

    We’ve discussed this on the yahoogroup of the European Spreadsheet Risk Interest Group, of which I’m the current chair. Our next conference is in Greenwich, London on 4-5 July.

    1) The authors did not catch the error, and only when they released the spreadsheet did others find it. Lesson: review by a different pair of eyes can catch errors before they cause loss or embarrassment. “Peer review” is the gold standard in academic research, and the best self-protection for business users of spreadsheets. Here’s a story from Edward Krudy of Reuters quoted in the Toronto Star: the person who checked the R&R study also checked their own work:

    http://www.thestar.com/business/2013/04/18/student_finds_glaring_spreadsheet_errors_in_study_used_to_justify_budget_slashing.html Student finds glaring spreadsheet errors in study used to justify budget slashing “I almost didn’t believe my eyes when I saw just the basic spreadsheet error,” said Herndon, 28. “I was like, am I just looking at this wrong? There has to be some other explanation. So I asked my girlfriend, ‘Am I seeing this wrong?’” His girlfriend, Kyla Walters, replied: “I don’t think so, Thomas.”

    2) Like most spreadsheet creators, they possibly did not realise at the start how important it would become; in their case, frequently cited in support of austerity and arguably the cause of more grief to the world at large than other problems that have merely cost millions. (See http://www.eusprig.org/horror-stories.htm ) Lesson: The more important something is, the more care you have to take with it.

    3) Technically, the error could have been spotted by simple tests, such as pressing Ctrl+[ on a formula to show what cells feed into the total. There are many software tools, such as XLTEST, to point out structural flaws. There is much guidance on safer spreadsheet construction, books such as “Spreadsheet Check and Control”, “Spreadsheet Safe”, etc. In the business world, common checks include cross-total balances and reconciliations.

    Eusprig papers are published at http://www.eusprig.org/conference-abstracts.htm

    The BBC also have a piece on this - good interviews

    http://www.bbc.co.uk/programmes/p0173h00 http://www.bbc.co.uk/news/magazine-22223190