More shell, less egg

My TextExpander/shell script post of last week reminded me of Doug McIlroy and some unfinished business from back in October. So let’s talk about shell scripts and Unix again.

In the comments to my October post about the early Unix developers and what good writers they were, Bill Cheswick mentioned Doug McIlroy, and said it was a shame I’d omitted him from the post. Bill was right; I should have included McIlroy. The problem was I couldn’t find a link to any of his writings.

I had an example of his writing—a good one—but it was on paper only. I still haven’t found an online copy of it, but it’s worth talking about anyway.

The piece requires a bit of backstory. Jon Bentley had a regular column called “Programming Pearls” in the Communications of the ACM (you may have come across this collection of some of his columns). In 1986 he got interested in literate programming, so he asked Donald Knuth to write a program in that style as a guest column and Doug McIlroy to write a literary-style critique of it.

Literate programming is an interesting topic in its own right. The idea, which originated with Knuth, is to write a program and its documentation at the same time, interleaved with each other. It’s not just writing good comments or including docstrings or using systems like POD. In literate programming, the code is subservient to the documentation. For example, the various sections of the code are written not in the order the compiler (or interpreter) wants them, but in the order most appropriate for the explanation. Two utility programs transform the combination file: one to typeset the documentation and the code via TeX, and the other to extract and rearrange the source code and send it to the compiler.

I’m interested in literate programming, but that’s not our topic today.

Both Knuth’s article/program and McIlroy’s critique were published in Bentley’s column and then republished in Knuth’s Literate Progamming collection, a copy of which I happen to have picked up for a few bucks in a used book store several years ago.

Literate Programming

The program Bentley asked Knuth to write is one that’s become familiar to people who use languages with serious text-handling capabilities: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

Knuth wrote his program in WEB, a literate programming system of his own devising that used Pascal as its programming language. His program used a clever, purpose-built data structure for keeping track of the words and frequency counts; and the article interleaved with it presented the program lucidly. McIlroy’s review started with an appreciation of Knuth’s presentation and the literate programming technique in general. He discussed the cleverness of the data structure and Knuth’s implementation,1 pointed out a bug or two, and made suggestions as to how the article could be improved.

And then he calmly and clearly eviscerated the very foundation of Knuth’s program.

What people remember about his review is that McIlroy wrote a six-command shell pipeline2 that was a complete (and bug-free) replacement for Knuth’s 10+ pages of Pascal. Here’s the script, with each command given its own line:

bash:
1:  tr -cs A-Za-z '\n' |
2:  tr A-Z a-z |
3:  sort |
4:  uniq -c |
5:  sort -rn |
6:  sed ${1}q

And here’s McIlroy’s short, impossible-to-misunderstand explanation:

If you are not a UNIX adept, you may need a little explanation, but not much, to understand this pipeline of processes. The plan is easy:

  1. Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-s) multiple newlines.
  2. Transliterate upper case to lower case.
  3. Sort to bring identical words together.
  4. Replace each run of duplicate words with a single representative and include a count (-c).
  5. Sort in reverse (-r) numeric (-n) order.
  6. Pass through a stream editor; quit (q) after printing the number of lines designated by the script’s first parameter (${1}).

I need to put that on a Post-it note as an example of how to explain a script. The best part? It would fit on a Post-it note.

(I can’t help wondering, though, why he didn’t use head -${1} in the last line. It seems more natural than sed. Is it possible that head hadn’t been written yet?)

Update 12/8/11
Thanks to Terry and Dan in the comments for pointing out the typos in the excerpt above. The mistakes were mine, not McIlroy’s, and are fixed now.

What’s often overlooked when this review is discussed is McIlroy’s explanation of why his solution is better—and it’s not just because it’s shorter. Here are some excerpts:

A wise engineering solution would produce—or better, exploit—reusable parts.

Very few people can obtain the virtuoso services of Knuth (or afford the equivalent person-weeks of lesser personnel) to attack nonce problems such as Bentley’s from the ground up. But old UNIX hands know instinctively how to solve this one in a jiffy.

To return to Knuth’s paper: Everything there—even input conversion and sorting—is programmed monolithically and from scratch. In particular the isolation of words, the handling of punctuation, and the treatment of case distinctions are built in. Even if data-filtering programs for these exact purposes were not at hand, these operations would well be implemented separately: for separation of concerns, for easier development, for piecewise debugging, and for potential reuse.

The simple pipeline given above will suffice to get answers right now, not next week or next month. It could well be enough to finish the job. But even for a production project, say for the Library of Congress, it would make a handsome down payment, useful for testing the value of the answers and for smoking out follow-on questions.

McIlroy’s review is a both an explanation and an exemplar of the Unix Way: small programs that do elementary tasks, but which are written so they can be combined in complex ways.

I can’t help but include one more excerpt. It’s the last paragraph of the review:

Knuth has shown us here how to program intelligibly, but not wisely. I buy the discipline. I do not buy the result. He has fashioned a sort of industrial-strength Fabergé egg—intricate, wonderfully worked, refined beyond all ordinary desires, a museum piece from the start.

Just remember, he’s saying this about Donald Knuth.

No Fabergé eggs for McIlroy. Just brass balls.


  1. My favorite line from this section of the review: “Knuth tiptoes around the tarpits of Pascal I/O—as I do myself.” Maybe you have to have programmed in a true, Wirthian Pascal to appreciate the dry humor. 

  2. Have I mentioned that McIlroy invented pipes? Yeah. 


33 Responses to “More shell, less egg”

  1. Clark says:

    One thing that I think is a bit unfair is this huge distinction between programming in code vs. the command line. In practice every programmer has their cache of favorite functions and classes. (Well this was before OO, but you get the idea) So when they do something they rarely are re-inventing the wheel. Rather programs have their private libraries as well as there being a slew of standard libraries they access.

    At this time in history what was unique about many programming languages was how basic the standard libraries were. By even the 90’s a lot of that had changed as most OS provided pretty robust libraries at a fairly high level.

    Today, while I frequently access command line functions I also tend to just have some standard templates and libraries I frequently reuse. The benefit is that I just need to tweak a few lines and I have a finished program.

  2. Dr. Drang says:

    The expansion in the availability and use of libraries has made programming in “regular” languages more shell-like than it was in the ’70s and ’80s. And in this case Knuth was hobbled by Pascal’s lack of some very practical tools, like separate compilation. You’re probably familiar with Brian Kernighan’s famous “Why Pascal is Not My Favorite Programming Language,” which lays out several problems that Pascal programmers back then had to deal with.

    I suspect, though, that one of the reasons good library support is now the norm rather than the exception is the popularity of Unix and C among programmers. The ideas it pushed have become almost universally accepted now.

  3. Clark says:

    Yeah. I learned to program on an Apple ][+ with UCSD Pascal. So believe me. I remember the days. My second language I learned was C on an old Vax running Unix. I remember this bug where there was one command that if you mistyped it you’d end up unmounting the hard drive.

  4. Gary Wheeler says:

    McIlroy misses the point. Knuth was demonstrating a principle (literate programming). The fact that the sample task could have been implemented more efficiently with a small shell script isn’t germane to the discussion. He lost the opportunity to critique what Knuth was actually doing.

  5. Terry Colligan says:

    I suspect this is just transliteration errors, but McIlroy’s “short, impossible-to-misunderstand explanation” has several errors:

    1. The explanation references ‘-a’, but I think the tr command uses -s’.

    2. “to lower case” doesn’t make any sense here. ‘sort’ doesn’t “bring” anything “to lower case”.

    3. Similarly, “with a single representative” doesn’t make any sense. Maybe you mean “with a single line”?

    With the above type-o and word-o problems, the explanation is less than clear.

  6. Dr. Drang says:

    No, Gary, as I said, McIlroy’s review started with a discussion of literate programming itself. He was complimentary.

  7. Dan Franklin says:

    Terry Colligan #5 is correct - 1. Should read -s instead of -a, 3. Should read “to bring identical lines together”, and 4. “representative” should be “line”.

    In addition, in 6. “quite” should be “quit”.

    It may be helpful to note in 4. that the count is prefixed to each line; thus sort -n -r, which without a key spec sorts entire lines, is sorting the initial number on each line.

    And to answer the question in the article, yes, sed predates head by many years, so I doubt head existed then.

  8. Dr. Drang says:

    Thanks, Terry and Dan, for pointing out the mistakes. I’m surprised nobody mentioned them sooner. The typos are mine, and I’ll fix them when I have the book in my hands again tonight.

    Profreading is hard.

  9. Paul W. Homer says:

    The beauty of shell scripting is that it employs a nearly-consistent philosophy to its decomposition of the underlying primitives. The level of organization, when understood, makes it easy to leverage the existing work of others. This type of approach towards structuring functionality in reusable pieces however seems to be disappearing from the landscape.

    The beauty of literate programming is that it synchronizes the often terse instructions for the computer, with the much needed fuller explanation required to allow future generations of programmers to continue on with the work. It never caught on, and so the industry spends a lot of time and effort starting from scratch again.

    At their core they are very different, complementary solutions to different problems.

    Paul. BTW: The TextExpander post was excellent, we just don’t see enough people out there showing how a bit more of an understanding can produce significantly better results. Instead we just get a lot of code splatted out quickly that gums up the works.

  10. Aaron says:

    “representative” would be a perfectly cromulent word choice here. The list of identical words is replaced with one that represents them all.

  11. mark says:

    This is just unfair.

    What if Knuth would be borned 1970 and would have used Ruby instead? And rather than bash, a shell that would natively support object data types? Where pipes become data that is passed from object to object, as messages?

    UNIX is coming to an end with its old mantra. Look at the TIOBE programming index, look at the evolution of the WWW.

    What the example above shows can be done in pure ruby or pure python just as easily. There is only one difference, which is speed, but other than that ALL the transformations can be done in better programming languages. On ANY system where these work, not just UNIX.

  12. Peter Aronoff says:

    Re head and sed, the Plan 9 site suggested using sed 10q for head even much later. So perhaps some people always felt that head was an unnecessary addition.

    Is it bothering anyone else that McIlroy’s pipeline can’t handle contractions or compound words? I haven’t seen Knuth’s version, but did he ignore those cases as well?

  13. Alex Measday says:

    One aspect not mentioned is portability. Knuth’s program could have been compiled and run on almost any operating system that had a Pascal compiler. While I agree with McIlroy’s sentiments, his command-line solution would only work on a UNIX system.

    This reminds me of a comp.lang.c discussion many years ago. One C/COBOL programmer countered the COBOL bashing with a complete, portable, 4-line program that sorted a file. (Different sections of a COBOL program are optional and sorting is built-in.) It’s not possible to write a simple, portable, file-sorting program in C. Under UNIX and other operating systems that support spawning subprocesses, you can execute an external sort program, but that approach is not portable between systems.

  14. André says:

    It’s functional elegance vs. imperative stumbling around. :P

  15. Dr. Drang says:

    Knuth’s doesn’t han’le c’ntractions, Pet’r.

    My suspicion is that McIlroy wrote his script to follow Knuth’s specification, which broke words on apostrophes.

  16. Tall Papa B says:

    Literate Programming survives.

  17. Ross says:

    Andre is quite right that this is an example of the elegance of FP vs. imperative — but the flip side is that FP often fails when it runs up against real-world edge cases. Knuth’s version may not handle contractions — but it could, easily. Could McIlroy’s? What if I want British/US spellings (such as “color” and “colour”) to be counted as one word? What about hyphenated words?

  18. Ramki says:

    The original solution is a classic Unix way of doing map-reduce. It can handle word exclusions easily by adding functionality to the map section. The code won’t grow much, the quality is still retained, although a little elegance is lost.

    (filter_foo 2>&1 | tr -cs ‘A-Za-z’ ‘\n’) | tr ‘A-Z’ ‘a-z’ | sort | uniq -c | sort -nr

    $cat filter_foo

    !/usr/bin/perl -w

    use strict;

    build an exclude hash

    my %h = map { chomp $; ($, 1) } grep /^\w/, ; while ( <> ) { chomp; for my $w (split) { $h{ lc($w) } ? print STDERR “$w\n” : print “$w “; } } END

    commented lines will be ignored

    don’t isn’t didn’t l’hospital

  19. Dave says:

    I fail to see the point here.

    Knuth was illustrating literate programming by creating code to solve a problem. His focus was the programming, not the solving of the problem. This is a common approach when illustrating technique. (finger points to moon, please look at the finger. The finger is important in this context.)

    McIlroy solved the problem using “the UNIX way”. His focus was solving the problem. He didn’t advance the topic of literate programming. (finger points to moon, please look at the moon. The moon is important in this context.)

    It seems to me that these gentlemen had different objectives and that should be recognised.

  20. Josh Rehman says:

    It certainly sounds like McIlroy made a strong point - but a significantly different one than you’ve reiterated. It sounds less like an indictment of literate programming than and indictment of literate programming for this particular problem. It would be difficult to show that LP is problematic for all programming problems, but certainly it would take the execution of more than a single programming task.

  21. Wences says:

    I think that the fact that McIlroy accompanied his script with a line by line explanation proves Knuth’s point. Knuth used WEB, which is a tool for doing literate programming in Pascal. If there was a tool to do literate shell programming, McIlroy could have put each line of explanation just before each comment and tangle would have produced his shell script and weave would have produced his explanation. And note that the explanation is indeed very “literate style” in the sense that is explains why he is doing some things (e.g.: bringing all the lines containing the same word together and then having just one line with the word and the count represent them all)

  22. Ken says:

    “Knuth was illustrating literate programming by creating code to solve a problem. […] McIlroy solved the problem using “the UNIX way”.”

    Sure, but what’s the point of LP if it’s not practical for even a toy problem?

    For a 6-line shell pipeline, the LP solution is a dozen pages of source code. What’s the LP solution going to look like for a serious program? Will it be 100 times longer? (A 100,000 line program — not at all unusual for an application — would then be 150,000 pages of text, and take over 100 feet of shelf space if printed.) Will it have 100 times more bugs (as Knuth’s did)? Is it only feasible if the requirements are perfectly fixed before coding starts? For what class of problems is LP a good solution, if any?

    LP to me is like writing out source code as calligraphy: it’s beautiful to look at and fun to write but I see no advantages to using it for a program you actually intend to run.

  23. ChrisD says:

    It seems that there is also an apples-to-oranges comparison here: these nifty off-the-shelf unix tools are many pages of code in length. It is unsurprising that you can do the work of ten pages of code by calling six ten-page-ish utilities. So, the comparison is disingenuous. We are comparing a programming language solution to a reusable-component command line. Others could simply loop through, scanf()ing the input file, push()ing the char* onto a suitably customized (subclass of) C++ List, and loop, pop()ing the results directly off: two 1-line loops. This does not take away from either examples’ merit: Knuth’s original (pun intended) idea of readable programs remains, and McIlroy’s UNIX Power Tools lesson persists.

  24. Dr. Drang says:

    I think literate programming has its advantages, Ken, but it’s certainly going to be of more value to an academic and researcher like Knuth than to a working programmer. As for serious programs, TeX is written in WEB, and it’s both serious and well-debugged.

  25. Charles Wolfe says:

    Literate programming is reasonable on very large scale projects that include software architects, software engineers, programmers. and others.

    Having written and maintained programs with 100s of thousands (millions) of lines of either assembly or higher-level language code, most comments tend to become useless rather rapidly and having the documents wholly separated from the code make the cross referencing or consultation of docs to understand code difficult most of the time.

    While shell programming and scripting languages can solve many problems — sometimes efficiently, they are far from capable of dealing with many real-world programming problems.

    I’ve been at this for over 40 years, and have formed some strong opinions over that time frame about coding styles and documentation. I’ve taught and most of my students do not appreciate my “preaching” and taking points off for poor documentation until they start working in the “real world”. Good programmers are good at writing in their native, natural language as well as the programming language in use.

  26. Russ Cox says:

    You can find some papers by McIlroy listed at http://cm.bell-labs.com/cm/cs/cstr.html

    The paper that introduced diff’s algorithm is http://cm.bell-labs.com/cm/cs/cstr/41.pdf.

  27. Dr. Drang says:

    Thank you, Russ. I liked CSTR #155, the trilogy of papers on drawing ellipses. The last of the three is another McIlroy critique, this time of a paper by Niklaus Wirth.

  28. Franklin Chen says:

    I wrote a response to this post on my blog: http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/

    I discussed both the literate programming angle and the programming language angle.

  29. Jens Ayton says:

    Ken, sophisticated techniques are by their nature least practical for toy problems.

  30. Anonymous coward says:

    20+ years later, we still don’t have sed or awk available as a library, afaik. We also don’t have a tr-like function which works on streams, or a sort for text files in standard libraries. Which is why you simply can’t use it when you write something that isn’t completely solvable via shell scripting - it doesn’t get compiled in, and your team leader/senior programmer/whatever will definitely not allow you to call shell commands from C, C++ or Pascal, even when it pays off.

    Aside from this, I do, however, agree with the comment above stating that McIlroy completely missed the point - or rather, instead of exercising critique on what Knuth produced, he chose to make another point. While I wholeheartedly agree with McIlroy’s style of solving things, this has bitten me several times - most SW engineers nowadays don’t appreciate simplicity, instead focusing on a sort of beauty of code from an academic standpoint.

  31. Doug says:

    Wouldn’t it be more efficient to convert to lower case before converting the non-alphabetic characters to newlines? i.e: 1 tr A-Z a-z | 2 tr -cs a-z ‘\n’ |

  32. Christian Lynbech says:

    The point has been made but just to iterate, you need all those small utilities and they must be implemented somehow. LP would be very applicable to that task as well.

    I also think it might be worth mentioning that while Knuth certainly operates at a completely different level than most mortals, I would be surprised if the same could not also be said of McIlroy. I am not convinced that the average UNIX hack would be able to shake out a bugfree simple solution like that. And LP is not that complicated in itself, Knuths program may have been but these are the cases you most need the power of LP to be able to follow the logic of the mastermind.

    It is not unlikely that Knuth had not invented LP if the Pascal environment he was working in was not as constrained as it was. One can easily see in the original WEB how he was battling various compiler issues. He chose, by hos own, Pascal not out of a passion for the language but rather because it was “everybody’s second choice”, ie. for portability reasons.

  33. nrolland says:

    If shell script have not picked up completely and remain for a large part wizardry, there must be a reason.

    This reason seems quite obvious to me as a programmer : - they manipulate text only, failing to raise the level of abstraction to the desired one - their environment (pipelining rules, etc..) is not itself composable (and we all know anything of importance should be)

    Hence the lack of library as noted by a comment. If something was a useful abstraction outside its niche, there would be a library to build tools on top of it, lots of people would want to use it inside other projects.

    Based on the said limitations, those powerful toys shall stay in their own little world and not come to big boys land.