Shellhead

One of the nice things about having a plain text archive of all my posts on my local machine is that I can learn more about my writing through the standard Unix toolset. Sometimes it takes a while to figure out the best tool for the job.

As I said last night, I now have the Markdown source of ANIAT in a set of files, one per post, in my Dropbox folder. Here’s the directory structure:

Local blog archive structure

How many posts have I written? If I cd into the source directory, I can run

bash:
find . -name "*.md" | wc -l

Since all the file end in .md, the find command with the -name option seemed like the obvious solution. It’ll print out all the files below the current directory (.), one per line. Piping that output to wc with the -l (lines) option gives the count I was looking for—in this case, 1796. This isn’t the best way to get the number of posts, but it’s what I thought of first.

Find is a pretty complicated beast, one that I’ve never really tamed. But I do know how to do a few things with it. Once I had the number of files, I wondered how many words I’d written. I knew I could get the number of words in each file through

bash:
find . -name "*.md" -exec wc -w {} \;

The first part of this command, up through "*.md", generates the list of all my posts. The rest of the command executes wc on all of those files. The empty braces act as a sort of variable that holds the name of each file in turn, and the escaped semicolon tells -exec that its command is finished. You have to escape the semicolon with a backslash so that it gets handled by -exec instead of the shell.

The result is 1796 lines that look like this:

    1170 ./2013/01/local-archive-of-wordpress-posts.md
     788 ./2013/01/lucre.md
    1289 ./2013/01/revisiting-castigliano-with-scipy.md
     545 ./2013/01/textexpander-posix-paths-and-forgetfulness.md
     144 ./2013/01/what-apples-working-on.md

Each line has the word count of a file followed by its pathname relative to the given directory. We want to add all those counts. Awk would be a natural tool for the summation because it splits lines on white space automatically, but I’m not very good with awk. I chose to use Perl, which has a awk-like autosplit command line option. Here’s the full command:

bash:
find . -name "*.md" -exec wc -w {} \; | perl -ane '$s+=$F[0];END{print $s}'

The -ane is a set of three switches ganged together that tell Perl to

The $s variable springs to life already initialized to zero the first time it’s referenced. It acts as the accumulator of the word count of the individual files by repeatedly adding the first item ($F[0]) of each line to itself. The END construct, which prints the value of $s, is a special portion of the code that doesn’t get executed with each input line; it’s run only after all the input lines are processed. The output is 932041, so I’m heading toward a million words here.

This was fun, but I realized after getting the answer that find was overkill for this problem. Find is a good command to use when you have files at several (or unknown) levels in the directory hierarchy, but that’s not what I have. From the source directory, each Markdown file is exactly two levels below. I could have gotten the file count more easily by

bash:
ls */*/*.md | wc -l

Even better, when given a list of files, wc will return not only the counts for each file in the list, but also the cumulative counts in a final line at the bottom of the output. Thus,

bash:
wc -w */*/*.md

will return a long output that ends with

     861 2013/01/google-lucky-links-in-bbedit.md
    1170 2013/01/local-archive-of-wordpress-posts.md
     788 2013/01/lucre.md
    1289 2013/01/revisiting-castigliano-with-scipy.md
     545 2013/01/textexpander-posix-paths-and-forgetfulness.md
     144 2013/01/what-apples-working-on.md
  932041 total

which gives the same total word count I got with the more complicated find/Perl pipeline. And if I didn’t want to clutter my Terminal window with the word counts for all 1796 individual files, I could’ve run

bash:
wc -w */*/*.md | tail -1

to get just the cumulative line.

If there’s a moral to this story, I guess it would be “think before you write,” although sometimes you can learn by doing things the wrong way.


11 Responses to “Shellhead”

  1. Dexter Ang says:

    My obsessive-compulsive self could probably not stand that the posts are sorted alphabetically for each month. Curious why you don’t append the day of the month to the file to really sort by when they are posted. Or you perhaps have a variable in the file you can just grep?

  2. Jasker says:

    Reckon it’s possible to monkey with the metadata, so that these local files have the same last modified date as those on the server? Creation dates would be a plus, but can be roughly seen from their path.

    I’m doing a manual version of this with my site, whose Markdown posts are saved in my Dropbox as I write them and kept there once I’ve sent them to Tumblr. The downside to this simple scheme being the mess Tumblr shoves into the middle of my URLS, like so:

    http://andala.org/post/40449783781/know-it-all

    To move elsewhere, I’d have to flatten my hierarchy to something sensible like yours, and fix my internal links.

  3. Neal Lippman says:

    The one downside to using the */*/* construct is that the shell expansion crates the entire file list and passes it as the command line to the executed command (ls or wc here). This means all 1700-odd file names are on the command line.

    With enough files, you could run up against whatever internal buffer size limit bash (or whatever shell you prefer) has. I don’t know exactly what that is, but clearly it’s pretty darn large given your test case.

    The advantage find has here is that the file names are generated one at a time, so you don’t have that command line length issue. Sort of like range vs xrange in python. And, as you noted, its more general if the folder steucture is of adbitrary depth. But, you do have to invoke wc repeatedly, so it’s probably a bit slower, and you have to add up the numbers yourself.

    Of course, you could have just gone with: cat */*/*.md | wc -w … :)

  4. Aristotle Pagaltzis says:

    perl -le '$_ = () = <*/*/*.md>; print'

    A list assignment in scalar context returns the number of list elements on the right-hand side. Since you don’t actually care about the file names here, they are assigned to the empty list, and the number of RHS list elements from that assignment is captured in $_, which the bare print then spits out. The less idiomatic version that does more work than required looks like this:

    perl -le '@f = <*/*/*.md>; print scalar @f'
    

    Doing the glob within Perl like this has the advantage (in the general case) that the zero-matches case generates a correct “0” response (which is, I concede, not a concern here) rather than the bogus “1” you get from the shell if you don’t monkey with the shell’s nullglob option. If that weren’t the case, you could write this very simply as

    perl -le 'print scalar @ARGV' */*/*.md
    

    Note that you can also do that in a shell with arrays, too:

    ( f=( */*/*.md ); echo "${#f[@]}" )
    

    I use a subshell here so I won’t have the f array lying around in the environment afterwards. If you’re inside a subshell already, you can use that to temporarily set nullglob too:

    ( f=( */*/*.md ); shopt -s nullglob; echo "${#f[@]}" )
    

    That gets you parity with the second Perl one-liner in pure shell… the first one, which doesn’t uselessly store the data, is out of reach.

    But you can also do better using wc:

    printf %c */*/*.md | wc -c
    

    Mechanics left as an exercise for the reader. :-) (Again, the fix-nullglob-within-subshell dance would be necessary to correctly handle the no-matches edge case.)

    The immediate difference of this to the ls version is many fewer redundant stat calls, which all of the one-liners in this comment share in common. The other one, which matters in the general case (but I concede is not an issue here), is that this solution will not miscount in the face of newline characters within file names, also shared by all alternatives given.

  5. Aristotle Pagaltzis says:

    Argh, I forget that your comment system eats leading whitespace and thus mangles comments that start with a Markdown code block. My comment was, of course, supposed to open with

    perl -le '$_ = () = <*/*/*.md>; print'
    

    Also, Neal’s point is a good one – expanding the glob within Perl like this means you are unaffected by any command line size limits.

  6. Alan says:

    Of course, you now should rerun the commands and update this post to reflect its own inclusion in the list!

  7. Dr. Drang says:

    Sadly, Aristotle, the comment system isn’t mine. A few years ago I looked into how WordPress comments worked because I wanted more control and more consistency between how Markdown is handled in the comments and how it’s handled in the body of the post. It was such a convoluted mess, I just backed out and vowed never to return. Anyway, I edited your comments to what I think you wanted.

    Neal, your point about the possibility of overrunning a bash buffer with globbing on the command line is real but, in my opinion, remote. I don’t know what the limits are, but I’ve never run into them. You’re absolutely right about the speed; the find solution was very poky compared to the globbing solution.

    I sympathize with Dexter and Jasker’s dislike of the files being sorted by name instead of post order—I’d prefer them in post order, too. I don’t want to change their names, though, because I refuse to break old links, and any static blog system I use would name the generated HTML files according to the Markdown source file names. (OK, I suppose I could write something that strips the date and time from the post’s name before processing.) The post number, by the way, is in each file’s header, so there is a way to recreate the original order.

    Alan: 932847.

  8. Aristotle Pagaltzis says:

    Ah, WordPress… I should have figured it wouldn’t be something you would simply miss. Thanks for fixing my comments!

    I noticed a stupid mistake of my own in the pure-shell one-liner – the shopt obviously has to go first, not in the middle:

    ( shopt -s nullglob; f=( */*/*.md ); echo "${#f[@]}" )
    

    D’oh.

  9. Matthew Wyatt says:

    Just an FYI - I pretty much never use find’s -exec flag anymore. Instead, I pipe to xargs. (Sometimes you need to use the -print0 flag for find and the —null flag for xargs, if there are spaces in the file names.) It’s supposedly faster in some cases, but I usually just do it because it’s easier for me to remember piping, which I do all the time, than “-exec foo {} \;”, which I only ever do when I’m using find.

  10. John says:

    Matthew Wyatt: find has the -exec utility [argument ...] {} + primary where + is used instead of \; which gives the same functionality as using xargs and which would have improved the speed of find in Dr. Drang’s analysis.

    And an explanation of why generally the output of ls shouldn’t used to provide a file listing for subsequent processing.

  11. Dr. Drang says:

    I didn’t know about using + in find, John—thanks for pointing it out. And it is definitely faster.

    As for the problems with ls, I was in a fortunate situation because each filename came from its post’s slug, and WordPress doesn’t allow hinky characters in the slug. I suppose the shell could choke on something that WordPress allows, but I’ve never noticed anything weird like that in my URLs.