Learning Awk

Although I learned how to use many of the traditional Unix tools during my Linux years, Awk1 wasn’t one of them. I learned Perl to write CGI scripts and that evolved into using it for little utility programs and one-liners. Perl was said to be a superset of Awk, so I didn’t see a need to learn the older language.

Years of Python use have made my Perl rusty. Now when I need a one-liner to do some data transformation in a pipeline, I struggle with Perl’s options and I forget to put sigils in front of the variable names. I’ve decided to use Awk instead.

The internet is loaded with Awk tutorials, but for my purposes the brief description in the Wikipedia article has been the most useful for getting Awk’s fundamental structure into my head. When looking at the tutorials and at other people’s scripts, it’s easy to get bogged down in the details and lose sight of the fundamentals. Here’s the description that set me straight:

An AWK program is a series of pattern action pairs, written as:

condition { action }

where condition is typically an expression and action is a series of commands. The input is split into records, whereby default records are separated by newline characters so that the input is split into lines. The program tests each record against each of the conditions in turn, and executes the action for each expression that is true. Either the condition or the action may be omitted. The condition defaults to matching every record. The default action is to print the record.

Boom. The condition selects lines; the action does things to them. That’s Awk in a nutshell. The other three things that are handy to know if you want to start writing Awk one-liners are

Yes, of course there are many many other things to learn about Awk, but those are details that are easy to pick up, especially if you have some experience with other scripting languages or C. Would you be surprised to learn that the condition can be a regular expression? Or that you can use printf to format the output. Of course not. But if you don’t understand—as I didn’t when I first tried to figure out some Awk one-liners—the condition {action} structure, you’ll never understand Awk.2

I use Gnuplot for making graphs from data files, and Gnuplot expects its data files to consist of whitespace-delimited lines—the same structure Awk expects. This makes it easy to use Awk to manipulate Gnuplot’s data files before, during, and after plotting.

For example, the other day I wanted to import a Gnuplot data file into Numbers and do a few exploratory calculations. Numbers can open text files, but they have to be tab-delimited, not whitespace-delimited. An obvious Awk solution is

awk '{gsub(/ +/, "\t"); print}' < data.txt > data-tsv.txt

There’s no condition in the command because I want it to act on all the lines in the file. The gsub function does a global substitution of the first argument with the second. It’s like s/pattern/replacement/g in Perl.

A better solution on a Mac is to use the clipboard instead of creating a TSV file. Select the entire data file, copy it, and execute

pbpaste | awk '{gsub(/ +/, "\t"); print}' | pbcopy  

in Terminal. The clipboard now has TSV data ready to paste into a Numbers spreadsheet.

I tweeted this solution a couple of days ago, but I forgot that my Twitter client, Dr. Twoot, is set up to “educate” any straight quotation marks it sees. The result was readable, but not exactly what I wanted.

Convert space-separated clipboard data to tab-separated for pasting into Numbers:

pbpaste | awk ‘{gsub(/ +/,”\t”);print}’ | pbcopy

9:33 PM Sat Aug 6, 2011

I didn’t notice the screwup until just a few minutes ago when I looked for the tweet in my stream. I hope the people who favorited it recognized the mistake with the curly quotes and straightened them out before trying to use it.

Another Gnuplot-related use of Awk is in the scripts I use to automatically generate the monthly Afghanistan and Iraq casualty charts. The data file for Afghanistan casualties is in the form

2011-02    20    38
2011-03    31    39
2011-04    46    51
2011-05    35    56
2011-06    47    65
2011-07    37    53

where each line consists of the month, the US casualties, and the coalition casualties. Because I wanted the generated plot to include casualty totals, the Gnuplot script file has had to “shell out” to do the sums. The lines that did that calculation used to use Perl:

ustot = `perl -e '$s=0;while(<>){($m,$a,$b)=split;$s+=$a}print$s;' acasualties.txt`
tot   = `perl -e '$s=0;while(<>){($m,$a,$b)=split;$s+=$b}print$s;' acasualties.txt`

I recently made them more compact by using Awk:

ustot = `awk '{s+=$2} END {print s}' < acasualties.txt`
tot   = `awk '{s+=$3} END {print s}' < acasualties.txt`

I don’t think I’ll ever try to write many-line Awk programs; it’s just easier for me to use Python when the problem gets more complicated. But with Python not especially adept at doing one-liners, and with my Perl skills fading by the hour, I expect to start using Awk more often.

If I start using sed, though, an intervention may be necessary to prevent me from turning into Solaris sysadmin.

  1. Although Awk is an acronym, and the standard reference spells it with all caps (AWK), Brian Kernighan now spells it with just a leading capital, like a proper noun. I figure his is a good example to follow. 

  2. I liken this to when I first “got” Lisp reading Robert Chassell’s Emacs Lisp tutorial. I suddenly realized that every Lisp command is a function call, and it’s just like calling functions in other languages except that the parentheses go around the entire thing and the arguments are separated by whitespace instead of commas: (function arg1 arg2) instead of function(arg1, arg2). Once I saw that fundamental point, I could read and sort of understand Lisp and Scheme.)