Capitalization and titlecase.py

In this post on Monday, I mentioned the titlecase module by Stuart Colville, which is a Python rewrite of John Gruber’s original Perl script, TitleCase.pl. My interest in capitalizing titles came from a desire to regularize the track titles in my iTunes library. Much of my library was built during my Linux days, and the track metadata were set by scripts that used the freeDB database, a database that was littered with misspellings and inconsistent capitalization. I had some hope of writing a Python script that used the titlecase and appscript modules to run through the library and standardize the capitalization. I’m not sure I’ll go through with that—I have a terrible fear of screwing up my whole library—but I did learn some things about titles along the way.

Let me first say that I’ve always used sentence style capitalization on And now it’s all this, a style I also use for section headings in the reports I write for work. But I do use headline style capitalization for the titles of my reports, and I often have to refer to other works that use the headline style. So a good capitalizing script would be a help.

(Also, I think song titles look better with headline capitalization, which is how I started down this path.)

My guide for proper headline capitalization has been the Chicago Manual of Style. In my edition (15th), the rules for Titles of Works are on pages 366–370. Its rules are very similar to those Gruber used, although it prefers all prepositions to be lowercase, regardless of length, unless they are stressed when saying the title or are used as another part of speech. I’ve always had a problem lowercasing longer prepositions—it just looks funny to me. An example in the Chicago Manual is

Four Theories concerning the Gospel according to Matthew

TitleCase.pl would turn this into

Four Theories Concerning the Gospel According to Matthew

and so would I.

Although Colville’s Python module was written to provide a Python function that matched the behavior of Gruber’s Perl script, it differs from TitleCase.pl in its handling of hyphenated compounds. Gruber applies the same rules to the individual elements as it does to space-separated words. Thus, sugar-and-spice becomes Sugar-and-Spice and forget-me-not becomes Forget-Me-Not. Colville capitalizes the first element but leaves the others as-is. Thus, forget-me-not becomes Forget-me-not, sugar-and-spice becomes Sugar-and-spice, but sugar-And-Spice becomes Sugar-And-Spice.

The Chicago Manual wouldn’t approve of either program; it prefers Gruber’s Sugar-and-Spice but Colville’s Forget-me-not. It generally approves of Gruber’s approach, but adds so many exceptions that one might be better off just giving up on hyphenated compounds, as Colville does. Applying TitleCase.pl to a hyphenated title written with no regard for capitalization will often fix it, but applying it to a carefully capitalized hyphenated compound could very well mess it up.

I decided to change titlecase.py to work like TitleCase.pl, recognizing that it may give bad results. This required two edits. First, I changed Line 40 from

words = re.split('\s', text)

to

words = re.split('(\s|-|‑)', text)

where that second hyphen is the non-breaking one: Unicode 2011, UTF-8 E2 80 91. You’ll find it in the Punctuation section of the Mac’s Character Palette. I never use it, but it’s in Aristotle Pagaltzis’s TitleCase test suite, so it must be of value to someone.

Second, I changed Line 51 from

line = " ".join(line)

to

line = "".join(line)

The first change splits the input on hyphens as well as white space and preserves the separating characters in the words list. The second change just puts all the words elements back together; because the separating characters are in words, there’s no longer any reason to put a space between each item.

If you plan to use TitleCase.pl or titlecase.py—or any of the variants written in other scripting languages—you have to recognize that they aren’t foolproof, even when there is no hyphenation in the title. My favorite example is the song “The In Crowd.” If you have the Mamas and the Papas version from the iTunes Store, these scripts will turn it into the anticlimactic “The in Crowd.” If, on the other hand, you have the Ramsey Lewis version, the title will have the troublesome word quoted—“The ‘In’ Crowd”—and the scripts will work fine because it’s treated as the first word of a subphrase and gets capitalized even though it’s in the small word list.

You should prefer the Ramsey Lewis version anyway.