Regexes, BBEdit, and Twitter screen names

I found another bug in Dr. Twoot today. A pretty embarrassing one, given the context in which I found it.

Here’s what my Twitter timeline looked like when I opened Dr. Twoot this morning:

Improper screen name detection

For some reason, the screen name for @bbedit_hints wasn’t turned into a link, only the @bbedit part was. It wasn’t too hard to find the reason. Here’s the section of code that finds screen names and turns them into links:

javascript:
57: // Handle Twitter names, ignoring case.
58: $.each(users, function(i, u) {
59:   iname = new RegExp('@' + u.screen_name, 'gi');
60:   link = '<a href="http://twitter.com/' + u.screen_name + '">' + '@' + u.screen_name + '</a>';
61:   body = body.replace(iname, link);
62: }) // each

The users list comes from the user_mentions tweet entity. The anonymous function called by the JQuery each method searches the tweet for the mentioned user’s screen name and replaces it with a link to the user’s Twitter page. If the screen name was miscapitalized in the tweet, the replace changes it to the canonical form. For example, a tweet that says

I love @BBEdit!

will be turned into

I love <a href="http://twitter.com/bbedit>@bbedit</a>!

and will be a working link when displayed by Dr. Twoot.

This code has been working for months. The bug that was lying in wait until today only appears when one of the users mentioned in a tweet has a screen name that matches the beginning of another screen name in the tweet. When that happens, the shorter screen name can transform the initial portion of the longer screen name into a link and make it impossible for the longer name to be found.

In the tweet above, @bbedit_hints got turned into

<a href="http://twitter.com/bbedit>@bbedit</a>_hints

when the each method was looking for instances of @bbedit. When each was later looking for @bbedit_hints, it was nowhere to be found—the closing anchor tag before the underscore made the search fail.1

The solution was pretty easy. Like many regular expression engines, JavaScript’s has a token, \b, that matches a word boundary. Including the word boundary in the regex search pattern in Line 59 fixed the problem.

javascript:
57: // Handle Twitter names, ignoring case.
58: $.each(users, function(i, u) {
59:   iname = new RegExp('@' + u.screen_name + '\\b', 'gi');
60:   link = '<a href="http://twitter.com/' + u.screen_name + '">' + '@' + u.screen_name + '</a>';
61:   body = body.replace(iname, link);
62: }) // each

The extra backslash is one of those annoying, easy-to-miss string escaping things. The backslash is JavaScript’s escape character, so to get a literal backslash into a string you have to use \\.

With the fix in place, the screen names were detected properly and they all linked to the correct Twitter pages.

Proper screen name detection

The bitter irony of this is that the tweet that pointed out the bug was part of a discussion of the regular expression chapter of the BBEdit manual—the manual that was my introduction to the wonderful world of regular expressions back in 1995 or so. My mistakes are not a reflection on the quality of the manual.


  1. Order is important. If @bbedit_hints had come before @bbedit in the user_mentions entity, the bug would have remained hidden. 


One Response to “Regexes, BBEdit, and Twitter screen names”

  1. Jan Marcel says:

    It is pretty amazing that you take your time to share your wisdom with others in such a finely detailed manner. Sorry if this comment qualify as spam, but I just had to express my appreciation. Thank you very much!