Aligning text tables on the decimal point

I’ve recently written several reports with tables of numbers, and as I made more and more tables of this type, it became clear that my Normalize Table filter for BBEdit needed some updating. Now it can align numbers on the decimal point.

Here’s an example. Suppose I start with a table in MultMarkdown format

| Very long description | Short | Intermediate |
|.--.|.--.|.--.|
| 17,750 | 9,240.152 | 7,253.0 |
| 3,517 | 13,600.0 | 6,675 |
| 18,580.586 | 8,353.3 | 13,107.3 |
| 7,725 | 355.3 | 14,823.2 |
| None | 10,721.6 | 7,713 |
| 12,779.1592 | 14,583.867 | 3,153.2 |
| 18,850.03 | -4,756.6686 | 13,081.74 |
| -2,264.80 | 13,772.729 | 12,557 |
| -17,141.001 | 26,227.27 | — |
| 2,772.35 | 14,772.1509 | -3,814.0 |
| 934 | 20,853 | -10,272.2 |
| 8,082.139 | 7,864.0048 | 1,583.010 |

This is legal, but it’s impossible to read as plain text. Since readability is the chief virtue of Markdown, I want to adjust the spacing so all the columns line up. The older version of my filter worked well for left-aligned, right-aligned, and centered columns, but it made no attempt to align numeric columns on the decimal point. Now it does. After running the above through the Normalize Table filter, it comes out looking like this:

| Very long description |    Short    | Intermediate |
|.---------------------.|.-----------.|.------------.|
|       17,750          |  9,240.152  |   7,253.0    |
|        3,517          | 13,600.0    |   6,675      |
|       18,580.586      |  8,353.3    |  13,107.3    |
|        7,725          |    355.3    |  14,823.2    |
|          None         | 10,721.6    |   7,713      |
|       12,779.1592     | 14,583.867  |   3,153.2    |
|       18,850.03       | -4,756.6686 |  13,081.74   |
|       -2,264.80       | 13,772.729  |  12,557      |
|      -17,141.001      | 26,227.27   |      —       |
|        2,772.35       | 14,772.1509 |  -3,814.0    |
|          934          | 20,853      | -10,272.2    |
|        8,082.139      |  7,864.0048 |   1,583.010  |

The key features of the new filter are:

  1. Decimal points are aligned (duh).
  2. Numbers without decimal points are right-aligned just before the decimal point.
  3. Decimal-aligned lists of figures are centered in their columns.
  4. Entries that aren’t numbers are also centered in their columns.

The decimal points at either end of the format line (.----.) tell the filter to use decimal alignment on that column. I should point out that the extra spaces the filter adds have no effect on the processed output.

HTML doesn’t have a good and widely supported way to align table columns on decimal points, but that’s OK, because my reports get transformed into LaTeX, not HTML, and LaTeX has a nice dcolumn package for decimal alignment.

Here’s the new version of the filter:

python:
  1:  #!/usr/bin/python
  2:  
  3:  import sys
  4:  import re
  5:  
  6:  decimal = re.compile(r'^( -?[0-9,]*)\.(\d* )$')
  7:  integer = re.compile(r'^( -?[0-9,]+) $')
  8:  
  9:  def just(string, type, n, b, d):
 10:    "Justify a string to length n according to type."
 11:    
 12:    if type == '::':
 13:      return string.center(n)
 14:    elif type == '-:':
 15:      return string.rjust(n)
 16:    elif type == ':-':
 17:      return string.ljust(n)
 18:    elif type == '..':
 19:      isdec = decimal.search(string)
 20:      isint = integer.search(string)
 21:      if isdec:
 22:        before = len(isdec.group(1))
 23:        after = len(isdec.group(2))
 24:        string = ' ' * (b - before) + string + ' ' * (d - after)
 25:      elif isint:
 26:        before = len(isint.group(1))
 27:        string = ' ' * (b - before) + string + ' ' * d
 28:      return string.center(n)
 29:    else:
 30:      return string
 31:  
 32:  
 33:  def normtable(text):
 34:    "Aligns the vertical bars in a text table."
 35:    
 36:    # Start by turning the text into a list of lines.
 37:    lines = text.splitlines()
 38:    rows = len(lines)
 39:    
 40:    # Figure out the cell formatting.
 41:    # First, find the separator line.
 42:    for i in range(rows):
 43:      if set(lines[i]).issubset('|:.- '):
 44:        formatline = lines[i]
 45:        formatrow = i
 46:        break
 47:    
 48:    # Delete the separator line from the content.
 49:    del lines[formatrow]
 50:    
 51:    # Determine how each column is to be justified.
 52:    formatline = formatline.strip(' ')
 53:    if formatline[0] == '|': formatline = formatline[1:]
 54:    if formatline[-1] == '|': formatline = formatline[:-1]
 55:    fstrings = formatline.split('|')
 56:    justify = []
 57:    for cell in fstrings:
 58:      ends = cell[0] + cell[-1]
 59:      if ends in ['::', ':-', '-:', '..']:
 60:        justify.append(ends)
 61:      else:
 62:        justify.append(':-')
 63:    
 64:    # Assume the number of columns in the separator line is the number
 65:    # for the entire table.
 66:    columns = len(justify)
 67:    
 68:    # Extract the content into a matrix.
 69:    content = []
 70:    for line in lines:
 71:      line = line.strip(' ')
 72:      if line[0] == '|': line = line[1:]
 73:      if line[-1] == '|': line = line[:-1]
 74:      cells = line.split('|')
 75:      # Put exactly one space at each end as "bumpers."
 76:      linecontent = [ ' ' + x.strip() + ' ' for x in cells ]
 77:      content.append(linecontent)
 78:    
 79:    # Append cells to rows that don't have enough.
 80:    rows = len(content)
 81:    for i in range(rows):
 82:      while len(content[i]) < columns:
 83:        content[i].append('')
 84:    
 85:    # Get the width of the content in each column. The minimum width will
 86:    # be 2, because that's the shortest length of a formatting string and
 87:    # because that matches an empty column with "bumper" spaces.
 88:    widths = [2] * columns
 89:    beforedots = [0] * columns
 90:    afterdots = [0] * columns
 91:    for row in content:
 92:      for i in range(columns):
 93:        isdec = decimal.search(row[i])
 94:        isint = integer.search(row[i])
 95:        if isdec:
 96:          beforedots[i] = max(len(isdec.group(1)), beforedots[i])
 97:          afterdots[i] = max(len(isdec.group(2)), afterdots[i])
 98:        elif isint:
 99:          beforedots[i] = max(len(isint.group(1)), beforedots[i])
100:        widths[i] = max(len(row[i]), beforedots[i] + afterdots[i] + 1, widths[i])
101:        
102:    # Add whitespace to make all the columns the same width. 
103:    formatted = []
104:    for row in content:
105:      formatted.append('|' + '|'.join([ just(s, t, n, b, d) for (s, t, n, b, d) in zip(row, justify, widths, beforedots, afterdots) ]) + '|')
106:    
107:    # Recreate the format line with the appropriate column widths.
108:    formatline = '|' + '|'.join([ s[0] + '-'*(n-2) + s[-1] for (s, n) in zip(justify, widths) ]) + '|'
109:    
110:    # Insert the formatline back into the table.
111:    formatted.insert(formatrow, formatline)
112:    
113:    # Return the formatted table.
114:    return '\n'.join(formatted)
115:  
116:          
117:  # Read the input, process, and print.
118:  unformatted = unicode(sys.stdin.read(), "utf-8")
119:  print normtable(unformatted)

The main additions come in Lines 88–100 and Lines 18–28. Instead of just tracking the length of each cell, the filter now also tracks the number of characters before and after the decimal point.1. The overall width of the column is then either the longest text entry (including the header entry) or the sum of the longest lengths before and after the decimal point. If necessary, extra spaces are added before and after the number to create a new string that’s then centered in column.

Although I use this in BBEdit, there’s no reason it couldn’t be adapted to other intelligent text editors. It uses standard input and standard output, so it ought to be easy to incorporate into a Sublime Text, TextMate, Vim, or Emacs workflow.


  1. If you use commas as your decimal indicator, you’ll have to edit the filter a bit to get it to work for you. Be aware, though, that you’ll still need to use periods in the formatting line—MultiMarkdown doesn’t allow commas there. 


Sitemap evolution

I don’t know how valuable a sitemap really is, but I decided I might as well have one. When I was running ANIAT on WordPress, the WP system (or maybe it was a plugin) made one for me, but now I need to build one on my own and keep it updated with each new post. I put together a simple system today and figured it was worth sharing how I did it.

A sitemap is an XML file that is, at its heart, just a list of URLs to all the pages on your site. ANIAT is build from a directory structure that looks like this,

Local copy of ANIAT

where source is the directory of Markdown source files, site is the directory of HTML and support files that’s mirrored to the server, and bin is the directory of scripts that build the site. The Makefile runs the scripts that build and upload the site.

Starting in the bin directory, I got a list of all the posts with a simple ls command:

ls ../site/*/*/*/index.html

This returned a long list of file paths that looked like this:

../site/2014/11/more-rss-mess/index.html
../site/2014/11/post-post-post/index.html
../site/2014/11/snell-scripting/index.html
../site/2014/11/the-rss-mess/index.html
../site/2014/11/three-prices/index.html

The simplest way to turn these into URLs was by piping them through awk,

ls ../site/*/*/*/index.html | awk -F/ '{printf "http://www.leancrew.com/all-this/%s/%s/%s/\n", $3, $4, $5}'

which returned lines like this:

http://www.leancrew.com/all-this/2014/11/more-rss-mess/
http://www.leancrew.com/all-this/2014/11/post-post-post/
http://www.leancrew.com/all-this/2014/11/snell-scripting/
http://www.leancrew.com/all-this/2014/11/the-rss-mess/
http://www.leancrew.com/all-this/2014/11/three-prices/

The -F/ option told awk to split each line on the slashes, and the $3, $4, and $5 are the third, fourth, and fifth fields of the split line.

With this working, it was easy to beef up the awk command to include the surrounding XML tags:

ls ../site/*/*/*/index.html | awk -F/ '{printf "<url>\n  <loc>http://www.leancrew.com/all-this/%s/%s/%s/</loc>\n</url>\n", $3, $4, $5}'

This gave me lines like this:

<url>
  <loc>http://www.leancrew.com/all-this/2014/11/more-rss-mess/</loc>
</url>
<url>
    <loc>http://www.leancrew.com/all-this/2014/11/post-post-post/</loc>
</url>
<url>
  <loc>http://www.leancrew.com/all-this/2014/11/snell-scripting/</loc>
</url>
<url>
  <loc>http://www.leancrew.com/all-this/2014/11/the-rss-mess/</loc>
</url>
<url>
  <loc>http://www.leancrew.com/all-this/2014/11/three-prices/</loc>
</url>

Now all I needed were a few opening and closing lines, and I’d have a sitemap file with all the required elements.

The awk commands were growing too unwieldy to maintain as a one-liner, so I put them in a script file and added the necessary parts to the beginning and the end.

 1:  #!/usr/bin/awk -F/ -f
 2:  
 3:  BEGIN {
 4:    print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n<url>\n  <loc>http://www.leancrew.com/all-this/</loc>\n</url>"
 5:  }
 6:  {
 7:    printf "<url>\n  <loc>http://www.leancrew.com/all-this/%s/%s/%s/</loc>\n</url>\n", $3, $4, $5
 8:    }
 9:  END {
10:    print "</urlset>"
11:  }

With this script, called buildSitemap, I could run

ls ../site/*/*/*/index.html | ./buildSitemap > ../site/sitemap.xml

and generate a sitemap.xml file in the site directory, ready to be uploaded to the server. The file looked like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
  <loc>http://www.leancrew.com/all-this/</loc>
</url>
.
.
.
<url>
  <loc>http://www.leancrew.com/all-this/2014/11/the-rss-mess/</loc>
</url>
<url>
  <loc>http://www.leancrew.com/all-this/2014/11/three-prices/</loc>
</url>
</urlset>

This was the output I wanted, but the awk script looked ridiculous, and if I wanted to update script to add <lastmod> elements to each URL, it was just going to get worse. I’m not good enough at awk to make a clean script that’s easy to maintain and improve. So I rewrote buildSitemap in Python:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import sys
 4:  
 5:  print '''<?xml version="1.0" encoding="UTF-8"?>
 6:  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 7:  <url>
 8:    <loc>http://www.leancrew.com/all-this/</loc>
 9:  </url>'''
10:  
11:  for f in sys.stdin:
12:    parts = f.split('/')
13:    print '''<url>
14:    <loc>http://www.leancrew.com/all-this/{}/{}/{}/</loc>
15:  </url>'''.format(*parts[2:5])
16:    
17:  print '</urlset>'

It’s a little longer, but for me it’s much easier to understand at a glance. In fact, it was fairly easy to figure out how to add <lastmod> elements:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import sys
 4:  import os.path
 5:  import time
 6:  
 7:  lastmod = time.strftime('%Y-%m-%d', time.localtime())
 8:  print '''<?xml version="1.0" encoding="UTF-8"?>
 9:  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
10:  <url>
11:    <loc>http://www.leancrew.com/all-this/</loc>
12:    <lastmod>{}</lastmod>
13:  </url>'''.format(lastmod)
14:  
15:  for f in sys.stdin:
16:    parts = f.split('/')
17:    mtime = time.localtime(os.path.getmtime(f.rstrip()))
18:    lastmod = time.strftime('%Y-%m-%d', mtime)
19:    print '''<url>
20:    <loc>http://www.leancrew.com/all-this/{1}/{2}/{3}/</loc>
21:    <lastmod>{0}</lastmod>
22:  </url>'''.format(lastmod, *parts[2:5])
23:    
24:  print '</urlset>'

The rstrip() in Line 17 gets rid of the trailing newline of f, and os.path.getmtime returns the modification time of the file in seconds since the Unix epoch. This is converted to a struct_time by time.localtime() and to a formatted date string in Line 18 by time.strftime(). The same idea is used in Line 7 to get the datestamp for the home page, but I cheat a bit by just using the current time, which is what time.localtime() returns when it’s given no argument.

Now the same pipeline,

ls ../site/*/*/*/index.html | ./buildSitemap > ../site/sitemap.xml

gives me a sitemap with more information, built from a script that’s easier to read. I’ve added this line to one of the Makefile recipes in bin. When I add or update a post, the sitemap gets built and is then mirrored to the server.

You could argue that the time I spent using awk was wasted, but I don’t see it that way. It was a quick way to get preliminary results directly from the command line, and in doing so I learned what I needed in the final script. If I were to do it over again, I’d still use awk in the early stages, but I’d shift to Python one step earlier. The awk script was a mistake, and I should have recognized that as soon as the BEGIN clause got unwieldy.

Update 11/21/14 10:29 PM
As Conrad O’Connell pointed out on Twitter,

@drdrang the sitemap file doesn’t do you much good unless you add it to your robots.txt file

Re: leancrew.com/all-this/2014/…

Conrad O’Connell (@conradoconnell) Nov 21 2014 5:13 PM

you do need to let the search engines know that you have a sitemap and where it is. The various ways you can do that are given on the sitemaps.org protocol page. I used this site to ping the search engines after validating my sitemap, but I agree with Conrad that adding a

Sitemap: http://path/to/sitemap.xml

line to the robots.txt file is probably a better idea because it doesn’t require you to know which search tools use sitemaps.


Three prices

When I saw (on Twitter, of course) that Mike Nichols had died, I thought of this sketch. Which isn’t surprising—I think of it often.

Jack Paar introduces it as an indictment of the funeral business—his mention of Jessica Mitford was a reference to the first edition of her The American Way of Death—and it certainly is that, but it’s also a sly poke at the three-tier pricing model lots of retailers use to get you “buy up.”

The low end of Apple’s current 16/64/128 GB storage option for the iPhone and iPad is the equivalent of “two men who… do God knows what with him.” Like Mike Nichols, we go for the one in the middle.


More RSS mess

When last we spoke, I was saying that although there have been some benefits to the more varied feed reading environment that grew after Google shut down Reader, we’ve also lost some valuable services. The one I want to talk about today is the near-instantaneous updating of cached feeds. This is primarily a loss for publishers, but readers have suffered, too.

When a feed-syncing service like Feedbin, Feedly, Feed Wrangler, etc. reads your RSS feed,1 it takes note of the new articles that have appeared since the last time it looked and saves a copy of the new stuff on its server. When one of its users asks to see unread posts, the service doesn’t go back to your site for the feed, it sends the user what it copied. This is what distinguishes subscribing through a service from subscribing directly, which is what we all did in the old days. Before there were syncing services, an app like NetNewsWire would poll all your subscribed sites, looking for new articles. It would collect whatever was unread at that time and present it to you.

The old way became annoying as we started using more devices. If we read an article on Device A and then moved to Device B, the feed reader on B wouldn’t know about the history on A and would show that same article as unread again. We could mark it as read, but when we later switched to Device C, it would pop up as unread yet again. The advantage of using a service is that the service knows what you have and haven’t read and knows what to deliver no matter which device you’re using.

The disadvantage of using a service arises when articles get updated after publication. The service delivers whatever it copied at the time it polled the site, which, by the time we launch our feed reader, could be hours or days out of date. We as readers are at the mercy of our service and the schedule it uses to update its copies.

In fact, the service may never update its copy. Each item in an RSS feed has a globally unique identifier, a string that’s not shared with any other RSS item from that site or any other site. The guid is typically not changed when the associated article is updated because it’s still the same article. Each item also has a publication date, which is also typically not changed when the article is updated.2 So the only way a service can be sure to have the most recent version of an item is to redownload it and change the copy on its server every time it polls the feed. This kind of defeats the purpose of making a cached copy, and some services don’t do it. They continue to serve up whatever version of the article was current when they first polled the feed.

How did Google handle updates when it was running Reader? Being Google, it solved the problem two ways: one through brute force and the other through cleverness.

The brute force solution was exactly what you’d expect of a company that indexes the entire internet. It rechecked every feed several times a day and updated all of its cached feed entries. Generally speaking, this meant that the articles it delivered through Reader were seldom more than an hour out of date. This is pretty good, but Google wanted more.

The clever solution was to develop a new protocol, PubSubHubbub, for handling feed updates. With this protocol, a PubSubHubbub server, called a hub, sat between publishers and Google Reader. Whenever an article was published or updated, the publisher could ping the hub to let it know there was something new. The hub, which Google controlled, would then tell Google Reader about the update, and Reader’s cache of the item would be changed right then and there. With PubSubHubbub, Google Reader was seldom more than a few seconds out of date.

The downside to PubSubHubbub was that it required the cooperation of the publisher. The publisher had to add an element to its feed and had to ping the hub whenever something changes. This wasn’t hard—for my money, Nathan Grigg provided the best explanation of what to do—but many publishers didn’t do it, which is why Google continued to use the brute force method as a backstop.

From a publisher’s point of view, Google Reader’s updating system was great, especially if you made the additions needed for PubSubHubbub. You knew that when you corrected a mistake in a blog post, it was the corrected version that people would see in their feed readers from that point on. This is, unfortunately, no longer the case. I’ve had people point out errors on Twitter that I had corrected a day earlier. Their syncing service was providing them with an old cached version that had never been updated.

It looks like Google’s hub is still operating, but I have no idea whether any of the feed syncing services use it. But because PubSubHubbub is an open protocol, anyone can provide hubs. The best known hub outside of Google is Superfeedr, and some of the syncing services do use it. Last week, I ran a small experiment to see which services update the articles they deliver and how quickly they respond to changes.

In hindsight, I would say that the experiment wasn’t very well run. Because it was going on during work hours, I had other things to do and couldn’t give it my fullest attention. Sometimes hours would go by between my checks on what the services were delivering, and I often forgot to clear my browser cache before checking. So my notes (which are in the post itself) on how quickly feeds were updated are not reliable. But I did learn some things:

  1. Feed Wrangler doesn’t use Superfeedr, and I see no evidence that it updates items after they’re cached. When I visit Feed Wrangler’s web site today, it’s still showing the original version of my test post, with no updates at all.
  2. When I started the test, Feedbin was not using Superfeedr to monitor my feed, but a day later it was. Despite that, I never saw any update to the post in Feedbin during the two days of the test. But if I check Feedbin now, it’ll show the fully updated post. To me, Feedbin is the most frustrating of the services. More people read ANIAT through Feedbin than any of the other services, but I have no idea the best way to ensure they its users get the most recent versions of my posts. (Update 2014/11/19 11:59 am Feedly is what most of my readers use, not Feedbin. Too many similar names.)
  3. Feedly and BazQux both use Superfeedr, and they were equally good at staying up to date. I can’t say how good that was because of my own experimental errors, but I do know that they update via both Superfeedr and brute force. I’ve configured both ReadKit and Reeder to use BazQux as my syncing service.
  4. NewsBlur is sui generis. It does not use Superfeedr, opting instead for a feed and site polling system of its own devising that seems to account for both the popularity of a site and how frequently it gets updated. You may recall from my last post that my server logs showed a NewsBlur hit on my DST post that included an absurdly high subscriber count of 15,879 in the user agent string. I found that through a grep search of the Apache log file that piped the lines through sed to filter out everything except the subscriber count. Looking at the full log entry,

    198.211.103.214 - - [10/Nov/2014:23:49:59 -0500] "GET /all-this/2013/03/why-i-like-dst/ HTTP/1.1" 200 6451 "-" "NewsBlur Page Fetcher - 15879 subscribers - http://www.newsblur.com/site/5719506/daring-fireball (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)"
    

    it’s clear that that’s the number of people who use NewsBlur to subscribe to Daring Fireball. Presumably, NewsBlur visited my DST post because John Gruber linked to it last week, and following links from popular blogs is part of NewsBlur’s updating system. I don’t pretend to understand why.

    Generally speaking, NewsBlur was behind Feedly and BazQux in updating its version of my test post.

  5. I didn’t test FeedHQ, mainly because I’d never heard of it, but it does use Superfeedr, so it ought to be fairly good at keeping up to date.

If I were to run this experiment again—which I’d only do if I were stuck in bed for a few days with the flu—I’d fix the mistakes I made this time, and I’d add a few more services, like FeedHQ, The Old Reader, and Inoreader. I might even try the self-hosted service, Stringer.


In calling the RSS situation a mess, I’m trying to be more descriptive than derogatory. Before Google shut down Reader, there was order imposed through a popular, centralized system. When that order was lost, what was left is a mess. We may prefer the mess because we don’t want a centralized system,3 but we shouldn’t pretend that it’s not a mess. Some of the advantages of Reader vanished along with it. If more of the syncing services recognized the value Google provided to both readers and publishers, I’m sure they could offer features that replace what we’ve lost.


  1. At the risk of offending Atom partisans, I’m going to lump all feed formats together and call them RSS. I know this isn’t technically correct—and the mix of formats is another part of the mess—but it’s common parlance and it’s a convenient shorthand. 

  2. Atom feeds don’t have a pubDate element, they have an updated element. which should tell the service when an item has been changed. RSS 2 feeds have pubDate but not updated. I told you it was a mess. (Updated 2014/11/19 11:59 am Atom has an optional published element for the initial publication date and time. Thanks to ttepasse for the tip.) 

  3. When Dave Winer created RSS on the fourth day, he didn’t intend it to be centralized.