Improved Markdown table commands for TextMate

A few years ago, I made a small TextMate bundle with commands for manipulating MultiMarkdown-style tables1. One of the commands, Normalize Markdown Table, didn’t work too well if the table included non-ASCII characters. As I said in the post:

I’m sure there’s some clever Python module I can use to get around this problem, but I don’t know what it is yet. Suggestions are welcome.

It’s taken three years and five months, but someone has finally stepped up. I got an email today from Christoph Kepper with a new Normalize Markdown Table script that fixes the problem.

To review briefly, the Normalize Markdown Table command takes a table that looks like this,

|Left align|Right align|Center align|
|:---------|----------:|:----------:|
|This|This|This|
|column|column|column|
|will|will|will|
|be|be|be|
|left|right|center|
|aligned|aligned|aligned|

and turns it into one that looks like this,

| Left align | Right align | Center align |
|:-----------|------------:|:------------:|
| This       |        This |     This     |
| column     |      column |    column    |
| will       |        will |     will     |
| be         |          be |      be      |
| left       |       right |    center    |
| aligned    |     aligned |   aligned    |

The idea is that rather than trying to get the columns to align as you type, you just create the table initially with misaligned separators. Then you select the table, choose the Normalize Markdown Table command, and it turns into the nicely formatted one.

Two things to note:

  1. Both forms are equally valid tables, and your Markdown processor will produce the same output whichever one you use. The Normalize Markdown Table command is purely for the writer’s benefit; it makes the table easier to read before it’s processed.
  2. The column alignment relies on the use of monospaced fonts, which is what all right-thinking people use in TextMate.

A problem with the Normalize Markdown Table command arose when the table included non-ASCII characters, like this:

|Author|Book|
|--|--|
|Theodore von Kármán|Mathematical Methods in Engineering|
|Stephen Timoshenko|Theory of Elasticity|
|Jacob Pieter Den Hartog|Mechanical Vibrations|

It would normalize to this:

| Author                  | Book                                |
|:------------------------|:------------------------------------|
| Theodore von Kármán   | Mathematical Methods in Engineering |
| Stephen Timoshenko      | Theory of Elasticity                |
| Jacob Pieter Den Hartog | Mechanical Vibrations               |

The column separators didn’t align because the á is two bytes long but takes up only one character space. Christoph’s improved code results in

| Author                  | Book                                |
|:------------------------|:------------------------------------|
| Theodore von Kármán     | Mathematical Methods in Engineering |
| Stephen Timoshenko      | Theory of Elasticity                |
| Jacob Pieter Den Hartog | Mechanical Vibrations               |

which is just what we want.

Here’s the new code:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import sys
 4:  
 5:  def just(string, type, n):
 6:      "Justify a string to length n according to type."
 7:      
 8:      string = unicode(string, 'utf-8')
 9:      if type == '::':
10:          return string.center(n)
11:      elif type == '-:':
12:          return string.rjust(n)
13:      elif type == ':-':
14:          return string.ljust(n)
15:      else:
16:          return string
17:  
18:  
19:  def normtable(text):
20:      "Aligns the vertical bars in a text table."
21:      
22:      # Start by turning the text into a list of lines.
23:      lines = text.splitlines()
24:      rows = len(lines)
25:      
26:      # Figure out the cell formatting.
27:      # First, find the formatting line.
28:      for i in range(rows):
29:          if set(lines[i]).issubset('|:.-'):
30:              formatline = lines[i]
31:              formatrow = i
32:              break
33:      
34:      # Delete the formatting line from the content.
35:      del lines[formatrow]
36:      
37:      # Determine how each column is to be justified. 
38:      formatline = formatline.strip('| ')
39:      fstrings = formatline.split('|')
40:      justify = []
41:      for cell in fstrings:
42:          ends = cell[0] + cell[-1]
43:          if ends == '::':
44:              justify.append('::')
45:          elif ends == '-:':
46:              justify.append('-:')
47:          else:
48:              justify.append(':-')
49:      
50:      # Assume the number of columns in the format line is the number
51:      # for the entire table.
52:      columns = len(justify)
53:      
54:      # Extract the content into a .
55:      content = []
56:      for line in lines:
57:          line = line.strip('| ')
58:          cells = line.split('|')
59:          # Put exactly one space at each end as "bumpers."
60:          linecontent = [ ' ' + x.strip() + ' ' for x in cells ]
61:          content.append(linecontent)
62:      
63:      # Append cells to rows that don't have enough.
64:      rows = len(content)
65:      for i in range(rows):
66:          while len(content[i]) < columns:
67:              content[i].append('')
68:      
69:      # Get the width of the content in each column. The minimum width will
70:      # be 2, because that's the shortest length of a formatting string and
71:      # because that matches an empty column with "bumper" spaces.
72:      widths = [2] * columns
73:      for row in content:
74:          for i in range(columns):
75:              widths[i] = max(len(unicode(row[i], 'utf-8')), widths[i])
76:      
77:      # Add whitespace to make all the columns the same width and 
78:      formatted = []
79:      for row in content:
80:          formatted.append('|' + '|'.join([ just(s, t, n) for (s, t, n) in zip(row, justify, widths) ]) + '|')
81:      
82:      # Recreate the format line with the appropriate column widths.
83:      formatline = '|' + '|'.join([ s[0] + '-'*(n-2) + s[-1] for (s, n) in zip(justify, widths) ]) + '|'
84:      
85:      # Insert the formatline back into the table.
86:      formatted.insert(formatrow, formatline)
87:      
88:      # Return the formatted table.
89:      return '\n'.join(formatted)
90:  
91:  
92:  # Read the input, process, and print.
93:  unformatted = sys.stdin.read()   
94:  print normtable(unformatted).encode('utf-8')

Christoph’s improvements boil down to just three changes:

  1. He added Line 8, which treats the string as UTF-8, both in the argument and the return value.
  2. He did the same thing in Line 75, which makes the len command return the number of characters rather than bytes.
  3. He added the encode method to Line 94, so the output would be handled properly.

I’ve put the new Normalize Markdown Table command into my Text Tables bundle, which you can download as a zip file. After you unzip the file, double-click on the resulting TextMate bundle, and it will install itself into your TextMate system. The bundle includes two other commands: one for turning tab-separated tabular data (like what you’d get if you copied a set of cells from a spreadsheet and pasted them into TextMate) into a nearly complete Markdown table, and one for turning a tab-separated table into a neatly aligned space-separated table (with no pipe characters as column separators). These two commands are described here and here.

One last thing. To show how awesome he is, Christoph didn’t send me the full source code of his improved command; he just sent me the diff between it and my original. To show how awesome I am, I decided to use patch to apply his update straight from the email. Unfortunately, my awesomeness wasn’t up to the challenge; for reasons I can’t explain, patch refused to apply the second of his three changes, and I had to do that one by hand. Time to turn in my Unix merit badge.


  1. It’s also the style used in PHP Markdown Extra and some other Markdown processors that implement tables. 


13 Responses to “Improved Markdown table commands for TextMate”

  1. Lri says:

    I’ve put the new Normalize Markdown Table command into my Text Tables bundle, which you can download as a zip file.

    The linked zip file still contains the old version…

  2. Carl says:

    This doesn’t really work. It solves the simple issue of bytes vs. codepoints, but a Unicode codepoint is not necessarily a “character.” The simplest illustration is á (a + U+ 0301) vs. á (U+00E1). These look the same, but the Unicode behind them is different. Both are just one character, but the first is made using a normal ‘a’ plus a combining character (U+0301), and the second is a precomposed character taking up just a single codepoint (U+00E1).

    On top of that, some characters don’t fit into a single column of space, even using precomposed characters. For example, あ and 愛 are Japanese and Chinese characters that are “double width.”

    There are ways to hack around this by using the Unicode normalization tools, but it’s a pain in the butt.

  3. Carl says:

    OK, if you want to fix it for real the relevant module is unicodedata. See http://docs.python.org/library/unicodedata.html

    unicodedata.normalize("NFKC", s) will reduce the combining characters into precomposed characters where possible (it’s not always possible), and unicodedata.east_asian_width(char) will return "W" when the character is double width. Using that you can make a more informed table fixer.

    It probably still won’t work for Arabic, and it definitely won’t work for letters with multiple accent marks, but it’s an improvement.

  4. Dr. Drang says:

    Lri, the bungling that resulted in me reuploading the old version (after properly uploading the new version) is painfully embarrassing to revisit. Thanks for finding the mistake quickly; I’m pretty sure it’s fixed now.

  5. tim says:

    Do you mind if I submit this to the Avian bundle?

  6. Dr. Drang says:

    There is, Carl, a definite Eurocentrism to the new version. In his email, Christoph said his goal was to get the command to work with the sort of accented characters that come up in his writing. He didn’t present it as a universal Unicode solution; that was me overselling it in my description.

    I don’t see a practical problem with the point you raised about accented characters. With a US keyboard layout, there are currently three common ways to insert an accented character:

    1. Using the traditional dead key approach.
    2. Using the new press-and-hold approach that let’s you choose an accented key from a popup.
    3. Using the Character Viewer.

    I’ve used all three of these methods, and Christoph’s code formats the table correctly for all of them. Maybe there’s a problem if you use a different keyboard, but I don’t have a convenient way to test that. I can only say that whatever keyboard Christoph is using seems to work for him.

    Maybe pasting characters copied from another source would lead to misalignment?

    I can’t say that I’m especially concerned with handling multiply accented characters—that seems like too much of an edge case to bother with—but it would be nice to handle a broader range of characters. Does that seem reasonably doable in Python 2.x, or is it really a job for 3.x?

  7. Dr. Drang says:

    Submit away, tim, as long as it’s properly attributed. See the CC notice in the sidebar. Note also Carl’s comments.

  8. Christoph Kepper says:

    My focus was indeed on German Umlauts and the Euro (€) sign. AFAIK TextMate has no support for variable width or wide fonts and also not for right-to-left languages like Arab or Hebrew script. At least that’s what Wikipedia says[1]. Therefor, I am not sure if Carl’s objections -as valid as they are- can ever be solved with TextMate.

    [1] http://en.wikipedia.org/wiki/TextMate

  9. Carl says:

    1. I’m not really thinking of this in a TextMate perspective. I don’t have TextMate installed on my Mac, and I just use the script just out of my Python directory.

    2. Dr. Drang is right that OS X seems to prefer precomposed characters. Even when I read the Wikipedia page about tiếng Việt, the characters were precomposed. So, for most purposes, as long as you stick to the Latin alphabet, you’ll be fine. Still, if you want to wear a belt with your suspenders, add this to line 93:

      import unicodedata unformatted = unicodedata.normalize(“NFKC”, unformatted)

    That still leaves problems with East Asian characters (ones which have bitten me personally), but most users can just ignore them.

  10. Carl says:

    Ah, Markdown edge cases! I indented a code block, but Markdown thought I was continuing a list item. Here it is again in monospace:

    import unicodedata 
    unformatted = unicodedata.normalize(“NFKC”, unformatted)
    
  11. BZH Geek says:

    This is really interesting, I missed your previous bundle/script so I just experimented with your edited version. Yet I face some problems I prefer to ask about as I am not a python expert (I am an average guy trying to use stuff some more than average guy coded).

    So I tried to use your bundle on a table with empty first cell in some rows and I discovered that the “normalized” output somehow shifts the content from the next cell to the left.

    Here is an example I have been experimenting with :

    Left align Right align Center align 1 2 3 1 2 1

    And here is the output :

    Left align Right align Center align 1 2 3 1 2 1

    It looks very different to me and HTML output is really different for the last 2 lines. As a simple test, I tried not to leave cells empty and then it works as intended (adding a space doesn’t solve it). The MMD manual says cells can be empty so I think I am missing something or maybe it is an edge case ? Empty cells on column 2 and up behave as expected.

    Any clue ?

    Thanks.

  12. BZH Geek says:

    Sorry I messed up my code blocks.

    Here is the raw table :

    |Left align|Right align|Center align|
    |:-|-:|:-:|
    |1|2|3|
    ||1|2|
    |||1|
    

    And what the scripts makes out of it :

    | Left align | Right align | Center align |
    |:-----------|------------:|:------------:|
    | 1          |           2 |      3       |
    | 1          |           2 |              |
    | 1          |             |              |
    
  13. Dr. Drang says:

    BZH, you’ve found an important bug. I’ve never tested for empty cells and apparently have never had a table with empty cells in the years I’ve been using this bundle. But empty cells are allowed, as are multicolumn spans. The Normalize Table command doesn’t handle either of these correctly.

    (I said back in the original post that column spans have to be formatted by hand after applying Normalize Table, but I’d like to be able to eliminate that requirement. If I am able to implement multicolumn spans, I’ll probably have to require at least one space between the pipes to designate an empty cell.)

    The problem is the strip in Lines 38 and 57 (57 is the real culprit, but both lines are poorly written). It was intended to remove just the “outside” pipe characters (and any leading or trailing spaces), but it’s actually removing all the leading and trailing pipes.

    I’ll get this fixed and issue an updated bundle in a day or two. Thanks for finding the bug.