Improved Markdown table commands for TextMate
March 13th, 2012 at 10:31 pm by Dr. Drang
A few years ago, I made a small TextMate bundle with commands for manipulating MultiMarkdown-style tables1. One of the commands, Normalize Markdown Table, didn’t work too well if the table included non-ASCII characters. As I said in the post:
I’m sure there’s some clever Python module I can use to get around this problem, but I don’t know what it is yet. Suggestions are welcome.
It’s taken three years and five months, but someone has finally stepped up. I got an email today from Christoph Kepper with a new Normalize Markdown Table script that fixes the problem.
To review briefly, the Normalize Markdown Table command takes a table that looks like this,
|Left align|Right align|Center align|
|:---------|----------:|:----------:|
|This|This|This|
|column|column|column|
|will|will|will|
|be|be|be|
|left|right|center|
|aligned|aligned|aligned|
and turns it into one that looks like this,
| Left align | Right align | Center align |
|:-----------|------------:|:------------:|
| This | This | This |
| column | column | column |
| will | will | will |
| be | be | be |
| left | right | center |
| aligned | aligned | aligned |
The idea is that rather than trying to get the columns to align as you type, you just create the table initially with misaligned separators. Then you select the table, choose the Normalize Markdown Table command, and it turns into the nicely formatted one.
Two things to note:
- Both forms are equally valid tables, and your Markdown processor will produce the same output whichever one you use. The Normalize Markdown Table command is purely for the writer’s benefit; it makes the table easier to read before it’s processed.
- The column alignment relies on the use of monospaced fonts, which is what all right-thinking people use in TextMate.
A problem with the Normalize Markdown Table command arose when the table included non-ASCII characters, like this:
|Author|Book|
|--|--|
|Theodore von Kármán|Mathematical Methods in Engineering|
|Stephen Timoshenko|Theory of Elasticity|
|Jacob Pieter Den Hartog|Mechanical Vibrations|
It would normalize to this:
| Author | Book |
|:------------------------|:------------------------------------|
| Theodore von Kármán | Mathematical Methods in Engineering |
| Stephen Timoshenko | Theory of Elasticity |
| Jacob Pieter Den Hartog | Mechanical Vibrations |
The column separators didn’t align because the á is two bytes long but takes up only one character space. Christoph’s improved code results in
| Author | Book |
|:------------------------|:------------------------------------|
| Theodore von Kármán | Mathematical Methods in Engineering |
| Stephen Timoshenko | Theory of Elasticity |
| Jacob Pieter Den Hartog | Mechanical Vibrations |
which is just what we want.
Here’s the new code:
python:
1: #!/usr/bin/python
2:
3: import sys
4:
5: def just(string, type, n):
6: "Justify a string to length n according to type."
7:
8: string = unicode(string, 'utf-8')
9: if type == '::':
10: return string.center(n)
11: elif type == '-:':
12: return string.rjust(n)
13: elif type == ':-':
14: return string.ljust(n)
15: else:
16: return string
17:
18:
19: def normtable(text):
20: "Aligns the vertical bars in a text table."
21:
22: # Start by turning the text into a list of lines.
23: lines = text.splitlines()
24: rows = len(lines)
25:
26: # Figure out the cell formatting.
27: # First, find the formatting line.
28: for i in range(rows):
29: if set(lines[i]).issubset('|:.-'):
30: formatline = lines[i]
31: formatrow = i
32: break
33:
34: # Delete the formatting line from the content.
35: del lines[formatrow]
36:
37: # Determine how each column is to be justified.
38: formatline = formatline.strip('| ')
39: fstrings = formatline.split('|')
40: justify = []
41: for cell in fstrings:
42: ends = cell[0] + cell[-1]
43: if ends == '::':
44: justify.append('::')
45: elif ends == '-:':
46: justify.append('-:')
47: else:
48: justify.append(':-')
49:
50: # Assume the number of columns in the format line is the number
51: # for the entire table.
52: columns = len(justify)
53:
54: # Extract the content into a .
55: content = []
56: for line in lines:
57: line = line.strip('| ')
58: cells = line.split('|')
59: # Put exactly one space at each end as "bumpers."
60: linecontent = [ ' ' + x.strip() + ' ' for x in cells ]
61: content.append(linecontent)
62:
63: # Append cells to rows that don't have enough.
64: rows = len(content)
65: for i in range(rows):
66: while len(content[i]) < columns:
67: content[i].append('')
68:
69: # Get the width of the content in each column. The minimum width will
70: # be 2, because that's the shortest length of a formatting string and
71: # because that matches an empty column with "bumper" spaces.
72: widths = [2] * columns
73: for row in content:
74: for i in range(columns):
75: widths[i] = max(len(unicode(row[i], 'utf-8')), widths[i])
76:
77: # Add whitespace to make all the columns the same width and
78: formatted = []
79: for row in content:
80: formatted.append('|' + '|'.join([ just(s, t, n) for (s, t, n) in zip(row, justify, widths) ]) + '|')
81:
82: # Recreate the format line with the appropriate column widths.
83: formatline = '|' + '|'.join([ s[0] + '-'*(n-2) + s[-1] for (s, n) in zip(justify, widths) ]) + '|'
84:
85: # Insert the formatline back into the table.
86: formatted.insert(formatrow, formatline)
87:
88: # Return the formatted table.
89: return '\n'.join(formatted)
90:
91:
92: # Read the input, process, and print.
93: unformatted = sys.stdin.read()
94: print normtable(unformatted).encode('utf-8')
Christoph’s improvements boil down to just three changes:
- He added Line 8, which treats the string as UTF-8, both in the argument and the return value.
- He did the same thing in Line 75, which makes the
lencommand return the number of characters rather than bytes. - He added the
encodemethod to Line 94, so the output would be handled properly.
I’ve put the new Normalize Markdown Table command into my Text Tables bundle, which you can download as a zip file. After you unzip the file, double-click on the resulting TextMate bundle, and it will install itself into your TextMate system. The bundle includes two other commands: one for turning tab-separated tabular data (like what you’d get if you copied a set of cells from a spreadsheet and pasted them into TextMate) into a nearly complete Markdown table, and one for turning a tab-separated table into a neatly aligned space-separated table (with no pipe characters as column separators). These two commands are described here and here.
One last thing. To show how awesome he is, Christoph didn’t send me the full source code of his improved command; he just sent me the diff between it and my original. To show how awesome I am, I decided to use patch to apply his update straight from the email. Unfortunately, my awesomeness wasn’t up to the challenge; for reasons I can’t explain, patch refused to apply the second of his three changes, and I had to do that one by hand. Time to turn in my Unix merit badge.
-
It’s also the style used in PHP Markdown Extra and some other Markdown processors that implement tables. ↩



March 14th, 2012 at 1:06 am
The linked zip file still contains the old version…
March 14th, 2012 at 1:24 am
This doesn’t really work. It solves the simple issue of bytes vs. codepoints, but a Unicode codepoint is not necessarily a “character.” The simplest illustration is á (a + U+ 0301) vs. á (U+00E1). These look the same, but the Unicode behind them is different. Both are just one character, but the first is made using a normal ‘a’ plus a combining character (U+0301), and the second is a precomposed character taking up just a single codepoint (U+00E1).
On top of that, some characters don’t fit into a single column of space, even using precomposed characters. For example, あ and 愛 are Japanese and Chinese characters that are “double width.”
There are ways to hack around this by using the Unicode normalization tools, but it’s a pain in the butt.
March 14th, 2012 at 1:34 am
OK, if you want to fix it for real the relevant module is
unicodedata. See http://docs.python.org/library/unicodedata.htmlunicodedata.normalize("NFKC", s)will reduce the combining characters into precomposed characters where possible (it’s not always possible), andunicodedata.east_asian_width(char)will return"W"when the character is double width. Using that you can make a more informed table fixer.It probably still won’t work for Arabic, and it definitely won’t work for letters with multiple accent marks, but it’s an improvement.
March 14th, 2012 at 7:03 am
Lri, the bungling that resulted in me reuploading the old version (after properly uploading the new version) is painfully embarrassing to revisit. Thanks for finding the mistake quickly; I’m pretty sure it’s fixed now.
March 14th, 2012 at 7:15 am
Do you mind if I submit this to the Avian bundle?
March 14th, 2012 at 7:39 am
There is, Carl, a definite Eurocentrism to the new version. In his email, Christoph said his goal was to get the command to work with the sort of accented characters that come up in his writing. He didn’t present it as a universal Unicode solution; that was me overselling it in my description.
I don’t see a practical problem with the point you raised about accented characters. With a US keyboard layout, there are currently three common ways to insert an accented character:
I’ve used all three of these methods, and Christoph’s code formats the table correctly for all of them. Maybe there’s a problem if you use a different keyboard, but I don’t have a convenient way to test that. I can only say that whatever keyboard Christoph is using seems to work for him.
Maybe pasting characters copied from another source would lead to misalignment?
I can’t say that I’m especially concerned with handling multiply accented characters—that seems like too much of an edge case to bother with—but it would be nice to handle a broader range of characters. Does that seem reasonably doable in Python 2.x, or is it really a job for 3.x?
March 14th, 2012 at 7:47 am
Submit away, tim, as long as it’s properly attributed. See the CC notice in the sidebar. Note also Carl’s comments.
March 14th, 2012 at 10:13 am
My focus was indeed on German Umlauts and the Euro (€) sign. AFAIK TextMate has no support for variable width or wide fonts and also not for right-to-left languages like Arab or Hebrew script. At least that’s what Wikipedia says[1]. Therefor, I am not sure if Carl’s objections -as valid as they are- can ever be solved with TextMate.
[1] http://en.wikipedia.org/wiki/TextMate
March 14th, 2012 at 10:34 pm
I’m not really thinking of this in a TextMate perspective. I don’t have TextMate installed on my Mac, and I just use the script just out of my Python directory.
Dr. Drang is right that OS X seems to prefer precomposed characters. Even when I read the Wikipedia page about tiếng Việt, the characters were precomposed. So, for most purposes, as long as you stick to the Latin alphabet, you’ll be fine. Still, if you want to wear a belt with your suspenders, add this to line 93:
import unicodedata unformatted = unicodedata.normalize(“NFKC”, unformatted)
That still leaves problems with East Asian characters (ones which have bitten me personally), but most users can just ignore them.
March 14th, 2012 at 10:36 pm
Ah, Markdown edge cases! I indented a code block, but Markdown thought I was continuing a list item. Here it is again in monospace:
April 12th, 2012 at 10:45 am
This is really interesting, I missed your previous bundle/script so I just experimented with your edited version. Yet I face some problems I prefer to ask about as I am not a python expert (I am an average guy trying to use stuff some more than average guy coded).
So I tried to use your bundle on a table with empty first cell in some rows and I discovered that the “normalized” output somehow shifts the content from the next cell to the left.
Here is an example I have been experimenting with :
And here is the output :
It looks very different to me and HTML output is really different for the last 2 lines. As a simple test, I tried not to leave cells empty and then it works as intended (adding a space doesn’t solve it). The MMD manual says cells can be empty so I think I am missing something or maybe it is an edge case ? Empty cells on column 2 and up behave as expected.
Any clue ?
Thanks.
April 12th, 2012 at 10:48 am
Sorry I messed up my code blocks.
Here is the raw table :
And what the scripts makes out of it :
April 12th, 2012 at 1:58 pm
BZH, you’ve found an important bug. I’ve never tested for empty cells and apparently have never had a table with empty cells in the years I’ve been using this bundle. But empty cells are allowed, as are multicolumn spans. The Normalize Table command doesn’t handle either of these correctly.
(I said back in the original post that column spans have to be formatted by hand after applying Normalize Table, but I’d like to be able to eliminate that requirement. If I am able to implement multicolumn spans, I’ll probably have to require at least one space between the pipes to designate an empty cell.)
The problem is the
stripin Lines 38 and 57 (57 is the real culprit, but both lines are poorly written). It was intended to remove just the “outside” pipe characters (and any leading or trailing spaces), but it’s actually removing all the leading and trailing pipes.I’ll get this fixed and issue an updated bundle in a day or two. Thanks for finding the bug.