Parsing comments

Frequent commenter Aristotle Pagaltzis and I had a brief discussion today about comments in baked blogs. We agree that Disqus, which seems to be the go-to system for people who are moving to a baked blog and want to maintain comments, is out of step with the general idea behind static blogging. It’s a service that maintains a blog’s comments in a database outside the control of the blogger, and if it goes out of business, there go your comments.1 On the other hand, it does seem dead simple to implement, and with all the other mishegoss surrounding the conversion to a static blog, it’s no wonder so many have just gone with Disqus.

But since I’m not actually implementing a static blog (yet) and have no timetable for doing so, I’m free to explore other ideas for comments. And for an elderly blog like this one, there are two problems that must be solved:

  1. Dealing with new comments as they come in.
  2. Dealing with the old comments that were written before the blog went static.

Matt Palmer has a Jekyll plugin that might be a good source for dealing with the former;2 in this post, I’m going to talk about the latter.

The first hurdle to get over is that WordPress’s MetaWeblog API has no commands for accessing comments. Thus, the techniques I used to download all my posts into a nicely organized set of Markdown source files won’t work for comments. There are a couple of plugins for downloading comments, but they put the results in CSV format, which I’d rather not deal with because I’m not sure how they handle line breaks. That leaves me with exporting the entire blog as an XML file and parsing it to pull out the comments.

The standard Python library for parsing XML is ElementTree. My experience with ElementTree in the past hasn’t been pleasant, but I followed a link in this StackOverflow discussion to Mark Pilgrim’s Dive into Python chapter on XML and everything went amazingly smoothly. Mark’s explanation—despite it being for Python 3 when I’m using Python 2.7—is much better than anything I’ve read before. To see how easy it would be to pull out the comments and put them into a simple plain text form, I wrote this little script:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import xml.etree.ElementTree as etree
 4:  from html2text import html2text
 5:  
 6:  wp = '{http://wordpress.org/export/1.2/}'
 7:  csep = '\n' + ('-' * 50) + '\n\n'
 8:  
 9:  tree = etree.parse('andnowitsallthis.wordpress.2013-01-16.xml')
10:  root = tree.getroot()
11:  
12:  posts = root[0].findall('item')
13:  
14:  for p in posts:
15:    comments = p.findall(wp + 'comment')
16:    if comments:
17:      pid = int(p.find(wp + 'post_id').text)
18:      f = open('comments-%04d.txt' % pid, 'w')
19:      clist = []
20:      for c in comments:
21:        cid = int(c.find(wp + 'comment_id').text)
22:        commenter = c.find(wp + 'comment_author').text
23:        cdate = c.find(wp + 'comment_date_gmt').text
24:        comment = html2text(c.find(wp + 'comment_content').text)
25:        cstr = '''Post: %d
26:  Comment: %d
27:  Commenter: %s
28:  Date: %s
29:  
30:  %s''' % (pid, cid, commenter, cdate, comment)
31:        clist.append((cid, cstr))
32:  
33:      clist.sort()
34:      f.write((csep.join(x[1] for x in clist)).encode('utf-8'))
35:      f.close()

The exported XML file is named andnowitsallthis.wordpress.2013-01-16.xml, and I had it saved to my local hard disk in the same directory as the script. Before writing the script, I played around with it and the ElementTree library interactively in IPython. I learned immediately that the parse command (Line 9) wouldn’t work because of funny characters in the XML file. Cleaning the file was an iterative process, because the error message generated by parse wouldn’t tell me all the bad characters, only the first one it encountered. But by moving between IPython running in a Terminal window and the XML file open in BBEdit, I soon got the file shipshape.

(Let me interrupt myself here and tell you what a pleasure it was to open a 12 megabyte file in a text editor and have it act as if that were the most normal thing in the world. I could never have done this editing in TextMate.)

Once the parsing worked, everything else fell into place. The basic structure of the XML file is

xml:
<rss>
  <channel>
    <item>… </item>
    <item>… </item>
    .
    .
    .
    <item>
      <wp:comment>… </wp:comment>
      <wp:comment>… </wp:comment>
      .
      .
      .
    </item>
    <item>… </item>
  </channel>
</rss>

The <item>s are the posts, some of which have comments and some (most) of which don’t. Line 12 gets all the <item>s and puts them into a list. We then loop through that list, pulling out the comments for those posts that have them, and writing the comments out into files with names like comments-2018.txt, where the number is the unique post ID number maintained by WordPress. Here’s an example of one of the files:

Post: 2018
Comment: 29665
Commenter: Peter
Date: 2013-01-15 10:38:41

Good move. In case you're looking for a Python based solution, several are
available but one that caught my attention is Acrylamid, mainly because it
supports incremental builds (a rarity amongst static blogging generators) See
: http://posativ.org/acrylamid/


--------------------------------------------------

Post: 2018
Comment: 29702
Commenter: Aristotle Pagaltzis
Date: 2013-01-16 11:30:15

What would you do with your comments? Baking is a good idea for a blog –
except, most everyone who does it that way seems to then end up using Disqus.

The apostrophe in “I `cd`‘d into the `source` directory” is the wrong way
around, btw. (SmartyPants?)


--------------------------------------------------

Post: 2018
Comment: 29730
Commenter: Dr. Drang
Date: 2013-01-16 15:30:29

Comment are a big problem, Aristotle. Because they're dynamic, they're
antithetical to the very idea of a static blog, but I don't want to eliminate
them. Disqus is the easy and popular answer; I'm just not sure I want to
depend on an outside service.

Alternatives are [Juvia](https://github.com/phusion/juvia), which has the
disadvantage of using a database instead of text files, and [Matt Palmer's
Jekyll plugin](http://theshed.hezmatt.org/jekyll-static-comments/), which I
might be able to raid for ideas. I'm still studying.

And yes, the backwards apostrophe was a SmartyPants error. The Markdown source
was


    I `cd`'d into the `source` directory


and SmartyPants decided that the backtick after "cd" meant that the apostrophe
was an opening single quote. Thanks for picking that up.


--------------------------------------------------

Post: 2018
Comment: 29752
Commenter: Aristotle Pagaltzis
Date: 2013-01-16 17:49:49

Ah. Matt Palmer’s approach is identical in spirit to how I decided I might
implement comments (if ever I do). My hand-wavy idea was to save them as files
in a spool directory on the server, periodically cleaned out after pulling
them into the master copy of the weblog data so they can then be incorporated
statically in the baked pages.

Obviously that means I have to view and approve every comment before it
appears on the site – which was fine for my preferences. But now it occurs to
me that exposing the comment spool in an Ajax-friendly form would also allow
closing the gap to real-time while keeping with the spirit of a baked weblog.

The reason Disqus bothers me so much is that their wide popularity puts them
in a position to build profiles of individual weblog readers, even those who
never even leave a comment. It’s the same privacy hole which DoubleClick first
brought to public consciousness.

As you can see, the comments are sorted in the order in which they were written by Line 33, which uses the WordPress comment ID as the key for sorting, and separated by a line of hyphens. This is not really how I’d want the comments to be saved, but it was a useful exercise to see how easy it was to pull the comments out of the XML. Very easy, as it turned out.

A couple of other things in the code worth mentioning:

The upshot is that getting the old comments out of the exported XML and into a set of source files for generating static HTML seems very doable.

For someone who claims not to be ready to switch to a baked blog, I sure am acting like someone who’s ready to switch to a baked blog.


  1. Disqus provides a way for you to download and save a copy of all your blog’s comments, which means they won’t be lost if Disqus goes under, but you’ll be left without a way to display them. 

  2. I don’t intend to use Jekyll, but I’m sure I could raid Matt’s plugin for ideas. 


5 Responses to “Parsing comments”

  1. Aristotle Pagaltzis says:

    Are you Fake Steve Jobs?

  2. Clark says:

    It’s funny everyone is going to static blogs. I started with a static blog way back in the late 90’s using some custom Python code. Then I wrote a new program that would handle comments and a bunch more for a technical philosophy blog around ‘94. It worked great until the day some spam-bot got through my security screwing up hundreds of posts. I decided it just wasn’t worth it so I switched to WordPress. While I get the reason for a static blog, honestly it’s not hard to get data out of mySQL and there are just so many benefits to using a database.

    Cools script though. Would be less of an issue for me since I don’t use (or really like) Markdown. I’m saving this for the day I ever want to export things.

  3. Ryan Gray says:

    You might try Pandoc to convert the HTML to Markdown.

  4. Peter says:

    While you’ve solved #2, why not take a look here here for a possible solution to #1 : http://tildehash.com/?article=why-im-reinventing-disqus

    Bake your comments ! It’s a PHP based solution and the source code (and other dependencies) are freely available. Please note the comments below the article are the actual demo and it’s looking pretty good IMHO.

  5. Dr. Drang says:

    Pandoc strikes me as serious overkill for converting comments, Ryan. Plus, I’d have to shell out to run it. Html2text has the advantage of being a Python library.

    Peter, I’d rather run a hot poker up my urethra than use PHP for anything significant.