Parsing comments

January 16, 2013 at 9:34 PM by Dr. Drang

Frequent commenter Aristotle Pagaltzis and I had a brief discussion today about comments in baked blogs. We agree that Disqus, which seems to be the go-to system for people who are moving to a baked blog and want to maintain comments, is out of step with the general idea behind static blogging. It’s a service that maintains a blog’s comments in a database outside the control of the blogger, and if it goes out of business, there go your comments.¹ On the other hand, it does seem dead simple to implement, and with all the other mishegoss surrounding the conversion to a static blog, it’s no wonder so many have just gone with Disqus.

But since I’m not actually implementing a static blog (yet) and have no timetable for doing so, I’m free to explore other ideas for comments. And for an elderly blog like this one, there are two problems that must be solved:

Dealing with new comments as they come in.
Dealing with the old comments that were written before the blog went static.

Matt Palmer has a Jekyll plugin that might be a good source for dealing with the former;² in this post, I’m going to talk about the latter.

The first hurdle to get over is that WordPress’s MetaWeblog API has no commands for accessing comments. Thus, the techniques I used to download all my posts into a nicely organized set of Markdown source files won’t work for comments. There are a couple of plugins for downloading comments, but they put the results in CSV format, which I’d rather not deal with because I’m not sure how they handle line breaks. That leaves me with exporting the entire blog as an XML file and parsing it to pull out the comments.

The standard Python library for parsing XML is ElementTree. My experience with ElementTree in the past hasn’t been pleasant, but I followed a link in this StackOverflow discussion to Mark Pilgrim’s Dive into Python chapter on XML and everything went amazingly smoothly. Mark’s explanation—despite it being for Python 3 when I’m using Python 2.7—is much better than anything I’ve read before. To see how easy it would be to pull out the comments and put them into a simple plain text form, I wrote this little script:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import xml.etree.ElementTree as etree
 4:  from html2text import html2text
 5:  
 6:  wp = '{http://wordpress.org/export/1.2/}'
 7:  csep = '\n' + ('-' * 50) + '\n\n'
 8:  
 9:  tree = etree.parse('andnowitsallthis.wordpress.2013-01-16.xml')
10:  root = tree.getroot()
11:  
12:  posts = root[0].findall('item')
13:  
14:  for p in posts:
15:    comments = p.findall(wp + 'comment')
16:    if comments:
17:      pid = int(p.find(wp + 'post_id').text)
18:      f = open('comments-%04d.txt' % pid, 'w')
19:      clist = []
20:      for c in comments:
21:        cid = int(c.find(wp + 'comment_id').text)
22:        commenter = c.find(wp + 'comment_author').text
23:        cdate = c.find(wp + 'comment_date_gmt').text
24:        comment = html2text(c.find(wp + 'comment_content').text)
25:        cstr = '''Post: %d
26:  Comment: %d
27:  Commenter: %s
28:  Date: %s
29:  
30:  %s''' % (pid, cid, commenter, cdate, comment)
31:        clist.append((cid, cstr))
32:  
33:      clist.sort()
34:      f.write((csep.join(x[1] for x in clist)).encode('utf-8'))
35:      f.close()

The exported XML file is named andnowitsallthis.wordpress.2013-01-16.xml, and I had it saved to my local hard disk in the same directory as the script. Before writing the script, I played around with it and the ElementTree library interactively in IPython. I learned immediately that the parse command (Line 9) wouldn’t work because of funny characters in the XML file. Cleaning the file was an iterative process, because the error message generated by parse wouldn’t tell me all the bad characters, only the first one it encountered. But by moving between IPython running in a Terminal window and the XML file open in BBEdit, I soon got the file shipshape.

(Let me interrupt myself here and tell you what a pleasure it was to open a 12 megabyte file in a text editor and have it act as if that were the most normal thing in the world. I could never have done this editing in TextMate.)

Once the parsing worked, everything else fell into place. The basic structure of the XML file is

xml:
<rss>
  <channel>
    <item>… </item>
    <item>… </item>
    .
    .
    .
    <item>
      <wp:comment>… </wp:comment>
      <wp:comment>… </wp:comment>
      .
      .
      .
    </item>
    <item>… </item>
  </channel>
</rss>

The <item>s are the posts, some of which have comments and some (most) of which don’t. Line 12 gets all the <item>s and puts them into a list. We then loop through that list, pulling out the comments for those posts that have them, and writing the comments out into files with names like comments-2018.txt, where the number is the unique post ID number maintained by WordPress. Here’s an example of one of the files:

Post: 2018
Comment: 29665
Commenter: Peter
Date: 2013-01-15 10:38:41

Good move. In case you're looking for a Python based solution, several are
available but one that caught my attention is Acrylamid, mainly because it
supports incremental builds (a rarity amongst static blogging generators) See
: http://posativ.org/acrylamid/


--------------------------------------------------

Post: 2018
Comment: 29702
Commenter: Aristotle Pagaltzis
Date: 2013-01-16 11:30:15

What would you do with your comments? Baking is a good idea for a blog –
except, most everyone who does it that way seems to then end up using Disqus.

The apostrophe in “I `cd`‘d into the `source` directory” is the wrong way
around, btw. (SmartyPants?)


--------------------------------------------------

Post: 2018
Comment: 29730
Commenter: Dr. Drang
Date: 2013-01-16 15:30:29

Comment are a big problem, Aristotle. Because they're dynamic, they're
antithetical to the very idea of a static blog, but I don't want to eliminate
them. Disqus is the easy and popular answer; I'm just not sure I want to
depend on an outside service.

Alternatives are [Juvia](https://github.com/phusion/juvia), which has the
disadvantage of using a database instead of text files, and [Matt Palmer's
Jekyll plugin](http://theshed.hezmatt.org/jekyll-static-comments/), which I
might be able to raid for ideas. I'm still studying.

And yes, the backwards apostrophe was a SmartyPants error. The Markdown source
was


    I `cd`'d into the `source` directory


and SmartyPants decided that the backtick after "cd" meant that the apostrophe
was an opening single quote. Thanks for picking that up.


--------------------------------------------------

Post: 2018
Comment: 29752
Commenter: Aristotle Pagaltzis
Date: 2013-01-16 17:49:49

Ah. Matt Palmer’s approach is identical in spirit to how I decided I might
implement comments (if ever I do). My hand-wavy idea was to save them as files
in a spool directory on the server, periodically cleaned out after pulling
them into the master copy of the weblog data so they can then be incorporated
statically in the baked pages.

Obviously that means I have to view and approve every comment before it
appears on the site – which was fine for my preferences. But now it occurs to
me that exposing the comment spool in an Ajax-friendly form would also allow
closing the gap to real-time while keeping with the spirit of a baked weblog.

The reason Disqus bothers me so much is that their wide popularity puts them
in a position to build profiles of individual weblog readers, even those who
never even leave a comment. It’s the same privacy hole which DoubleClick first
brought to public consciousness.

As you can see, the comments are sorted in the order in which they were written by Line 33, which uses the WordPress comment ID as the key for sorting, and separated by a line of hyphens. This is not really how I’d want the comments to be saved, but it was a useful exercise to see how easy it was to pull the comments out of the XML. Very easy, as it turned out.

A couple of other things in the code worth mentioning:

The wp string defined in Line 6 is the prefix to the WordPress-specific tag names and needs to be included in the find and findall commands that search for those tags. Putting it in a variable made the later commands shorter.
Unlike the posts themselves, which are saved in Markdown, the comments are saved as HTML. Don’t ask why—I don’t know. But I wanted them in Markdown. Given this week’s news, it won’t surprise you to hear that I immediately thought of Aaron Swartz’s html2txt script and library. I’ve known of it for almost as long as I’ve known of Markdown, but this was the first time I needed to use it. Obviously, I haven’t gone through all the comments to see what kind of job it did, but it worked well on everything I’ve looked at so far. Not surprising.

The upshot is that getting the old comments out of the exported XML and into a set of source files for generating static HTML seems very doable.

For someone who claims not to be ready to switch to a baked blog, I sure am acting like someone who’s ready to switch to a baked blog.

Disqus provides a way for you to download and save a copy of all your blog’s comments, which means they won’t be lost if Disqus goes under, but you’ll be left without a way to display them. ↩
I don’t intend to use Jekyll, but I’m sure I could raid Matt’s plugin for ideas. ↩

And now it’s all this

I just said what I said and it was wrong
Or was taken wrong

Parsing comments

Site search

Meta

Recent posts

Credits

And now it’s all this

I just said what I said and it was wrong Or was taken wrong

Parsing comments

Site search

Meta

Recent posts

Credits

I just said what I said and it was wrong
Or was taken wrong