Local archive of WordPress posts

I wouldn’t be surprised to find myself jumping on the baked blog bandwagon. Everybody’s doing it. If I do, the first thing I’ll need is a collection of all my previous posts, downloaded to my local machine and saved in a directory structure that matches the URL structure WordPress has been presenting to the outside world. This evening, I did just that.

There were two main reasons it went so quickly and so smoothly:

  1. My posts have always been saved as Markdown on the server, so I didn’t have to go through the pain that Gabe went through when he converted from WordPress to Pelican last year. When ANIAT was served by Blosxom, the Markdown was saved in files on the server and was converted to HTML on the fly by a CGI script. When it was served by Movable Type, the Markdown was saved in a database on the server and converted into static HTML pages. Currently, the Markdown is saved in a database on the server and converted to HTML on the fly by WordPress’s PHP engine (and the first time a page is accessed, it’s cached as a static file in some compressed format).
  2. I’d already written a script, get-post, that got the saved content of a post from the server through the XML-RPC MetaWeblog API. All I needed to do was extend it to collect a set of posts and to save each one in the appropriate directory.

The semi-new script that does the collection and saving is called save-posts. It’s run this way:

save-posts 1 1000

This command downloads each post from 1 to 1000 in turn and saves them in a folder with a name like all-this/source/2013/01 in my Dropbox folder. Within each month folder, the individual Markdown source files are saved with names like geofencing-in-flickr.md, where the basename is slug of the post as it was created and served by WordPress. This hierarchy and naming scheme will ensure that the individual posts created by the static blog will have the same URLs as their WP counterparts—old links should continue to work.

Here’s the Python source of save-posts:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import xmlrpclib
 4:  import sys
 5:  import os
 6:  from datetime import datetime
 7:  import pytz
 8:  import re
 9:  
10:  # Blog parameters (url, user, pw) are stored in ~/.blogrc.
11:  # One parameter per line, with name and value separated by colon-space.
12:  p = {}
13:  with open(os.environ['HOME'] + '/.blogrc') as bloginfo:
14:    for line in bloginfo:
15:      k, v = line.split(': ')
16:      p[k] = v.strip()
17:  
18:  # The header fields and their metaWeblog synonyms.
19:  hFields = [ 'Title', 'Keywords', 'Date', 'Post',
20:              'Slug', 'Link', 'Status', 'Comments' ]
21:  wpFields = [ 'title', 'mt_keywords', 'date_created_gmt',  'postid', 
22:               'wp_slug', 'link', 'post_status', 'mt_allow_comments' ]
23:  h2wp = dict(zip(hFields, wpFields))
24:  
25:  # Regex for the path to save the post in from the slug.
26:  pathRE = re.compile(r'(/\d\d\d\d/\d\d/).+/$')
27:  
28:  # Get the post ID from the command line.
29:  try:
30:    startID = int(sys.argv[1])
31:    endID = int(sys.argv[2])
32:  except:
33:    sys.exit()
34:  
35:  # Time zones. WP is trustworthy only in UTC.
36:  utc = pytz.utc
37:  myTZ = pytz.timezone('US/Central')
38:  
39:  # Connect.
40:  blog = xmlrpclib.Server(p['url'])
41:  
42:  # The posts in header/body format.
43:  for i in range(startID, endID + 1):
44:    try:
45:      post = blog.metaWeblog.getPost(i, p['user'], p['pw'])
46:      header = ''
47:      for f in hFields:
48:        if f == 'Date':
49:          # Change the date from UTC to local and from DateTime to string.
50:          dt = datetime.strptime(post[h2wp[f]].value, "%Y%m%dT%H:%M:%S")
51:          dt = utc.localize(dt).astimezone(myTZ)
52:          header += "%s: %s\n" % (f, dt.strftime("%Y-%m-%d %H:%M:%S"))
53:        else:
54:          header += "%s: %s\n" % (f, post[h2wp[f]])
55:  
56:      contents = header.encode('utf8') + '\n' 
57:      contents += post['description'].encode('utf8') + '\n'
58:  
59:      print post['link']
60:      partialPath = pathRE.search(post['link']).group(1)
61:      fullPath = os.environ['HOME'] + '/Dropbox/all-this/source' + partialPath
62:      try:
63:        os.makedirs(fullPath)
64:      except OSError:
65:        pass
66:      postFile = open(fullPath + post['wp_slug'] + '.md', 'w')
67:      postFile.write(contents)
68:      postFile.close()
69:    except:
70:      pass

As I said, most of this is explained in the post that describes the get-post script. The new stuff is

As the files were saved, I noticed that several posts with names like 666-revision and 999-autosave scroll past. These are obviously weird partial posts that shouldn’t be in the archive. After testing to make sure no real posts would be deleted, I cd‘d into the source directory and issued these commands:

find ./ -name "*-revision*" -exec rm {} \;
find ./ -name "*-autosave*" -exec rm {} \;

That got rid of the oddballs.

Mind you, I’m still not ready to switch to a baked blog. But I’m readier than I was yesterday.

Update 1/15/13
While I appreciate the suggestions (here and on Twitter) for a static blogging engine, I’ll probably write my own. The attraction to me of a baked blog is less the speed of web serving—I don’t get that much traffic, and caching has worked well for me—than the complete control over the site. When you use someone else’s system, you have to work with, or around, their decisions, and if I’m going to do that I’ll just stick with WordPress.

Do not, by the way, expect the switch, if it happens at all, to happen any time soon. Even guys as talented as Gabe, Brett, and Justin took a long time to switch their sites to a static system. And if I never switch, I’ll still consider this exercise to have been worth the modest time I put into it. Having a local archive of the blog will still be useful for searching and other offline work.


6 Responses to “Local archive of WordPress posts”

  1. Peter says:

    Good move. In case you’re looking for a Python based solution, several are available but one that caught my attention is Acrylamid, mainly because it supports incremental builds (a rarity amongst static blogging generators) See : http://posativ.org/acrylamid/

  2. Aristotle Pagaltzis says:

    What would you do with your comments? Baking is a good idea for a blog – except, most everyone who does it that way seems to then end up using Disqus.

    The apostrophe in “I cd‘d into the source directory” is the wrong way around, btw. (SmartyPants?)

  3. Dr. Drang says:

    Comment are a big problem, Aristotle. Because they’re dynamic, they’re antithetical to the very idea of a static blog, but I don’t want to eliminate them. Disqus is the easy and popular answer; I’m just not sure I want to depend on an outside service.

    Alternatives are Juvia, which has the disadvantage of using a database instead of text files, and Matt Palmer’s Jekyll plugin, which I might be able to raid for ideas. I’m still studying.

    And yes, the backwards apostrophe was a SmartyPants error. The Markdown source was

    I `cd`'d into the `source` directory
    

    and SmartyPants decided that the backtick after “cd” meant that the apostrophe was an opening single quote. Thanks for picking that up.

  4. Aristotle Pagaltzis says:

    Ah. Matt Palmer’s approach is identical in spirit to how I decided I might implement comments (if ever I do). My hand-wavy idea was to save them as files in a spool directory on the server, periodically cleaned out after pulling them into the master copy of the weblog data so they can then be incorporated statically in the baked pages.

    Obviously that means I have to view and approve every comment before it appears on the site – which was fine for my preferences. But now it occurs to me that exposing the comment spool in an Ajax-friendly form would also allow closing the gap to real-time while keeping with the spirit of a baked weblog.

    The reason Disqus bothers me so much is that their wide popularity puts them in a position to build profiles of individual weblog readers, even those who never even leave a comment. It’s the same privacy hole which DoubleClick first brought to public consciousness.

  5. Jay says:

    I added categories to the script and the output comes back as

    ['Cat1', 'Cat2']
    

    Is there a simple way to strip out the junk and have it save as a simple list?

  6. Dr. Drang says:

    Jay,
    First check the type to make sure that post['categories'] is a list and not a string that looks like a list. The documentation says it’s a list, but you never know.

    Assuming it’s a list, you’ll have to make a special case for categories and have it print out

    `", ".join(post['categories'])