Again with the checkcards

Sometimes it seems like all I do is rejigger my checkcards script in response to changes my local library makes to its website. When the script stopped working last week, I put it on my to-do list but was too busy to get around to it until today.

Quick recap: checkcards is a script (actually a Python script, a shell script, and a LaunchAgent plist, but it’s the Python script that does the heavy lifting) that logs into the library’s website, gathers up all the books my family has checked out or on hold, and sends my wife and me a nicely formatted email with that information every morning. Here’s what the email looks like on an iPhone: one table with the checked-out items,

Checked out

and one with the on-hold items.

On hold

For checked-out items, a light red background indicates an items that’s due or over due, and a light yellow background indicates an item that’s due within two days. For on-hold items, a light red background indicates an item that’s ready to be picked up.

The last time I had to update the scripts, I made a big change, switching from the mechanize module to the twill module. These are both modules that act sort of like a web browser, allowing you to automate the interaction with a site.Twill seemed easier and cleaner, but today I had to switch back to mechanize.

Here’s what happened. The library revamped its entire website. Most of the changes seem to be cosmetic, but the login page—which you need to go through to get at your lists of checked-out and on-hold items—has changed more than just its CSS file.

New login page

The form on this page confused twill, making it think there were actually two <form> sections, one with the form items you see and another with a couple of hidden fields. A quick look at the page’s HTML confirmed that the hidden fields were there, but they were part of the form with the visible fields. Twill’s confusion on this matter caused it to send incomplete information to the server, so the script couldn’t log in. I tried what I could to get twill to interpret the form correctly, but nothing worked.

I then experimented with Ruby’s mechanize module. I don’t really know Ruby, but I figured it wouldn’t be too hard to put together a script if the module worked. It didn’t. In fact, it didn’t recognize the hidden fields at all.

So I went back to the Python mechanize module and tried again. My suspicion was that the problem I’d had with mechanize a couple of months ago wasn’t due to a deficiency in mechanize itself1 but to its crappy documentation.

Don’t get me wrong. There’s plenty of documentation for mechanize, but a lot of it is out of date and almost all of it is focused on the lower-level classes and functions. Mechanize has a high-level class, called Browser, which I figured was the proper class to use, but there’s no real documentation on its methods. All the mechanize site has on Browser are some simple, incomplete examples on the home page.

I have to think mechanize’s programmers want us to use Browser, because it provides the most straightforward interface to mechanize’s capabilities. But by failing to provide complete documentation on its methods, they’re limiting the number of people who can use their library.

Two things put me on the right path for learning Browser’s methods:

  1. The source code itself.
  2. This blog post by Rogério Carvalho Schneider. Yes, it’s mostly a set of examples, but Schneider’s examples are more complete and cover more methods.

The relevant portion of my script is now this

 57:  # Go through each card, collecting the lists of items.
 58:  for card in cardList:
 59:    # Open a browser and login
 60:    br = mechanize.Browser()
 61:    br.set_handle_robots(False)
 63:    br.select_form(nr=0)
 64:    br.form['code'] = card['code']
 65:    br.form['pin'] = card['pin']
 66:    br.submit()
 68:    # Go to the page for items checked out and get the HTML.
 70:    cHtml = br.response().read() 
 72:    # Go to the page for items on hold and get the HTML.
 74:    hHtml = br.response().read()

Looking back through the history of this script, I see that this isn’t terribly different from what this section looked like before the switch to twill, but it’s different enough to make this version work. And I have a better understanding of why it works.

That understanding will come in handy the next time the library’s webmaster decided to freshen up the site and break my script.

  1. It really couldn’t be, as twill is built on mechanize

3 Responses to “Again with the checkcards”

  1. Ryan Gray says:

    Have you tried Library Books App? It’s for the Mac and iPhone. It works with the various systems libraries use.

  2. James says:

    Exactly why, the Library website needs a public API! This sort of data should be retrievable via logon and fed out via XML. The library would then use the same API in their web app themselves. And you can take that API and put it into an actual iOS App, etc.

    This is why Sherlock and other similar screen scraping solutions failed. If a third party doesn’t have an API and the first party changes their site, then the third party tool breaks and has to be re-written. This can take place so frequently it’s hardly worth the effort.

    Moving to a Semantic Net requires API’s and cooperation to data flows automagically and can be extraordinarily useful to all parties.

  3. Dr. Drang says:

    Library Books App looks great, Ryan, and it supports my library. I still like the idea of getting an email every day because I always check email, but it looks like the notification system is flexible enough to do what I need. And since my wife now has an iPhone, too, I may switch to it the next time checkcards breaks.

    James, you’re preaching to the choir. I did look to see if my library had a public API before I wrote checkcards, but didn’t find anything, which is why I had to go with screenscraping. That Library Books App works with my library makes me think there really is an API, so maybe I should look again.