November 17th, 2011 at 10:15 pm by Dr. Drang
Sometimes it seems like all I do is rejigger my
checkcards script in response to changes my local library makes to its website. When the script stopped working last week, I put it on my to-do list but was too busy to get around to it until today.
checkcards is a script (actually a Python script, a shell script, and a LaunchAgent plist, but it’s the Python script that does the heavy lifting) that logs into the library’s website, gathers up all the books my family has checked out or on hold, and sends my wife and me a nicely formatted email with that information every morning. Here’s what the email looks like on an iPhone: one table with the checked-out items,
and one with the on-hold items.
For checked-out items, a light red background indicates an items that’s due or over due, and a light yellow background indicates an item that’s due within two days. For on-hold items, a light red background indicates an item that’s ready to be picked up.
The last time I had to update the scripts, I made a big change, switching from the
mechanize module to the
twill module. These are both modules that act sort of like a web browser, allowing you to automate the interaction with a site.
Twill seemed easier and cleaner, but today I had to switch back to
Here’s what happened. The library revamped its entire website. Most of the changes seem to be cosmetic, but the login page—which you need to go through to get at your lists of checked-out and on-hold items—has changed more than just its CSS file.
The form on this page confused
twill, making it think there were actually two
<form> sections, one with the form items you see and another with a couple of hidden fields. A quick look at the page’s HTML confirmed that the hidden fields were there, but they were part of the form with the visible fields.
Twill’s confusion on this matter caused it to send incomplete information to the server, so the script couldn’t log in. I tried what I could to get
twill to interpret the form correctly, but nothing worked.
I then experimented with Ruby’s mechanize module. I don’t really know Ruby, but I figured it wouldn’t be too hard to put together a script if the module worked. It didn’t. In fact, it didn’t recognize the hidden fields at all.
So I went back to the Python
mechanize module and tried again. My suspicion was that the problem I’d had with
mechanize a couple of months ago wasn’t due to a deficiency in
mechanize itself1 but to its crappy documentation.
Don’t get me wrong. There’s plenty of documentation for
mechanize, but a lot of it is out of date and almost all of it is focused on the lower-level classes and functions.
Mechanize has a high-level class, called
Browser, which I figured was the proper class to use, but there’s no real documentation on its methods. All the
mechanize site has on
Browser are some simple, incomplete examples on the home page.
I have to think
mechanize’s programmers want us to use
Browser, because it provides the most straightforward interface to
mechanize’s capabilities. But by failing to provide complete documentation on its methods, they’re limiting the number of people who can use their library.
Two things put me on the right path for learning
- The source code itself.
- This blog post by Rogério Carvalho Schneider. Yes, it’s mostly a set of examples, but Schneider’s examples are more complete and cover more methods.
The relevant portion of my script is now this
python: 57: # Go through each card, collecting the lists of items. 58: for card in cardList: 59: # Open a browser and login 60: br = mechanize.Browser() 61: br.set_handle_robots(False) 62: br.open(lURL) 63: br.select_form(nr=0) 64: br.form['code'] = card['code'] 65: br.form['pin'] = card['pin'] 66: br.submit() 67: 68: # Go to the page for items checked out and get the HTML. 69: br.open(cURL) 70: cHtml = br.response().read() 71: 72: # Go to the page for items on hold and get the HTML. 73: br.open(hURL) 74: hHtml = br.response().read()
Looking back through the history of this script, I see that this isn’t terribly different from what this section looked like before the switch to
twill, but it’s different enough to make this version work. And I have a better understanding of why it works.
That understanding will come in handy the next time the library’s webmaster decided to freshen up the site and break my script.
It really couldn’t be, as
twillis built on