Archiving tweets without IFTTT

Here’s a final bit of follow-up on the flurry of posts about Twitter and archiving tweets that I wrote early in the month. The way I left it, I was using the IFTTT web service to append new tweets to a text file in my Dropbox folder. Since I have no control over IFTTT, my preference would be to use my own archiving script sitting on my own hard disk. Today I wrote that script.

Plenty of people have written scripts to do this. I wrote my own because I wanted one written in a language (Python) and using a Twitter library (tweepy) with which I’m familiar. I used tweepy a while ago in a script that posts tweets with images. Tweepy is nice in that it automatically handles conversions between the pure text that the Twitter API deals in and the objects—integers and datetime objects, in particular—that Python understands and can manipulate directly.

Here’s the script, called archive-tweets.py:

python:
 1:  #!/usr/bin/python
 2:  
 3:  import tweepy
 4:  import pytz
 5:  import os
 6:  
 7:  # Twitter parameters.
 8:  me = 'drdrang'
 9:  consumerKey = 'abc'
10:  consumerKeySecret = 'def'
11:  accessToken = 'ghi'
12:  accessTokenSecret = 'jkl'
13:  
14:  # Local file parameters.
15:  urlprefix = 'http://twitter.com/%s/status/' % me
16:  tweetdir = os.environ['HOME'] + '/Dropbox/twitter/'
17:  tweetfile = tweetdir + 'twitter.txt'
18:  idfile = tweetdir + 'lastID.txt'
19:  
20:  # Date/time parameters.
21:  datefmt = '%B %-d, %Y at %-I:%M %p'
22:  homeTZ = pytz.timezone('US/Central')
23:  utc = pytz.utc
24:  
25:  # This function pretty much taken directly from a tweepy example.
26:  def setup_api():
27:    auth = tweepy.OAuthHandler(consumerKey, consumerKeySecret)
28:    auth.set_access_token(accessToken, accessTokenSecret)
29:    return tweepy.API(auth)
30:  
31:  # Authorize.
32:  api = setup_api()
33:  
34:  # Get the ID of the last downloaded tweet.
35:  with open(idfile, 'r') as f:
36:    lastID = f.read().rstrip()
37:  
38:  # Collect all the tweets since the last one.
39:  tweets = api.user_timeline(me, since_id=lastID)
40:  
41:  # Write them out to the twitter.txt file.
42:  with open(tweetfile, 'a') as f:
43:      for t in reversed(tweets):
44:        ts = utc.localize(t.created_at).astimezone(homeTZ)
45:        lines = ['',
46:                 t.text,
47:                 ts.strftime(datefmt),
48:                 urlprefix + t.id_str,
49:                 '- - - - -',
50:                 '']
51:        f.write('\n'.join(lines).encode('utf8'))
52:        lastID = t.id_str
53:  
54:  # Update the ID of the last downloaded tweet.
55:  with open(idfile, 'w') as f:
56:    lastID = f.write(lastID)

The first few sections define the parameters that control the way the script works. The Twitter parameters are my user name and the set of keys and secrets needed to interact with the Twitter API. The keys and secrets shown are obviously not the ones I really use. If you want to run your own version of this program, you’ll have to get your own keys and secrets from Twitter’s developer site. It’s not a big deal; I explained the steps in an earlier post.1

The local file parameters define where the tweet archive is, and the date/time parameters define how the dates will be formatted in the archive.

A key feature of the script is knowing how far back in my user timeline to look for new tweets. This is handled by storing the ID number of the last downloaded tweet in a file named lastID.txt, kept in the same folder as the archive file itself. When the script is run, Lines 35-36 open that file and read the number in it. This information is then used as the since_id parameter in the call to collect tweets from the user timeline in Line 39.

Update 7/24/12
As pointed out in the comments, I forgot to mention here that I took the ID of the last tweet in my archive—an archive I’d already built from tweets saved in ThinkUp—and used it to “seed” lastID.txt. In so doing, I avoided “file not found” errors.

Since the user timeline returns the tweets in reverse chronological order, and I want to add them to the archive in chronological order, I reverse the list of tweets at the top of the loop in Line 43. For each tweet, I get the time stamp, which is in UTC, and convert it to my local time. I then write out

to the end of archive file, which was opened for appending ('a') in Line 42.

As the script loops through the list of tweets, it updates the lastID variable on Line 52. This is then used to overwrite the old value in lastID.txt in Lines 55-56.

(As an aside, this is the first script I’ve written to use the with open() as idiom, first introduced in Python 2.5. I’m late to the party, but I like it.)

I certainly don’t want to run this script by hand, so I used Lingon 3 to create a Launch Agent that runs the script every hour on my work computer, a machine that’s always on.

Archive Tweets Launch Agent

Every hour is, I admit, extreme overkill. I’m only running it that often initially to test for bugs. In a day or two I’ll redefine the Launch Agent to run once a day.

The Launch Agent setup works only for Macs, but archive-tweets.py is a script that could run on any platform. If I were working on a Linux machine, I’d have its run times scheduled via cron. If I were working on Windows, I’d slit my throat and wouldn’t have to think about archiving tweets anymore.

Actually, unless I’ve screwed this script up, I can stay on OS X and still won’t have to think about archiving tweets anymore.

Update 7/22/12
The script seems to be working fine—it should, I did a fair amount of debugging and testing—but the Launch Agent started out wonky. First, using a script path of ~/Dropbox/twitter/archive-tweets.py caused problems. I should’ve known better than to use the tilde as an alias for my home directory. Going with a full, explicit path, /Users/drdrang/Dropbox/twitter/archive-tweets.py, eliminated some of the error messages.

Second, Lingon 3 added some self-identifying information to the plist file that launchd doesn’t like. The source code of com.leancrew.archivetweets.plist in my LaunchAgents folder was

xml:
 1:  <?xml version="1.0" encoding="UTF-8"?>
 2:  <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
 3:  <plist version="1.0">
 4:  <dict>
 5:    <key>Label</key>
 6:    <string>com.leancrew.archivetweets</string>
 7:    <key>LingonWhat</key>
 8:    <string>python /Users/drdrang/Dropbox/twitter/archive-tweets.py</string>
 9:    <key>ProgramArguments</key>
10:    <array>
11:      <string>python</string>
12:      <string>/Users/drdrang/Dropbox/twitter/archive-tweets.py</string>
13:    </array>
14:    <key>StartInterval</key>
15:    <integer>3600</integer>
16:  </dict>
17:  </plist>

The LingonWhat key in Line 7 was the source of this error:

7/22/12 9:43:52.900 PM com.apple.launchd.peruser.501: (com.leancrew.archivetweets) Unknown key: LingonWhat

Because that key/string pair doesn’t do anything, I just edited them out in TextMate, leading to

xml:
 1:  <?xml version="1.0" encoding="UTF-8"?>
 2:  <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
 3:  <plist version="1.0">
 4:  <dict>
 5:    <key>Label</key>
 6:    <string>com.leancrew.archivetweets</string>
 7:    <key>ProgramArguments</key>
 8:    <array>
 9:      <string>python</string>
10:      <string>/Users/drdrang/Dropbox/twitter/archive-tweets.py</string>
11:    </array>
12:    <key>StartInterval</key>
13:    <integer>3600</integer>
14:  </dict>
15:  </plist>

With that change, the launchd began running the script with no errors and new tweets started accumulating in my archive file.

I’m going to shoot an email to Lingon 3’s developer, Peter Borg, to let him know about this error and ask why he puts that extraneous key/string pair in there.


  1. Because the script does nothing more than download publicly available tweets from the user’s timeline, I’m not certain that the keys and secrets are really necessary (as they are in a script that, for example, posts tweets). Because I already had a set of keys and secrets lying around from an earlier project, I just used those and didn’t explore any other possibilities. 

  2. The date/time format is slightly different from the one used by the IFTTT recipe, which I had no control over, and my ThinkUp-to-archive script, which I wrote to match the IFTTT recipe’s output. Rather than have two different formats in my archive, I did some regex find-and-replaces to change the old formats to this new one. 


18 Responses to “Archiving tweets without IFTTT”

  1. Jan Marcel says:

    Thanks a lot for yet another gem, doctor! Is there any value of lastID that could be used to fetch every tweet from a certain user, assuming such user has fewer than 3200 tweets? If not, do you believe the script could be easily adapted to accomplish that? Thanks once again!

  2. Dr. Drang says:

    Apart from the limit of 3200 total tweets, Twitter also restricts you to gathering no more than 200 tweets per request. Since I expect to be running this script at least once a day, that limit won’t be a problem. It would be a problem, though, if you tried to use it to grab a user’s entire history.

    If I needed to get a user’s full tweet history, I’d try this script by Chris Kinniburgh.

  3. Giovanni Lanzani says:

    Hi Doc,

    I’d change lines 35—39 of the script with

    try:
        with open(idfile, 'r') as f:
            lastID = f.read().rstrip()
    

    Collect all the tweets since the last one.

            tweets = api.user_timeline(me, since_id=lastID)
    except IOError:
    

    Collect all the tweets since the last one.

            tweets = api.user_timeline(me, count = 3200)
    

    because the first time you run the script you have to lastID.

  4. Lauri Ranta says:

    Doesn’t the API allow getting timelines without authentification? Or does tweepy require it for some reason?

    I’ve updated my Ruby scripts for getting full timelines and archiving tweets. The first one also replaces t.co links with the URLs returned by curl -L.

  5. Jon says:

    It might be worth a note that lastID.txt needs to be seeded with the ID of the tweet you want to start with before first run, or the script dies.

  6. Jon says:

    Looks like Giovanni beat me to the punch!

  7. Tim Bueno says:

    I just thought I would drop a link to the script I wrote that accomplishes this same task. My script, on first run, will grab all tweets a user has posted (up to the 3200 limit) and on sequential runs will only grab yet-to-be-archived tweets.

    The link to my post about my script can be found here

  8. Dr. Drang says:

    Everyone,
    You’re all right; I forgot to mention that before I started running the script regularly, I put the ID of the archive’s last tweet in lastID.txt. I’ll make a note of that in the body of the post.

    Lauri,
    I mentioned the authentication issue in Footnote 1. If you follow Tim Bueno’s link, you’ll see that tweepy, like the Twitter API itself, doesn’t need any credentials to call user_timeline. I’ll probably rewrite that part of the code tonight.

    I stuck with the t.co links because that’s what was already in my archive from ThinkUp. I do have plans to use tweet entities to exchange them for the original URLs. Should be easy—I’ve already done something similar in Dr. Twoot.

    Giovanni,
    Have you tested your code? I was under the impression that count was limited to 200 per call to user_timeline, and 3200 is the limit over multiple calls.

    Tim,
    Thanks for linking to your script and answering the question I had about authentication. Your script has more functionality than I need, but would be a good choice for people who don’t already have a tweet archive file.

  9. Jan Marcel says:

    Thanks for the reply, doctor! I’ve successfully used Tim Bueno’s script to create my first tweet archive file. Thanks, Tim! ;-)

    However, I’ve run into a small glitch while using Tim’s script as is. Probably because he didn’t use any credentials to call user_timeline, the following error message occurred to me when I executed the script for the first time:

    Gathering unarchived tweets... 
    Traceback (most recent call last):
      File "./autotweetArchiver.py", line 61, in <module>
        statuses = api.user_timeline(count=200, include_rts=True, since_id=idValue, screen_name=theUserName)
      File "/Library/Python/2.7/site-packages/tweepy-1.10-py2.7.egg/tweepy/binder.py", line 185, in _call
      File "/Library/Python/2.7/site-packages/tweepy-1.10-py2.7.egg/tweepy/binder.py", line 168, in execute
    tweepy.error.TweepError: Rate limit exceeded. Clients may not make more than 150 requests per hour.
    

    Since I also had a set of keys lying around, I simply changed the Twitter API initialization to make use of those (copying from Dr. Drang’s script) instead of delving into some debugging. Fortunately, that workaround did the trick.

    Doctor, since you are planning to rewrite that part of your code in order to get rid of the necessity of Twitter API keys and secrets, maybe you’d like to double check that requirement before proceeding. As of now, your script is working flawlessly.

    Should someone from this blog’s astute audience run into the same error message as I did, my modification of Tim’s script can be found on Github.

    Once again, thank you very much to Dr. Drang and Tim Bueno for their scripts.

  10. Tim Bueno says:

    Hmmmm…. I cannot seem to reproduce the Rate Limit error you showed. I’ll look into it if I have time later tonight. Looks like you’ve got it working in your version though.

    For what its worth, I’ve been using my script for the past few weeks without error. I have it automated with cron right now but you could use Lingon and launchd like Dr. Drang and accomplish the same thing.

    Thanks for checking out my script!

  11. Dr. Drang says:

    Incredibly, tweepy doesn’t seem to support tweet entities, so changing t.co links to the original URLs won’t be as easy as I thought. Lauri does it by forking a curl process, but I’d rather not do it that way. I may look into adding entity support to my fork of tweepy, but don’t hold your breath.

  12. Jan Marcel says:

    The error message is intermittent for me, but I’ve been able to reproduce it a couple of times. Please share with the rest of us if you find out what may be causing it, but there’s absolutely no need to sweat over it.

    Thank you for sharing your work! :-)

  13. Giovanni Lanzani says:

    Yes, my initial script is wrong, as count cannot exceed 200. I had to use Tim script too to get the initial archive.

  14. Lauri Ranta says:

    The Ruby library has a method for converting t.co URLs, but it broke a few months ago.

    I don’t like the curl method either, but it’s surprisingly fast since you only need to download the headers. It also expands double-shortened URLs and doesn’t use up API calls.

  15. Dr. Drang says:

    Lauri,
    Getting the original URLs from tweet entities (assuming I had a library that could do that) wouldn’t use any extra API calls. The entities are included in the response to the user_timeline call.

    But I hadn’t considered that curl keeps following redirects. In a world where people are still using bit.ly and other shorteners—that then get “shortened” again by t.co—that’s a real advantage.

  16. Ganter Ludwig says:

    I have just released an iPhone app for easily searching and archiving the tweets of specific accounts. It works with both public and private accounts. It can export all tweet archives as plain text or raw JSON, and soon as a spreadsheet. Storing a copy of the raw JSON response from Twitter captures all the data in a tweet, not just the main text and date.

    http://tweetkeeperapp.com

  17. Dr. Drang says:

    I wish Ganter well, but I don’t understand why one would want such an app on his phone. The whole point of this script, and of the IFTTT recipes that do the same thing, is to make the archiving automatic—an iPhone app can’t do that, can it?

  18. Pascal says:

    Hi Dr Drang, came across this from Ben Brookes and wondered if you’d checked out SocialSafe ( http://socialsafe.net )? It schedules backups for Twitter, FB, Instagram, Linkedin and others to create an offline searchable db and journal. It’s not perfect but it’s a pretty painless process ;) Would be interested in your thoughts. Happy to send you over a licence if you’d like to give it a spin