Completing my Twitter archive

My complete Twitter archive became available today (yes, I’ve been checking every day), and I just got done processing it to fit in with my existing archive.1 Here’s what I did.

After downloading and unzipping the archive, I had a new tweets directory in my Downloads folder. The tweets themselves are a few levels deep in the tweets directory and are provided in two formats: JSON and CSV. For reasons I hope will become clear soon, I decided to work with the JSON files. They’re kept in several files in a single subdirectory. Each file contains one month’s worth of tweets and the file name is formatted like this: yyyy_mm.js.

Twitter archive hierarchy

The nice thing about this file naming scheme is that it makes alphabetical and chronological order the same, something I exploited when running my script.

My first month of tweets, 2007_11.js looks like this:

Grailbird.data.tweets_2007_11 = 
 [ {
  "source" : "<a href=\"http://twitterrific.com\" rel=\"nofollow\">Twitterrific</a>",
  "entities" : {
    "user_mentions" : [ ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  },
  "geo" : {
  },
  "id_str" : "459419062",
  "text" : "This NY Times headline is backward: http://xrl.us/bb3xe. Cart before horse, tail wagging dog.",
  "id" : 459419062,
  "created_at" : "Sat Dec 01 02:52:47 +0000 2007",
  "user" : {
    "name" : "Dr. Drang",
    "screen_name" : "drdrang",
    "protected" : false,
    "id_str" : "10697232",
    "profile_image_url_https" : "https://si0.twimg.com/profile_images/2892920204/2d46631044a7476ad8b355e205ad1e6d_normal.png",
    "id" : 10697232,
    "verified" : false
  }
}, {
  "source" : "<a href=\"http://twitterrific.com\" rel=\"nofollow\">Twitterrific</a>",
  "entities" : {
    "user_mentions" : [ ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  },
  "geo" : {
  },
  "id_str" : "455042882",
  "text" : "Panera's wifi signon page used to crash Safari. Not so with Leopard and Safari 3.",
  "id" : 455042882,
  "created_at" : "Thu Nov 29 17:14:28 +0000 2007",
  "user" : {
    "name" : "Dr. Drang",
    "screen_name" : "drdrang",
    "protected" : false,
    "id_str" : "10697232",
    "profile_image_url_https" : "https://si0.twimg.com/profile_images/2892920204/2d46631044a7476ad8b355e205ad1e6d_normal.png",
    "id" : 10697232,
    "verified" : false
  }
}, {
  "source" : "<a href=\"http://twitterrific.com\" rel=\"nofollow\">Twitterrific</a>",
  "entities" : {
    "user_mentions" : [ ],
    "media" : [ ],
    "hashtags" : [ ],
    "urls" : [ ]
  },
  "geo" : {
  },
  "id_str" : "453535262",
  "text" : "Safari Stand now reinstalled on my Leopard machine and all's right with the world.",
  "id" : 453535262,
  "created_at" : "Thu Nov 29 06:04:55 +0000 2007",
  "user" : {
    "name" : "Dr. Drang",
    "screen_name" : "drdrang",
    "protected" : false,
    "id_str" : "10697232",
    "profile_image_url_https" : "https://si0.twimg.com/profile_images/2892920204/2d46631044a7476ad8b355e205ad1e6d_normal.png",
    "id" : 10697232,
    "verified" : false
  }
} ]

As you can see, it starts with a line that defines a variable. The rest of the file sets the variable to a list of objects. I haven’t looked at this especially carefully, but I assume this whole file is basically just one legal JavaScript assignment statement.

What I see when I look at this file, though, is something that’s damned close to a legal Python assignment. The only problems are

These are minor problems and we’ll get around both of them.

My goal is to pull out all the tweets in the archive and write them to a single plain text file with the same format as my existing archive. As I write this, the last couple of entries in that archive look like this:

Complained yesterday about not being able to download my Twitter archive. Today it’s available.
January 17, 2013 at 3:35 PM
http://twitter.com/drdrang/status/292022255536439296
- - - - -

@gruber The reason everyone thinks Apple needs new hit products is because that’s how Apple got to where it is now from where it was in ’97.
January 17, 2013 at 4:33 PM
http://twitter.com/drdrang/status/292036995331522562
- - - - -

Each entry goes tweet, date/time, URL, dashed line, blank line.

After a bit of interactive experimenting in IPython (a tool I should have been using long ago), I came up with this script, which I call extract-tweets.py:

python:
 1:  #!/usr/bin/python
 2:  
 3:  from datetime import datetime
 4:  import pytz
 5:  import sys
 6:  
 7:  # Universal convenience variables
 8:  utc = pytz.utc
 9:  instrp = '%a %b %d %H:%M:%S +0000 %Y'
10:  false = False
11:  true = True
12:  
13:  # Convenience variables specific to me
14:  homeTZ = pytz.timezone('US/Central')
15:  urlprefix = 'http://twitter.com/drdrang/status/'
16:  outstrf = '%B %-d, %Y at %-I:%M %p'
17:  
18:  # The list of JSON files to process is assumed to be given as
19:  # the arguments to this script. They are also expected to be in
20:  # chronological order. This is the way they come when the 
21:  # Twitter archive is unzipped.
22:  for m in sys.argv[1:]:
23:    f = open(m)
24:    f.readline()            # discard the first line
25:    tweets = eval(f.read())
26:    tweets.reverse()        # we want them in chronological order
27:    for t in tweets:
28:      text = t['text']
29:      url = urlprefix + t['id_str']
30:      dt = utc.localize(datetime.strptime(t['created_at'], instrp))
31:      dt = dt.astimezone(homeTZ)
32:      date = dt.strftime(outstrf)
33:      print '''%s
34:  %s
35:  %s
36:  - - - - -
37:  ''' % (text, date, url)

In Lines 10 and 11, we defined two variables, true and false to have the Python values True and False. This turns all the trues and falses in the JSON into legal Python. On Line 24, you see that after opening a file, we read the first line but don’t do anything with it. The purpose of this line is to move the file’s position marker to the start of the next line, so that the read call in Line 25 doesn’t include the first line. Everything sucked up by that read is legal Python and can be eval’d to create a list of dictionaries that are assigned to the variable tweets.

With this bit of trickery out of the way, the rest of the script is pretty straightforward. There’s some messing about with timezones because I want my timestamps to reflect my timezone, not UTC, which is what Twitter saves. The conversion is done through the pytz library and follows the same general outline I used in an earlier archiving script.

The guts of the script is a loop that opens and processes each file given on the command line. With the script saved in the same directory as the JSON files, I ran

python extract-tweets.py *.js > full-twitter.txt

to get all my tweets in one file with the format described above.

As it happens, the Twitter archive I downloaded this evening didn’t include any of today’s tweets. So I copied today’s tweets from the archive I keep in Dropbox, which gets updated every hour, pasted them onto the end of full-twitter.txt, and saved it in Dropbox to be regularly updated from now on.

According to the archive, this was my first tweet:

Safari Stand now reinstalled on my Leopard machine and all’s right with the world.
Dr. Drang (@drdrang) Thu Nov 29 2007 12:04 AM CST

Remember Safari Stand? Those were the days.


  1. Why did I need to download Twitter’s version of my archive when I already had one of my own? Because of Twitter’s limitation on how many tweets you can retrieve through the API, my archive was missing my first thousand or so tweets. Now I have them and can sleep peacefully again. 


10 Responses to “Completing my Twitter archive”

  1. Kevin Lipe says:

    My Twitter archive became available today, too. I had given up on it, but when you tweeted that yours was available for download, I checked again, and there it was.

    I’ve found that my tweets are mostly useless now, though, five years later. They just remind me that five years ago I was a lot dumber. (I assume that five years from now, things I just posted today would remind me of how dumb I am now.)

  2. Dr. Drang says:

    In contrast, Kevin, I’m finding myself to have been much smarter five years ago than I am now. Witty, urbane, scintillating. Not like the senile husk I’ve become.

  3. Ricky says:

    Hi! I don’t know Python, so I may be confused, but at least in Perl, it would be really easy to just use JSON; and read the file directly, without having to do anything that would really need explanation. Surely Python has something similar — is there another reason not to use it?

    Thanks, Ricky

  4. David Avraamides says:

    Yes, as @Ricky states, there is a json module in Python (since 2.6) so you could have done:

    import json
    tweet = json.load(open(filename))
    

    There are also a number of options to the load method for doing custom conversions, if necessary: http://docs.python.org/2/library/json.html.

  5. Dr. Drang says:

    David, I tried json.load() last night and it just returned errors. Don’t know why, but because I could see that eval() would work for me (and because this is a one-time conversion) I took the easy way out.

    In general, though, you are absolutely correct, and if this were a script that would be used again and again, I would’ve taken the time to figure out what I was doing wrong with json.load().

    Later addition: Come to think of it, the JSON data I’ve dealt with in the past has never included an assignment—it’s always been just the data structure itself.

  6. Jan Marcel says:

    Hi, doctor!

    Thanks a lot for the script! It worked flawlessly for me. However, I just noticed that my Twitter archive is not complete. Comparing the output of your script with my existing archive via diff, there are a lot of differences. For example:

    2591,2595d2585
    < @junecloud I see… I wondered that maybe there was an API you could use. Well, thanks anyway for your response. You and Delivery Status rock!
    < August 1, 2011 at 1:24 PM
    < http://twitter.com/jgmarcel/status/98066552800284672
    < - - - - -
    < 
    

    As you can easily check, that tweet indeed exists. However, it’s nowhere to be found in my exported Twitter archive:

    $ grep -lr 98066552800284672 data/js/tweets
    $ 
    

    I know this can only be a bug of Twitter’s part, but given that this post’s title is “Completing my Twitter archive”, be aware that your tweets archive may still be incomplete. :-/

  7. Dr. Drang says:

    Thanks for the tip, Jan. I didn’t delete my previous archive, so I can check if Twitter shortchanged me, too.

  8. Jan Marcel says:

    My pleasure, doc. Please let us know if that’s indeed the case.

  9. Gil Creque says:

    I have to say I really geeked out when I saw your solution for handling the difference in Boolean constants because the solution I came up with in my head was no where near as simple and elegant as a simple variable assignment. Thanks for sharing.

  10. roland says:

    very good article on twitter…