A few tweet archive utilities

You’re probably sick of my posts about making a local tweet archive. So this one’s about querying that archive and finding the tweets that are in it. You can, of course, open your archive file up in a text editor and use its search tools, but often it’s more convenient to do these things from the command line.

Tweet count

Let’s start with a simple one. This shell script, which I call tweet-count, returns the number of tweets in the archive.

bash:
#!/bin/bash

fgrep -- '- - - - -' ~/Dropbox/twitter/twitter.txt | wc -l | tr -d ' '

It’s just a pipeline using some classic Unix tools. fgrep is just like grep, only it treats the search string literally, not as a regular expression. If you recall from my earlier post, my tweet archive uses lines with five hyphens separated by spaces as the tweet separator. Because hyphens are special to the shell, I needed to put that double hyphen before the search string to tell fgrep that it was done processing command line switches.

The fgrep command returns all the separator lines, one for each tweet. That output gets piped to wc, which normally counts words, lines, and characters. The -l option tells wc to count only lines and return that number. This will be the number of tweets in the archive.

For reasons not entirely clear to me, wc indents its output. I didn’t want the tweet count to be indented, so I passed the output through one more command: tr. With the -d option, tr deletes characters in the input that match the given string. Here, I’m telling it to delete spaces.

Update 8/9/12
As Aristotle Pagaltzis pointed out in the comments, this version of tweet-count is unnecessarily complicated. There’s a -c switch for fgrep that returns the number of matches directly:

fgrep -c -- '- - - - -' ~/Dropbox/twitter/twitter.txt

Last tweet

The next command, last-tweet, returns the last tweet in the archive. I’m not sure I’ll be using this much in the future, but I used it a lot while I was debugging my tweet archiving script and making sure it was run periodically.

bash:
#!/bin/bash

tail ~/Dropbox/twitter/twitter.txt | perl -n0777e '@a=split("- - - - -"); $a[-2] =~ s/^\s+//; print $a[-2]'

It starts by using the tail command to get the last ten lines of the archive, then passes that to a Perl one-liner that does the following:

  1. Slurps in the file as a single string via the -n option and the -0777 option.
  2. Splits that string at the separator lines into a list of strings, @a. Because the archive has a separator line at the end, the last tweet will be the second to last item in the list, $a[-2].
  3. Strips the leading whitespace from the last tweet.
  4. Prints the cleaned-up last tweet.

Running last-tweet now returns

Just got an email from Jason Quackenbush. Surprisingly, it’s neither spam nor Groucho Marx.
August 8, 2012 at 2:14 PM
http://twitter.com/drdrang/status/233279936209768448

Finding tweets

This script, called find-tweet, is the most useful. It’s a Perl script that runs through the archive, printing out the tweets that match the search string.

perl:
 1:  #!/usr/bin/perl
 2:  
 3:  $/ = "- - - - -\n";
 4:  
 5:  open TWEETS, '<', "$ENV{'HOME'}/Dropbox/twitter/twitter.txt" or die $!;
 6:  while (<TWEETS>) {
 7:    if (/$ARGV[0]/i) {
 8:      print $_;
 9:    }
10:  }

Line 3 sets Perl’s input record separator to the tweet separator string (with the trailing line break). This causes the while loop starting on Line 6 to read the file one tweet at a time. Line 7 tests whether the tweet contains the search string, and Line 8 prints it out (including the separator line) if it does. The i after the final slash in Line 7 specifies that the search should be case-insensitive, which is pretty much always what I want. It’s been so long since I was deeply into Perl regular expressions, I can’t remember whether an m or s would be helpful here, too. So far, I haven’t needed them.

Update 8/9/12
In the comments, Ricky showed me the three-argument version of open, which I’ve incorporated in the code above. I don’t remember running across it before; is it possible that version didn’t exist when I was learning Perl in the ’90s?

The search string passed to find-tweet will be treated as a regular expression, so care must be taken if periods, parentheses, or other special characters are part of the search. Because the shell also treats certain characters as special (semicolons, hyphens, and spaces, in particular), it’s probably best to wrap all but the simplest search strings in single quotes.

Most of the time, though, I find that simple search strings are what I use. For example,

find-tweet pickup

returns

Hit by a pickup while on my bike. Scraped up, bruised, and sore, but otherwise OK. Bike is OK, too. Wife, who saw it all, is a bit shaken.
April 25, 2009 at 9:09 AM
http://twitter.com/drdrang/status/1613159792
- - - - -

The pickup driver did stop and was ticketed. I rode home and am now at one of our endless Little League games.
April 25, 2009 at 10:39 AM
http://twitter.com/drdrang/status/1613747648
- - - - -

My one bit of satisfaction: the pickup that hit me was new, and I put a dent in its fender.
April 25, 2009 at 11:42 AM
http://twitter.com/drdrang/status/1614185190
- - - - -

You’ll note that the tweets themselves aren’t word-wrapped to make them easier to read. I thought about doing that but decided it wasn’t worth the effort. The Terminal wraps the results well enough.


4 Responses to “A few tweet archive utilities”

  1. Ricky says:

    Hi! I know that this is somewhat irrelevant to the topic of this post, but there are a few changes you should make to your Perl script. In particular, your use of the (insecure) 2-arguement form of the open function should be remedied. If something should, either by accident or by malicious intent, manage to change the value of the HOME environment variable to start with a ‘>’, you will lose your twitter archive. A better form would be:

        open TWEETS, '<', "$ENV{'HOME'}/Dropbox..."
    

    I also prefer using lexical file handles (as open my $in, '<', "..."), but for a simple program like this, I will admit it is not as important.

  2. Aristotle Pagaltzis says:

    You can write tweet-count much more simply:

    fgrep -c -- '- - - - -' ~/Dropbox/twitter/twitter.txt
    

    You can also rather shorten last-tweet:

    tail -r ~/Dropbox/twitter/twitter.txt | sed -n '1d; /- - - - -/q; /./p' | tail -r
    

    I also concur with Ricky about find-tweet.

  3. Dr. Drang says:

    Thanks to both Ricky and Aristotle for their improvements. I’ll be making changes to the scripts shortly.

    Although I think it very unlikely that ENV{'HOME'} will ever get a greater-than sign prepended to it (and if it does, I’ll have more trouble than just an overwritten tweet archive), forcing the file to be read-only is the right thing to do.

    As for Aristotle’s suggestions, using fgrep’s -c option is a no-brainer that I managed not to have the brains to learn and use. Shows how shallow my grep knowledge really is.

    While Aristotle’s rewrite oflast-tweet is smarter and shorter than mine, I’ll stick with my Perl monstrosity. While I have been able to write in sed by continually referring to a manual, I’ve never been able to look at a sed script more complicated than a simple substitution and immediately know what it does. I prefer to use scripts I understand.

  4. Charles Pergiel says:

    This strikes me as make-work project, interesting only as an exercise in programming. But maybe not. I don’t use Twitter, I don’t tweet, and I don’t follow, but I occasionally see other people’s tweets posted here and there on the net, and some of them are rather, er, what’s the word? “Spot on” isn’t quite it, but that’s sort of what I meant. So maybe there is some use for this.

    On the other hand, it strikes me as kind of like the current video surveillance fad: let’s record video of everything all the time and keep it forever. Who is ever going to watch it? If you have an infinite quantity of data, how can you ever find anything useful?