Improved RSS subscriber count script

When I got my RSS subscriber count email this morning, I knew something was wrong because the count was about twice what it was the day before. I’ve found and fixed the bug that can overcount the subscribers through Google Reader, but there’s still one more bug that needs to be fixed.

The script I run to calculate the subscriber count is a slight variation on Marco Arment’s original. Surprisingly, the overcounting bug has nothing to do with my changes; it’s in Marco’s code.

Here’s the problem code.

bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort | uniq | cut -d' ' -f 1 | awk '{s+=$1} END {print s}'`

That’s a really long pipeline, so let’s look at each part individually:

The problem is in the uniq command. If your subscriber count changes during the course of the day—not an uncommon occurrence—you’ll get lines that look like this after the sort:

2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551

The uniq command will not treat these as duplicates because they aren’t. It’ll convert these lines into

2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551

You can see the problem. Because uniq keeps two lines associated with the same feed-id, we’re counting most of those subscriptions twice. That’s why my subscriber count this morning was nearly twice what it should have been.

What we need is a way to tell uniq to return just one line for each feed-id. Ideally, we’d like the line from the end of the day, because that’s the most up-to-date count.

Here’s my solution. Instead of a simple sort | uniq, I do this:

sort -t= -k2 -s | tac | uniq -f2

The -t= -k2 options tell sort to reorder the lines on basis of what comes after the equals sign, which is the feed-id. The -s option ensures that the sort is stable, that is, that lines with the same feed-id appear in their original order after the sort. The -r option then reverses the sort, so for each feed-id, the top line will be the one associated with the last inquiry of the day.

The tac command then reverses all the lines, so for each feed-id the top line will be the one associated with the last inquiry of the day. This will be important after our next step.

The -f2 option tells uniq to ignore the first two “fields” of each line, where fields are separated by white space. In other words, it decides on the uniqueness of a line by looking at the feed-id=1234567890 part only. This will turn a section like

2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551

into

2737 subscribers; feed-id=9141626367700991551

which is just what we want.

You can see now why I reversed the lines after sorting. When uniq is used without options, the line it chooses to retain doesn’t matter because they’re all the same. But when we use the -f option, it’s important to know which of the “duplicate” lines (which are duplicates over only a portion of their length) is returned. As it happens, it’s the first “duplicate” that’s returned, so by reversing the stable sort in the previous step, uniq -f2 returns the line from the last hit of the day for each feed-id.

The remainder of pipeline is unchanged, so my new version of Marco’s script gets the Google Reader subscriber count like this:

bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort -t= -k2 -s | tac | uniq -f2 | awk '{s+=$1} END {print s}'`

Update 9/28/12
When I first wrote this, I didn’t understand how the -s and -r options worked in sort. I thought the -r would reverse the lines after the stable sort, but that’s not what it does. To do what I wanted, I needed the line reversal as a separate command. tac (so named because it acts like cat in reverse) filled the bill.

Also, as one of the commenters on Marco’s script pointed out, there’s no need to do the cut when awk can sum over the first field directly.

There’s a similar bug in the pipeline for other aggregators that provide a subscriber count in their access log entries:

bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | sort | uniq | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`

As you can see, this pipeline does the same sort | uniq thing and will double count subscribers to an aggregator if that aggregator’s subscriber figure changes during the course of the day. Unfortunately, I’m not sure how to fix this problem. Because the identifier that distinguishes one aggregator from another doesn’t necessarily come after the subscriber count in these log lines, I don’t know how to trick uniq into behaving the way I want.

For example, if I run just the part of the pipeline through uniq, I get these lines from the NewsGator aggregator:

"NewsGatorOnline/2.0 (http://www.newsgator.com; 1 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 3 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 4 subscribers)"

These shouldn’t be added together, but I can’t tell uniq to consider only the first field of each line—it doesn’t have an option for that. I’m pretty sure a Perl one-liner could do it, but my Perl is a little rusty at the moment. If you can whip one up, or if you have a better idea, I’d like to hear about it.

As a practical matter, aggregators that report subscribers but aren’t Google Reader make up such a small part of my total subscriber base that double counting them has little effect. Even if there’s no solution to this problem, it won’t make much difference.

Update 9/28/12
Well, there is a solution, and Marco provided it (mostly) in the comments. The trick lies in using awk arrays and the post-increment (++) operator. Here’s the improved code:

bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | tac | awk -F\( '!x[$1]++' | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`

Instead of sort | uniq, this uses

tac | awk -F\( '!x[$1]++'

The tac command reverses the lines, which puts them in reverse chronological order.

The clever part—due to Marco—is the awk one-liner that returns just the first line of each aggregator. Note first that it’s all pattern: if the value of the pattern is true, the line is printed (print being awk’s default command); if it’s false, nothing is printed. So the script acts as a filter, printing only those lines for which the pattern evaluates to true.

Truth is determined through the value of an associative array, x. As awk reads through the lines of the file, items of x are created with keys that come from the first field, $1, and values that are incremented for every line with a matching first field. The first field is everything before the first open parenthesis, which—in my logs, anyway—corresponds to the name of the aggregator.

The trick is that the ++ incrementing operator acts after x[$1] is evaluated. The first time a new value of $1 is encountered, the value of x[$1] is zero. In the context of an awk pattern, this is false. The not operator, !, flips that to true, and the line is printed. For subsequent lines with that same $1, the value of x[$1] will be a positive number—a true value which the ! will flip to false. Thus, the subsequent lines with the same $1 aren’t printed.

You’ll note that there’s no sort in this pipeline. Because we’re using an awk filter instead of uniq, we don’t have to have the lines grouped by aggregator before running the filter.

I won’t pretend I understood the awk one-liner as soon as I saw it. I seldom use awk and have never written a script that used arrays in the pattern. But once I started to catch on, I realized it was very much like some Perl programs I’ve seen that build associative arrays to count word occurrences.

Double-counting Google Reader subscribers, though, is a big deal. If you’re using Marco’s script, you should change that pipeline.

Update 9/28/12
I see that Marco has updated his script since I first posted. My changes to the pipeline differ a little from his, so I set up my own fork of his script.


10 Responses to “Improved RSS subscriber count script”

  1. Marco says:

    Nice catch. Thanks. How about this modification for OTHERSUBS:

    OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | sort -t\( -k2 -sr | awk '!x[$1]++' | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`
    

    (I wish I could preview this to check for markup misinterpretations.)

  2. Dr. Drang says:

    The awk filter is very clever, Marco. Amazing how terse you can be with auto-initializing variables.

    I don’t understand why you’re keying the sort to the portion of the line after the first opening parenthesis. That portion includes the subscription count, so the sort would put the line with the highest count at the top, regardless of which line came last in the day. I’ve been trying out sort -k1,1 -r | awk '!x[$1]++', which seems to work well.

    Also, I think what I wrote in the post about the -s option is wrong, at least when it’s used in combination with -r. I need to understand that better.

  3. Dr. Drang says:

    OK, I think this is what I want

    sort -k1,1 -s | tac | awk '!x[$1]++'
    

    And I think I’ll be using tac (the reverse of cat) in the Google Reader pipeline, too, but now it’s time to put this away and get back to work.

  4. Aristotle Pagaltzis says:

    Perl one-liner as requested:

    OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | perl -MList::Util=sum -nE'$subs{$2} = $1 if /([0-9]+) subscribers; feed-id=([0-9]+)/; END { say sum values %subs }'`
    

    I tried getting rid of those fgreps but a number of reasons conspire to make any solution in either Awk or Perl less succinct and a lot less readable than just piping fgreps together.

  5. Dr. Drang says:

    Aristotle,
    Your script fails on my server with an “unrecognized switch” error on the -E. Did you mean -e?

    And because you have a feed-id in your regex, you’re actually changing GRSUBS, not OTHERSUBS. The idea for OTHERSUBS would be the same, you’d just need to use another key for your hash.

  6. Marco says:

    First, I can’t take credit for that clever awk expression — I found it in a forum somewhere while searching for solutions to various parts of this.

    Second, I noticed a weird problem with my GR subscriber count from yesterday: the normally consistent total was 60% lower than the day before. Mine’s split into about 20 feed-ids with mostly tiny subscriber counts, but the one with the highest subscriber count didn’t hit my server at all yesterday. It did, however, hit it the day before.

    Given the unique feed-ids and your last-request counting, I figure it doesn’t necessarily need to be limited to just that day: if Google Reader crawls one of my feed-ids yesterday but not today, those are still subscribers that should count.

    I modified the script to do a basic 1-day lookback for the Google Reader fetch, and it works well, returning results consistent with what I’ve seen in the past with Feedburner.

    New version: https://gist.github.com/3783146

    As far as I know, this incorporates your other changes except the email removal.

  7. telemachus says:

    @Dr. Drang,

    The -E flag enables all of the newer features of Perl automatically. (That allows Aristotle to use say for example rather than print and a \n.)

    I don’t know exactly what version of Perl added this feature, but your server’s version must be older than that addition.

  8. telemachus says:

    And of course, as soon as I post my first comment, I find it. The -E flag was added in Perl 5.10. I would say your server’s Perl is due for an upgrade since 5.10 came out in December of 2007.

  9. Dr. Drang says:

    Thanks, telemachus. Yes, my server is running an ancient version of Perl:

    perl --version
    
    This is perl, v5.8.8 built for x86_64-linux
    
    Copyright 1987-2006, Larry Wall
    

    I suspect there’s a more recent version somewhere on the machine, but that’s the default.

  10. Dr. Drang says:

    Unsurprisingly, this script can give inaccurate results if it’s run shortly after the log file is archived and restarted from scratch. On my server, that happens at or near the end of the month, so the counts are low until the running log file gets replenished with a full day of hits.