Improved RSS subscriber count script
September 27th, 2012 at 10:12 pm by Dr. Drang
When I got my RSS subscriber count email this morning, I knew something was wrong because the count was about twice what it was the day before. I’ve found and fixed the bug that can overcount the subscribers through Google Reader, but there’s still one more bug that needs to be fixed.
The script I run to calculate the subscriber count is a slight variation on Marco Arment’s original. Surprisingly, the overcounting bug has nothing to do with my changes; it’s in Marco’s code.
Here’s the problem code.
bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort | uniq | cut -d' ' -f 1 | awk '{s+=$1} END {print s}'`
That’s a really long pipeline, so let’s look at each part individually:
fgrep "$LOG_FDATE" "$LOG_FILE"This reads the site’s access log file ($LOG_FILEis defined as the path to the log file earlier in the script) and returns only those lines associated with yesterday (again,$LOG_FDATEis defined as yesterday’s date earlier in the script).fgrep " $RSS_URI"returns only those lines accessing the site’s feed URL (yes,$RSS_URIis defined earlier in the script).egrep -o '[0-9]+ subscribers; feed-id=[0-9]+'returns only those lines that have both a subscriber count and afeed-iddefinition. This is characteristic of hits from Google’s FeedFetcher for Reader. The-ooption tellsegrepto return only the portion of the line that matches the regular expression, so we’re left with lines that look like this:2735 subscribers; feed-id=9141626367700991551sortsorts the lines alphabetically.uniqeliminates duplicate lines that are adjacent to one another. Thissort | uniqconstruct is common in shell scripts. Becauseuniqonly eliminates duplicates if they are adjacent, thesortis needed to make them adjacent.cut -d' ' -f 1returns just the subscriber count for each line, which is before the first space character.awk '{s+=$1} END {print s}'adds up all the counts and returns the sum.
The problem is in the uniq command. If your subscriber count changes during the course of the day—not an uncommon occurrence—you’ll get lines that look like this after the sort:
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
The uniq command will not treat these as duplicates because they aren’t. It’ll convert these lines into
2735 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
You can see the problem. Because uniq keeps two lines associated with the same feed-id, we’re counting most of those subscriptions twice. That’s why my subscriber count this morning was nearly twice what it should have been.
What we need is a way to tell uniq to return just one line for each feed-id. Ideally, we’d like the line from the end of the day, because that’s the most up-to-date count.
Here’s my solution. Instead of a simple sort | uniq, I do this:
sort -t= -k2 -s | tac | uniq -f2
The -t= -k2 options tell sort to reorder the lines on basis of what comes after the equals sign, which is the feed-id. The -s option ensures that the sort is stable, that is, that lines with the same feed-id appear in their original order after the sort. The -r option then reverses the sort, so for each feed-id, the top line will be the one associated with the last inquiry of the day.
The tac command then reverses all the lines, so for each feed-id the top line will be the one associated with the last inquiry of the day. This will be important after our next step.
The -f2 option tells uniq to ignore the first two “fields” of each line, where fields are separated by white space. In other words, it decides on the uniqueness of a line by looking at the feed-id=1234567890 part only. This will turn a section like
2737 subscribers; feed-id=9141626367700991551
2737 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
2735 subscribers; feed-id=9141626367700991551
into
2737 subscribers; feed-id=9141626367700991551
which is just what we want.
You can see now why I reversed the lines after sorting. When uniq is used without options, the line it chooses to retain doesn’t matter because they’re all the same. But when we use the -f option, it’s important to know which of the “duplicate” lines (which are duplicates over only a portion of their length) is returned. As it happens, it’s the first “duplicate” that’s returned, so by reversing the stable sort in the previous step, uniq -f2 returns the line from the last hit of the day for each feed-id.
The remainder of pipeline is unchanged, so my new version of Marco’s script gets the Google Reader subscriber count like this:
bash:
# Google Reader subscribers and other user-agents reporting "subscribers" and using the "feed-id" parameter for uniqueness:
GRSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | egrep -o '[0-9]+ subscribers; feed-id=[0-9]+' | sort -t= -k2 -s | tac | uniq -f2 | awk '{s+=$1} END {print s}'`
Update 9/28/12
When I first wrote this, I didn’t understand how the -s and -r options worked in sort. I thought the -r would reverse the lines after the stable sort, but that’s not what it does. To do what I wanted, I needed the line reversal as a separate command. tac (so named because it acts like cat in reverse) filled the bill.
Also, as one of the commenters on Marco’s script pointed out, there’s no need to do the cut when awk can sum over the first field directly.
There’s a similar bug in the pipeline for other aggregators that provide a subscriber count in their access log entries:
bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | sort | uniq | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`
As you can see, this pipeline does the same sort | uniq thing and will double count subscribers to an aggregator if that aggregator’s subscriber figure changes during the course of the day. Unfortunately, I’m not sure how to fix this problem. Because the identifier that distinguishes one aggregator from another doesn’t necessarily come after the subscriber count in these log lines, I don’t know how to trick uniq into behaving the way I want.
For example, if I run just the part of the pipeline through uniq, I get these lines from the NewsGator aggregator:
"NewsGatorOnline/2.0 (http://www.newsgator.com; 1 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 3 subscribers)"
"NewsGatorOnline/2.0 (http://www.newsgator.com; 4 subscribers)"
These shouldn’t be added together, but I can’t tell uniq to consider only the first field of each line—it doesn’t have an option for that. I’m pretty sure a Perl one-liner could do it, but my Perl is a little rusty at the moment. If you can whip one up, or if you have a better idea, I’d like to hear about it.
As a practical matter, aggregators that report subscribers but aren’t Google Reader make up such a small part of my total subscriber base that double counting them has little effect. Even if there’s no solution to this problem, it won’t make much difference.
Update 9/28/12
Well, there is a solution, and Marco provided it (mostly) in the comments. The trick lies in using awk arrays and the post-increment (++) operator. Here’s the improved code:
bash:
# Other user-agents reporting "subscribers", for which we'll use the entire user-agent string for uniqueness:
OTHERSUBS=`fgrep "$LOG_FDATE" "$LOG_FILE" | fgrep " $RSS_URI" | fgrep -v 'subscribers; feed-id=' | egrep '[0-9]+ subscribers' | egrep -o '"[^"]+"$' | tac | awk -F\( '!x[$1]++' | egrep -o '[0-9]+ subscribers' | awk '{s+=$1} END {print s}'`
Instead of sort | uniq, this uses
tac | awk -F\( '!x[$1]++'
The tac command reverses the lines, which puts them in reverse chronological order.
The clever part—due to Marco—is the awk one-liner that returns just the first line of each aggregator. Note first that it’s all pattern: if the value of the pattern is true, the line is printed (print being awk’s default command); if it’s false, nothing is printed. So the script acts as a filter, printing only those lines for which the pattern evaluates to true.
Truth is determined through the value of an associative array, x. As awk reads through the lines of the file, items of x are created with keys that come from the first field, $1, and values that are incremented for every line with a matching first field. The first field is everything before the first open parenthesis, which—in my logs, anyway—corresponds to the name of the aggregator.
The trick is that the ++ incrementing operator acts after x[$1] is evaluated. The first time a new value of $1 is encountered, the value of x[$1] is zero. In the context of an awk pattern, this is false. The not operator, !, flips that to true, and the line is printed. For subsequent lines with that same $1, the value of x[$1] will be a positive number—a true value which the ! will flip to false. Thus, the subsequent lines with the same $1 aren’t printed.
You’ll note that there’s no sort in this pipeline. Because we’re using an awk filter instead of uniq, we don’t have to have the lines grouped by aggregator before running the filter.
I won’t pretend I understood the awk one-liner as soon as I saw it. I seldom use awk and have never written a script that used arrays in the pattern. But once I started to catch on, I realized it was very much like some Perl programs I’ve seen that build associative arrays to count word occurrences.
Double-counting Google Reader subscribers, though, is a big deal. If you’re using Marco’s script, you should change that pipeline.
Update 9/28/12
I see that Marco has updated his script since I first posted. My changes to the pipeline differ a little from his, so I set up my own fork of his script.



September 27th, 2012 at 11:34 pm
Nice catch. Thanks. How about this modification for OTHERSUBS:
(I wish I could preview this to check for markup misinterpretations.)
September 28th, 2012 at 9:20 am
The
awkfilter is very clever, Marco. Amazing how terse you can be with auto-initializing variables.I don’t understand why you’re keying the
sortto the portion of the line after the first opening parenthesis. That portion includes the subscription count, so the sort would put the line with the highest count at the top, regardless of which line came last in the day. I’ve been trying outsort -k1,1 -r | awk '!x[$1]++', which seems to work well.Also, I think what I wrote in the post about the
-soption is wrong, at least when it’s used in combination with-r. I need to understand that better.September 28th, 2012 at 9:37 am
OK, I think this is what I want
And I think I’ll be using
tac(the reverse ofcat) in the Google Reader pipeline, too, but now it’s time to put this away and get back to work.September 28th, 2012 at 11:33 pm
Perl one-liner as requested:
I tried getting rid of those
fgreps but a number of reasons conspire to make any solution in either Awk or Perl less succinct and a lot less readable than just pipingfgrepstogether.September 29th, 2012 at 12:15 am
Aristotle,
Your script fails on my server with an “unrecognized switch” error on the
-E. Did you mean-e?And because you have a
feed-idin your regex, you’re actually changingGRSUBS, notOTHERSUBS. The idea forOTHERSUBSwould be the same, you’d just need to use another key for your hash.September 29th, 2012 at 10:55 am
First, I can’t take credit for that clever
awkexpression — I found it in a forum somewhere while searching for solutions to various parts of this.Second, I noticed a weird problem with my GR subscriber count from yesterday: the normally consistent total was 60% lower than the day before. Mine’s split into about 20 feed-ids with mostly tiny subscriber counts, but the one with the highest subscriber count didn’t hit my server at all yesterday. It did, however, hit it the day before.
Given the unique feed-ids and your last-request counting, I figure it doesn’t necessarily need to be limited to just that day: if Google Reader crawls one of my feed-ids yesterday but not today, those are still subscribers that should count.
I modified the script to do a basic 1-day lookback for the Google Reader fetch, and it works well, returning results consistent with what I’ve seen in the past with Feedburner.
New version: https://gist.github.com/3783146
As far as I know, this incorporates your other changes except the email removal.
September 29th, 2012 at 5:18 pm
@Dr. Drang,
The
-Eflag enables all of the newer features of Perl automatically. (That allows Aristotle to usesayfor example rather thanprintand a\n.)I don’t know exactly what version of Perl added this feature, but your server’s version must be older than that addition.
September 29th, 2012 at 5:21 pm
And of course, as soon as I post my first comment, I find it. The
-Eflag was added in Perl 5.10. I would say your server’s Perl is due for an upgrade since 5.10 came out in December of 2007.September 29th, 2012 at 6:43 pm
Thanks, telemachus. Yes, my server is running an ancient version of Perl:
I suspect there’s a more recent version somewhere on the machine, but that’s the default.
September 30th, 2012 at 9:41 am
Unsurprisingly, this script can give inaccurate results if it’s run shortly after the log file is archived and restarted from scratch. On my server, that happens at or near the end of the month, so the counts are low until the running log file gets replenished with a full day of hits.