# Some file exploration with Unix tools

In Sunday’s post, I mentioned I’ve has been having trouble with intermittent Internal Server Errors here for the past couple of months. On Tuesday, Seth Brown sent me this tweet:

@drdrang You may be aware of this, but your site appears to be down.
Seth Brown (@DrBunsen) Sep 23 2014 2:22 PM

He wasn’t kidding. I logged into the server’s control panel and saw that the virtual memory had been pegged for about half an hour. After a little digging, I found that one IP number, 144.76.32.147, was continually making requests. After checking with whois to make sure that number didn’t belong to something like Google—which I wouldn’t want to block—I added this line to my .htaccess file:

deny from 144.76.32.147


The virtual memory use backed off and the site was soon accessible again. Crisis averted, but it did give me more impetus to shift from WordPress to a static blog and possibly to a new web host.

Today I decided to investigate Tuesday’s problem a little more. First, I found that the top ten pages for the day were goofy.

Sep 23, 2014 pages
1033  /all-this/
469  /all-this/tag/programming/
460  /all-this/tag/blogging/
413  /all-this/tag/python/
324  /all-this/tag/mac/
312  /all-this/tag/wordpress/
294  /all-this/2014/09/end-of-summer-miscellany/
207  /all-this/tag/text-editing/
18139  total


I don’t think a single “tag” page has ever been in a day’s top ten, but on Tuesday there were seven. It looked like the hits from 144.76.32.147 were the result of a web spider run amok. I downloaded my Apache log file for the month and pulled out all the entries for that IP number.

awk '/144\.76\.32\.147/ {print}' < leancrew.com > spider.txt


Before opening the file in BBEdit, I checked how many entries there were with

wc -l spider.txt


and learned that 144.76.32.147 was responsible for over 25,000 hits. Since 25,000-line files are just an appetizer for BBEdit, I opened up spider.txt and looked at the details.

I learned that all of the hits from 144.76.32.147 had come on Tuesday. They started before 9:00 that morning and kept going until a little before 2:30, when I blocked access. There were intermittent 404 responses throughout the day, but they turned into mostly 404s shortly before 2:00. That matched up pretty well with what I’d learned about the site hitting the virtual memory limit.

The user agent string for all of these hits was the same:

Mozilla/5.0 (X11; compatible; semantic-visions.com crawler; HTTPClient 3.1)


That confirmed that the hits were coming from a web spider, one apparently written by these guys:

Well, that certainly looks like a group that crawls the web, but maybe what I was seeing was someone else misusing Semantic Visions’ software. I ran

dig semantic-visions.com +short


to get their IP number. It was 144.76.32.142. Not exactly the number that had been hitting me, but only five away in the last octet. Certainly part of Semantic Visions’ block of addresses. So Semantic Visions was the culprit.

We run an unrivaled web-mining system that provides an unprecedented view of what is going on in the world, at any moment.

and

We run an unrivaled web-mining system around a powerful Ontology which is unparalleled in its flexibility, reach and sophistication.

They certainly did an unprecedented mining of my site, hitting every page dozens or hundreds of times over the course of five to six hours. The runaway growth of their process, though, put me more in the mind of oncology than ontology.

Knowing that Semantic Visions is likely to be using a block of IP numbers, I rejiggered my search through the Apache log file,

awk '/144\.76\.32\./ {print}' < leancrew.com > semantic.txt


and found that they started crawling my site again today, using IP numbers from 144.76.32.132 through 144.76.32.148. Although their spider seemed to be well behaved today, I’ve blocked that whole range of numbers anyway. Fuck them and the ontology they rode in on.

Blocking Semantic Visions was fun, but it isn’t a long-term solution. As I thought today about building a static blogging system, I figured it’d be worthwhile to have a list of all my posts in chronological order. I already have the Markdown input for each post saved locally in a directory structure that looks like this:

yyyy/mm/slug.md


(The slug is a transformation of the post’s title that WordPress uses to generate the URL.)

Unfortunately, this doesn’t allow me to figure out the order of the posts within each month. But that information is in the header of each file. Here’s an example header:

Title: Oh, Domino!
Keywords: news
Date: 2005-09-02 23:05:12
Post: 32
Slug: oh-domino
Status: publish


The Date and Slug lines are what I need to make the ordered list. Starting in Terminal, cd’d to the directory that contains all the year folders, I run this pipeline

cat */*/*.md | awk '/^Date:/ {printf "%s %s ", $2,$3; split($2, d, "-")}; /^Slug:/ {printf "%s/%s/%s.md\n", d[1], d[2],$2}' | sort > all-posts.txt


which creates a file called all-posts.txt that’s filled with lines that look like this:

2014-09-12 00:04:03 2014/09/terminal-velocity.md
2014-09-13 21:25:40 2014/09/sgml-nostalgia.md
2014-09-16 08:31:54 2014/09/pcalc-construction-set.md
2014-09-16 20:34:06 2014/09/scipy-and-image-analysis.md
2014-09-19 09:14:45 2014/09/an-unexpected-decision.md
2014-09-21 21:39:02 2014/09/end-of-summer-miscellany.md
2014-09-24 00:47:09 2014/09/engineering-language.md
2014-09-24 09:37:28 2014/09/shortcut-stupidity.md


The cat command concatenates all 2,000-plus articles, which is about eight megabytes. Old-time Unix users may blanch at this, but we’re not living in the ’80s anymore—eight megabytes is nothing. This chunk of text is then fed to awk, which does the following:

• Extracts the date and time from the Date line.
• Prints them to stdout.
• Splits the components of the date and stores them in an array called d.
• Extracts the filename (sans extension) from the Slug line.
• Prints the year, month, and filename (with extension) separated by slashes to stdout. This is the path to the Markdown source file.

At this point I have a chunk of text with a line for every post, but they’re not ordered chronologically yet. But because of the way the date and time are formatted, lexical order is the same as chronological order, and a simple pass through sort does what I need. Finally, the text is saved to a file.

As I get further down the road on building my system, it may turn out that I don’t need this file. But at least I didn’t expend much effort in making it.

Sunday’s post prompted a few suggestions for a new web host. After the downtime on Tuesday, Seth suggested Amazon’s S3. I spent a little time today looking into it, and it looks like a good choice. The upside is reliability and scale; the downside is unfamiliarity. I’m used to configuring Apache and will have to learn some new skills to tweak an S3 site. But I think it’ll be worth it. I can’t imagine Semantic Visions taking down S3.