This one time, at band camp, Earle Martin wrote:
- Somebody at 80.68.93.162 has been slurping the whole site, indexes,
pages, format varieties (rdf, raw) and all with WWW::Mechanize at the rate of one request per second.
This is something we can fix! That IP == ivorw.vm.bytemark.co.uk. Ivor, care to back off slurping for a bit?
a) and b) are going to happen this afternoon. c) d) and e) hopefully within a day or two. f) as soon as e) is complete. If load *still* goes too high after the spamwall has been raised, it may be indicative of a deeper problem meriting heavy investigation, and I'll put the site back into maintenance mode again.
I think a rudimentary publishing mode would be helpful. The front page doesn't need to be built from the database every time, so having it published might be good. I suspect putting the latest changes on there causes a fair bit of pain.
Incidental thought: since robots.txt doesn't allow wildcards, perhaps we should put 'rel="nofollow"' on all our links to resources useless to search engines. I believe the better-behaved robots should respect this.
Indeed, this should most certainly be done for all revisions of a page except the current, so that if someone reverts spam without the admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very well behaved bot. Others like the MSN one are much less well behaved.