On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon Rumble wrote:
Indeed, this should most certainly be done for all revisions of a page except the current, so that if someone reverts spam without the admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very well behaved bot. Others like the MSN one are much less well behaved.
I'm not convinced of that.
Google routinely and regularly fetches *large* pages on the Open Guide to Boston that almost never change. Think Category Restaurant page -- 1MB page, changes maybe once a week, Google fetches it daily.
Granted, OG Boston is particularly poorly optimized for this because we use index_list in our category pages. The actual index_value, etc. mode in wiki.cgi is significantly more lightweight. (Bad decision on my part.) But you don't have to have someone fetching much data to hurt a site, and even if Google is only requesting things slowly, they can still exceed the return rate of the server.
Regards,