On 7/26/07, Christopher Schmidt <crschmidt(a)crschmidt.net> wrote:
On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon
Indeed, this should most certainly be done for
all revisions of a
page except the current, so that if someone reverts spam without the
admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very
well behaved bot. Others like the MSN one are much less well behaved.
I'm not convinced of that.
Google routinely and regularly fetches *large* pages on the Open Guide
to Boston that almost never change. Think Category Restaurant page --
1MB page, changes maybe once a week, Google fetches it daily.
Spoke with the crawler guys here and your site changes more often than
you seem to think. That page also has a high page rank which affects
Your HTTP headers could help more: try using if-modified-since.
Consider also a reverse caching proxy to reduce load. You can reduce
the crawl freq with the webmaster console if you still think it's too
You could restructure the page to not be a megabyte too, of course ;-)
Paul (not speaking as a representative of his employer, just trying to help out)
Granted, OG Boston is particularly poorly optimized for this because we
use index_list in our category pages. The actual index_value, etc. mode
in wiki.cgi is significantly more lightweight. (Bad decision on my
part.) But you don't have to have someone fetching much data to hurt a
site, and even if Google is only requesting things slowly, they can
still exceed the return rate of the server.
OpenGuides-Dev mailing list - OpenGuides-Dev(a)lists.openguides.org