On 7/26/07, Christopher Schmidt <crschmidt(a)crschmidt.net> wrote:
On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon
Rumble wrote:
Indeed, this should most certainly be done for
all revisions of a
page except the current, so that if someone reverts spam without the
admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very
well behaved bot. Others like the MSN one are much less well behaved.
I'm not convinced of that.
Google routinely and regularly fetches *large* pages on the Open Guide
to Boston that almost never change. Think Category Restaurant page --
1MB page, changes maybe once a week, Google fetches it daily.
Spoke with the crawler guys here and your site changes more often than
you seem to think. That page also has a high page rank which affects
crawl frequency.
Your HTTP headers could help more: try using if-modified-since.
Consider also a reverse caching proxy to reduce load. You can reduce
the crawl freq with the webmaster console if you still think it's too
much.
You could restructure the page to not be a megabyte too, of course ;-)
HTH,
Paul (not speaking as a representative of his employer, just trying to help out)
Granted, OG Boston is particularly poorly optimized for this because we
use index_list in our category pages. The actual index_value, etc. mode
in wiki.cgi is significantly more lightweight. (Bad decision on my
part.) But you don't have to have someone fetching much data to hurt a
site, and even if Google is only requesting things slowly, they can
still exceed the return rate of the server.
Regards,
--
Christopher Schmidt
Web Developer
--
OpenGuides-Dev mailing list - OpenGuides-Dev(a)lists.openguides.org
http://lists.openguides.org/cgi-bin/mailman/listinfo/openguides-dev