On 7/26/07, Christopher Schmidt crschmidt@crschmidt.net wrote:
On Thu, Jul 26, 2007 at 01:24:50AM +0100, Rev Simon Rumble wrote:
Indeed, this should most certainly be done for all revisions of a page except the current, so that if someone reverts spam without the admin password, it's not indexed by the crawlers.
PS: I strongly doubt it's Google causing problems. Google is a very well behaved bot. Others like the MSN one are much less well behaved.
I'm not convinced of that.
Google routinely and regularly fetches *large* pages on the Open Guide to Boston that almost never change. Think Category Restaurant page -- 1MB page, changes maybe once a week, Google fetches it daily.
Spoke with the crawler guys here and your site changes more often than you seem to think. That page also has a high page rank which affects crawl frequency.
Your HTTP headers could help more: try using if-modified-since. Consider also a reverse caching proxy to reduce load. You can reduce the crawl freq with the webmaster console if you still think it's too much.
You could restructure the page to not be a megabyte too, of course ;-)
HTH, Paul (not speaking as a representative of his employer, just trying to help out)
Granted, OG Boston is particularly poorly optimized for this because we use index_list in our category pages. The actual index_value, etc. mode in wiki.cgi is significantly more lightweight. (Bad decision on my part.) But you don't have to have someone fetching much data to hurt a site, and even if Google is only requesting things slowly, they can still exceed the return rate of the server.
Regards,
Christopher Schmidt Web Developer
-- OpenGuides-Dev mailing list - OpenGuides-Dev@lists.openguides.org http://lists.openguides.org/cgi-bin/mailman/listinfo/openguides-dev