There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it, e.g.
http://dev.openguides.org/changeset/573 http://dev.openguides.org/changeset/1132
However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...
which if you view the source does indeed have <meta name="robots" content="noindex,nofollow" /> in the <head>.
But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Am I missing something obvious?
Kake
On Wed, Jun 25, 2008 at 03:07:59PM +0100, Kake L Pugh wrote:
There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it, e.g.
http://dev.openguides.org/changeset/573 http://dev.openguides.org/changeset/1132
However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...
which if you view the source does indeed have
<meta name="robots" content="noindex,nofollow" /> in the <head>.
But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Am I missing something obvious?
In order to read the meta-headers the bot needs to make a request to the page. It wont index that page, or follow links from it though.
It might be worth putting some stuff in robots.txt to stop it, but I don't know if robots.txt can use the request parameters in patterns.
David
On Wed 25 Jun 2008, David Sheldon dave@earth.li wrote:
In order to read the meta-headers the bot needs to make a request to the page. It wont index that page, or follow links from it though.
Hm, right, I see. I'd sort of assumed that it would only do that once, and then realise "OK, I don't need to check this page again". It's grabbed that particular one three times in the past week.
It might be worth putting some stuff in robots.txt to stop it, but I don't know if robots.txt can use the request parameters in patterns.
Bob says it can't.
Kake
On Wed, Jun 25, 2008 at 03:07:59PM +0100, Kake L Pugh wrote:
There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it, e.g.
http://dev.openguides.org/changeset/573 http://dev.openguides.org/changeset/1132
However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...
which if you view the source does indeed have
<meta name="robots" content="noindex,nofollow" /> in the <head>.
But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Am I missing something obvious?
'noindex,nofollow' says: "Don't put my contents in the Google Searches, and don't follow links from this page."
It can't know that the page has those tags without crawling it.
"Crawling" and "Indexing" are two different things: the only way to have a page not be crawled is to: * Not have any links pointing to it anywhere that Google can get to * Including it in robots.txt.
Regards,
On Wed, Jun 25, 2008 at 03:07:59PM +0100, Kake L Pugh wrote:
There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it
However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...
which if you view the source does indeed have
<meta name="robots" content="noindex,nofollow" /> in the <head>.
But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Am I missing something obvious?
Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time.
Yahoo do the same.
And yes, I do have the logs to prove that. I've had to ban their IP ranges from cpandeps and from wikiproxy.cantrell.org.uk.
On Wed, Jun 25, 2008 at 04:06:28PM +0100, David Cantrell wrote:
Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time.
Yahoo do the same.
And yes, I do have the logs to prove that.
Er, you seem to be misunderstanding how meta tags work: they have to *Crawl* the page to see the tags... and there is no tag that says "never crawl this page again."
I've never had Google violate robots.txt.
Logs don't really matter here: the key thing to point to would be an instance of Google search results containing a piece of HTML that is blocked by noindex. If you can find one of those, I bet that Google would be interested in seeing it. (Cheap tricks like modifying the HTML after Google crawls by don't count.)
Regards,
Christopher Schmidt wrote:
On Wed, Jun 25, 2008 at 04:06:28PM +0100, David Cantrell wrote:
Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time. Yahoo do the same.
Er, you seem to be misunderstanding how meta tags work: they have to *Crawl* the page to see the tags... and there is no tag that says "never crawl this page again."
You seem to be misunderstanding the concept of a cache. If they read the meta tag once, they should remember what it said for a while, AND OBEY IT without asking for that page again. Likewise robots.txt.
I've never had Google violate robots.txt.
Lucky you. I had them start crawling one of my sites, so I added a robots.txt, but they kept coming. I can understand them keeping going for a day or so cos they cached the fact that I didn't have a robots.txt file, but they were still requesting files other than robots.txt well over a month later. I hope all their programmers' children die in a fire.
the key thing to point to would be an
instance of Google search results containing a piece of HTML that is blocked by noindex. If you can find one of those, I bet that Google would be interested in seeing it. (Cheap tricks like modifying the HTML after Google crawls by don't count.)
I have no interest in helping google. My time is better spent by taking a few seconds to block their abusive bot than it is in figuring out how to contact the right person and convincing him that he's fucked up and then waiting months while he gets manglement approval to deploy a bugfix, while all the time the bot continues to prevent real users from having access to the service.
On Wed, Jun 25, 2008 at 09:59:47PM +0100, David Cantrell wrote:
Christopher Schmidt wrote:
On Wed, Jun 25, 2008 at 04:06:28PM +0100, David Cantrell wrote:
Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time. Yahoo do the same.
Er, you seem to be misunderstanding how meta tags work: they have to *Crawl* the page to see the tags... and there is no tag that says "never crawl this page again."
You seem to be misunderstanding the concept of a cache. If they read the meta tag once, they should remember what it said for a while, AND OBEY IT without asking for that page again. Likewise robots.txt.
Sure. And they do, in my experience. The definition of 'a while' may not be what you want it to be: In general, this can be helped by accurately describing how long 'a while' should be.
the key thing to point to would be an
instance of Google search results containing a piece of HTML that is blocked by noindex. If you can find one of those, I bet that Google would be interested in seeing it. (Cheap tricks like modifying the HTML after Google crawls by don't count.)
I have no interest in helping google.
Then allow me to put it differently: if you were to present that, instead of half-baked vitriol filled with hateful comments, I would be much less likely to mentally dismiss you as nothing more than a troll. It seems clear that this doesn't bother you one way or another, so I'm sorry that you have found your experience with running a webserver so painful, and hope that one day, you find a teddy bear that you can hug and make yourself feel better.
Regards,
On Wed 25 Jun 2008, Christopher Schmidt and Dave Cantrell wrote Some Stuff.
I'm not sure this discussion is going anywhere useful. Can we leave it at this, please?
Kake
openguides-dev@lists.openguides.org