Robot deterrence

List overview All Threads
Download

newer

older

Universal edit button

5th anniversary meet, Sat 21 June,...

Kake L Pugh

25 Jun 2008 25 Jun '08

3:07 p.m.

There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it, e.g.

http://dev.openguides.org/changeset/573 http://dev.openguides.org/changeset/1132

However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...

which if you view the source does indeed have <meta name="robots" content="noindex,nofollow" /> in the <head>.

But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Am I missing something obvious?

Kake

Show replies by date

David Sheldon

25 Jun 25 Jun

3:14 p.m.

On Wed, Jun 25, 2008 at 03:07:59PM +0100, Kake L Pugh wrote:

...

There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it, e.g.

http://dev.openguides.org/changeset/573 http://dev.openguides.org/changeset/1132

However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...

which if you view the source does indeed have

<meta name="robots" content="noindex,nofollow" /> in the <head>.

But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Am I missing something obvious?

In order to read the meta-headers the bot needs to make a request to the page. It wont index that page, or follow links from it though.

It might be worth putting some stuff in robots.txt to stop it, but I don't know if robots.txt can use the request parameters in patterns.

David

-- "I think 'small and fluffy' is a good term, which should be used more often" --Andie

Kake L Pugh

3:19 p.m.

On Wed 25 Jun 2008, David Sheldon dave@earth.li wrote:

...

In order to read the meta-headers the bot needs to make a request to the page. It wont index that page, or follow links from it though.

Hm, right, I see. I'd sort of assumed that it would only do that once, and then realise "OK, I don't need to check this page again". It's grabbed that particular one three times in the past week.

...

It might be worth putting some stuff in robots.txt to stop it, but I don't know if robots.txt can use the request parameters in patterns.

Bob says it can't.

Kake

Christopher Schmidt

3:18 p.m.

On Wed, Jun 25, 2008 at 03:07:59PM +0100, Kake L Pugh wrote:

...

There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it, e.g.

http://dev.openguides.org/changeset/573 http://dev.openguides.org/changeset/1132

However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...

which if you view the source does indeed have

<meta name="robots" content="noindex,nofollow" /> in the <head>.

But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Am I missing something obvious?

'noindex,nofollow' says: "Don't put my contents in the Google Searches, and don't follow links from this page."

It can't know that the page has those tags without crawling it.

"Crawling" and "Indexing" are two different things: the only way to have a page not be crawled is to: * Not have any links pointing to it anywhere that Google can get to * Including it in robots.txt.

Regards,

-- Christopher Schmidt Web Developer

David Cantrell

4:06 p.m.

On Wed, Jun 25, 2008 at 03:07:59PM +0100, Kake L Pugh wrote:

...

There are a number of OpenGuides page types that web spiders don't really need to index, and we have code to stop them doing it

However, it doesn't seem to be working. See for instance: http://london.randomness.org.uk/wiki.cgi?action=list_all_versions;id=Locale%...

which if you view the source does indeed have

<meta name="robots" content="noindex,nofollow" /> in the <head>.

But from the Apache logs: 66.249.67.153 - - [25/Jun/2008:14:59:00 +0100] "GET /wiki.cgi?action=list_all_versions;id=Locale%20IG9 HTTP/1.1" 200 3151 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Am I missing something obvious?

Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time.

Yahoo do the same.

And yes, I do have the logs to prove that. I've had to ban their IP ranges from cpandeps and from wikiproxy.cantrell.org.uk.

-- David Cantrell | top google result for "topless karaoke murders" Godliness is next to Englishness

Christopher Schmidt

4:24 p.m.

On Wed, Jun 25, 2008 at 04:06:28PM +0100, David Cantrell wrote:

...

Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time.

Yahoo do the same.

And yes, I do have the logs to prove that.

Er, you seem to be misunderstanding how meta tags work: they have to *Crawl* the page to see the tags... and there is no tag that says "never crawl this page again."

I've never had Google violate robots.txt.

Logs don't really matter here: the key thing to point to would be an instance of Google search results containing a piece of HTML that is blocked by noindex. If you can find one of those, I bet that Google would be interested in seeing it. (Cheap tricks like modifying the HTML after Google crawls by don't count.)

Regards,

-- Christopher Schmidt Web Developer

David Cantrell

9:59 p.m.

Christopher Schmidt wrote:

...

On Wed, Jun 25, 2008 at 04:06:28PM +0100, David Cantrell wrote:

...
Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time. Yahoo do the same.

Er, you seem to be misunderstanding how meta tags work: they have to *Crawl* the page to see the tags... and there is no tag that says "never crawl this page again."

You seem to be misunderstanding the concept of a cache. If they read the meta tag once, they should remember what it said for a while, AND OBEY IT without asking for that page again. Likewise robots.txt.

...

I've never had Google violate robots.txt.

Lucky you. I had them start crawling one of my sites, so I added a robots.txt, but they kept coming. I can understand them keeping going for a day or so cos they cached the fact that I didn't have a robots.txt file, but they were still requesting files other than robots.txt well over a month later. I hope all their programmers' children die in a fire.

...

                            the key thing to point to would be an
instance of Google search results containing a piece of HTML that is blocked by noindex. If you can find one of those, I bet that Google would be interested in seeing it. (Cheap tricks like modifying the HTML after Google crawls by don't count.)

I have no interest in helping google. My time is better spent by taking a few seconds to block their abusive bot than it is in figuring out how to contact the right person and convincing him that he's fucked up and then waiting months while he gets manglement approval to deploy a bugfix, while all the time the bot continues to prevent real users from having access to the service.

-- header FROM_DAVID_CANTRELL From =~ /david.cantrell/i describe FROM_DAVID_CANTRELL Message is from David Cantrell score FROM_DAVID_CANTRELL 15.72 # This figure from experimentation

Christopher Schmidt

10:18 p.m.

On Wed, Jun 25, 2008 at 09:59:47PM +0100, David Cantrell wrote:

...

Christopher Schmidt wrote:

...
On Wed, Jun 25, 2008 at 04:06:28PM +0100, David Cantrell wrote:

...
Yes, what you're missing is that google don't pay attention to robots.txt or the meta thingy. I expect that they cache it and then ignore changes for some time. Yahoo do the same.

Er, you seem to be misunderstanding how meta tags work: they have to *Crawl* the page to see the tags... and there is no tag that says "never crawl this page again."

You seem to be misunderstanding the concept of a cache. If they read the meta tag once, they should remember what it said for a while, AND OBEY IT without asking for that page again. Likewise robots.txt.

Sure. And they do, in my experience. The definition of 'a while' may not be what you want it to be: In general, this can be helped by accurately describing how long 'a while' should be.

...

...
                            the key thing to point to would be an
instance of Google search results containing a piece of HTML that is blocked by noindex. If you can find one of those, I bet that Google would be interested in seeing it. (Cheap tricks like modifying the HTML after Google crawls by don't count.)
I have no interest in helping google.

Then allow me to put it differently: if you were to present that, instead of half-baked vitriol filled with hateful comments, I would be much less likely to mentally dismiss you as nothing more than a troll. It seems clear that this doesn't bother you one way or another, so I'm sorry that you have found your experience with running a webserver so painful, and hope that one day, you find a teddy bear that you can hug and make yourself feel better.

Regards,

-- Christopher Schmidt Web Developer

Kake L Pugh

10:21 p.m.

On Wed 25 Jun 2008, Christopher Schmidt and Dave Cantrell wrote Some Stuff.

I'm not sure this discussion is going anywhere useful. Can we leave it at this, please?

Kake

6372

Age (days ago)

6372

Last active (days ago)

openguides-dev@lists.openguides.org

8 comments

4 participants

tags (0)

participants (4)

Christopher Schmidt
David Cantrell
David Sheldon
Kake L Pugh