it seems like i am having quitq a few frustrating spam-related
conversations recently.
On Wed 15 Dec 2004, Rev Simon Rumble
<simon(a)rumble.net> wrote:
> There's a helluva lot of wiki spam on that site. :(
looking for ways to hold off wiki spam without having to require user
login, i tried patching CGI::Wiki(::Kwiki) to set a 'human=probably'
cookie that you couldn't edit pages without.
It just seems to have slowed down the flood rather than stopped it, on
heavily-spammed sites though.
I also changed the text 'Edit this page' to 'alter this page' on a
couple of sites to attempt to help avert brute-force google based
attacks.
I fear that the way this is going, there is going to have to be a
'proof of humanity' test, e.g. recognising numbers that are heavily
distorted on a noisy background like all the free web/mail services now
do.
i don't want to have to go here yet! but what else is the plan;
requiring user login without email confirmation won't last for long;
when that becomes the rule, will these spamcrawlers which already
fake headers and cookies slow down for long? it is like these people
in amsterdam who have 3 locks on their bike because 2 locks are not
secure enough any more.
i am not crazy about having to log into websites, especially in wiki,
raising the edit bar; e.g. i am working on a wireless project which
aggregates information from london OG into a simple locative service;
i don't want the barrier to entry of login from here, it is certain to
put casual users off making ad-hoc additions to the guides from their
local free network node. some kind of referer mask to allow free edits
from certain domains? it's all starting to get a bit complex.
i *do* want to sort this out.
recently i've notice a lot of pages like this, for example:
http://joyce.eng.yale.edu/~bt/school/report.cgi/draft/kennecott.htm
web pages full of totally dissociated text, with a selection of
mailto: links. they pop up high up on what used to be known as a
googlewhack (but this is me using teoma, too) typing in two unrelated
kinds of words that are unlikely to co-occur in a document.
designed to wreak search-based web-email-crawler revenge on certain
addresses?
i'm sort of astounded by the volume of junk information relative to
'real' information, a kind of episodic insane subtext to the web. at
least when email suffered from it so badly we had the excuse of "oh,
they're trying to subvert / corrupt bayesian filters". webspam of this
kind makes less sense.
the mega search engines have a fuck of a lot to answer for. but how
can they avoid indexing this stuff when there's *enough* of it to make
it look plausible, to some kind of big-brain engine trying to infer
sense and relation a priori from information mulch.
perhaps a content-based wiki spam filter would be useful; it wouldn't
hold off dissociated-text attacks but would certainly deal with the
'long list of medication spam links' style wikispam we are seeing now.
a blacklist service?
zx