it seems like i am having quitq a few frustrating spam-related conversations recently.
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
There's a helluva lot of wiki spam on that site. :(
looking for ways to hold off wiki spam without having to require user login, i tried patching CGI::Wiki(::Kwiki) to set a 'human=probably' cookie that you couldn't edit pages without.
It just seems to have slowed down the flood rather than stopped it, on heavily-spammed sites though.
I also changed the text 'Edit this page' to 'alter this page' on a couple of sites to attempt to help avert brute-force google based attacks.
I fear that the way this is going, there is going to have to be a 'proof of humanity' test, e.g. recognising numbers that are heavily distorted on a noisy background like all the free web/mail services now do.
i don't want to have to go here yet! but what else is the plan; requiring user login without email confirmation won't last for long; when that becomes the rule, will these spamcrawlers which already fake headers and cookies slow down for long? it is like these people in amsterdam who have 3 locks on their bike because 2 locks are not secure enough any more.
i am not crazy about having to log into websites, especially in wiki, raising the edit bar; e.g. i am working on a wireless project which aggregates information from london OG into a simple locative service; i don't want the barrier to entry of login from here, it is certain to put casual users off making ad-hoc additions to the guides from their local free network node. some kind of referer mask to allow free edits from certain domains? it's all starting to get a bit complex.
i *do* want to sort this out.
recently i've notice a lot of pages like this, for example: http://joyce.eng.yale.edu/~bt/school/report.cgi/draft/kennecott.htm
web pages full of totally dissociated text, with a selection of mailto: links. they pop up high up on what used to be known as a googlewhack (but this is me using teoma, too) typing in two unrelated kinds of words that are unlikely to co-occur in a document. designed to wreak search-based web-email-crawler revenge on certain addresses?
i'm sort of astounded by the volume of junk information relative to 'real' information, a kind of episodic insane subtext to the web. at least when email suffered from it so badly we had the excuse of "oh, they're trying to subvert / corrupt bayesian filters". webspam of this kind makes less sense.
the mega search engines have a fuck of a lot to answer for. but how can they avoid indexing this stuff when there's *enough* of it to make it look plausible, to some kind of big-brain engine trying to infer sense and relation a priori from information mulch.
perhaps a content-based wiki spam filter would be useful; it wouldn't hold off dissociated-text attacks but would certainly deal with the 'long list of medication spam links' style wikispam we are seeing now. a blacklist service?
zx