There's a helluva lot of wiki spam on that site. :(
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
There's a helluva lot of wiki spam on that site. :(
Fixed, thanks. Yes, it gets spammed much more than London does, probably because it's been around for a long time. I try to remember to check it every day because it *does* get spammed *every* day. Sigh.
Is the fact that you posted here instead of mailing me directly a sign that we need to make the email addresses of site admins more obvious?
Kake
This one time, at band camp, Kake L Pugh wrote:
Fixed, thanks. Yes, it gets spammed much more than London does, probably because it's been around for a long time. I try to remember to check it every day because it *does* get spammed *every* day. Sigh.
Ouch. You might like to check out www.bloglines.com which is the only RSS aggregator I've found to be useful. It tells me all changes to my Guide and London. Just wish Wikipedia had RSS for my watchlist. (But understand why they wouldn't: RSS would bring the server to its knees.)
Is the fact that you posted here instead of mailing me directly a sign that we need to make the email addresses of site admins more obvious?
More that I was hoping for a discussion on how to deal with this.
This one time, at band camp, Kake L Pugh wrote:
Is the fact that you posted here instead of mailing me directly a sign that we need to make the email addresses of site admins more obvious?
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
More that I was hoping for a discussion on how to deal with this.
I've changed the subject line to make this more apparent.
People: Simon would like a discussion. Please discuss.
Kake
it seems like i am having quitq a few frustrating spam-related conversations recently.
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
There's a helluva lot of wiki spam on that site. :(
looking for ways to hold off wiki spam without having to require user login, i tried patching CGI::Wiki(::Kwiki) to set a 'human=probably' cookie that you couldn't edit pages without.
It just seems to have slowed down the flood rather than stopped it, on heavily-spammed sites though.
I also changed the text 'Edit this page' to 'alter this page' on a couple of sites to attempt to help avert brute-force google based attacks.
I fear that the way this is going, there is going to have to be a 'proof of humanity' test, e.g. recognising numbers that are heavily distorted on a noisy background like all the free web/mail services now do.
i don't want to have to go here yet! but what else is the plan; requiring user login without email confirmation won't last for long; when that becomes the rule, will these spamcrawlers which already fake headers and cookies slow down for long? it is like these people in amsterdam who have 3 locks on their bike because 2 locks are not secure enough any more.
i am not crazy about having to log into websites, especially in wiki, raising the edit bar; e.g. i am working on a wireless project which aggregates information from london OG into a simple locative service; i don't want the barrier to entry of login from here, it is certain to put casual users off making ad-hoc additions to the guides from their local free network node. some kind of referer mask to allow free edits from certain domains? it's all starting to get a bit complex.
i *do* want to sort this out.
recently i've notice a lot of pages like this, for example: http://joyce.eng.yale.edu/~bt/school/report.cgi/draft/kennecott.htm
web pages full of totally dissociated text, with a selection of mailto: links. they pop up high up on what used to be known as a googlewhack (but this is me using teoma, too) typing in two unrelated kinds of words that are unlikely to co-occur in a document. designed to wreak search-based web-email-crawler revenge on certain addresses?
i'm sort of astounded by the volume of junk information relative to 'real' information, a kind of episodic insane subtext to the web. at least when email suffered from it so badly we had the excuse of "oh, they're trying to subvert / corrupt bayesian filters". webspam of this kind makes less sense.
the mega search engines have a fuck of a lot to answer for. but how can they avoid indexing this stuff when there's *enough* of it to make it look plausible, to some kind of big-brain engine trying to infer sense and relation a priori from information mulch.
perhaps a content-based wiki spam filter would be useful; it wouldn't hold off dissociated-text attacks but would certainly deal with the 'long list of medication spam links' style wikispam we are seeing now. a blacklist service?
zx
On Thu, Dec 16, 2004 at 06:27:30AM -0800, Jo Walsh wrote:
perhaps a content-based wiki spam filter would be useful; it wouldn't hold off dissociated-text attacks but would certainly deal with the 'long list of medication spam links' style wikispam we are seeing now. a blacklist service?
Have the framework to enable certain classes of data to be moderated (regexp based on node update for example).
Then you can trivially require moderation for any edit that contains an external URL ref. This would block the vast majority of spam that I've seen on the OpenGuides so far, and would not be an unreasonable inconvience to wiki users.
We already have the concept of admin users for node deletion, so simply set a flag in the database that a node version needs moderation and refrain from displaying such data on the public web site.
(feel free to condense the results of this discussion into a wishlist bug(s) on rt.cpan.org for OpenGuides).
Dominic.
This one time, at band camp, Dominic Hargreaves wrote:
Have the framework to enable certain classes of data to be moderated (regexp based on node update for example).
Then you can trivially require moderation for any edit that contains an external URL ref. This would block the vast majority of spam that I've seen on the OpenGuides so far, and would not be an unreasonable inconvience to wiki users.
Or, alternatively, look up the provided URLs in one of the spam databases that tracks such things. Or perhaps start a Wiki-specific one for use on lots of Wiki systems?
On Thu, Dec 16, 2004 at 02:48:54PM +0000, Rev Simon Rumble wrote:
This one time, at band camp, Dominic Hargreaves wrote:
Have the framework to enable certain classes of data to be moderated (regexp based on node update for example).
Then you can trivially require moderation for any edit that contains an external URL ref. This would block the vast majority of spam that I've seen on the OpenGuides so far, and would not be an unreasonable inconvience to wiki users.
Or, alternatively, look up the provided URLs in one of the spam databases that tracks such things. Or perhaps start a Wiki-specific one for use on lots of Wiki systems?
MoinMoin already has one; see http://moinmoin.wikiwkiweb.de/AntiSpamGlobalSolution for information on how it works (or how to turn it on, but it's only got a plugin for MoinMoin listed there.)
On Thu, Dec 16, 2004 at 06:27:30AM -0800, Jo Walsh wrote:
perhaps a content-based wiki spam filter would be useful; it wouldn't hold off dissociated-text attacks but would certainly deal with the 'long list of medication spam links' style wikispam we are seeing now. a blacklist service?
Would using some metrics of sentence complexity and structure be useful? There was some talk recently on the london.pm list about this, in which Lingua::EN::Fathom, diction(1) and style(1) were recommended.
http://london.pm.org/pipermail/london.pm/Week-of-Mon-20041011/029431.html and follow-ups.
This is working on the assumption here that typical comment spam will get outlandish scores. Not that I know, cos I've never seen any, let alone measured it.
I'm going to pop over to spam-l now and see if anyone there has any ideas ...
This one time, at band camp, Christopher Schmidt wrote:
MoinMoin already has one; see http://moinmoin.wikiwkiweb.de/AntiSpamGlobalSolution for information on how it works (or how to turn it on, but it's only got a plugin for MoinMoin listed there.)
That doesn't seem to be working. Google cache: http://tinyurl.com/66gek
This is a nice blacklist of regexes for Wiki spammers: http://blacklist.chongqed.org/
Of course, it's the wrong approach. It's not scalable in the same manner as scalability problems with RSS. Having a delay between refreshes to combat scalability issues means you could get spammed by an already-known spammer.
Should be some database that is query/response based where your Wiki submits all the URLs and gets a boolean reply. RBL-style approach.
Some very basic, quick-and-dirty regexes could possibly solve much of the problem. Look for the drugs, gambling games and watch brands they're pushing. Instead of giving a "this is spam, rejected" message, give a cryptic technical-sounding error like "server timeout" or "internal server error".
On Thu, Dec 16, 2004 at 02:48:54PM +0000, Rev Simon Rumble wrote:
Or, alternatively, look up the provided URLs in one of the spam databases that tracks such things.
And use the Spamassassin modules to do it!
On Thu, 16 Dec 2004, Jo Walsh wrote:
it seems like i am having quitq a few frustrating spam-related conversations recently.
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
There's a helluva lot of wiki spam on that site. :(
looking for ways to hold off wiki spam without having to require user login, i tried patching CGI::Wiki(::Kwiki) to set a 'human=probably' cookie that you couldn't edit pages without.
It just seems to have slowed down the flood rather than stopped it, on heavily-spammed sites though.
I also changed the text 'Edit this page' to 'alter this page' on a couple of sites to attempt to help avert brute-force google based attacks.
I fear that the way this is going, there is going to have to be a 'proof of humanity' test, e.g. recognising numbers that are heavily distorted on a noisy background like all the free web/mail services now do.
It's my impression that a lot of spam comes not from bots, but from people in China or elsewhere where labor is cheap. Seems like antibot techniques won't help there -- you'd need smarter software...
Justin
It's my impression that a lot of spam comes not from bots, but from people in China or elsewhere where labor is cheap. Seems like antibot techniques won't help there -- you'd need smarter software...
I can confirm that. My (Best Practical's) wiki has been getting pretty hammered. Adding captchas didn't even slow them down. Turning every external link to "This link blocked due to wikispam" didn't slow them down. And the sad thing is that since kwiki has a regexp based parser, when they paste in 20k of hyperlinks with chinese hyperlinks, the page won't even _render_ anymore. So it's not like they're getting google juice from me :/ A couple even left nasty notes saying "you keep fighting the spam. We're just going to keep hitting you. Give up."
At this point, I'm leaning strongly to requiring email verification for editors.
Justin
-- OpenGuides-Dev mailing list - OpenGuides-Dev@openguides.org http://openguides.org/mm/listinfo/openguides-dev
On Thu, Dec 16, 2004 at 08:30:39AM +0000, Kake L Pugh wrote:
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
More that I was hoping for a discussion on how to deal with this.
People: Simon would like a discussion. Please discuss.
Dave Cantrell and I were talking about this last night, and this is the result:
http://openguides.org/dev/?node=Wiki%20Greylisting
Dave, perhaps you might like to elucidate on the basics of BGP for those of us who aren't familiar with it.
On Thu, Dec 16, 2004 at 06:27:30AM -0800, Jo Walsh wrote:
I fear that the way this is going, there is going to have to be a 'proof of humanity' test, e.g. recognising numbers that are heavily distorted on a noisy background like all the free web/mail services now do.
Unfortunately, even these are doomed. Firstly, because there are always more black-hat programmers who are willing to try and beat them (for a white-hat example, see: http://www.cs.berkeley.edu/~mori/gimpy/gimpy.html ); and secondly, because "they'll hire child workers to read your images and manually register/post/ping/trackback/whatever. (Already happening.)" [I don't know if it /is/ already happening, but it sounds feasible.] - http://diveintomark.org/archives/2003/11/15/more-spam
i am not crazy about having to log into websites, especially in wiki,
If it requires a login, it is not wiki. It may seem a lot like wiki, but it is not wiki. It is almost-wiki.
recently i've notice a lot of pages like this, for example: http://joyce.eng.yale.edu/~bt/school/report.cgi/draft/kennecott.htm
Or this? http://downlode.org/perl/spamtrap/spamtrap.cgi
designed to wreak search-based web-email-crawler revenge on certain addresses?
Designed to be email-spider honey traps. Any spam poisoner should use robots.txt or a meta robots header to block all robots (mine does); that way the legitimate search engine robots will ignore their recursive black holes of gibberish and only black-hat spiders will get sucked in. In theory.
webspam of this kind makes less sense.
If it shows up on the web it's because the programmer hasn't robot-banned it. More pernicious are the scads of websites made of random words at random URLs that do nothing but redirect you to some wanker trying to sell you something. Or those people who go buying up expired domains (like earlemartin.com! Yay!).
On Thu, 23 Dec 2004, Earle Martin wrote:
On Thu, Dec 16, 2004 at 08:30:39AM +0000, Kake L Pugh wrote:
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
More that I was hoping for a discussion on how to deal with this.
People: Simon would like a discussion. Please discuss.
Dave Cantrell and I were talking about this last night, and this is the result:
http://openguides.org/dev/?node=Wiki%20Greylisting
Dave, perhaps you might like to elucidate on the basics of BGP for those of us who aren't familiar with it. a
so what happens if a person whois grey listed makes a valid change do they get added to whitelist.
-- Earle Martin http://downlode.org/ http://purl.oclc.org/net/earlemartin/
-- OpenGuides-Dev mailing list - OpenGuides-Dev@openguides.org http://openguides.org/mm/listinfo/openguides-dev
On Thu, Dec 16, 2004 at 02:41:33PM +0000, Dominic Hargreaves wrote:
Then you can trivially require moderation for any edit that contains an external URL ref. This would block the vast majority of spam that I've seen on the OpenGuides so far, and would not be an unreasonable inconvience to wiki users.
It would be if the threshold was "any external URL". Make the threshold something like 5 URLs and then you're talking.
On Thu, Dec 16, 2004 at 08:30:39AM +0000, Kake L Pugh wrote:
This one time, at band camp, Kake L Pugh wrote:
Is the fact that you posted here instead of mailing me directly a sign that we need to make the email addresses of site admins more obvious?
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
More that I was hoping for a discussion on how to deal with this.
I've changed the subject line to make this more apparent.
It's like war, the decision made soonest is the best. (Less artsy, more fartsy)
Stephen
On Dec 23, 2004, at 16:57, Earle Martin wrote:
On Thu, Dec 16, 2004 at 08:30:39AM +0000, Kake L Pugh wrote:
On Wed 15 Dec 2004, Rev Simon Rumble simon@rumble.net wrote:
More that I was hoping for a discussion on how to deal with this.
People: Simon would like a discussion. Please discuss.
Dave Cantrell and I were talking about this last night, and this is the result:
What would be the policy for getting added / removed from the grey/black lists?
tom
Bob wrote:
so what happens if a person whois grey listed makes a valid change do they get added to whitelist.
And Tom wrote:
What would be the policy for getting added / removed from the grey/black lists?
I would say that:
0) Firstly, the grey/blacklist only takes effect when you try to save a page, unlike a .htaccess ban. 1) Grey/blacklist control is entirely manual, done by the admins. No auto-whitelisting. 2) If you're on one of the lists, the warning message that you receive when you try to save an edit should include a contact form or link to a contact form where you can say "hey, I'm a white hat!" - the admin would then explicitly whitelist their /24 block. Of course, if they were lying, splat.
I forgot to mention another thing that came up in discussion - there should probably be a "flag this page for review" link somewhere on every page. That would take you to a form that says "You have indicated that you think this page should be checked by a site administrator. Please enter the reason:" and have three boxes: reason for flassing (mandatory), name (optional) and email (optional). The submitted page would then get added to a simple queue for inspection by the site admins.
Right, I'm off out. Merry Christmas one and all!
On Fri, Dec 24, 2004 at 01:41:22PM +0000, Earle Martin wrote:
reason for flassing (mandatory)
Er... flagging, not flassing (which is something between flashing and flossing).
Earle Martin wrote:
On Thu, Dec 16, 2004 at 06:27:30AM -0800, Jo Walsh wrote:
I fear that the way this is going, there is going to have to be a 'proof of humanity' test, e.g. recognising numbers that are heavily distorted on a noisy background like all the free web/mail services now do.
Unfortunately, even these are doomed. Firstly, because there are always more black-hat programmers who are willing to try and beat them (for a white-hat example, see: http://www.cs.berkeley.edu/~mori/gimpy/gimpy.html ); and secondly, because "they'll hire child workers to read your images and manually register/post/ping/trackback/whatever. (Already happening.)"
No they won't, they'll re-serve the images onto porn sites and use sad lonely geeks to solve the problem for free.
On Thu, Dec 23, 2004 at 03:57:56PM +0000, Earle Martin wrote:
http://openguides.org/dev/?node=Wiki%20Greylisting
Dave, perhaps you might like to elucidate on the basics of BGP for those of us who aren't familiar with it.
Yes, and a correction to your image too :-)
BGP (Border Gateway Protocol) is what ISPs use to figure out how to route a packet from one host to another half way round the world. It works by ISPs "announcing" that they can route to (eg) all hosts in NETMASK/BITS. eg, 212.58.224.111 is a BBC web server. Auntie announces a route to 212.58.224.0/19. That notation means "the netblock which contains IP 212.58.224.0 and all other IPs which have the same first 19 bits". Obviously, the smaller the number of leading bits a block shares, the larger the block.
The idea is that when a wikiadmin sees some spam, they would look at a view of the routing table (I have a script which does this using route-views.org's very nice DNS-ish view of the routing table at the University of Oregon) to see what netblock was being announced that contained the spammer's IP, and would greylist the entire block [Earle - it might be bigger or smaller than a /20]. They would also BLACKlist the spammer's /24 or, if the netblockis smaller than /24, blacklist just that smaller block.
We blacklist a /24 as well as greylisting a larger block for reasons of route aggregation, ISPs being spam-friendly, and so on. I can explain in great detail and at great length over a beer ;-)
I have a script which I use for this sort of stuff, contact me off-list if you want a copy.
It's very important to note that no blacklisting or greylisting should happen without an admin's say-so. Although I would very strongly recommend blacklisting all the networks at: http://www.spamhaus.org/drop/drop.lasso (explanation at http://www.spamhaus.org/drop/) and at least greylisting (but preferably blacklisting) all of: http://www.okean.com/sinokoreacidr.txt (explanation at http://www.okean.com/asianspamblocks.html)
FWIW, I have all of those netblocks blacklisted (plus several others) in my mail sewer config, and there is no noticeable performance hit.
openguides-dev@lists.openguides.org