hello,
i've been crawling the OG pages' rdf to use in our wireless portal application. i have a longish list of pages that gave me back toxic RDF/XML; it follows here.
most of this i think is either URIs included as regualr string data that have '&' characters in them; the odd windows or unicode special character in the URI name.
better detox on the RDF pages would be hella cool; i would love to have summaries or full page content attached too, and detoxofying those is maybe a bit harder...
zx
--
http://london.openguides.org/index.cgi?id=Boathouse,_SW15_2JX;format=rdf#obj http://london.openguides.org/index.cgi?id=Britannia,_W8_6UX;format=rdf#obj http://london.openguides.org/index.cgi?id=British_Museum;format=rdf#obj http://london.openguides.org/index.cgi?id=Chinese_Dinner,_SW16_5JF;format=rd... http://london.openguides.org/index.cgi?id=Coach_And_Horses,_SW13_9LW;format=... http://london.openguides.org/index.cgi?id=Common_Cafe,_SW16 5NP;format=rdf#obj http://london.openguides.org/index.cgi?id=Dirty_Dick%27s,_EC2M_4NR;format=rd... http://london.openguides.org/index.cgi?id=Eel_Pie,_TW1_3NJ;format=rdf#obj http://london.openguides.org/index.cgi?id=Food_Bazaar,_WC1X_8TL;format=rdf#o... http://london.openguides.org/index.cgi?id=Gili_Gulu,_WC2H_9EP;format=rdf#obj http://london.openguides.org/index.cgi?id=Honglee,_CR0_1NG;format=rdf#obj http://london.openguides.org/index.cgi?id=Jerusalem_Tavern,_EC1M_5NA;format=... http://london.openguides.org/index.cgi?id=John_Lewis,_Oxford_Street;format=r... http://london.openguides.org/index.cgi?id=Kelong,_CR0_4RF;format=rdf#obj http://london.openguides.org/index.cgi?id=Market_Porter,_SE1_9AA;format=rdf#... http://london.openguides.org/index.cgi?id=Morgan_M;format=rdf#obj http://london.openguides.org/index.cgi?id=National_Gallery;format=rdf#obj http://london.openguides.org/index.cgi?id=Okawari,_W5_5AP;format=rdf#obj http://london.openguides.org/index.cgi?id=Osushi,_CRO_1BF;format=rdf#obj http://london.openguides.org/index.cgi?id=Peter_Jones;format=rdf#obj http://london.openguides.org/index.cgi?id=Pied_Bull,_SW16_3QB;format=rdf#obj http://london.openguides.org/index.cgi?id=Rudi%27s_Sandwich_Bar,_WC1X_8TP;fo... http://london.openguides.org/index.cgi?id=Sesame,_WC1N_3NG;format=rdf#obj http://london.openguides.org/index.cgi?id=St_John,_EC1M_4AY;format=rdf#obj http://london.openguides.org/index.cgi?id=Surrey_Street_Market,_Croydon;form... http://london.openguides.org/index.cgi?id=Tagine,_SW12_9RT;format=rdf#obj http://london.openguides.org/index.cgi?id=Tandoori_Garden,_SW6_7SR;format=rd... http://london.openguides.org/index.cgi?id=Tesco,_E11_1HT;format=rdf#obj http://london.openguides.org/index.cgi?id=Tesco,_E9_6ND;format=rdf#obj http://london.openguides.org/index.cgi?id=The_Place_Below,_EC2V_6AU;format=r... http://london.openguides.org/index.cgi?id=Victoria_Station;format=rdf#obj http://london.openguides.org/index.cgi?id=Wonderful,_CR0_1AE;format=rdf#obj
On Mon, May 16, 2005 at 10:40:19AM -0700, Jo Walsh wrote:
the odd windows or unicode special character
Can anyone advise on the best way to strip these characters out at editing time? And is it better to just throw them away, or is there a reliable way of de-weirding them?
On Tue, May 17, 2005 at 12:55:17AM +0100, Earle Martin wrote:
Can anyone advise on the best way to strip these characters out at editing time? And is it better to just throw them away, or is there a reliable way of de-weirding them?
I should add that newpage.cgi needs to be patched to strip them out, and index.cgi should probably bring up an error if you try to edit a page with a bad character (like '%A0', which I just found in a node name) in the name of the node.
----- Original Message ----- From: "Earle Martin" openguides@downlode.org To: "OpenGuides software developers" openguides-dev@openguides.org Sent: 17 May 2005 01:06 Subject: Re: [OpenGuides-Dev] toxic XML URIs
On Tue, May 17, 2005 at 12:55:17AM +0100, Earle Martin wrote:
Can anyone advise on the best way to strip these characters out at editing time? And is it better to just throw them away, or is there a reliable way of de-weirding them?
I should add that newpage.cgi needs to be patched to strip them out, and index.cgi should probably bring up an error if you try to edit a page with a bad character (like '%A0', which I just found in a node name) in the name of the node.
Sounds like a job to do in the untaint routine inside newpage.cgi. Presuming that newpage.cgi works with taint mode. ???
On Tue, May 17, 2005 at 06:27:32AM +0100, IvorW wrote:
I should add that newpage.cgi needs to be patched to strip them out, and index.cgi should probably bring up an error if you try to edit a page with a bad character (like '%A0', which I just found in a node name) in the name of the node.
Sounds like a job to do in the untaint routine inside newpage.cgi. Presuming that newpage.cgi works with taint mode. ???
Actually, it doesn't at the moment. I've made a start however at stripping out badness; it's actually running live as http://london.openguides.org/newpage.cgi (code: newpage.txt in same dir).
----- Original Message ----- From: "Earle Martin" openguides@downlode.org To: "OpenGuides software developers" openguides-dev@openguides.org Sent: 19 May 2005 01:40 Subject: Re: [OpenGuides-Dev] toxic XML URIs
On Tue, May 17, 2005 at 06:27:32AM +0100, IvorW wrote:
I should add that newpage.cgi needs to be patched to strip them out, and index.cgi should probably bring up an error if you try to edit a page with a bad character (like '%A0', which I just found in a node name) in the name of the node.
Sounds like a job to do in the untaint routine inside newpage.cgi. Presuming that newpage.cgi works with taint mode. ???
Actually, it doesn't at the moment. I've made a start however at stripping out badness; it's actually running live as http://london.openguides.org/newpage.cgi (code: newpage.txt in same dir).
There is a nice example using a regex to untaint in Ovid's CGI course: http://users.easystreet.com/ovid/cgi_course/lessons/lesson_three.html
openguides-dev@lists.openguides.org