(Quoting the whole thing since most people won't have read the RT thingy.)
On Sun 09 Jan 2005, via RT comment-OpenGuides@rt.cpan.org wrote:
This message about OpenGuides was sent to you by DOM via rt.cpan.org
Full context and any attached attachments can be found at: <URL: https://rt.cpan.org/Ticket/Display.html?id=6386 >
This is a comment. It is not sent to the Requestor(s):
[IVORW - Sat May 22 17:14:21 2004]:
[MRAMBERG - Sat May 22 17:01:40 2004]:
Namely, it doesn't support them, so searches for things like gr�nerl�kka and gr�nland are moot.
I will look at expanding the allowed character set for search strings to take into account internationalization. We need to see if there are issues with Search::InvertedIndex, and make sure that it tokenizes in an i18n way.
Work on this should probably be concentrated on Plucene - see other utf8 bugs in OpenGuides though.
If we want Unicode searches to work we may need to canonicalise characters before indexing (and also before searching the indexes of course) since plenty of characters have more than one Unicode code point (eg e-acute can be "e-acute" or "e" + "combining acute". I'm not sure how Perl copes with this issue - anyone?
I was thinking if we're taking the trouble to do this it might also be a plan to canonicalise further, eg store and search for "e-acute" as simply "e". Not everyone can type accented characters so either the data or the search term could be typed as plain "e", and it would be nice to Do The Right Thing.
I think the right place to do this could be in a new CGI::Wiki::Search::* that could sit on top of either Search::InvertedIndex or Plucene or even a simple table in the database.
I'm not working on this at the moment, just occasionally thinking about the issues.
Kake