[OpenGuides-Dev] Re: [cpan #6386] [Comment] Search Engine has an issue with international characters - OpenGuides-Dev

9 Jan 2005


      (Quoting the whole thing since most people won't have read the RT thingy.)
On Sun 09 Jan 2005,  via RT comment-OpenGuides@rt.cpan.org wrote:
...
This message about OpenGuides was sent to you by DOM via rt.cpan.org
Full context and any attached attachments can be found at:
<URL: https://rt.cpan.org/Ticket/Display.html?id=6386 >
This is a comment.  It is not sent to the Requestor(s):
[IVORW - Sat May 22 17:14:21 2004]:
...
[MRAMBERG - Sat May 22 17:01:40 2004]:
...
Namely, it doesn't support them, so searches for things like
   gr�nerl�kka
and gr�nland are moot.
I will look at expanding the allowed character set for search strings 
to take into account internationalization. We need to see if there are 
issues with Search::InvertedIndex, and make sure that it tokenizes in 
an i18n way.
Work on this should probably be concentrated on Plucene - see other utf8
bugs in OpenGuides though.
If we want Unicode searches to work we may need to canonicalise
characters before indexing (and also before searching the indexes of
course) since plenty of characters have more than one Unicode code
point (eg e-acute can be "e-acute" or "e" + "combining acute".  I'm
not sure how Perl copes with this issue - anyone?
I was thinking if we're taking the trouble to do this it might also be
a plan to canonicalise further, eg store and search for "e-acute" as
simply "e".  Not everyone can type accented characters so either the
data or the search term could be typed as plain "e", and it would be
nice to Do The Right Thing.
I think the right place to do this could be in a new CGI::Wiki::Search::*
that could sit on top of either Search::InvertedIndex or Plucene or even
a simple table in the database.
I'm not working on this at the moment, just occasionally thinking about
the issues.
Kake