Here's a log of a conversation Ivor and I had on IRC the other week about searching - meant to post it before, but forgot. Comments?
Cheers,
Earle.
00:09 -!- Irssi: Starting query in perl with ivorw 00:09 <ivorw> Hi, I've been thinking about how to restructure the search on OG 00:12 <hex> hi. 00:12 <hex> oho? do go on... 00:13 <ivorw> I'm thinking that we keep the existing idea of priming a page cache that is then searched 00:14 * hex nods 00:14 <ivorw> However, the cache can be primed with pages resulting from an SQL query, in addition to those from a keyword inverted index search 00:15 <ivorw> The idea is that including locale=camden causes all locale camden pages to be loaded into the cache 00:16 <ivorw> But a full and & or syntax is available for applying to the cache 00:17 <ivorw> I'm now looking for a syntax for metadata qualifiers on a search 00:19 <hex> hmm... so the search would be faster because locale camden would be cached? 00:20 <ivorw> hex, not quite - the search tree only works over a single hash (cache). The idea is that everything that could possibly match the search is pre-loaded into the cache. 00:21 <ivorw> Prior to O/G, the usemod search worked by slurping the whole wiki into the hash every time. 00:22 <hex> blimey 00:22 <ivorw> Although this works, SII et al provide a better mechanism for subsetting the data, and something that will not blow the server up with a substantial query and dataset 00:23 <hex> right, yes. 00:23 <ivorw> My bodge (which works prety well) is to 'prime' the input hash with the results of an inverted index search on all of the keywords supplied - regardless of and/or syntax 00:24 <ivorw> However, only the SII is used, so the search will only find words in the body text of the page, not in the metadata 00:25 <hex> so we need a mechanism for searching metadata? 00:26 <ivorw> I want to keep this idea of a primed cache, and load it with SQL query results, to provide just that 00:28 <ivorw> How about: King's Head&locale=acton&category=real ale 00:29 <ivorw> Or, how about: locale=west end&category=pubs 00:30 <hex> ah, you mean search syntax 00:30 <hex> I quite like Google style 00:30 <hex> King's Head locale:"West End" category:Pubs 00:31 <ivorw> Ah, but are spaces allowed in metadata field names? 00:32 <hex> I dunno, but surely that could be handled without the user being involved... 00:32 <hex> s/ /_/g on the fly sort of thing and vice versa 00:33 <ivorw> Just thinking that some syntax might be tricky or ambiguous 00:34 <ivorw> the last post code:foo #Is this matching on post code or code? 00:35 <hex> oh, right! 00:35 <hex> no, I think all metadata is one word 00:36 <hex> there could always be a pidgin syntax for it anyway 00:36 <hex> phone:12345 00:36 <ivorw> taken from google again (dig them pidgeons :) 00:37 <ivorw> If we can name every meta field with \w chars, this will be OK 00:38 <ivorw> We could have aliases, phone, telephone, tel, etc. 00:39 <hex> yup! 00:39 <ivorw> Given my previous suggestion, I would quite like to be able to do regexp matches 00:40 <ivorw> e.g. king's head&post_code=~W3 00:41 <hex> must the search terms be joined by '&'? 00:42 <hex> I would have thought a magic word would suffice, or the = 00:42 <ivorw> that was just my previous syntax. 00:42 <hex> ah, I follow. 00:42 <hex> but yes, regexen++ 00:44 <ivorw> How about dropping the &s: locale=west end category=pubs distance(530546,181503)<200 00:45 <ivorw> Note, that's the grid ref of Holborn tube 00:46 <hex> I prefer colons to equals, simply because I think people are more used to Google style 00:46 <hex> but that may just be me 00:47 <ivorw> How about colons for a straight match, and =~ for a regexp 00:47 <hex> "west end" would need quotes so you don't search for "end" in "locale:west" 00:47 <hex> yes, that sounds great. 00:48 <ivorw> might want to delimit the regexp 00:48 <hex> actually, could it be :~ for a regexp, for consistency? 00:48 <hex> I know that's not very perlish.... 00:48 <ivorw> yup, why not tho 00:48 <hex> cool. 00:49 <ivorw> what do you think of my idea of a distance 'function'? 00:51 <hex> how about: 00:51 <hex> near:530546,181503 range:200m 00:51 <hex> a little easier to read 00:51 <hex> (and write) 00:51 <hex> I love the idea 00:52 <ivorw> with a default range presumably 00:52 <hex> hmm, yes 00:52 <hex> units: m, ft, yds, mi, km 00:52 <hex> (maybe) 00:53 <ivorw> Didn't someone give a talk on a module to handle dimensions? 00:53 <hex> dunno :) 00:53 <ivorw> I recall one last year in State51 00:54 <ivorw> Alex Gough: - Meaningful Strong Typing with Data::Dimensions 00:55 <ivorw> unfortunately, the link for the slides is broken :( 00:55 <hex> bug him on the list:) 00:55 <hex> listen, I must run, or rather sleep, my eyes are closing 00:56 <hex> this is very promising stuff 00:56 <ivorw> OK, noe wurriz - Thanks for the braindump receptacle 00:56 <hex> no problem! 00:56 <hex> seeya... 00:56 <ivorw> nn
On Tue 23 Sep 2003, Earle Martin openguides@downlode.org wrote:
Here's a log of a conversation Ivor and I had on IRC the other week about searching - meant to post it before, but forgot. Comments?
First of all, here's an update on where we are with the search, as of OpenGuides 0.25 (just released).
The search box now searches for your terms not only in the titles and bodies of the nodes, but also in the locales and categories. I feel this is much more likely to provide the search results that a user expects. We now require CGI::Wiki 0.49 (also just released) in order that the locale and category searches can be made case-insensitive.
00:13 <ivorw> I'm thinking that we keep the existing idea of priming a page cache that is then searched 00:14 * hex nods 00:14 <ivorw> However, the cache can be primed with pages resulting from an SQL query, in addition to those from a keyword inverted index search
You don't need to do an SQL query, as CGI::Wiki's ->list_nodes_by_metadata method will do this for you.
00:28 <ivorw> How about: King's Head&locale=acton&category=real ale 00:29 <ivorw> Or, how about: locale=west end&category=pubs 00:30 <hex> ah, you mean search syntax 00:30 <hex> I quite like Google style 00:30 <hex> King's Head locale:"West End" category:Pubs
I'm not always clear, reading this conversation, on whether the thing being discussed at any one time is the thing that the user types into the search box, or the query string that will appear in the URL, or the API of OpenGuides::SuperSearch, or what.
00:31 <ivorw> Ah, but are spaces allowed in metadata field names?
Metadata type and value can be whatever you wish.
00:39 <ivorw> Given my previous suggestion, I would quite like to be able to do regexp matches 00:40 <ivorw> e.g. king's head&post_code=~W3
Would anyone actually use this, practically speaking? Perhaps a fuzzy match would be more useful. We do in fact already have a method for fuzzy matching on titles; it's just that nobody has written a user interface for it yet.
In general, I really don't want to have to remember yet another syntax for writing search queries. How about we have an "advanced search" page, with little boxes or even dropdowns so it's absolutely clear which field will be searched for the things you type in.
Search for: _______________________
Locale(s): [ -- select -- ]
Category(ies): [ -- select -- ]
Kind of along the lines of the old Grub and Pub searches, which I hope to find time to resurrect this week.
Kake
On Tue, Sep 23, 2003 at 05:32:57PM +0100, Kate L Pugh wrote:
OpenGuides 0.25 (just released).
Cool.
00:29 <ivorw> Or, how about: locale=west end&category=pubs 00:30 <hex> King's Head locale:"West End" category:Pubs
I'm not always clear, reading this conversation, on whether the thing being discussed at any one time is the thing that the user types into the search box, or the query string that will appear in the URL, or the API of OpenGuides::SuperSearch, or what.
We were talking about what the user would be typing into the search box at this point.
00:40 <ivorw> e.g. king's head&post_code=~W3
Would anyone actually use this, practically speaking? Perhaps a fuzzy match would be more useful.
I would think it might, yes. Regexen are very specific; I don't think I'd have much use for them.
In general, I really don't want to have to remember yet another syntax for writing search queries. How about we have an "advanced search" page
I've been wanting this for quite a while. However, your point about yet another syntax is why I mentioned Google style in the conversation - Google lets you do advanced queries from the generic box, but gives you an advanced search page as well. I'd like this format for search terms (where metadata1 and metadata2 are two arbitrary metadata types like locale and category):
term1 [metadata1]:term2 [metadata2]:"term3 term4"
This would search for a node with the word term1, and the word term2 in its metadata1 and the phrase "term3 term4" in its metadata2. A less generic example:
fish category:Restaurants locale:"West London"
So that would search for a restaurant in West London that mentions the word "fish". Very straightforward, I think. However, since the new OG release "searches for your terms not only in the titles and bodies of the nodes, but also in the locales and categories", I guess that makes this redundant.
On Tue 23 Sep 2003, Earle Martin openguides@downlode.org wrote:
I've been wanting this for quite a while. However, your point about yet another syntax is why I mentioned Google style in the conversation - Google lets you do advanced queries from the generic box, but gives you an advanced search page as well. I'd like this format for search terms (where metadata1 and metadata2 are two arbitrary metadata types like locale and category):
term1 [metadata1]:term2 [metadata2]:"term3 term4"
Ah, yes, so like google you could search either by typing in the individual boxes or by shortcutting to the syntax above.
However, since the new OG release "searches for your terms not only in the titles and bodies of the nodes, but also in the locales and categories", I guess that makes this redundant.
I think the "one box" simple search approach is quite adequate at the moment - does anyone have any evidence otherwise? ie, have you found anomalous search results on any of the openguides installs that you believe a more complicated search syntax would fix?
Regarding search more generally, I think that words typed into the search box should default to an "AND" search, which I don't believe they do at the moment.
Kake
On Thu, Oct 09, 2003 at 06:00:04PM +0100, Kate L Pugh wrote:
term1 [metadata1]:term2 [metadata2]:"term3 term4"
Ah, yes, so like google you could search either by typing in the individual boxes or by shortcutting to the syntax above.
I think the "one box" simple search approach is quite adequate at the moment - does anyone have any evidence otherwise?
My preferred approach is "one box" as well, but to *also* to have an "advanced search" page, much like this one:
http://www.google.com/advanced_search?hl=en
- basically, Google does it right.
Regarding search more generally, I think that words typed into the search box should default to an "AND" search, which I don't believe they do at the moment.
No, they don't, which is definitely not right. This should return the Windsor Castle as the first hit:
http://openguides.org/london/supersearch.cgi?search=sausages+oysters
You currently have to search for "sausages & oysters", which feels rather old-fashioned.
On Thu 09 Oct 2003, Earle Martin openguides@downlode.org wrote:
My preferred approach is "one box" as well, but to *also* to have an "advanced search" page, much like this one:
I'm just reluctant to expend effort on creating a more advanced search thing when (a) nobody's expressed a real need for it, and (b) the simple search we have still needs work.
Kake wrote:
Regarding search more generally, I think that words typed into the search box should default to an "AND" search, which I don't believe they do at the moment.
Earle wrote:
No, they don't, which is definitely not right. This should return the Windsor Castle as the first hit:
http://openguides.org/london/supersearch.cgi?search=sausages+oysters
I've had a bash about at it, but I'm hampered by not having learnt Parse::RecDescent yet. From my reading of the code, I'd have thought that searches should default to OR, but they're not, so I'm clearly misunderstanding the code.
Ivor: Could you take the time to comment the RecDescent grammar in OpenGuides::SuperSearch, so I have something to go on? If you won't find time to do it soon then we may have to put RecDescent aside for now and use something simpler.
Or in fact if anyone else here understands Parse::RecDescent, could you put me straight? The code is here:
http://search.cpan.org/src/KAKE/OpenGuides-0.25/lib/OpenGuides/SuperSearch.p...
in the _perform_search method.
Thanks,
Kake
----- Original Message ----- From: "Kate L Pugh" kake@earth.li To: "Discussion of development on the OpenGuides software." openguides-dev@openguides.org Sent: 09 October 2003 20:25 Subject: Re: [OpenGuides-Dev] Search and search syntax
On Thu 09 Oct 2003, Earle Martin openguides@downlode.org wrote:
My preferred approach is "one box" as well, but to *also* to have an "advanced search" page, much like this one:
I'm just reluctant to expend effort on creating a more advanced search thing when (a) nobody's expressed a real need for it, and (b) the simple search we have still needs work.
Fine, I agree that the search needs work, but first I think we need to agree what direction we are working towards.
What I had in mind was a common search cgi script, which could be passed a single text string, or could be passed a form with multiple fields in it, corresponding to the "advanced" search. Either way, it's the same cgi script being invoked.
Kake wrote:
Regarding search more generally, I think that words typed into the search box should default to an "AND" search, which I don't believe they do at the moment.
Earle wrote:
No, they don't, which is definitely not right. This should return the Windsor Castle as the first hit:
http://openguides.org/london/supersearch.cgi?search=sausages+oysters
I've had a bash about at it, but I'm hampered by not having learnt Parse::RecDescent yet. From my reading of the code, I'd have thought that searches should default to OR, but they're not, so I'm clearly misunderstanding the code.
In fact, it's not AND or OR by default, but a phrase search. In this case looking for "sausages oysters", which doesn't make much sense.
Ivor: Could you take the time to comment the RecDescent grammar in OpenGuides::SuperSearch, so I have something to go on? If you won't find time to do it soon then we may have to put RecDescent aside for now and use something simpler.
I'm wondering what level of commenting you need, as I didn't think that the grammar was _that_ difficult to understand - apologies.
Anyway, here is an alternative grammar which is a drop-in replacement, which does a google style AND by default, and gives you AND if you separate the words with commas. Phrase search is still available, by passing a string bounded by "". Note that this has two new treenode types of "meta" and "regexp", but these are commented out, as there is no code to handle these node types.
my $parse = Parse::RecDescent->new(q{
search: list eostring {$return = $item[1]} #A search is a list followed by an eostring
list: comby(s) #A list is a list of one or more combys {$return = (@{$item[1]}>1) ? ['AND', @{$item[1]}] : $item[1][0]}
comby: <leftop: term ',' term> #A comby is a list of terms separated by commas {$return = (@{$item[1]}>1) ? ['OR', @{$item[1]}] : $item[1][0]}
# A term can be one of a number of things
term: '(' list ')' {$return = $item[2]} #A subexpression in parentheses | '!' term {$return = ['NOT', @{$item[2]}]} #A '!' (NOT) expression # | word ':' term {$return = ['meta', $item[1], $item[3]];} #A metadata lookup | '"' word(s) '"' {$return = ['word', @{$item[2]}]} #A phrase in "" | word {$return = ['word', $item[1]]} # A word # | m([/|\]) m([^$item[1]]+) $item[1] # A regexp bounded by '/' '|' or '' # { $return = ['regexp', qr($item[2])] }
word: /[\w'*%]+/ {$return = $item[1]} #A word can contain wildcards * and % also '
eostring: /^\Z/
});
Hope this helps,
Ivor.
On Sat 11 Oct 2003, Ivor Williams ivor.williams@tiscali.co.uk wrote:
What I had in mind was a common search cgi script, which could be passed a single text string, or could be passed a form with multiple fields in it, corresponding to the "advanced" search. Either way, it's the same cgi script being invoked.
Yes.
In fact, it's not AND or OR by default, but a phrase search. In this case looking for "sausages oysters", which doesn't make much sense.
I think I'm going to have to have another look at the code when I've not just got up, because I can't see why this is. Can someone explain it to me?
I'm wondering what level of commenting you need, as I didn't think that the grammar was _that_ difficult to understand - apologies.
The issue is not so much "a search is a list followed by an eostring, a list is a set of combys separated by '|'", "a comby is a set of terms separated by '&'" - which I agree is fairly clear once you've read the Parse::RecDescent perldoc - but the thinking behind the way the tree is built up. You need to say *why* you're doing what you're doing.
Anyway, here is an alternative grammar which is a drop-in replacement, which does a google style AND by default, and gives you AND if you separate the words with commas. Phrase search is still available, by passing a string bounded by "".
Thanks - can you write some tests for this? Don't worry if you can't, it'll just mean a short delay while I find time to do it.
Kake
----- Original Message ----- From: "Kate L Pugh" kake@earth.li To: "Discussion of development on the OpenGuides software." openguides-dev@openguides.org Sent: 12 October 2003 07:49 Subject: Re: [OpenGuides-Dev] Search and search syntax
In fact, it's not AND or OR by default, but a phrase search. In this case looking for "sausages oysters", which doesn't make much sense.
I think I'm going to have to have another look at the code when I've not just got up, because I can't see why this is. Can someone explain it to me?
The relevant line of the RecDescent is as follows:
| word(s) {$return = ['word', @{$item[1]}]}
This results in a series of words being turned into a list with 'word' at the head, i.e. ['word','sausages','oysters']
[snip...] - but the thinking behind the way the tree is built up. You need to say *why* you're doing what you're doing.
Fine. What the code is doing is contructing a tree of nodes representing the query. Each node is an arrayref, and the first item of the array is a node type. Here are some examples.
['word', 'pub'] ['word', 'the', 'green', 'man'] ['AND', ['word', 'restaurant'], ['word', 'vegan'] ] ['OR', ['word', 'cheap'], ['word', 'value'] ]
In the original CGI version of SuperSearch was a debug line commented out:
# print $outstr,pre(Dumper($tree));
Although we don't want the <pre> tag necessarily, this is a way to see the output of the parse, and fix problems with the grammar.
-_-_-
What happens to this tree is that it is walked recursively. This is what _matched_items does. This results in calls to matched_word, matched_AND etc.
Also of note is that the word nodes trigger a call to _prime_wikitext to load up the base text. I think that there is a potential bug here, as the original intention here was to prime the wikitext once based on a complete OR of all word searches in the inverted index, then applying boolean logic via the parse tree. I wouldn't be surprised if some of the ANDs, ORs and NOTs don't work properly.
The solution is to do a pre-pass of the tree, priming on each word node. Also, to do this, _prime_wikitext should not empty out the hash every time, as it's being called more than once.
Looking at _prime_wikitext, I see you have incorporated category and locale searches in it. I had started to do this, but didn't know about your call list_nodes_by_metadata.
Anyway, here is an alternative grammar which is a drop-in replacement, which does a google style AND by default, and gives you AND if you separate the words with commas. Phrase search is still available, by passing a string bounded by "".
Thanks - can you write some tests for this? Don't worry if you can't, it'll just mean a short delay while I find time to do it.
I've not got my head round where the test database is and what it's got in it.
I also have a version of SuperSearch.pm where the functionality of _perform_search was split down, with other methods _build_parser and _apply_parser.
This is attached. Beware, this code has seriously branched from the SuperSearch.pm in the latest release. I am providing it for ideas, and for a resolution of the _prime_wikitext issue above (which is solved by making _prime_wikitext recursive and giving it the whole parse tree). Enjoy.
Also, if you have any further questions or issues on this, I will be willing to help.
Ivor.
openguides-dev@lists.openguides.org