So you’ve decided to switch to SphinxQL (which is SQL syntax now supported by Sphinx). However, its MATCH() statement uses query syntax that was previously available from “extended” matching mode, and you’re using some other mode. What do you do?
Legacy modes had two major ingredients in them. First, the way they pick matches (well, we didn’t call them matching modes for nothing). Second, the way they rank them. You can perfectly have both in the new scheme of things. For matching, you rewrite them using full-text query syntax. For ranking, you also pick a proper ranker using OPTION clause. Moreover, the two are no longer tied together, so it’s more flexible now.
There were 4 modes before the query language, namely ALL, ANY, PHRASE, and BOOLEAN, standard prefix omitted for brevity. How to repeat their matching behavior now?
Query language defaults to matching all keywords, so you don’t really need to do anything to repeat ALL behavior (except filter out special characters; see below). BOOLEAN was a limited subset of query syntax, so you don’t need to the query either, you only need to change the ranker (also discussed below). PHRASE is pretty simple, you just wrap the query in quotes to use phrase matching operator, and that’s it. Pretty intuitive, thanks to Google. Finally, ANY is essentially equivalent to OR-ing all the keywords together, and you can achieve that either by inserting | between keywords, or by using the quorum operator with a minimum threshold of 1. Example:
// original query in SPH_MATCH_ANY $q1 = ' one two three '; // rewritten using OR operator $q2 = ' one | two | three '; // rewritten using quorum operator $q3 = ' "one two three"/1 ';
Now that we match the same documents as before, we also want them in the same relevance rank order as before. The way how Sphinx computes rank is now controlled by a knob surprisingly called “ranker” and in OPTION ranker=xxx turns that knob in SphinxQL (for reference, SetRankingMode() call is what you use in SphinxAPI). ALL and PHRASE modes both correspond to ranker=proximity, ANY to ranker=matchany, and BOOLEAN to ranker=none. So, for instance, the final SphinxQL query that emulates SPH_MATCH_ANY modes would look like this:
SELECT * FROM myindex WHERE MATCH(' "one two three"/1 ') OPTION ranker=matchany
Note that now you also can match any word, but, say skip pretty expensive ranking part, because you sort by price anyway. That’s something you could not do using those legacy matching modes before!
SELECT * FROM myindex WHERE MATCH('one|two|three') ORDER BY price OPTION ranker=NONE
That covers the most important part but, as usual, the devil’s in details, and there a few other minor bits to cover.
First, the special characters have to be escaped (or stripped away). For instance, ‘one|two’ would be handled as ( one AND two ) when using SPH_MATCH_ALL mode, but running that query through MATCH() verbatim would result in ( one OR two) behavior. You can escape all the specials using EscapeString() SphinxAPI call. The call’s a simple client-side one-liner so you can easily re-implement it if that’s convenient, but you’d have to watch out for the updates them.
Second, if you choose to emulate ANY by inserting | between the words, that’s not as a straightforward replace-all-whitespace-with-pipes thing as it might seem. (Escaping specials and wrapping the query in “…”/1, on the other hand, is that straightforward.) For instance, ‘one;two’ is still two keywords, despite no whitespace. BuildKeywords() API call can help with that. It will break down the query to keywords the Sphinx way. Be aware, however, that, unlike EscapeString(), this one does result in an extra network roundtrip to searchd.
Third, the previous “simple” query parser had a 10-keyword limit, and simply ignored the keywords beyond that limit. (I only vaguely remember the rationale behind that 2001-made decision, should had been something related to limiting query performance impact.) Anyway, there’s no such limit in our current query language, so queries that throw more than 10 keywords at searcher can return different results. BuildKeywords() should be your friend here too if you’re emulating the old behavior.
Fourth and the last, is that complexity going to be any slower than just using the API and SetMatchMode()? Nope, it should be performing perfectly on par. That’s because legacy matching modes are internally no longer. (That’s in turn because why they are legacy.) And when you’re using SetMatchMode(), they are in fact internally emulated using exactly the same set of tricks. So, no extra performance impact involved.
|« March 23, 2010. Talking, blogging, coding, and meeting||March 26, 2010. Sphinx powers Mozilla’s addons search »|