Mar 23, 2010. Porting legacy matching modes to SphinxQL

So you’ve decided to switch to SphinxQL (which is SQL syntax now supported by Sphinx). However, its MATCH() statement uses query syntax that was previously available from “extended” matching mode, and you’re using some other mode. What do you do?

Legacy modes had two major ingredients in them. First, the way they pick matches (well, we didn’t call them matching modes for nothing). Second, the way they rank them. You can perfectly have both in the new scheme of things. For matching, you rewrite them using full-text query syntax. For ranking, you also pick a proper ranker using OPTION clause. Moreover, the two are no longer tied together, so it’s more flexible now.

There were 4 modes before the query language, namely ALL, ANY, PHRASE, and BOOLEAN, standard prefix omitted for brevity. How to repeat their matching behavior now?

Query language defaults to matching all keywords, so you don’t really need to do anything to repeat ALL behavior (except filter out special characters; see below). BOOLEAN was a limited subset of query syntax, so you don’t need to the query either, you only need to change the ranker (also discussed below). PHRASE is pretty simple, you just wrap the query in quotes to use phrase matching operator, and that’s it. Pretty intuitive, thanks to Google. Finally, ANY is essentially equivalent to OR-ing all the keywords together, and you can achieve that either by inserting | between keywords, or by using the quorum operator with a minimum threshold of 1. Example:

// original query in SPH_MATCH_ANY
$q1 = ' one two three ';
 
// rewritten using OR operator
$q2 = ' one | two | three ';
 
// rewritten using quorum operator
$q3 = ' "one two three"/1 ';

Now that we match the same documents as before, we also want them in the same relevance rank order as before. The way how Sphinx computes rank is now controlled by a knob surprisingly called “ranker” and in OPTION ranker=xxx turns that knob in SphinxQL (for reference, SetRankingMode() call is what you use in SphinxAPI). ALL and PHRASE modes both correspond to ranker=proximity, ANY to ranker=matchany, and BOOLEAN to ranker=none. So, for instance, the final SphinxQL query that emulates SPH_MATCH_ANY modes would look like this:

SELECT * FROM myindex WHERE MATCH(' "one two three"/1 ') OPTION ranker=matchany

Note that now you also can match any word, but, say skip pretty expensive ranking part, because you sort by price anyway. That’s something you could not do using those legacy matching modes before!

SELECT * FROM myindex WHERE MATCH('one|two|three') ORDER BY price OPTION ranker=NONE

That covers the most important part but, as usual, the devil’s in details, and there a few other minor bits to cover.

First, the special characters have to be escaped (or stripped away). For instance, ‘one|two’ would be handled as ( one AND two ) when using SPH_MATCH_ALL mode, but running that query through MATCH() verbatim would result in ( one OR two) behavior. You can escape all the specials using EscapeString() SphinxAPI call. The call’s a simple client-side one-liner so you can easily re-implement it if that’s convenient, but you’d have to watch out for the updates them.

Second, if you choose to emulate ANY by inserting | between the words, that’s not as a straightforward replace-all-whitespace-with-pipes thing as it might seem. (Escaping specials and wrapping the query in “…”/1, on the other hand, is that straightforward.) For instance, ‘one;two’ is still two keywords, despite no whitespace. BuildKeywords() API call can help with that. It will break down the query to keywords the Sphinx way. Be aware, however, that, unlike EscapeString(), this one does result in an extra network roundtrip to searchd.

Third, the previous “simple” query parser had a 10-keyword limit, and simply ignored the keywords beyond that limit. (I only vaguely remember the rationale behind that 2001-made decision, should had been something related to limiting query performance impact.) Anyway, there’s no such limit in our current query language, so queries that throw more than 10 keywords at searcher can return different results. BuildKeywords() should be your friend here too if you’re emulating the old behavior.

Fourth and the last, is that complexity going to be any slower than just using the API and SetMatchMode()? Nope, it should be performing perfectly on par. That’s because legacy matching modes are internally no longer. (That’s in turn because why they are legacy.) And when you’re using SetMatchMode(), they are in fact internally emulated using exactly the same set of tricks. So, no extra performance impact involved.

Happy porting.


« »

3 Responses to “Porting legacy matching modes to SphinxQL”

  1. Niklas E says:

    MATCH statement is good knowing and important meaning I follow and try understand and compatibilize legacy LAMP sql (montao.com.br), SphinxQL and gql (web.montao.com.br) with kind thanks and regards for all very impressive work

  2. Miro says:

    Hi,

    We are currently developing a search engine for out clients web site.

    The biggest problem now is ranking so we hope you might be able to help us here.

    We have 2 text fields and when you are searching for words “one & two” [quote marks are not used in query just here] in SPH_MATCH_EXTENDED2 mode we want to get relevance based on given weights matching all words in each of fields. Currently it works that it gives full weights for each field if sphinx finds any of searched words inside.

    Is there any solution to this? We are having problems for few weeks now because of this and we would prefer not to be in need of doing 3 queries just to get ranking sorted out :(

    Tnx,
    Miro

  3. shodan says:

    Miro, I’m not sure if I understood your question correctly, but it seems (seems) that you need a custom ranker developed. We can do that under our commercial services, call or email for details.

    BTW blog comments isn’t the best place to ask.. your comment got tagged as spam for some reason so I saw it just now.

Leave a Reply