A number of Sphinx features are frequently overlooked, and multi-queries is one of them. I’ve heard a myth that Sphinx does not have faceted searching. In fact, it does.. Moreover, the multi-query mechanism is more powerful and flexible than just that, it lets you do more than just facets. But the documentation does not ever mention the word “facets” indeed, and voila, a myth is born. Well, time to debunk it!
So what are those multi-queries and how they can make your searches 3x faster?
Multi-queries are just a mechanism that lets you send several search queries to searchd in one batch. That, in turn, lets searchd internally optimize common parts between the queries. And that’s where the savings come from.
Note that you don’t need to do anything special about those queries to batch them. Any queries can be batched. Any combination that you send is perfectly valid. There is a limit on the number of queries (controlled by max_batch_queries directive since 1.10 and fixed at 32 before that) but that’s just a sanity check.
Multi-queries are currently only accessible via SphinxAPI. Support for multi-queries via SphinxQL might also be added later, provided that MySQL server protocol client libraries allow that; at a glance, they do. The relevant methods are AddQuery() and RunQueries() respectively. (Fun fact: regular Query() API method actually uses these two internally.) AddQuery() saves a snapshot of the current query settings resulting from any other previous API calls plus saves the query itself. So once you call AddQuery(), settings attached to query will no longer be changed. No API calls will affect the saved settings any more. And thus, you can use any other settings for your next queries (different sorting modes, different filters, etc). Then RunQueries() takes all the saved snapshots, sends them to searchd in one batch, receives the results from searchd, in one batch as well, and returns an array of multiple result sets.
Why use multi-queries? Generally, it’s all about performance. First, by sending queries and fetching results in one go, we always save a tiny bit on network roundtrips. Second, and much more importantly, searchd gets a chances to find common query parts across the batch, and optimize for those. We add optimizations over time, so it makes sense to batch the queries where you can: that way, once we implement new batch optimizations, they will automatically apply as you upgrade. In case when no known optimizations can be applied, queries are simply processed one by one, as usual. But in any case, optimizations or not, the results are guaranteed to be exactly the same as if you ran the queries one by one. So that’s entirely transparent for the application.
Why (or rather when) not use multi-queries? Query in a batch naturally have to be independent. That is, query B can not depend on the results of query A. However that’s not always the case. For instance, you might want to display results from ancillary index if and only if no search results were returned from the main one. Or just use a different offset into 2nd result set depending on the amount of matches in the 1st result set. Separate queries (or separate batches) would be required in such cases.
As of time of this writing, there are two major batch optimizations that you should be aware of: common query optimization (available since 0.9.8) and common subquery optimizations (available since 1.10).
Common query optimization is exactly how you implement facets, and it works as follows. In a batch, searchd identifies those queries where only sorting and grouping settings are different, while full-text part, filters and other settings coincide. For such subsets of queries, it runs the expensive full-text searching operation only once, and builds multiple, differently sorted (or grouped) result sets off the found matches. For instance, if there are 3 queries for “ipod nano” in a batch, 1st query picks 10 cheapest matches, 2nd one groups the matches by store ID and then sorts the stores by rating, and 3rd query picks the max price, full-text search for “ipod nano” will only be performed once, but 3 differently sorted nd grouped result sets will be built.
Now, so called faceted search is just a particular case where common query optimization applies. Indeed, you can implement that by sending a number of queries in one batch: one for main search results, a few others with different grouping and sorting settings (top-3 authors, top-5 vendors, top-20 sub-category counts, etc). And, because only sorting and grouping settings would be different, it gets optimized.
Common subquery optimization is even more interesting. It lets searchd identify the common full-text parts within queries. Even if the queries are different, but some parts are common, intermediate search results for those parts can be cached and shared between the queries. For instance, in this batch
barack obama president barack obama john mccain barack obama speech
there’s a common 2-word part (“barack obama”) that can only be evaluated once and cached. That’s exactly what common subquery (aka subtree) optimization does. Max cache size for every batch is strictly controlled by subtree_docs_cache and subtree_hits_cache directive, so that even if a common part “i am” matches 100 million documents, server does not unexpectedly run out of RAM.
But let’s go back to the common query optimization. Here’s a sample code which runs the same query with three different sorting modes:
require ( "sphinxapi.php" ); $cl = new SphinxClient (); $cl->SetMatchMode ( SPH_MATCH_EXTENDED2 ); $cl->SetSortMode ( SPH_SORT_RELEVANCE ); $cl->AddQuery ( "the", "lj" ); $cl->SetSortMode ( SPH_SORT_EXTENDED, "published desc" ); $cl->AddQuery ( "the", "lj" ); $cl->SetSortMode ( SPH_SORT_EXTENDED, "published asc" ); $cl->AddQuery ( "the", "lj" ); $res = $cl->RunQueries();
How does one tell whether the optimization actually triggers and works? If it does, the respective log entries would have an extra “multiplier” field that tells how many queries were combined and processed together:
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext2/0/rel 747541 (0,20)] [lj] the [Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext2/0/ext 747541 (0,20)] [lj] the [Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext2/0/ext 747541 (0,20)] [lj] the
Notice that “x3” part? It’s the multipler field and it means that this query was optimized and processed as a part of a 3-query subset (including this one). For the sake comparison, here’s the log in which the queries were sent one by one:
[Sun Jul 12 15:18:17.062 2009] 0.059 sec [ext2/0/rel 747541 (0,20)] [lj] the [Sun Jul 12 15:18:17.156 2009] 0.091 sec [ext2/0/ext 747541 (0,20)] [lj] the [Sun Jul 12 15:18:17.250 2009] 0.092 sec [ext2/0/ext 747541 (0,20)] [lj] the
We can see that with a batch per-query time improved 1.5x to 2.3x depending on the sorting mode. And that level of sample improvement is definitely not a limit, Maybe even an understament. Cases are known for both optimizations where query speed improved over 3x times, and that’s production cases and not synthetic tests! Common query optimzation maps very well to vertical product search engines and online shops, common subquery cache maps nicely to data mining queries, but, of course, those aren’t the only eligible search types. You can, for one, eliminate the full-text searching at all and compute mulitple reports (with different sorting and grouping).
What kind of batch optimizations to expect in the future? That depends upon you. In our current long-term plan, we have an optimization for identical queries with different set of filters. Know another frequent querying pattern that just might be subject to a clever optimization? Let us know!
|« March 31, 2010. Full-text search BOF session at MySQL Conference||April 9, 2010. 2 cents on 2% optimizations »|