Jun 27, 2010. Doing time segments, geodistance searches, and overrides in SphinxQL

SphinxQL now lets you do everything querying-related that SphinxAPI did, in a simpler, faster, and more convenient way. For most of the features, the mapping of API calls to SphinxQL syntax is straightforwards (either via SQL syntax or using our OPTION clause). However, a few things, namely time segments, geosearches, overrides, index and weight fields, etc might now be less obvious. So let’s discuss them.

Time segments sorting mode can now be easily repeated using INTERVAL and NOW functions. Moreover, it’s so much more flexible in SphinxQL. In custom SphinxAPI mode, turn-points were hardcoded, and limited to sorting. With SphinxQL, you now can control those turn-points, and also use the result for something else, not just sorting.

Specifically, legacy SphinxAPI mode is completely equivalent to computing this INTERVAL expression and sorting by it:

SELECT *, INTERVAL(time_stamp_attr, NOW()-90*86400, NOW()-30*86400,
    NOW()-7*86400, NOW()-86400, NOW()-3600) AS time_seg
...
ORDER BY time_seg DESC, @weight DESC

86400 is the number of seconds per 24 hours so the turn-points are at 90 days, 30 days, 7 days, 1 day, and 1 hour respectively. That’s exactly how they are hard-coded in SPH_SORT_TIME_SEGMENTS mode.

Note, by the way, that you need not worry about NOW() function used too much or constant multiplication. All that stuff gets optimized by our expression engine and this INTERVAL() expression is simplified down to an efficient search of attribute values in a list of constants.

Geodistance searches are supported through GEODIST(lat1,lon1,lat2,lon2) function. That is similar to SetGeoAnchor() API call but more flexible again. API required that lat1/lon1 are attribute names, and lat2/lon2 are constant floats. SphinxQL does not have that restriction any more, all latitudes and longitudes can be arbitrary expressions!

At the same time we are not sacrificing any speed by making GEODIST() that generic, because we check whether any (or both) of the points are constant expressions, and optimize accordingly. For instance, GEODIST() between two constant points will be computed only once. And GEODIST() with a pair of attributes and a constant anchor will be computed exactly as before, without any extra computational overheads.

Attribute value overrides, depending on how did you use them, can now be implemented using either IF() or IN() functions. For instance, to boost matching posts written by user’s friends, you could use something like:

SELECT *, @weight + IN(author_id,2,3,5,8,13,21)*1000 ...
ORDER BY @weight DESC

And, say, giving a 10% discount on a given category of goods to a current user (because he punched in a valid promo code) can be accomplished like:

SELECT *, IF(category_id=123,0.9,1)*price AS adjusted_price ...
ORDER BY adjusted_price ASC

In this particular case just cutting 10% off could be expressed without an IF(), for example as price*(1-0.1*(category_id=123)), exploiting the fact that boolean expressions in scalar context evaluate to 0 or 1. However IF() might be more efficient when more complex expressions are involved, because it short-circuits. That is, either 2nd or 3rd argument gets evaluated, but never the both of them.

Index and field weights were, frankly speaking, overlooked a while ago and stayed unimplemented for some time. But as of bleeding edge trunk, they are available, through an OPTION clause:

SELECT ... OPTION field_weights=(title=5,abstract=2,content=1)

To summarize, all the searching things you were able to do with the API are now available from SphinxQL as well, and sometimes in even more flexible and optimal way than before. So be sure to give it a try.


« »

8 Responses to “Doing time segments, geodistance searches, and overrides in SphinxQL”

  1. Hi Andrew,
    nice explanation, but I don’t get it:
    SELECT *, IF(category_id=123,0.9,1)*price AS adjusted_price

    Maybe it is a misprint here.

    Thanks,
    Yaroslav

  2. shodan says:

    Yaroslav, exactly what seems a misprint? That’s valid SphinxQL syntax :-)

  3. Very cool. Great job.

  4. EllisGL says:

    @Yaroslav:
    Think of it as this:

    $adjusted_ price = $price;

    if($category_id === 123)
    {
    $adjusted_ price *= 0.9;
    }

  5. kmike says:

    Will SPH_GROUPBY_DAY and other SetGroupBy() date built-in functions be supported in SphinxQL? I know they could be emulated by INTERVAL(), too, but it’s very clumsy when you have to search through the large time spans.

    It’s being discussed here:
    http://www.sphinxsearch.com/forum/view.html?id=6258

  6. shodan says:

    kmike, yes, that’s on the todo list.

  7. Tim Hawkins says:

    Its not something I can directly put my finger on, but we had a lot of trouble using the sphinxQL interface on port 9306, we kept getting exceptions in searchd when the query returned a large amount of data. I think that the mysql protocol emulation has some buffer overflow or bounds checking issues when result sets are large, we store entire articles as string attributes to avoid having to hit the database again on a search listing.

    We had no problem writing individual records to the RT indices via the MySQL emulation protocol, but could not reliably read back the results of queries. and had to resort to using the old protocol on 9312 for queries.

  8. shodan says:

    Tim, that might have been since fixed in current trunk.

Leave a Reply