News
Downloads
Services
Community
Partners
About

Ranking overview

Ranking (aka weighting) of the search results can be defined as a process of computing a so-called relevance (aka weight) for every given matched document with regards to a given query that matched it. So relevance is in the end just a number attached to every document that estimates how relevant the document is to the query. Search results can then be sorted based on this number and/or some additional parameters, so that the most sought after results would come up higher on the results page.

There is no single standard one-size-fits-all way to rank any document in any scenario. Moreover, there can not ever be such a way, because relevance is subjective. As in, what seems relevant to you might not seem relevant to me. Hence, in general case it's not just hard to compute, it's theoretically impossible.

So ranking in Sphinx is configurable. It has a notion of a so-called ranker. A ranker can formally be defined as a function that takes document and query as its input and produces a relevance value as output. In layman's terms, a ranker controls exactly how (using which specific algorithm) will Sphinx assign weights to the document.

Previously, this ranking function was rigidly bound to the matching mode. So in the legacy matching modes (that is, SPH_MATCH_ALL, SPH_MATCH_ANY, SPH_MATCH_PHRASE, and SPH_MATCH_BOOLEAN) you can not choose the ranker. You can only do that in the SPH_MATCH_EXTENDED mode. (Which is the only mode in SphinxQL and the suggested mode in SphinxAPI anyway.) To choose a non-default ranker you can either use SetRankingMode() with SphinxAPI, or OPTION ranker clause in SELECT statement when using SphinxQL.

As a sidenote, legacy matching modes are internally implemented via the unified syntax anyway. When you use one of those modes, Sphinx just internally adjusts the query and sets the associated ranker, then executes the query using the very same unified code path.

Available rankers

Sphinx ships with a number of built-in rankers suited for different purposes. A number of them uses two factors, phrase proximity (aka LCS) and BM25. Phrase proximity works on the keyword positions, while BM25 works on the keyword frequencies. Basically, the better the degree of the phrase match between the document body and the query, the higher is the phrase proximity (it maxes out when the document contains the entire query as a verbatim quote). And BM25 is higher when the document contains more rare words. We'll save the detailed discussion for later.

Currently implemented rankers are:

  • SPH_RANK_PROXIMITY_BM25, the default ranking mode that uses and combines both phrase proximity and BM25 ranking.

  • SPH_RANK_BM25, statistical ranking mode which uses BM25 ranking only (similar to most other full-text engines). This mode is faster but may result in worse quality on queries which contain more than 1 keyword.

  • SPH_RANK_NONE, no ranking mode. This mode is obviously the fastest. A weight of 1 is assigned to all matches. This is sometimes called boolean searching that just matches the documents but does not rank them.

  • SPH_RANK_WORDCOUNT, ranking by the keyword occurrences count. This ranker computes the per-field keyword occurrence counts, then multiplies them by field weights, and sums the resulting values.

  • SPH_RANK_PROXIMITY, added in version 0.9.9-rc1, returns raw phrase proximity value as a result. This mode is internally used to emulate SPH_MATCH_ALL queries.

  • SPH_RANK_MATCHANY, added in version 0.9.9-rc1, returns rank as it was computed in SPH_MATCH_ANY mode ealier, and is internally used to emulate SPH_MATCH_ANY queries.

  • SPH_RANK_FIELDMASK, added in version 0.9.9-rc2, returns a 32-bit mask with N-th bit corresponding to N-th fulltext field, numbering from 0. The bit will only be set when the respective field has any keyword occurences satisfiying the query.

  • SPH_RANK_SPH04, added in version 1.10-beta, is generally based on the default SPH_RANK_PROXIMITY_BM25 ranker, but additionally boosts the matches when they occur in the very beginning or the very end of a text field. Thus, if a field equals the exact query, SPH04 should rank it higher than a field that contains the exact query but is not equal to it. (For instance, when the query is "Hyde Park", a document entitled "Hyde Park" should be ranked higher than a one entitled "Hyde Park, London" or "The Hyde Park Cafe".)

  • SPH_RANK_EXPR, added in version 2.0.2-beta, lets you specify the ranking formula in run time. It exposes a number of internal text factors and lets you define how the final weight should be computed from those factors. You can find more details about its syntax and a reference available factors in a subsection below.

You should specify the SPH_RANK_ prefix and use capital letters only when using the SetRankingMode() call from the SphinxAPI. The API ports expose these as global constants. Using SphinxQL syntax, the prefix should be omitted and the ranker name is case insensitive. Example:

// SphinxAPI
$client->SetRankingMode ( SPH_RANK_SPH04 );

// SphinxQL
mysql_query ( "SELECT ... OPTION ranker=sph04" );

Legacy matching modes rankers

Legacy matching modes automatically select a ranker as follows:

  • SPH_MATCH_ALL uses SPH_RANK_PROXIMITY ranker;

  • SPH_MATCH_ANY uses SPH_RANK_MATCHANY ranker;

  • SPH_MATCH_PHRASE uses SPH_RANK_PROXIMITY ranker;

  • SPH_MATCH_BOOLEAN uses SPH_RANK_NONE ranker.

Expression based ranker (SPH_RANK_EXPR)

Expression ranker, added in version 2.0.2-beta, lets you change the ranking formula on the fly, on a per-query basis. For a quick kickoff, this is how you emulate PROXIMITY_BM25 ranker using the expression based one:

SELECT *, WEIGHT() FROM myindex WHERE MATCH('hello world')
OPTION ranker=expr('sum(lcs*user_weight)*1000+bm25')

The output of this query must not change if you omit the OPTION clause, because the default ranker (PROXIMITY_BM25) behaves exactly like specified in the ranker formula above. But the expression ranker is somewhat more flexible than just that and provides access to many more factors.

The ranking formula is an arbitrary arithmetic expression that can use constants, document attributes, built-in functions and operators (described in Section 5.5, “Expressions, functions, and operators”), and also a few ranking-specific things that are only accessible in a ranking formula. Namely, those are field aggregation functions, field-level, and document-level ranking factors.

A document-level factor is a numeric value computed by the ranking engine for every matched document with regards to the current query. (So it differs from a plain document attribute in that the attribute do not depend on the full text query, while factors might.) Those factors can be used anywhere in the ranking expression. Currently implemented document-level factors are:

  • bm25 (integer), a document-level BM25 estimate (computed without keyword occurrence filtering).

  • max_lcs (integer), a query-level maximum possible value that the sum(lcs*user_weight) expression can ever take. This can be useful for weight boost scaling. For instance, MATCHANY ranker formula uses this to guarantee that a full phrase match in any field rankes higher than any combination of partial matches in all fields.

  • field_mask (integer), a document-level 32-bit mask of matched fields.

  • query_word_count (integer), the number of unique keywords in a query, adjusted for a number of excluded keywords. For instance, both (one one one one) and (one !two) queries should assign a value of 1 to this factor, because there is just one unique non-excluded keyword.

  • doc_word_count (integer), the number of unique keywords matched in the entire document.

A field-level factor is a numeric value computed by the ranking engine for every matched in-document text field with regards to the current query. As more than one field can be matched by a query, but the final weight needs to be a single integer value, these values need to be folded into a single one. To achieve that, field-level factors can only be used within a field aggregation function, they can not be used anywhere in the expression. For example, you can not use (lcs+bm25) as your ranking expression, as lcs takes multiple values (one in every matched field). You should use (sum(lcs)+bm25) instead, that expression sums lcs over all matching fields, and then adds bm25 to that per-field sum. Currently implemented field-level factors are:

  • lcs (integer), the length of a maximum verbatim match between the document and the query, coutned in words. LCS stands for Longest Common Subsequence (or Subset). Takes a minimum value of 1 when only stray keywords were matched in a field, and a maximum value of query keywords count when the entire query was matched in a field verbatim (in the exact query keywords order). For example, if the query is 'hello world' and the field contains these two words quoted from the query (that is, adjacent to each other, and exaclty in the query order), lcs will be 2. For example, if the query is 'hello world program' and the field contains 'hello world', lcs will be 2. Note that any subset of the query keyword works, not just a subset of adjacent keywords. For example, if the query is 'hello world program' and the field contains 'hello (test program)', lcs will be 2 just as well, because both 'hello' and 'program' matched in the same respective positions as they were in the query. Finally, if the query is 'hello world program' and the field contains 'hello world program', lcs will be 3. (Hopefully that is unsurpising at this point.)

  • user_weight (integer), the user specified per-field weight (refer to SetFieldWeights() in SphinxAPI and OPTION field_weights in SphinxQL respectively). The weights default to 1 if not specified explicitly.

  • hit_count (integer), the number of keyword occurrences that matched in the field. Note that a single keyword may occur multiple times. For example, if 'hello' occurs 3 times in a field and 'world' occurs 5 times, hit_count will be 8.

  • word_count (integer), the number of unique keywords matched in the field. For example, if 'hello' and 'world' occur anywhere in a field, word_count will be 2, irregardless of how many times do both keywords occur.

  • tf_idf (float), the sum of TF*IDF over all the keywords matched in the field. IDF is the Inverse Document Frequency, a floating point value between 0 and 1 that describes how frequent is the keywords (basically, 0 for a keyword that occurs in every document indexed, and 1 for a unique keyword that occurs in just a single document). TF is the Term Frequency, the number of matched keyword occurrences in the field. As a side note, tf_idf is actually computed by summing IDF over all matched occurences. That's by construction equivalent to summing TF*IDF over all matched keywords.

  • min_hit_pos (integer), the position of the first matched keyword occurrence, counted in words. Indexing begins from position 1.

  • min_best_span_pos (integer), the position of the first maximum LCS occurrences span. For example, assume that our query was 'hello world program' and 'hello world' subphrase was matched twice in the field, in positions 13 and 21. Assume that 'hello' and 'world' additionally occurred elsewhere in the field, but never next to each other and thus never as a subphrase match. In that case, min_best_span_pos will be 13. Note how for the single keyword queries min_best_span_pos will always equal min_hit_pos.

  • exact_hit (boolean), whether a query was an exact match of the entire current field. Used in the SPH04 ranker.

A field aggregation function is a single argument function that takes an expression with field-level factors, iterates it over all the matched fields, and computes the final results. Currently implemented field aggregation functions are:

  • sum, sums the argument expression over all matched fields. For instance, sum(1) should return a number of matched fields.

Expressions for the built-in rankers

Most of the other rankers can actually be emulated with the expression based ranker. You just need to pass a proper expression. Such emulation is, of course, going to be slower than using the built-in, compiled ranker but still might be of interest if you want to fine-tune your ranking formula starting with one of the existing ones. Also, the formulas define the nitty gritty ranker details in a nicely readable fashion.

  • SPH_RANK_PROXIMITY_BM25 = sum(lcs*user_weight)*1000+bm25

  • SPH_RANK_BM25 = bm25

  • SPH_RANK_NONE = 1

  • SPH_RANK_WORDCOUNT = sum(hit_count*user_weight)

  • SPH_RANK_PROXIMITY = sum(lcs*user_weight)

  • SPH_RANK_MATCHANY = sum((word_count+(lcs-1)*max_lcs)*user_weight)

  • SPH_RANK_FIELDMASK = field_mask

  • SPH_RANK_SPH04 = sum((4*lcs+2*(min_hit_pos==1)+exact_hit)*user_weight)*1000+bm25