Aug 17, 2010. How Sphinx relevance ranking works

Over time, we added quite a bunch of matching and ranking modes to Sphinx, and will be adding more. A number of different questions that regularly pop up, ranging from “how do I force this document ranked the 1st” from “how do I draw 1 to 5 stars depending on match quality”, do in fact boil down to matching and ranking internals. So let’s cover that: just how do matching and ranking modes work, what weighting factors contribute to the final weight and how, how does one tweak stuff, etc. And, of course, the stars, our destination.

What are matching modes?

First of all, let’s sort out those confusing modes. SphinxAPI exposes two different methods, SetMatchMode() and SetRankingMode() respectively. SphinxQL, for reference, does not, it only exposes OPTION ranker, which maps to ranking mode, but not the matching mode. What is all that about?

Matching modes are all about legacy and compatibility. And ranking modes are about how Sphinx computes relevance.

Previously, up to version 0.9.8, Sphinx only used to have matching modes and every matching mode was implemented with a different code path. Each code path implemented a different kind of both matching and ranking. For instance, SPH_MATCH_ALL required all keywords to be present, and computed document weight as phrase proximity alone. SPH_MATCH_ANY required any of the keywords, and computed weight differently. And so on.

In 0.9.8 we started a new, unified matching engine. To avoid breaking compatibility while we work on it, in 0.9.8 it was only exposed under a separate matching mode, called SPH_MATCH_EXTENDED2. By version 0.9.9 it became apparent that the new engine is stable and performant enough, and we removed all the legacy code paths in favor of the new engine. So starting from 0.9.9 all queries are now handled with a unified engine, which was not the case before and created maintenance difficulties. And so all the matching modes are now, in fact, just a legacy.

Sphinx stays compatible with those legacy modes, of course, and when you use one of those, it automatically switches to a simplified query parsing code (that completely ignores query syntax) and automatically picks a proper ranker. But that is it. Everything else is always handled with our unified matching engine. And thus, document weight (aka, @weight) only depends on the chosen ranking mode (aka, ranker). For instance, the following two queries will result in exactly the same weight (and exactly the same processing time, too):

// 1st route
$cl->SetMatchMode ( SPH_MATCH_ALL );
$cl->Query ( "hello world" );
 
// 2nd route
$cl->SetMatchMode ( SPH_MATCH_EXTENDED2 );
$cl->SetRankingMode ( SPH_RANK_PROXIMITY );
$cl->Query ( "hello world" );

Note that second route allows you to use, say, (@title hello world) syntax because the matching mode allows for it. First one does not, because in that matching mode all special operators are ignored and @title will be interpreted as a keyword.

So SetMatchMode() does nothing more than escape the query and pick a proper ranker. It’s rather a legacy call, as there won’t be new matching modes any more (they were a temporary solution until we had a full blown query syntax in the first place, but temporary solutions are notorious for their tendency to last) and query-syntax enabled matching allows you to do everything that older modes used to provide and much more. SetRankingMode() does even less, it just lets you explicitly pick a ranker. Which brings us to the question…

What are rankers?

Ranking modes, or rankers for short, can be formally defined as functions that compute a relevance value (weight) for a given query and document arguments.

Relevance is ultimately subjective, so there’s no single one-size-fits-all ranker, and there will never be. So there can be many different factors used to compute a final weight and a myriad of ways to combine those factors into a weight, discussing that is a subject for a separate posting.

The two most important weighting factors that Sphinx computes and uses as of 1.10 are 1) classic statistical BM25 factor, used by most if not all search engines since 80s, and 2) Sphinx specific phrase proximity factor.

BM25 factor

BM25 is a floating point value that depends on frequencies of the matched keywords only. Frequencies in question are in-document and in-collection frequencies. Basically, keywords that are more rare and/or occur many times in the document yield more weight to that document.

Standard BM25 implementation is nicely covered in Wikipedia article on BM25 but Sphinx uses a slightly modified variant. First, for performance reasons we account for all the keywords occurrences in the document, and not just the matched ones. For instance, (@title “hello world”) query that only matches a single instance of “hello world” phrase in the title will result in the same BM25 with a (hello world) query that matches all the instances of both keywords everywhere in the document. Second, we don’t enforce any document attributes and therefore don’t necessarily have a document length, so we ignore document length too (equivalent to plugging b=0 into original BM25). Both changes were intentional, as in our testing using original BM25 did not result in enough ranking improvement to justify the associated performance impact. The exact BM25 computation that Sphinx uses is, in pseudo-code, as follows:

BM25 = 0
foreach ( keyword in matching_keywords )
{
    n = total_matching_documents ( keyword )
    N = total_documents_in_collection
    k1 = 1.2
 
    TF = current_document_occurrence_count ( keyword )
    IDF = log((N-n+1)/n) / log(1+N)
    BM25 = BM25 + TF*IDF/(TF+k1)
}
 
// normalize to 0..1 range
BM25 = 0.5 + BM25 / ( 2*num_keywords ( query ) )

TF means Term Frequency in a document being ranked. It’s based on a number of occurrences within a document but smoothed with a logarithm function, so that 1000 occurrences don’t result in 1000x improvement over just 1. TF can generally vary from 0 to 1 but, with a chosen k=1.2, it actually varies from 0.4545… to 1.

IDF means Inverse Document Frequency in the entire document set. IDF possesses lesser values for frequent words (such as “the” or “to” etc) and greater values for rare ones, with peak values being IDF=1 when a keyword occurs in exactly one document, and IDF=-1 when it occurs in every indexed document.

So, as you can see from the code above, BM25 increases when the keywords are rare and occur many times in the document and decreases when the keywords are frequent. It should be noted that overly frequent keywords that match more than a half of indexed documents actually decrease BM25! Indeed, when a keyword occurs in 90% of the documents then the documents without it are rarer gems, probably more interesting as such, and deserve more weight.

Phrase proximity factor

Phrase proximity factor, on the contrary, does not care about the keyword frequencies at all and accounts for the mutual disposition of query keywords in the document. Instead of keyword frequencies used for BM25, Sphinx analyzes keyword positions in every field and computes phrase proximity value as the longest common sub-sequence (LCS) length between the query and the document. Basically, per-field phrase proximity is a number of keywords that occurred in the document in exactly the same order as they did in the query. Here go a few examples:

1) query = one two three, field = one and two three
field_phrase_weight = 2 (because 2-keyword long "two three" subphrase matched)
 
2) query = one two three, field = one and two and three
field_phrase_weight = 1 (because single keywords matched but no subphrase did)
 
3) query = one two three, field = nothing matches at all
field_phrase_weight = 0

Per-field phrase weights are then multiplied by field weights specified with SetFieldWeights() API call (or OPTION field_weights in SphinxQL) and added altogether to produce a per-document phrase weight. Field weights default at 1, and can not be set lower than 1. In pseudo-code, the entire phrase proximity calculation looks as follows:

doc_phrase_weight = 0
foreach ( field in matching_fields )
{
   field_phrase_weight = max_common_subsequence_length ( query, field )
   doc_phrase_weight += user_weight ( field ) * field_phrase_weight
}

Example:

doc_title = hello world
doc_body = the world is a wonderful place
 
query = hello world
query_title_weight = 5
query_body_weight = 3
 
title_phrase_weight = 2
body_phrase_weight = 1
doc_phrase_weight = 2*5+3*1 = 13

It’s the phrase proximity factor that guarantees that closer phrase matches will be ranked higher, and exact phrase matches will be ranked at the very top. One can use field weights to tweak and tune that behavior. For instance, in the example just above, a single-keyword match in title is made worth the same as a two-keyword phrase match in body.

Phrase proximity is by design somewhat more computationally intensive than BM25 because it needs to work through all the keyword occurrences in the matched documents and not just the documents only. Sphinx defaults to using proximity because we believe this yields better search quality. You can however choose to use a more lightweight ranker that omits the expensive proximity calculations.

Orbital view of the rankers

Phrase proximity and BM25 are the two most important factors, that is, values that contribute to the final document weight. However, the final weight value is determined by the ranker, that is, the specific function that crunches one or more factors into a single number (Also, there are other factors besides phrase weight and BM25 that Sphinx can compute and use.)

As of 1.10-beta, Sphinx has 8 different rankers, and will definitely add more in the future. Every ranker computes weight differently and thus might or might not be suitable for a particular scenario.

There are 3 simple rankers (NONE, WORDCOUNT, FIELDMASK) that do nothing, count keyword occurrences, and return matching fields bitmask, respectively. Those are useful when ranking is not needed at all, or computed somehow on application side.

There are 2 legacy rankers (PROXIMITY, MATCHANY) that rely on phrase proximity alone and are used to emulate MATCH_ALL and MATCH_ANY legacy modes respectively.

There are 3 more rankers (BM25, PROXIMITY_BM25, SPH04) that can combine phrase proximity, BM25, and other bits. Query-syntax enabled modes and SphinxQL default to PROXIMITY_BM25 now, and PROXIMITY_BM25 is strongly suggested as a drop-in replacement for PROXIMITY, too. BM25 is recommended as a reasonably good quick ranker, and also for comparison with other systems. SPH04 builds upon PROXIMITY_BM25 but additionally ranks exact field matches and field start matches higher than “just” matches.

PROXIMITY_BM25 and SPH04 are expected to yield the best quality, but your particular results may vary.

Choice of ranker can severely affect search query performance. NONE is obviously the quickest ranker, but what about the other ones? Processing the keyword positions (occurrences) is typically the most expensive part, so rankers that don’t need to do that (FIELDMASK, BM25) are always quicker than all the others. They utilize less disk IO too (no need to read in the positions). Rankers that process keyword positions (WORDCOUNT, PROXIMITY, MATCHANY, PROXIMITY_BM25, SPH04) only differ in CPU impact.

Nitty gritty ranker details

This section describes the exact algorithms Sphinx rankers use and provides pseudo-code. You can skip it freely unless you want to tweak ranking, tune field weights, etc.

While the factors might be integer, boolean, floating point or whatever else, the weight has to be a single scalar value. In Sphinx, the weight is not just scalar but an integer value. This isn’t a real constraint, floating point weight values can be mapped to integers in a variety of ways anyway.

Let’s begin with the three simplest rankers.

1) SPH_RANK_NONE ranker just assigns every document weight to 1.

weight = 1

Why use this and effectively skip ranking at all? The answer is performance. If you’re sorting search results by price, why spend CPU cycles doing expensive ranking you’re going to throw away anyway?

2) SPH_RANK_WORDCOUNT ranker counts all the keyword occurrences and multiplies them by user field weights.

weight = 0
foreach ( field in matching_fields )
    weight += num_keyword_occurrences ( field )

Note that it counts all occurrences, and not the unique keywords. Therefore 3 occurrences of just 1 matching keyword will contribute exactly as much as 1 occurrence of 3 different keywords.

3) SPH_RANK_FIELDMASK ranker returns a bit mask of matched fields.

weight = 0
foreach ( field in matching_fields )
    set_bit ( weight, index_of ( field ) )
    // or in other words, weight |= ( 1 << index_of ( field ) )

The other five rankers are somewhat more complicated and mostly rely on phrase proximity.

4) SPH_RANK_PROXIMITY, the default ranker in SPH_MATCH_ALL legacy mode, simply passes the phrase proximity for a weight:

weight = doc_phrase_weight

By the definition of phrase weight, when documents do match the query but no sequence of two keywords matches, all such documents will receive a weight of 1. That, clearly, isn’t differentiating the results much so using PROXIMITY_BM25 ranker instead is advised. The associated searching performance impact should be negligible.

5) SPH_RANK_MATCHANY ranker, used to emulate legacy MATCH_ANY mode, combines phrase proximity and the number of matched keywords so that, with default per-field weights, a) longer sub-phrase match (aka bigger phrase proximity) in any field would rank higher, and b) in case of agreeing phrase proximity, document with more matched unique keywords would rank higher. In other words, we look at max sub-phrase match length first, and a number of unique matched keywords second. In pseudo-code,

k = 0
foreach ( field in all_fields )
    k += user_weight ( field ) * num_keywords ( query )
 
weight = 0
foreach ( field in matching_fields )
{
   field_phrase_weight = max_common_subsequence_length ( query, field )
   field_rank = ( field_phrase_weight * k + num_matching_keywords ( field ) )
   weight += user_weight ( field ) * field_rank
}

It does not use BM25 at all because legacy mode did not use it and we need to stay compatible.

6) SPH_RANK_PROXIMITY_BM25, the default SphinxQL ranker and also the default ranker when “extended” matching mode is used with SphinxAPI, computes weight as

weight = doc_phrase_weight*1000 + integer(doc_bm25*999)

So document phrase proximity is the primary factor and BM25 is an auxiliary one that additionally sorts documents sharing the same phrase proximity. BM25 belongs to 0..1 range, so last 3 decimal digits of final weight contain scaled BM25, and all the other digits are used for the phrase weight.

7) SPH_RANK_BM25 ranker sums user weights of the matched fields and BM25.

field_weights = 0
foreach ( field in matching_fields )
    field_weights += user_weight ( field )
weight = field_weights*1000 + integer(doc_bm25*999)

Almost like PROXIMITY_BM25 ranker, except that user weights are not multiplied by per-field phrase proximities. Not using phrase proximity allows the engine to evaluate the query using document lists only, and skip the processing of keyword occurrences lists. Unless your documents are extremely short (think tweets, titles, etc), occurrence lists are somewhat bigger than the document lists and take somewhat more time to process. So BM25 is a faster ranker than any of the proximity-aware ones.

Also, many other search systems either default to BM25 ranking, or even provide it as the only option. So it might make sense to use BM25 ranker when doing performance testing to make the comparison fair.

8) SPH_RANK_SPH04 ranker further improves on PROXIMITY_BM25 ranker (and introduces numbers instead of meaningful names, too, because a name would be way too complicated). Phrase proximity is still the leading factor, but, within a given phrase proximity, matches in the beginning of the field are ranked higher, and exact matches of the entire field are ranked highest. In pseudo-code,

field_weights = 0
foreach ( field in matching_fields )
{
    f = 4*max_common_subsequence_length ( query, field )
    if ( exact_field_match ( query, field ) )
        f += 3
    else if ( first_keyword_matches ( query, field ) )
        f += 2
    field_weights += f * user_weight ( field )
}
weight = field_weights*1000 + integer(doc_bm25*999)

Thus, when querying for “Market Street”, SPH04 will basically rank a document with exact “Market Street” match in one of the fields the highest, followed by “Market Street Grocery” that begins the field with a matching keyword, then followed by “West Market Street” that has a phrase match somewhere, and then followed by all the documents that do mention both keywords but not as a phrase (such as “Flea Market on 26th Street”).

So how do I draw those stars?

Or, more formally, how do I compute the max possible weight and scale returned weights to A-F scale, or percents, or whatever else?

As you can see from the previous section there’s no simple way to tell that. Maximum weight depends both on a chosen ranker and a particular query. For example, an upper weight bound with PROXIMITY_BM25 ranker would be

max_weight = num_keywords * sum ( user_field_weights ) * 1000 + 999

But can this upper bound ever be reached? Barely in practice, because that would require a) exact phrase matches b) in all the fields c) plus BM25 peaking at 999 which roughly translates to only using one-in-a-million keywords. Moreover, what if the query uses field limit operators, eg. @title hello world? In that case our upper bound can never be reached because we would never match any field except title. For this particular query the practical upper bound, which could possibly be reached by an “ideal” document, is different.

Therefore, computing the “ideal” maximum weight (one that can actually be reached) is really, really complicated. We could possibly do that on Sphinx side but that’s a lengthy R&D project with questionable outcome. So if you can’t live without percentiles (or stars!), you can either use the “absolute” upper bound estimate like the one given above (that might never be practically reached and result in “100% match”), or just use the maximum weight from your particular query, and rescale everything to that weight. Using multi-queries, the latter can be made pretty cheap.

So how do I rank exact field matches higher?

You use a ranker that does that.

Both SphinxAPI-default PROXIMITY and SphinxQL-default PROXIMITY_BM25 rankers do not. They only rank a longer sub-phrase match higher, but do not care as to where in the field that match occurred, and whether it matched the entire field or not.

SPH04 ranker, added in 1.10-beta, does.

So how do I force document D to rank first?

Depending on why document D needs to be ranked higher, you either use a ranker that fits your requirements better, or use Sphinx run-time expressions to compute what you need and sort the result set differently.

Example given, boosting those exact field-equals-query matches could be emulated by sorting by an expression:

SELECT *, @weight+IF(fieldcrc==$querycrc,1000,0) AS myweight ...
ORDER BY myweight DESC

where fieldcrc is CRC(field) attribute computed at indexing time and stored in the index, and querycrc is CRC(query) computed at searching time.

Example given, instead of checking for strict CRC match, you could index and store field lengths, and rank shorter fields higher by using an expression like

SELECT *, @weight+ln(len+1)*1000 AS myweight ...

Example given, to force a document rank higher when a given keyword is searched for, you create a separate field with super-important keywords, put it there, and assign a high weight to that field. (Don’t set the weight over a million though or 32-bit weight would overlap!)

So how does Sphinx ranking compare to system XYZ?

Major Web search engines (think Google) are an entirely different story, subject to another posting, perhaps. Web-scale ranking (and spam fighting) forces them to account for hundreds and thousands of factors in their ranking. Many of those factors (PageRank, page and domain age, incoming link count, ratio of code to text, etc) are not text related, however, and can also be employed with Sphinx in a particular application by using expressions. Sphinx itself is generic and its rankers only handle some text related factors, everything else has to be explicitly added on top.

However most other full-text search systems still either default to plain old BM25 for text related factors, or even limit you to it. Don’t get me wrong, BM25 is great, it’s a great weighting factor. But using it as the only ranking factor is, ahem, really last century. Sphinx proximity based rankers do a step towards improving that, and we plan taking a few more steps, too. Stay tuned, interesting things are going to happen. ;)


« »

28 Responses to “How Sphinx relevance ranking works”

  1. Excellent article, clarified a lot for me. :)

  2. Dmitry Nesterov says:

    What ranking settings are used for console client “search”?

  3. Thanx greatly!

    Is there a contradiction between this article saying “SphinxAPI-default PROXIMITY and SphinxQL-default PROXIMITY_BM25″ and main doc at “8.3.2. SetRankingMode” saying “SPH_RANK_PROXIMITY_BM25, default ranking mode” ?

  4. shodan says:

    Dmitry, CLI search only lets you specify the legacy matching mode which in turn selects the ranker. Default is PROXIMITY. Using -e2 would result in there-default PROXIMITY_BM25.

    Denis, it’s a bit more complicated. SphinxAPI historically defaults to MATCH_ALL mode that is now equivalent to using MATCH_EXTENDED2 mode and PROXIMITY ranker. SetRankingMode() defaults to PROXIMITY_BM25 indeed. But it does not take effect unless you switch from the legacy mode.

    Apologies for delayed followups, WordPress spam filter we’re using isn’t perfect, apparently.

  5. Remdex says:

    While SPH_RANK_WORDCOUNT is the simplest it does the best job then we are searching by color (search by color including keyword can be done just using sphinx), then just number of custom color accuracy’s in image is the most important and order doesn’t matter.

  6. m says:

    num_keywords * sum ( user_field_weights ) * 1000 + 999

    this formula does not work for me. i am seeing weights in the results which exceed this maximum. this is with proximity_bm25

  7. m says:

    how do we determine the formula for any of these ranking modes when multiple indexes are in use? for example let’s say we’re using two indexes a and b and both indexes have fields description and title, with default user weights. then we send query “foo” against both indexes with proximity_bm25 ranking. what is the max weight in this case?

  8. shodan says:

    m, weights over the maximum might be either a bug, or because of using some of the advanced operators (such as proximity) that do additional weighting tricks internally, or because of using multiple indexes and summing the weights using index_weights feature.

    If you’re querying for just a few words in a single index, without any query syntax involved, then that’s definitely a bug. Please submit your query and data to our bugtracker and we’ll take a look.

    When querying through more than one index, a single match from just one of the indexes is going to be returned by default, so the maximum is the same.

  9. m says:

    hi shodan,

    ok thanks. i’m testing with queries of a single word against two indexes which have exactly the same config but different data. it crossed my mind that maybe targetting two indexes adds a factor of two to the max_weight formula, but it sounds like you’re saying that’s not the case. what exactly do i need to submit to the bugtracker, as far as data?

  10. shodan says:

    Nope, searching through 2 indexes can only (only) result in adding the weights together if you’re explicitly asking Sphinx to do so using either SetIndexWeights() call in the API or OPTION index_weights in SphinxQL respectively.

    Regarding the data, submit a few SQL rows, a respective snippet from a configuration file, a query you’re using, expected result, and a result you’re actually getting.

  11. myltik says:

    Brilliant! Now I know how sphinx ranking methods works :-) thanks a lot

  12. m says:

    i have a couple of questions:

    1. which of the ranking modes is the best measure of “coverage”. in other words, i want to know which document had the greatest number of keyphrases in the query, and i don’t care about their individual counts.

    2. to clarify the meaning of “phrase” as used in the article above, how many phrases are in this query: (“sphinx search” | “foo bar”)? Also will sphinx always treat phrases containing spaces as a single atomic unit for matching?

    Thanks.

  13. Ping Yin says:

    From the description, i think in sphinx LCS stands for Longest common substring instead of longest common subsequence (See http://en.wikipedia.org/wiki/Longest_common_subsequence_problem).

    LCS is a classic algorithm, from the wiki, LCS of string “ABCD” and “AXCX” should be 2, while in sphinx, it is 1.

  14. shodan says:

    @m, re (1), there is no easy way to compute coverage as of 2.0.1, but there might be as of 2.0.2 via the expression based ranker. Re (2), the entire query is used as a huge phrase for ranking purposes, so the possible LCS values should be either 2 (just one two-keyword phrase operator) or 4 (both phrase operators match and are adjacent to each other, ie. document says “sphinx search foo bar” somewhere)

    @ping, it stands for Subsequence, no mistake here. The query is our first sequence and the document is the second one, and the sequences’ elements are words.

  15. Ping Yin says:

    @shodan,
    > 2) query = one two three, field = one and two and three
    > field_phrase_weight = 1

    From my understanding, by considering a word as a single unit, the standard LCS of the query and field should be 3 instead of 1 since i can change field to “one two three” by removing some words from field. And the longest common substring is 1.

  16. shodan says:

    @ping, I see your point. Yes, formally speaking, if you define keywords (!) as alphabet elements, and treat the query and the document as “strings” in that alphabet, then we’re indeed looking at substrings in such “strings”.

    While formally correct, that’d be very confusing and hard to explain though. I would rather suggest talking of contiguous subsequences, i.e. ones that only allow picking contiguous ranges of elements, but disallow intermediate deletions.

  17. re: “TF means Term Frequency in a document being ranked. It’s based on a number of occurrences within a document but smoothed with a logarithm function, so that 1000 occurrences don’t result in 1000x improvement over just 1.”

    I see a log on the IDF part, probably to control just how much IDF affects things.

    But is term document frequency contribution logarithmic, or just sublinear? Most rankers go with sublinear, which logarithmic is of course. But this looks sublinear with a k1 parameter to control how sublinear. From what I know, Lucene/Solr chooses square root. The only ranker that treats document frequency contribution logarithmically is MySQL’s MyISAM FTS ranker. That means it really basically doesn’t care how many times the document repeats itself. Hm, logarithm sounds too aggressive.

    If it were logarithmic, BM25F basically wouldn’t work because BM25F supposes that a pseudo-document with the terms repeated to effect changes in weight will weigh that document section more highly without making a word repeated across 2 sections worth too much. Well, I guess you could just “repeat” the sections 1000x, and you’re not actually repeating, so it’s OK, but that doesn’t seem right.

    Off the top of my head, Solr sounds the most sensible, but who knows?

    This stuff always looks ad-hoc to me.

  18. Edit to the above:

    “doesn’t care” => “doesn’t care very much”

    “But this looks sublinear” => “But this looks linear”

    Sorry, should have edited =)

  19. shodan says:

    @Jaimie, you are correct, TF is actually dampened using a hyperbolic function rather than a logarithm, DampenedTF = TF/(k1+TF).

    And, by the way, re-reading that sentence now I see that I should had used “dampen” instead of “smooth” too.

    However off the top of my head, with proper parameter tweaking, any of the hyperbolic, or sqrt(), or log() functions can all work. They will all nicely dampen too high TFs, which is the point.

  20. Yes, it’s hyperbolic I guess.

    That may be true, but a logarithm would have some consequences for the “F” in BM25F. You’d have to use ridiculously big factors and get very inflated pseudo-document lengths, no? You’d have to repeat the title 1000x to get anything useful in terms of boosting that section.

    I think there are forms of BM25F where you also tweak the “free” parameters, but at the core BM25F can be implemented by repeating the field you want weighed 5x 5 times.

    That includes inflating the document length for those 4 extra times.

    Honestly, I have mixed feelings about BM25F to begin with on this point. I get what it’s trying to do. It doesn’t want a word across fields to count for more than it should, or worse report documents with all keywords (full coordination) below one with some repeated across sections. You can fix this other ways, though. Simply rank coordination first.

    Also, all you need is a long body to screw up the concept of per-title document length. You’re normalizing it for length, but losing the implied information in the fact that a really short title with all your words probably means something even if the body has lots of blah-blah.

    As usual, it all looks like handwaving to me. I see the papers sometimes I shake my head at how scientific they’re trying to be while violating common sense.

  21. shodan says:

    Nope, it would not.

    All that BM25F does is basically pre-scaling the raw TFs and field lengths, indeed. So from there it doesn’t really matter too much what *kind* of a dampening function applied against those scaled TFs you use, as long as the parameters are tuned so that the dampened result does not grow too quickly.

    And yes, BMxx family of functions is far from being THE solution. To get nicer search quality, you gotta account for much more signals than just this or that variant of BM25. But this post is aged :) Nowadays, Sphinx expression ranker lets you do that, a bunch of signals are exposed, and even more are coming in 2.2.

  22. It would not what? It would not require 1000x? Yes, I believe it possibly would _if_ you applied a logarithm to frequency. For BM25 it’s fine because it’s hyperbolic as you say.

    But TF/(k1+TF) & sqrt(TF) look very different than log(TF). Both slowly reward repetition, i.e. “dampen.” LogE(TF) grants very little reward for repetition after a certain point. And if you repeat that much it messes things up by inflating document length, but I could be wrong. Am I wrong on that? Maybe it’s OK because it’s fair for all documents in the collection.

    BM25F is “state of the art.” So I often look at it for advice. Don’t use linear combinations of scores across fields. I get why. But that’s at the expense of resolution of document length on fields such as title, though. You don’t lose that with linear combinations. And then you can make coordination a more important factor (i.e. multiply by 999 or 1000).

    Not perfect. I guess IR never is.

  23. shodan says:

    Well. Logarithmic function is not necessarily just log(tf).

    http://www.wolframalpha.com/input/?i=plot+x%2F%281.2%2Bx%29+and+log%28x%29%2Flog%282%29%2F10%2B0.45+from+1+to+32

    And document length inflation is absolutely irrelevant here. We’re only discussing the dampening function. I’m pointing out that you could get pretty close results with any *kind* of the dampening function, that is it.

  24. Not trying to argue, sir. You probably know this stuff more deeply than I ever will :)

    And how did I not know that Wolfram lets you do that? That’s awesome. They behave differently for later values of X:

    http://www.wolframalpha.com/input/?i=plot+x%2F%281.2%2Bx%29+and+log%28x%29%2Flog%282%29%2F10%2B0.45+from+1+to+500

    I suppose anything works as long as it works for TF = 1..REASONABLE_MAX, so even if the characteristics later start to degenerate, I’m not sure that matters.

    So yes, if you ‘fix’ the logarithmic function so you don’t have to make the F part of BM25F (I mean the idea of “repeating” terms to form a pseudo-document affecting both TF and DOCLENGTH) such a high number to get a real effect, then your adjusted log would also work very well. I stand corrected.

    The other issue relates not to this, but to the other blog post you wrote. You implemented BM25F which requires all sorts of things including the average length of all fields in a corpus. I just think BM25F isn’t all it’s cracked up to be. It may not do well (I’ve been playing but nothing as rigorous as most research papers) in an eCommerce setting, for example. I believe this is because:

    1. Descriptions vary widely in length.
    2. Titles do to an extent as well, but,
    3. When you glue them together, even with the repetition to make those terms in that field weigh more.
    4. You no longer have the quasi-proximity effect from being a bunch of keywords satisfying the query in a short field, esp. if a description is very long.

    If you add in a proximity factor or various other factors like you have in SPH_04, maybe it’s mitigated, but I think that’s a flaw in BM25F that everyone writing research papers is ignoring. The worst part is that if you read some of their research, they say things like:

    “Second, it should be noted that taking into account
    the structure of a document (e.g., in BM25F) implicitly rewards proximity within important short streams, such as the title.”
    http://www.soi.city.ac.uk/~ser/papers/foundations_bm25_review.pdf

    Am I missing something here, or is document length lost when you manufacture the pseudo-document?

    Thanks for the input.

  25. shodan says:

    Why, I invite you to argument at will. It’s a very refreshing change to have *this* kind of a discussion that we’re currently having, you know :)

    On dampening functions, for any given range of interest one could always tweak the coefficients. For instance to handle 1 to 500 range, we can just change the log divisor from 10 to 20, and voila, the lines are not that far away from each other again:

    http://www.wolframalpha.com/input/?i=plot+x%2F%281.2%2Bx%29+and+log%28x%29%2Flog%282%29%2F16%2B0.45+from+1+to+500

    Of course, the slope will indeed still look very differently between log() and sqrt() and hyperbolic functions. After all, log() and sqrt() are unbounded while the hyperbola is.

    But that’s the entire point! For instance, on the graph just above log() is climbing up somewhat slower than the hyperbola does. And theoretically, either might be useful for different scenarios.

    Now, my personal and unfounded (!) bias is that hyperbola for a dampening function is a fine 1st step, better than log() or sqrt(), because it a) differentiates between small values of TF well, and b) saturates to almost 1.0 very quickly.

    Intuitively, the 2nd step would be to not just dampen, but penalize (!) higher raw TFs, on the grounds those are likely to indicate spam. That in turn can be achieved with many different classes of dampening functions, for instance, by multiplying a couple hyperbolas:

    http://www.wolframalpha.com/input/?i=plot+x%2F%281.2%2Bx%29*101%2F%28x%2B100%29+from+1+to+32

    Unfortunately I haven’t yet had the time to check whether changing a dampening function this way would improve search quality. Cf. unfounded :-)

    On BM25F, our implementation was rather intended as a method to compute more individual ranking signals, rather than one Rule-them-All Uber-Signal. It’s now fully tweakable on the fly, so it’s completely up to you as to how to use it. For instance, using BM25F you now can specify *zero* weights for most fields and, say, compute BM25 over title only with some title_k1 and title_b. And then also compute BM25 over the rest of the document but not the title using a different pair of doc_k1 and doc_b parameters. And then combine them both and add a pinch of LCS or keyword density. That’s the intended use.

    And this way, you can actually avoid the “glueing” issue you’d have with just one instance of BM25F, because you can compute and combine per-field values. That’d naturally favor proximity and density.

    The document length is not really lost but just biased (towards the higher-weighted fields) when you create a pseudo-document. title_len=10 scaled by 10 would still be offset by a huge body_len=100000 and that kind of a document would always sink. Now, that can never be fixed with just 1 instance of BMxx but can easily be solved by combining BM25_title + alpha*BM25_body. Having BM25F support lets you compute just that.

    Last but not least, Sphinx was initially built with a key idea that BMxx is useful, but not that important, and there should be a different *major* ranking signal. For which I chose LCS a while ago. Both PROXIMITY_BM25 and SPH04 rankers do in fact use length-unadjusted (!) variant that degenerates into BM15 and even that works, because the rankers essentially perform ORDER BY lcs DESC, bm15 DESC kinda of a sorting.

    Now, chosing LCS and making it *the* major factor was just an ad hoc decision that dates back 12 years, but we actually tested the quality of PROXIMITY_BM25 (should have been called PROXIMITY_BM15… sins of the youth) vs just BMxx a couple years ago. I can’t remember whether we used real BM25 or our simplified BM15. I do remember that even PROXIMITY_BM25 performed significantly better than just BM25, though.

    And with all the new recently added and forthcoming factors it’s time to redo that search quality test again, and come up with an even more elaborate SPH05, SPH06 etc. If only my days had 48 hours in them… :)

  26. I thought about this a little more before replying. I’ll suppose a few things and you can tell me if I’m wrong. A little history =)

    When IR started, every text retrieval system used the boolean model, but almost all non-computer-geeks [most everyone] are bad at forming a boolean query.

    In response, the probabilistic/VSM model was introduced, which lets people like lawyers searching cases type in stuff like “workplace violence laws 1996 1997 1998 cuomo” without any thought of that nerdy boolean stuff. It will just *know* via some (nerdy) math which of those tokens are worthwhile. Imagine writing that in boolean form. They all use boolean OR semantics so you have to think even less.

    The problem is that sometimes OR semantics are problematic. A query for “hair dryer” will show the products from the category “clothes dryers.” Ouch.

    OK, so why does all this matter? Well, what exactly does BM25 do with AND semantics? Pretty much nothing. It will favor documents covering all the words (because we’re using AND anyway!) with higher frequencies of words that are statistically less likely to appear in a document based on statistics.

    Yawn. I’m personally more interested in what’s in the title.

    Am I wrong? So what’s the point of BM25F with AND semantics? It’s actually *worse* than a scaled linear sum for sure, and this is exactly what the authors of BM25F warn against. What am I missing?

    I think you agree. I’m just baffled by the non-real-world-tests these academia people must be using.

  27. Here is the paper that I’m pretty sure you’ve seen:
    http://www.hugo-zaragoza.net/academic/pdf/robertson_cikm04.pdf

    They use the phrase “dangers of score combination” several times. They criticize the idea of score combination strongly, which, of course, we’re both proposing in various ways above, no?

  28. Hi,
    Interesting discussion. I am currently doing research on large document clustering.

    Is there any research that has already been done on the Sphinx SPH04 ranker [papers/proceedings/journal]?…

    Thanks for your reply.

Leave a Reply