Mar 22, 2013. New BM25 functions and IDF operators in custom rankers

Until 2.1.1-beta the functions exposed in custom rankers for handling relevancy based on term frequency and Inverse Document Frequency (IDF) did not take field or document lengths into account. In 2.1.1-beta, Sphinx includes functions that take relevance ranking to the next level.

New IDF functions

 mysql> SELECT * FROM myindex WHERE MATCH(‘less_common more_common’) OPTION RANKER= expr(sum
(1000*tf_idf))

This simple ranking should give you the docs that have both less_common and more_common words first, but after that it really depends on the frequency at which they appear in the documents, as the calculated tf_idf for less_common could be lower than a more_common one.

Relevance is ultimately subjective and so we consider flexibility to be very important when it comes to relevance ranking. To provide maximum flexibility, Sphinx 2.1.1-beta includes a triad of functions that can provide several IDF values:

  • min_idf - will give you the lowest IDF of the search terms (the IDF of the most frequent word)
  • max_idf – will give you the lowest IDF of the search terms  (the IDF of the least frequent word)
  • sum_idf - will simply sum the IDFs of the unique search terms (this can be very useful if you want to order by the inverse frequency of the words in the whole index).

Please note that all 3 provide float values between 0..1 and they work only inside sum() just as tf_idf works.

mysql> SELECT * FROM myindex WHERE MATCH(‘less_common more_common’) OPTION RANKER= expr(sum
(1000*sum_idf))

In this example, assuming less_common has a 0.6 idf and more_common has 0.2 idf , the weight will be 0.8 for documents that have both keywords, 0.6 for those that have only the less_common and 0.2 for those that have more_common.

IDF mode

It’s now possible to use two different IDF calculations. The default IDF in 2.1.1 (the only IDF function available in older versions) penalized very common words. This default IDF can have values between -1 and 1 for keywords, with -1 being a keyword present in all documents and 1 for a keyword present in only 1 document.

To give an example let’s consider ‘the | something‘ as the search query:

Our index will match 3 types of documents: 1) documents containing only ‘the‘, 2) documents containing only  ’something‘ and 3) documents containing both ‘the‘ and ‘something‘.

The calculated IDF will be biggest for the documents that have only ‘something‘. Those with both ‘the‘ and ‘something‘ will have a lower IDF because of the penalty given to very common term ‘the‘.

New Option, Two Values

We’ve added a new option for the SELECT command called ‘IDF‘, which can take 2 values:

  • normalized - default , IDF formula is log((N-n+1)/n) , where N is the collection size and n is the number of matched documents. The IDF will vary between [-log(N), log(N)] and very frequent words will get penalized
  • plain - with a formula of log(N/n) , the IDF varies between [0, log(N)]- more exactl,y between 0..1 , keywords are never penalized

Unlike the default IDF (normalized), when using plain mode, the first documents from our example will be the ones having both ‘the‘ and ‘something‘, followed by the ones with ‘something‘ and followed lastly by those with only ‘the‘.

A sample query would look like :

mysql> SELECT * FROM myindex WHERE MATCH('the|something') OPTION idf='plain';

Global IDF

A problem with ranking in distributed searches is that, on each node, BM25 and tf_idf values are calculated locally. If you have an index distributed across several shards, when doing a query, each index in that shard calculates the term frequencies based on the data it has. This can be a problem as some words might be more common in some shards and less common in other shards. To overcome this we’ve added an option that allows use of an external idf file.

Now, in the case of a distributed index, you can take each IDF of the shard, merge them together and create a global IDF file which then is used by each shard, resulting in having a global term frequency across the whole index.

To create the global IDF file, the first thing to do is dump the dictionary with statistics included for each shard:

$ indextool --dumpdict shard1 --stats >  shard1.dict

Next we create the IDF for that shard:

$ indextool --buildidf shard1.dict --out shard1.idf

And finally we grab the IDF files from all shards and create a single one:

$ indextool --mergeidf shard1.dict shard2.idf .... --out global.idf

Now all we have to do is to copy the global.idf to all shards and set the indexes to use this global.idf:

index shardx
{
    ...
    global_idf = /path/to/global.idf
    ...
}

By default, the local idf is still used, so we need to set in OPTION clause global_idf=1 so the query is made using the global idf and not the local one.

mysql> SELECT * FROM myindex WHERE ...  OPTION global_idf=1;

New BM25 functions

The existing bm25 operator does not compute a genuine BM25 function as it doesn’t take the document length into account. Starting with this version, it’s now possible to use full BM25 functions in custom ranking expressions. Just enable the index option index_field_lengths for recording the length of fields calculation at indexing

index myindex {
      …
      index_field_lengths = 1
      …
}

BM25A(k1,b)

Computes a complete BM25 function. It needs 2 parameters, equivalents of the k1 and b parameters found in the BM25 algorithm

mysql> SELECT id,source_id,added,weight() FROM static WHERE MATCH('mysql query') LIMIT 10 OPTION ranker=expr('1000*bm25a(2.0,0.75)');
+------+-----------+------------+----------+
| id   | source_id | added      | weight() |
+------+-----------+------------+----------+
|   63 |        62 | 1351773991 |      535 |
|   84 |        83 | 1351773991 |      534 |
|  162 |       161 | 1351773991 |      534 |
|   95 |        94 | 1351773991 |      533 |
|  163 |       162 | 1351773991 |      533 |
|  166 |       165 | 1351773991 |      532 |
|  113 |       112 | 1351773991 |      531 |
|  168 |       167 | 1351773991 |      531 |
|  169 |       168 | 1351773991 |      530 |
|  175 |       174 | 1351773991 |      530 |
+------+-----------+------------+----------+
10 rows in set (0.00 sec)

Both parameters are float values, and the returned value of BM25A() is a float number between 0..1. The first parameter, k1, calibrates the document term frequency scaling. Having k1 as 0 corresponds to a binary model – no term frequency. Increasing k1 will give rare words more boost. The second parameter, b, calibrates the scaling by document lengthand can take values from 0 to 1, where having 0 means no length normalization and having 1 corresponds to fully scaling the term weight by the document length.

BM25F(k1,b,{field_weights_list})

BM25F extends the original BM25A toward multiple fields, which can have different weights. Note that OPTION field_weights still applies to calculate LCS values.

Simulating BM25A with BM25F :

mysql> SELECT id,source_id,added,weight() FROM static WHERE MATCH('mysql query') LIMIT 10 OPTION ranker=expr('1000*bm25f(2.0,0.75,{title=1,description=1})');
+------+-----------+------------+----------+
| id   | source_id | added      | weight() |
+------+-----------+------------+----------+
|   63 |        62 | 1351773991 |      535 |
|   84 |        83 | 1351773991 |      534 |
|  162 |       161 | 1351773991 |      534 |
|   95 |        94 | 1351773991 |      533 |
|  163 |       162 | 1351773991 |      533 |
|  166 |       165 | 1351773991 |      532 |
|  113 |       112 | 1351773991 |      531 |
|  168 |       167 | 1351773991 |      531 |
|  169 |       168 | 1351773991 |      530 |
|  175 |       174 | 1351773991 |      530 |
+------+-----------+------------+----------+
10 rows in set (0.00 sec)

Increasing the weight for the title will change the results:

mysql> SELECT id,source_id,added,weight() FROM static WHERE MATCH('mysql query') LIMIT 10 OPTION ranker=expr('1000*bm25f(2.0,0.75,{title=10,description=1})');
+------+-----------+------------+----------+
| id   | source_id | added      | weight() |
+------+-----------+------------+----------+
|   57 |        56 | 1351773991 |      536 |
|   63 |        62 | 1351773991 |      535 |
|   84 |        83 | 1351773991 |      535 |
|  162 |       161 | 1351773991 |      534 |
|  163 |       162 | 1351773991 |      533 |
|  168 |       167 | 1351773991 |      533 |
|   95 |        94 | 1351773991 |      532 |
|  113 |       112 | 1351773991 |      531 |
|  166 |       165 | 1351773991 |      531 |
|  169 |       168 | 1351773991 |      531 |
+------+-----------+------------+----------+
10 rows in set (0.00 sec)

Well, that’s it for now — any questions? Need help implementing these new options? Don’t hesitate to contact us.


« »

Leave a Reply