Mar 15, 2010. The clone wars (how Sphinx handles duplicates)

Documents duplicated across different indexes happen not so infrequently in Sphinx instances, chiefly because of quirks with main/delta setup. Guilty as charged, our own samples on main/delta used to differentiate the rows based on timestamp, which is naturally prone to duplicates. (That’s where a good intention of keeping it simple will get you..)

So how does Sphinx handle those duplicates when indexing? When searching? When merging indexes? What happens to totals and aggregate functions, and why are those odd numbers you’re getting not necessarily a bug?

Let’s begin with duplicate document indexing. In default “extern” attribute storage mode Sphinx is able to notice duplicate IDs and warn about them. It will also automatically throw away all of the duplicated attribute rows except just one. However, that happens at a pretty late stage of indexing, when keyword occurrences are already fully processed and not going anywhere! So even though the index will only contain one set of attribute values, it will still contain all the duplicated keywords, potentially resulting in incorrect matches. Like this:

$ cat test.sql
INSERT INTO documents ( id, gid, data ) VALUES
	( 1, 123, 'this is my rifle' ),
	( 1, 456, 'this is my gun' );
 
$ indexer test
...
WARNING: duplicate document ids found
total 2 docs, 30 bytes
 
$ search -i test rifle gun
...
index 'test1': query 'rifle gun ': returned 1 matches of 1 total in 0.003 sec
 
displaying matches:
1. document=1, weight=1643, gid=456

You can also see how just one value of group ID attribute (gid=456) got into the final index — and, what’s even more important, which one made it into the index. Clearly not the first in the order of appearance. But it’s not necessarily the last one either. Essentially it’s sort of randomly selected between all the duplicates.

As a side note for meticulous readers, the selection is, of course, deterministic and not random. That is, for any given particular incoming order of rows, it will always pick up the same row. But you can’t really predict which. And, unless you’re using sorting in your database query (which you really, really should not!) you can’t predict the database-outgoing, Sphinx-incoming order either. Hence, it’s random for practical purposes, as in you can’t rely on a particular row being chosen.

The bottom line is simple, indexer will handle duplicate documents within a single index (and even warn you) and will not crash or anything, but to avoid junk attributes and search results, you want to avoid duplicates anyway.

OK, so let’s assume that every single index is “clean”, now what happens in searching time when there are duplicates across indexes? In other words, how are the duplicate documents from multiple indexes handled when combining multiple search result sets?

Short story: they are removed, the final result set is guaranteed to be dupe-free. Unlike indexing, removal order is also guaranteed, matches from indexes specified last win over matches from those specified first. Totals also get adjusted, but only to the extent possible by processing result sets themselves, and not the full match sets. And aggregates are not adjusted at all.

Duplicates are removed based on ID that is required to be unique for every document. Hence, if we search through two indexes (think main/delta) at the same time, and two documents share the same ID, one of the matches must be for an older version of the document and needs to be removed from the result set.

For the very same reason, index order matters. Sphinx expects more recent indexes to be specified later in the search query; older indexes to the left, newer ones to the right:

$client->Query ( "rifle", "main delta" );

and the matches from newer indexes are going to win. So if “rifle” query finds a document with id=1 and gid=123 in main index, and also finds a document with id=1 but gid=456 in delta, only the latter one is going to make it into final result set.

But that means we’ve just thrown away a document from combined search results. So the “total_found” values will be adjusted accordingly. And “total” value actually gets recomputed from scratch after searchd is done combining the final result set, so it gets naturally adjusted too.

Still, the totals won’t be precisely compensated for all the duplicates that might be there. Here’s a made-up example: assume that, for a given query, index A matches 10K rows dated year 2008 and 20K rows dated year 2009, and index B matches the very same 20K rows from 2009, plus 30K more from 2010. Assume that we’re sorting by date. 1K-row result set from A will be filled with rows from 2009, 1K-row set from B with rows from 2010, none of these overlap — and, even though actual “total_found” equals 60K, the value of 80K will be returned, with 20K dupes honestly counted twice. Moreover, even if the result sets would entirely overlap, it would only be adjusted by the size of the result set, and return 79K matches.

So the totals correction works fine on smaller sets but once the result set starts to grow it’s going to be off. That can, of course, be alleviated by respectively raising max_matches setting but that’s a performance impact which might or might not be justifiable.

On the other hand, when using GROUP BY, aggregate functions (SUM, AVG, MIN, MAX) will not be adjusted for duplicates, at all. Grouped result sets have only one match for every group anyway, and compensating just for that single match seems rather pointless (chances are there are (much) more dupes within the group that we never see and adjust for). So, for instance, in the extreme case when the index is fully duplicated, SUM is going to be 2x the “actual” value.

The general issue with totals and aggregates is that we’d need to somehow identify all the duplicates to be able to compute them precisely. And that, in the worst case, might take an unexpectedly high amount of CPU time, RAM, and, worst of all, precious network bandwidth in case of distributed queries. Imagine we’ve got 100K matches from local index A plus 200K more from local index B and want to compute precise total_found. Essentially we need to keep 300K document IDs for that, with 8 bytes for ID that’s extra 2.4 MB to be processed. That’s not really much but at least 0.005 to 0.01 seconds of CPU time will be spent processing these 2 megs anyway. Now imagine that all these document IDs have to be sent over the wire. Oops, 0.03 more seconds gone transmitting the data at 75 MB/sec aka 600 Mbps (which I hear is a realistic limit for most 1 Gbps gear) and that’s not counting extra copying and other OS overheads. Overall we’re talking extra 0.05 plus seconds in distributed case. Translating to 20 queries/sec/node or less. I take it that precise total_found is usually not that much important to introduce this kind of a limit.

Admittedly, the numbers are exaggerated to make a point. Not all queries are going to end up sending 300K IDs, and that data can be compressed pretty efficiently, and so on, and so forth. The point’s that it takes noticeable time to precisely compute one not super much important value in a case that’s best avoided after all. So we’re keeping the computations quick but not always precise.

We’re finally taking that time when doing physical index merging though. Starting with 0.9.9, index merge has a so-called “phantom killer” built into it. Meaning that after you’ve merged two indexes with duplicates, those duplicates should automagically go away. That wasn’ t the case with previous versions and also still isn’t the case if you use “inline” attribute storage mode, because in the latter case there’s no reasonably efficient way for indexer to identify the duplicates. Technically that can be fixed too, but I’m not really aware of anyone using inline attributes much. Index order matters when merging as well, “source” index you’re merging into “destination” index is expected to be more recent and takes precedence.

It isn’t an option to be merging all the time which brings us to the final question, how else do we avoid them dupes?

Sometimes you can just configure indexes in a way that you don’t make any. For example, if document ID is ever-growing, and data never gets updated once it’s in a system, you’d pick the rows for a delta index based on the ID precisely. That’s how examples in current documentation work; all IDs less or equal that the threshold go to one index, all IDs strictly greater than the threshold go to another one, and no ID ever happens in both indexes.

When that’s a non-option, you can use kill-lists to suppress previous versions of the documents. The documentation’s at http://sphinxsearch.com/docs/current.html#conf-sql-query-killlist (and if you find it way terse, let me know that I should consider another post explaining kill lists in detail). K-lists were also introduced in 0.9.9 (and, you bet, rely on index order as well).

So if you’re seeing odd counts from 0.9.8 indexes with dupes, it might be a) not really a bug, and b) a good idea to upgrade.


« »

3 Responses to “The clone wars (how Sphinx handles duplicates)”

  1. Ashwin says:

    Hi,

    Please do post a post on Kill-list. The documentation doesn’t explain too much. Updates & Deletes are vital to many apps and kill-lists will be very helpful.

    Thanks,
    Ashwin

  2. Sebastian says:

    A somewhat related question on aggregates:

    Let’s consider a main+delta schema and the data structure example in your post.

    The desired effect is to get count(distinct gid) for the whole data (including main and delta).

    Now:
    select count(distinct gid) from main
    as well as
    select count(distinct gid) from delta
    both work as expected.

    However
    select count(distinct gid) from main + delta
    will actually simply run the count on each index, then add the results, which will actually not result in the correct number.

    At the moment, we’ve setup a schema in which we merge delta into main every hour, and only query main. This will of course not hold too well as the indexes grow.

    Is there a better way to obtain that number at the moment?

  3. shodan says:

    @Sebastian, the only alternative I could immediately think of would be to check all GIDs from delta for presence in main. And note that COUNT(DISTINCT) even against a single index might be imprecise in case there are a lot of groups. Essentially we chose fast approximate path with strict RAM limits for GROUP BY rather than the exact path.

Leave a Reply