Fixing a subtle bug in a major new feature brought one tiny indexing optimization possibility on a radar. We’re copying hits (keyword occurrences) from the data source buffer to indexing buffer, and currently, sometimes the per-document data has to be split on an indexing buffer boundary. That was previously fine but started to cause issues in newly written code.
Removing that redundancy is a pretty bulky chunk of work. At the same time the benefits are really low, if measurable. On a quick test the difference was at most 1% and, in fact, it just might had been timer jitter. So I will be sticking with a quick and dirty fix for some time (or a while).
But what’s interesting here is how 1) a quick fix is, well, possible but pretty dirty and the only real solution seems to make that bulky change (at least I can’t come up with any other “clean enough” way to fix it yet), and 2) in the end, removing that redundancy seems to be going to simplify the indexing loop, even if it’s not about any measurable performance difference. Plus that quick test yields a counter-intuitive result that 3) copying data around isn’t actually as super slow as one might expect.
So the code sort of said “rewrite me” in a sense. What started as a seemingly simple bug fix is going to cause a reasonably sized internal change at some point. Days like this, code base seem to evolve in mysterious ways, almost intelligently.
Ah, and that major feature I mentioned is a new dictionary type that stores keywords and eliminates the need for slow prefix/infix indexing. Much less mysterious, much more predictable, that optimization is going to speed up sub-string indexing a lot. And was conceived by a human, which is relaxing.
| « March 15, 2010. The clone wars (how Sphinx handles duplicates) | March 23, 2010. Talking, blogging, coding, and meeting » |

