Archive for March, 2010

Full-text search BOF session at MySQL Conference

Wednesday, March 31st, 2010

Just wanted to post a quick update that our Birds-of-a-Feather session proposal for MySQL Conference got approved too, so come join us on Tuesday, April 13th, 7:00 PM PST if you are around Santa Clara Convention Center. We’ll be discussing all things full-text search, including Sphinx, of course, but not really limited to it. Evening BOF sessions were free to attend previous years, so consider it even if you aren’t attending the conference itself.

Now if you are, I couldn’t help myself but plug a reminder of a Sphinx: full-text search in 2010 talk the next evening, that is, Wednesday, April 14th, 5:15 PM PST. Note that the talk will be an overview of newly added features rather than a bird-view of the entire system. So if you haven’t used Sphinx before, I suggest fetching a copy and going through some tutorial before the talk. Even though we aren’t pushing our install as “the famous 5-minute” one, it still doesn’t take more than that. (So, too bad this punch line is taken.)

Sphinx vs MySQL expression benchmarks

Monday, March 29th, 2010

Curiously, even though we support arbitary arithmetic expressions for a while, I’ve never actually benchmarked how does our implementation compare to other ones, for instance, to MySQL. Time to fill that gap.

I used current trunk version of Sphinx (namely r2265) and MySQL 5.0.37, both running on Windows. Test data was 1 million random rows generated with following PHP script. Table type defaults to InnoDB.
(more…)

Sphinx powers Mozilla’s addons search

Friday, March 26th, 2010

Are you familiar with Mozilla Addons? Those nifty little plugins/widgets to either your Thunderbird email client or Firefox browser that enable one to block those annoying adds, change the theme, or even manage your mp3 play-list while browsing the internet (my favorite). In fact, according to Alexa.com, the Mozilla Addons site represents 72.8% of the total traffic traveling to mozilla.org.  As a result of a conversation I had with Dave Dash, Senior Web Developer of Mozilla earlier this week and reading through his blog, I just learned that Sphinx is actually used to power the addons site.

On the surface, addon search is not to all too complicated. They have roughly 5,000 addons, that’s pretty small, so what, should be easy to index with MySQL and get the job done, right? Not so fast, looking beneath the covers reveals that loads peaked at ~10 queries/second, and not only addons themselves but also secondary metadata about multiple locales (US, FR, DE, etc),  platforms, product lines (Firexfox, Thunderbird, Seamonkey, Sunbird or Fennec), and last but not least Mozilla also needed to index every translation of those addons, involving a number of joins and queries in the end. So with MySQL, that was a stress on their system resources. And with Sphinx, their most complicated queries can be run every 5 min, opposed to 180,000 times per day on demand. It was also pleasant to read that the “API was a breeze to work with, and was easy to drop into our own codebase.”

You can find the details in Dave’s entry on Sphinx, hope you all found this as interesting as I did. Oh, and this was Richard Kelm, your friendly Sphinx sales guy, so (obligatory plug) if that wasn’t technical enough and you’ve got any other questions on using, tuning, or scaling Sphinx, feel free to email me at or give me a call. Thanks!

Porting legacy matching modes to SphinxQL

Tuesday, March 23rd, 2010

So you’ve decided to switch to SphinxQL (which is SQL syntax now supported by Sphinx). However, its MATCH() statement uses query syntax that was previously available from “extended” matching mode, and you’re using some other mode. What do you do?

Legacy modes had two major ingredients in them. First, the way they pick matches (well, we didn’t call them matching modes for nothing). Second, the way they rank them. You can perfectly have both in the new scheme of things. For matching, you rewrite them using full-text query syntax. For ranking, you also pick a proper ranker using OPTION clause. Moreover, the two are no longer tied together, so it’s more flexible now.

There were 4 modes before the query language, namely ALL, ANY, PHRASE, and BOOLEAN, standard prefix omitted for brevity. How to repeat their matching behavior now?

(more…)

Mysterious ways

Wednesday, March 17th, 2010

Fixing a subtle bug in a major new feature brought one tiny indexing optimization possibility on a radar. We’re copying hits (keyword occurrences) from the data source buffer to indexing buffer, and currently, sometimes the per-document data has to be split on an indexing buffer boundary. That was previously fine but started to cause issues in newly written code.

Removing that redundancy is a pretty bulky chunk of work. At the same time the benefits are really low, if measurable. On a quick test the difference was at most 1% and, in fact, it just might had been timer jitter. So I will be sticking with a quick and dirty fix for some time (or a while).

But what’s interesting here is how 1) a quick fix is, well, possible but pretty dirty and the only real solution seems to make that bulky change (at least I can’t come up with any other “clean enough” way to fix it yet), and 2) in the end, removing that redundancy seems to be going to simplify the indexing loop, even if it’s not about any measurable performance difference. Plus that quick test yields a counter-intuitive result that 3) copying data around isn’t actually as super slow as one might expect.

So the code sort of said “rewrite me” in a sense. What started as a seemingly simple bug fix is going to cause a reasonably sized internal change at some point. Days like this, code base seem to evolve in mysterious ways, almost intelligently.

Ah, and that major feature I mentioned is a new dictionary type that stores keywords and eliminates the need for slow prefix/infix indexing. Much less mysterious, much more predictable, that optimization is going to speed up sub-string indexing a lot. And was conceived by a human, which is relaxing.

The clone wars (how Sphinx handles duplicates)

Monday, March 15th, 2010

Documents duplicated across different indexes happen not so infrequently in Sphinx instances, chiefly because of quirks with main/delta setup. Guilty as charged, our own samples on main/delta used to differentiate the rows based on timestamp, which is naturally prone to duplicates. (That’s where a good intention of keeping it simple takes one.)

So how does Sphinx handle those duplicates when indexing? When searching? When merging the indexes? What happens to totals and aggregate functions, and why are those odd numbers you’re getting not necessarily a bug?

(more…)

Hello world!

Sunday, March 14th, 2010

It’s year 2010 and so we’re catching up and making ourselves a blog too, I gather that’s just a mere decade later than everyone else.

What I’m going (well, hoping) to post here? Development diary notes, benchmarks, educational articles, personal rants, and generally everything that comes to mind but isn’t big enough for official company news entry nor tiny enough for Twitter entry. (Too rare a-occasions, those.)

It’s obviously going to be dedicated to searching in general, with some bias to Sphinx in particular, but not necessarily always about it. This declared as a devs blog and not Sphinx only blog, we do reserve the right to make a post or three on general programming topics. Or politics. Or beautiful women. And margaritas (frozen classic one, please, and do not salt the glass).

First post!