Curiously, even though we support arbitary arithmetic expressions for a while, I’ve never actually benchmarked how does our implementation compare to other ones, for instance, to MySQL. Time to fill that gap.
I used current trunk version of Sphinx (namely r2265) and MySQL 5.0.37, both running on Windows. Test data was 1 million random rows generated with following PHP script. Table type defaults to InnoDB.
mysql_connect ( "localhost", "root", "" ) or die ( "connect" ); mysql_select_db ( "test" ) or die ( "select_db" ); mysql_query ( "drop table if exists benchexpr" ); mysql_query ( "create table benchexpr ( a int not null, b int not null, c int not null )" ); mt_srand ( 0xdeadbeef ); for ( $i=0; $i<1000000; $i++ ) mysql_query ( sprintf ( "insert into benchexpr values ( %d, %d, %d )", mt_rand(), mt_rand(), mt_rand() ) );
The queries were in the form of SELECT MIN(a) AS result, 1 AS groupkey FROM bench GROUP BY groupkey, because of SphinxQL syntax being too tight (that’s to be fixed as time permits). Exactly the same queries were run against Sphinx and MySQL. Every query was repeated 3 times and minimum run time was picked, as it jittered for 0.01 or 0.02 sec across runs. Without further ado, there go the results!
|N||expression||MySQL result||time||Sphinx result||time||k|
So there’s a noticeable difference in performance (and now I know that our expression calculator performs pretty well actually). However there are also some discrepancies in the results. Why?
Generally the reason is that Sphinx expression type system is (slightly) different. Let’s dissect specific examples.
All arguments to (a+b*c) expression are unsigned 32-bit integers (uint32), so Sphinx computed the expression as uint32. The result is however too big for uint32 and wraps around. This is how C/C++ compilers behave, actually, so you get have exactly the same result from uint32 res = a*b+c line in a C program. Databases are cleary different and auto-switch to a wider type. Now, when we enforce BIGINT type on one of the arguments (which is signed 64-bit aka int64), results start to match!
As for other expression that involve floating-point operations, like division or SIN, in those cases Sphinx (again working more like C/C++ compiler than a DB) computes results in 32-bit floats, while MySQL (supposedly) uses 64-bit doubles. Hence the roundoff errors and different results. Those are especially visible on this randomized data. For one, results are very different in SIN() case, basically because the sine values are small and 1 million times the 32-bit float roundoff error ends up being pretty big.
Another interesting finding is that MySQL does not optimize expressions really well, if at all. See for instance how expressions 5-6 and expressions 9-10 resulted in upto almost 2x different times from MySQL. Sphinx expression optimizer folds the constants and shows exactly the same timings. Of course examples like 5+6+7 are exaggerated but c/3*4 on the other hand is not, and it results in 20% difference.
Last but not least, despite being faster, we do still have plenty of room for optimizations. The expression calculations could be batched. GROUP BY 1 trick can not only fixed in syntax but also optimized for. I’ve made a very simple JIT expression compiler prototype once, boosting expressions like a+b*c to almost native performance levels. All that combined, I think there’s a solid chance to beat MySQL not just 20x but over 100x at complex enough expressions. Anybody out there doing lots of math in MySQL who can share samples of production queries?
The bottom line? Sphinx expressions can perform 3x to 20x better, and our optimizer also seems better. On int and bigint paths, there are no precision issues, but you have to be careful with C-style typing. On float path, there is a natural precision issue with 32-bit single precision type, C-style float once again. So I’ll be adding double precision support and redoing these benchmarks.
|« March 26, 2010. Sphinx powers Mozilla’s addons search||March 31, 2010. Full-text search BOF session at MySQL Conference »|