Indexing with Sphinx is pretty straightforward and a simple 3-line setup will normally work pretty well for you. But there always is room for improvement. So today, we have a collection of tips for you that cover a few dark corners in the indexing process that can be subject to such improvements: (better) ranged queries, throttling, data security when indexing over the network, and disk usage.
Two-faced indexing performance
Performance questions generally fall in two different categories: 1) increasing indexing speed, 2) otherwise improving resource use during indexing, or throttling. Extremely quick indexing is always a nice thing to have. Except when it’s not. What if the price is too high? What if the indexer completes its work in 1 minute, but the system is entirely unusable during that time? It might be better to have it work for 5 minutes, but keeping the system steady. That’s what throttling is all about.
Also, if we’re talking about indexing from a database source, we always need to take in the consideration that the database server iteslf can affect the performance a lot, and that both indexing performance and general database server performance can be impacted.
Tip #1: indexing performance is a two-faced problem, and, counter-intuitively sometimes you might want to reduce search indexing speed in order to improve performance of the entire system (application + database + search).
Optimizing and/or throttling SQL data sources
Sphinx has a number of interesting directives to improve the speed and leverage the load when working with SQL sources. The two directives you are probably already comfortable with are sql_query_range and sql_range_step. Basically, they let you slice and dice the query, greatly reduce locking with MyISAM, minimize locking with InnoDB too, and are the only reasonable way to go in the Postgres world, because PG client library always fetches the entire result set into memory first. What else is there to them?
Tip #2: be sure to have an index on the column you’re using for your ranged queries. Ideally, just use PK for this. For one, InnoDB tables are usually clustered on PK, so that way sql_query fetches will be the quickest.
Tip #3: do not overslice the data that you fetch. In other words, use a high enough sql_range_step. Real world example: a table with just 1,000,000 rows was taking forever to index. Why? Despite just 1M rows, the ID column values in that table ranged from 1 to a few billion, so with a default setting of sql_range_step = 1000 Sphinx indexer ended up issuing a few million queries, most of which were returning no results. Bumping up the step resolved the problem immediately.
Tip #4: do not forget to examine the main sql_query fetch query, with range conditions attached. Running a simple EXPLAIN might sometimes very quickly reveal that, say, you need an extra index.
Tip #5: consider using sql_ranged_throttle if your DB box maxes out during indexing. That directive lets you insert a small delay between each ranged query fetch, thus putting less load on the DB.
And one more tip for SQL sources in general, not necessarily for ranged queries:
Tip #6: consider using sql_joined_field directive if you have expensive JOINs and GROUP_CONCATs. That one essentially emulates both JOIN and GROUP_CONCAT on the Sphinx side, and does that much more efficiently that MySQL.
Sphinx indexer also comes with a few throttling settings which can be used to “tame the beast”, that is, control indexer RAM an IO use.
indexer RAM consumption is strictly limited with mem_limit. It defaults to 32 MB only, so that everything would works on a small system too, brining us to the next tip.
Tip #7: get your mem_limit right for your data size and RAM. The default value of 32M can work alright for smaller collections, but a bigger buffer helps indexing a lot. Note that the max possible value of mem_limit is 2047M, attempting to set it higher might overflow (indexer should however warn about that). Curiously, that is not a big restriction, because even for the biggest collections, mem_limit between 1024M and 2047M should suffice. 32M vs 512M usally makes a big difference, but it slows down quickly, and 1024M vs 1536M might make no difference at all.
In addition to mem_limit, indexer will use several (up to 4) additional write buffers, sized as per write buffer directive. They default to 1M each. Increasing the write buffers leads to fewer write IOs from Sphinx to OS, but both OS and even the disk drives buffer the writes too, and so:
Tip #8: do NOT set the write_buffer too high. YMMV, but anything within 1M to 16M should suffice, and these buffers will rarely get you much indexing improvement anyway. Setting them to something big like 128M makes no sense; that will only consume precious memory; but that will not improve indexing speed.
Jumping from RAM to disk, if indexer stress the disk too much, you can put a limit on that, too.
Tip #9: use max_iops and/or max_iosize directives to limit the impact on the disk subsystem. Use max_iops if indexer generates too much IO operations for the disks to handled. Additionally use max_iosize if indexer generates way too much IO traffic (even if with only a few operations).
max_iops (maximum I/O operations per second) is perhaps one of the most effective “damage control” barriers in the indexer. Disk is both a frequent bottleneck, and something that makes your system unresponsible when it’s clogged. Why is that? Most spinning hard disks can only do 70-100 ios/sec, maybe upto 200 ios/sec for expensive SAS drives. RAID setups can improve that, but only a few more times. SSD disks are much better, they range anywhere 400 ios/sec to 50,000 to 100,000 ios/sec in case of enterprise SSDs. But there’s a catch: those high IOPS figures are achieved using small IO size, say, using 4 KB blocks. So if indexer writes a lot of data and consumes all the bandwidth, the entire server will still be affected, despite that we’re nowhere near the IOPS limit. This is exactly the scenario in which you need to limit max_iosize too.
Checking how much does indexer consume exactly can be done running iostat on linux (on some distributions the package is called sysstat) while indexing. Last but not least, note that these indexer throttling options are applicable to xmlpipe sources just as well. They are about the index data the indexer itself reads and writes to disk; irregardless of the data source.
MySQL networking: speed and security
And now for something completely different. When Sphinx indexer and the DB are on different physical servers, transfering the data necessary for indexing over the network can be a problem. Servers in the same datacenter with a fast enough (1Gbps) link between should not be a problem. However, if the connection is 100Mbps, or even worse (when the servers are geographically remote), or, by the way, when the servers are hosted in the cloud and the traffic between instances is taxed, this setup may have some issues… either performance or financial ones.
Tip #10: consider MySQL compression on a slower (100 Mbps and under) or expensive network connection. To enable compression, set mysql_connect_flags = 32. That is a specific bit that enables compression in the client library. (You can find a reference on the other flags in the MySQL protocol internal description.) In our experience, you can expect savings out of this even on 100 Mbps LAN connection. Be advised that this puts a little more CPU strain to compress and decompress the data on both servers.
Tip #11: on an untrusted network, use SSL protected connections to MySQL for data security. indexer provides the option to setup a SSL certificate with mysql_ssl_cert, mysql_ssl_key, mysql_ssl_ca directives.
Finally, for something yet different, indexing a huge collection can consume way too much disk. Indeed, the old index files are kept so that we could run the searches. In addition to that, indexer writes an intermediate temporary files, and in addition to that, the final newly constructed index. So peak disk use could be as big as 3x the normal use.
Tip #12: use inplace inversion (inplace_enable directive) to save disk space when rebuilding huge collections. This is an alternative indexing implementation that reuses those temporary to store the final new index data, and thuse reduces the disk requirements from ~3x to ~2x. On our testing it was only marginally (about 5%) slower than the regular indexing.
The slowdown comes from the fact that if we reuse the temporary files, we might need to move around the data. Very basically, the final index is a fully sorted list of (word,docid) pairs and the temporary files are a partially sorted chunks of that final list, sized according to mem_limit. So during the “sorting hits” phase indexer reads those partially sorted chunks, merges them, and writes the final index. Apparently, we can not simply write the final sorted data over the temporary data: we might overwrite something useful! In that case, Sphinx stops for a moment, relocates the data to be overwritten, and resumes sorting. There are a few inplace inversion fine-tuning directives that control that process: inplace_hit_gap and inplace_docinfo_gap directives control how much space will be preallocated in the beginning of the temporary file for the final index hits and docinfos, inplace_reloc_factor controls what fraction of mem_limit pool will be used for the relocation buffer, and inplace_write_factor controls what fraction of memory will be used for the write buffer. All these come with reasonable defaults, but sometimes you might still need to fine tune these, as having more memory allocated can speed up the process (by allowing larger portions to be processed) and also reduce the IO operations.
It should be mentioned that using inplace inversion does not affect searching in any way, since the index format does not change. The resulting indexes must be identical, whether you use inplace inversion or now, it’s all about optimizing the index creation process.
When Sphinx indexing does not work exactly as you want it to out of the box, there are a bunch of performance and/or behaviour tweaking options available that let you optimize the impact on the SQL database, or throttle the RAM, disk, or network consumption in different ways.
|« August 7, 2012. A neat optimization for facets tabs using cutoff||September 16, 2012. Sphinx coming to San Francisco: Sept 29-Oct 5th »|