Oct 19, 2011. dist_threads, the New Right Way to use many cores

One of the application of distributed indexes in Sphinx is parallelizing queries across many CPU cores even when running on a single server. There’s a well known trick to have an agent line (or three) pointing to the very same master searchd instance. Only problem with that approach is, every query entails a bunch of one-off TCP connections, extra forks, and other redundant internal work. Which is okay when you’re serving a few heavy queries but might spin over 50% of your CPU in system time doing those works when you’re doing many quick ones.

Now that’s a problem, but starting with 1.10-beta, there is a solution, called dist_threads directive. So if you’re still doing that agent=localhost trick, and suffering from TCP stack pressure and/or seeing way too much system time in top(1) or vmstat(8), do read on, you are eligible. (As a collateral, if you’re still on anything pre-2.0.1, you should seriously consider upgrading, too.)
First of all, how do you use it? Let’s learn by example. Assume you’re running on a 4-way machine and you’ve got 4 index shards set up like this, using the good old agent=localhost trick:

# old way to use 4 cores
index dist1
{
    local = chunk1
    agent = localhost:9312:chunk2
    agent = localhost:9312:chunk3
    agent = localhost:9312:chunk4
}

How to convert? That’s very simple. Specify all local indexes as local in your distributed index, and set dist_threads higher than one in searchd section. Be sure to be running 1.10-beta or higher, of course, and don’t forget to restart searchd. That’s it, you’re all set:

# new right way to use 4 cores
index dist1
{
    local = chunk1
    local = chunk2
    local = chunk3
    local = chunk4
}
 
searchd
{
    dist_threads = 4
    # ...
}

From here, searchd will just internally create up to dist_threads threads to search through local indexes, and round-robin indexes against those threads. In the example just above, there are 4 chunks and can be up to 4 threads, so every chunk gets a separate thread. Values of dist_threads above 4 would be handled just the same, every chunk gets a thread. If dist_threads was set to 2, there would be 2 threads created, 1st thread would search through chunk1 and then chunk3, and 2nd one would handle chunk2 and then chunk4 respectively. If dist_threads was set to 3, there would be 3 threads, 1st thread would search through chunk1 and chunk4, 2nd thread through chunk2, 3rd thread through chunk 3. Hope you get the idea.

Despite the name of the directive, dist_threads does NOT require you to use workers=threads. That’s right, you can still use workers=fork, so that every search request would spawn a separate process. dist_threads would then just create more threads within that process. Sort of combining the best of two worlds: if query crashes, it only crashes a separate process handling current search request, so nothing else is affected. At the same time, query is able to span many cores. Pretty neat.

Moreover, it does not trigger for just distributed index searches, any search that involves more than 1 local index is eligible. Hence, if you have your agents on remote machines configured like this:

# master config, old way of parallelzing remote queries
index dist1
{
    local = chunk1
    local = chunk2
    agent = box2:9312:chunk1
    agent = box2:9312:chunk2
}

Then you should reconfigure both the master and the agent as well. On agent, just add dist_threads and that’s it. And final master configuration should look like:

# master config, new right way of parallelzing remote queries
index dist1
{
    local = chunk1
    local = chunk2
 
    # new right way, ask agent once, let it parallelize itself
    # but don't forget to bump dist_threads on agent!
    agent = box2:9312:chunk1,chunk2
}
 
searchd
{
    # this will affect (parallelize) local searching only
    # that is, chunk1 and chunk2
    dist_threads = 4
 
    # ...
}

This not only reduces CPU time previously spent in forks (that are more expensive than threads) but actually also saves on network traffic. Indeed, master now only needs to send a query to agent(s) once, and agents parallelize work on their side themselves. No need to repeat a query two times, two times.

Are the search results going to be the same? They absolutely should. Despite that assignment indexes to threads can technically happen in a, well, arbitrary fashion, searchd still combines the search results in the order of appearance in the configuration file. So the duplicate elimination order is preserved. And the kill-lists are applied in the right order too.

What is the optimal value for dist_threads, what should you use? That depends greatly on your workload, how you shard your indexes, whether you’re running in a virtualized environment and what’s your hypervisor, etc. Here go a few rules of thumb. On a CPU bound workload, you probably want to utilize as much cores as possible, and thus should set dist_threads to a number of physical cores (HyperThreading is typically, well, not great nowadays). That can however induce some thread stomping in a VM environment and you might be able to get better latency by setting dist_thread lower there. Last but not least, with a HDD bound (or hitting) workload it’s not CPUs but HDDs that are the bottleneck, and, therefore, you should tie dist_threads to a count of physical HDDs in your disk subsystem (but the specific optimal value may vary depending on RAID type, cluster size, etc). Be sure to run a few experiments and benchmarks. (Or perhaps consider our consulting services if you’d like us to run those for you.)

Bottom line, if you are using agent=localhost trick, you should definitely consider dist_threads, now. It strips a bunch of redundant work (forks, network time, request parsing, etc) from searchd, while doing exactly the same job, and delivering exactly the same results. In extreme cases, we’ve seen about 2x overall CPU time improvement (1 fork per request instead of 4 forks per request makes a lot of difference, and then again you can use prefork), and it’s always (much) less pressure on TCP stack.


« »

One Response to “dist_threads, the New Right Way to use many cores”

  1. Very nice explanation. Thank you!

Leave a Reply