May 7, 2014. Sphinx at Divendo

Many Sphinx users utilize MySQL as a data source. Divendo is one of them. This blog post will relay the story of how Divendo got going with Sphinx after MySQL fulltext search became unwieldy. Enjoy!

Divendo

Divendo is a meta-search engine for classified ads serving the following continents and countries:

  • Europe: Spain, Italy, Portugal, United Kingdom
  • Americas: Mexico, Argentina, Brazil, Chile, Colombia, Peru, Venezuela
  • Asia: India
  • Australia

Divendo allows users in all these countries to search classified ads for cars, real estate, job listings and articles. Users can also define custom alerts by email or RSS and Divendo will notify them when new listings related to their interests become available.

For an example of Sphinx in action at Divendo, take a look at this Sphinx powered search for jobs related to Sphinx in Spain.

Sphinx implementation history:

Here’s what the Divendo team had to say about their Sphinx implementation:

In the beginning we used MySQL fulltext. But over time, we saw that it was not enough because the system became too slow with many concurrent searches. So, we decided to try Sphinx (version 0.9.9) as a single instance of 2x CPU and 14GB RAM (we didn’t know too much about the real needs of the resources).

After a couple of months of testing we implemented Sphinx into our services successfully. We saw that the index was growing too fast so we decided to split indexes by country and type and also by the so called “delta + main” scheme. We also upgraded to a more powerful server with 8x CPU and 8GB of RAM.

Due to the complexity of the searches, we try to give the user the best results by using what Sphinx calls “Multi-Queries” to perform the following searches: premium, exact, approximate (if we cannot give the user an exact result, we give them the best approximate result) and diversified (we give the user the best ad of each partner). The premium search classified ads belong to partners who pay per click, while the exact search classified ads are organic search results. Each different type of query utilizes a different index.

We were again in need of further processing resources. The next step was to divide indexes horizontally moving into a 4 server distributed stage (6x CPU and 4GB RAM each one of them). We also took advantage of upgrading the system to 2.0.4. Those changes, made at the beginning of May 2013, allowed us to improve the speed of the site as well as give the users more accurate results.

At the end of October 2013, after we added another 2 more nodes into the cluster, we discovered that we were not taking advantage of the cache system of Sphinx. We have a lot of searches but they are usually completely different from each other. So we added a “Memcache” layer, where we store the result for at least 2 hours, in order to reduce the CPU usage.

Multi-Queries:

Divendo mentioned multi-queries. They also mentioned they discovered that they “were not taking advantage of the cache system of Sphinx.”  To understand what they’re talking about, it’s helpful to understand what multi-queries are.

Multi-queries, or query batches, let you send multiple queries, with optimizations, to Sphinx in one go (more formally, one network request).  To learn more about the optimizations brought by multi-queries, read the documentation (some of which is repeated below), but the two main optimizations that are good to know about are common query optimization and common subtree optimization. 

From the docs:

Common query optimization means that searchd will identify all those queries in a batch where only the sorting and group-by settings differ, and only perform searching once. Common subtree optimization  lets searchd exploit similarities between batched full-text queries. It identifies common full-text query parts (subtrees) in all queries, and caches them between queries.

Why use multi-queries? Generally, it all boils down to performance. First, by sending requests to searchd in a batch instead of one by one, you always save a bit by doing less network roundtrips. Second, sending queries in a batch enables searchd to perform certain internal optimizations. As new types of optimizations are being added over time, it makes sense to pack all the queries into batches where possible, so that simply upgrading Sphinx to a new version would automatically enable new optimizations.

Why (or rather when) not use multi-queries? Multi-queries require all the queries in a batch to be independent, and sometimes they aren’t. That is, sometimes query B is based on query A results, and so can only be set up after executing query A. For instance, you might want to display results from a secondary index if and only if there were no results found in a primary index (as is the case for Divendo), or maybe just specify offset into 2nd result set based on the amount of matches in the 1st result set. In that case, you will have to use separate queries (or separate batches).

Divendo Resource Structure:

4 Web Servers

  • - CPU: 4 x
    - RAM: 5 GB
    - HD: 32 GB Local

Static Content Server:

  • - CPU: 6 x
    - RAM: 6 GB
    - HD: 32 GB Local
    + 1200 GB Cabina

Parser Servers:

  1. - CPU: 4 x
    - RAM: 4 GB
    - HD: 15 GB Local
  2. - CPU: 8 x
    - RAM: 2 GB
    - HD: 15 GB Local
  3. - CPU: 5 x
    - RAM: 2 GB
    - HD: 15 GB Local
  4. - CPU: 6 x
    - RAM: 4 GB
    - HD: 15 GB Local
Sphinx Servers:

  1. Master Node:
    - CPU: 2 x
    - RAM: 4 GB
    - HD: 10 GB Local
  2. Distributed:
    - CPU: 8 x
    - RAM: 12 GB
    - HD: 32 GB Local
  3. Distributed:
    - CPU: 8 x
    - RAM: 12 GB
    - HD: 32 GB Local
  4. Distributed:
    - CPU: 8 x
    - RAM: 12 GB
    - HD: 32 GB Local
  5. Distributed:
    - CPU: 8 x
    - RAM: 12 GB
    - HD: 32 GB Local
  6. Distributed):
    - CPU: 8 x
    - RAM: 4 GB
    - HD: 32 GB Local
  7. Distributed:
    - CPU: 8 x
    - RAM: 4 GB
    - HD: 32 GB Local
Database Servers:

  1. Mysql Slave: Web Requests
    - CPU: 12 x
    - RAM: 7 GB
    - HD: 232 GB Local
  2. MySQL Master
    - CPU: 12 x
    - RAM: 9 GB
    - HD: 350 GB Local
  3. Sphinx Generate Index:
    - CPU: 12 x
    - RAM: 4 GB
    - HD: 232 GB Local
Memcache Servers:

  1. Admin Sessions:
    - CPU: 2 x
    - RAM: 2.5 GB
    - HD: 16 GB Local
  2. Parsers:
    - CPU: 2 x
    - RAM: 3 GB
    - HD: 6 GB Local
  3. Web:
    - CPU: 2 x
    - RAM: 16 GB
    - HD: 10 GB Cabina
  4. Web:
    - CPU: 2 x
    - RAM: 16 GB
    - HD: 10 GB Local
  5. Web:
    - CPU: 2 x
    - RAM: 16 GB
    - HD: 10 GB Local
  6. Searches:
    - CPU: 2 x
    - RAM: 16 GB
    - HD: 10 GB Local
  7. Searches:
    - CPU: 2 x
    - RAM: 16 GB

Divendo Search Stats

Queries per day. Search pages => 32.000 pageviews * 2 sub-queries (ads + filter) * (4 types: premium, exact, diversified and approximate) = 264.000

Total index size. 3 GB * 6 servers = 18GB with approximately 84,000,000 documents.

Bye bye

Thanks for reading about Sphinx at Divendo. Another happy Sphinx + MySQL user. If you have questions for the Divendo team about what they’ve learned, I’m sure they would be happy to answer them. Just ask in the comments below.

Happy Sphinxing!


« »

2 Responses to “Sphinx at Divendo”

  1. danblack says:

    I’m interested how the master/distributed sphinx servers work? Is it as simple as the master generating an index and this being pushed atomically to the slaves?

  2. steve says:

    Hi Dan, I asked the Divendo team, they say that each distributed server generates its own portion of the index.

Leave a Reply