Jul 18, 2014. String filtering and a simple use case

Sphinx 2.2.3 added the possibility to filter over string attributes. This can be quite useful. You may even be able to replace some of those fulltext queries.

Consider this:

Let’s pretend that we have an application that stores some metrics in a database. In our case, the application actually stores the metrics of the SHOW STATUS command (or searchd –status). We store the data in a MySQL table with a simple structure: a server id, name of the metric, value of the metric, date when it was added (and, of course, an autoincrementing id).

In our frontend, we want to display charts of this logged data over several periods of time. To do that, we need to extract a series of data that span across an interval (ie. from the last 3 days, or the last day) using a stepping interval.

Translated into a SQL query, that’s a grouping on an expression of date/step.

mysql> SELECT ROUND(added/($interval)) as crunch_date,group_concat(value order by added asc) as crunch_value FROM mon_status WHERE serverid='$serverid' AND metric='$name'  AND added >  $date_interval GROUP BY crunch

Switching the backend to Sphinx is quite easy. We need to create an index that will get the data from this table. The SphinxQL query would be similar to the SQL statement from MySQL, so it will be easy to port. The question now is, “how do we search the index?”. We have 2 strings in our index, the serverid and the metric name.

One approach would be to set both serverid and metric as fulltext fields and do a fulltext matching. However, we may be surprised by the fact that in Sphinx the query will be slower than in MySQL (considering that we’re using an optimized MySQL install with indexes on the columns etc.).

To give more details, our table contains 11M records with only a single server.

mysql> SELECT ceil( added / ( 300 ) ) AS crunch, GROUP_CONCAT( value) as value1   FROM monstatus  WHERE MATCH ('@serverid xyz @metric queries') AND added > 1405386678 GROUP BY crunch WITHIN GROUP  ORDER BY added ASC  LIMIT 0,20  OPTION ranker=none,max_matches=200

Query time: 0.208 sec

This query gets data from the last 24 hours. The counterpart in MySQL takes 0.10 sec.

Why?

If we do a SHOW META we’ll find that our ‘serverid’ is found in all documents, that means 11M hits. If we take out the serverid, the query takes only 0.016 sec. This shows the impact of an extremely common keyword.

So, let’s store both ‘serverid’ and ‘metric’ as string attributes. Let’s change the query and instead of fulltext matching, we’ll do some string filtering:

mysql> SELECT ceil( added / ( 300 ) ) AS crunch, GROUP_CONCAT( value) as value1  FROM monstatus  WHERE serverid = 'xyz' AND metric = 'queries' AND added >1405386678 GROUP BY crunch WITHIN GROUP  ORDER BY added ASC  LIMIT 0,20

Query time: 0.006 sec

That’s fast right? So we should use string filtering? Not exactly. Let’s take a look at a bigger interval, like last month:

In MySQL, the time increases to 0.16sec.

Using FTS matching on both ‘serverid’ and ‘metric’ we get 0.216sec. The increase of query time is smaller in Sphinx than in MySQL. The fulltext match is the same as in the first case, what is increased is the date filter (which will retain more records to be processed by the grouping).

Using string filters, we get 0.222sec. Because there is no fulltext matching, the query is a fullscan. That means, the engine needs to iterates the attributes to compare the values.

The order of the filters in the WHERE clause matters and if we put ‘metric’ first we can expect to have less matches, so it should improve the search time:

mysql>  SELECT ceil( added / ( 3600) ) AS crunch, GROUP_CONCAT( value) as value1 FROM monstatus WHERE metric='queries' AND  serverid='xyz' AND added>1402840520 GROUP BY crunch WITHIN GROUP ORDER BY added ASC LIMIT 90,30 OPTION max_matches=200;

Query time: 0.17sec

It’s an improvement, but it may take some effort to know which filter should be first. Like in this case above, if we filter first by date, we’ll get all the metrics (SHOW STATUS has more than 20 metrics). If we had 30 servers, and not just one, and if the search was over a small period of time, it may be possible that filtering first by date will be faster.

So how do we make it even faster?

We saw earlier that if we do a fulltext search just on the metric, the query is fast (the metric had ~ 110.000 hits only). To get only the data for the server we are looking for, we add a string filtering on the serverid.

The query will look like:

SELECT ceil( added / ( 300 ) ) AS crunch, GROUP_CONCAT( value) as value1  FROM monstatus  WHERE MATCH ('@metric queries') AND serverid = 'xyz'  AND added >1405386678 GROUP BY crunch WITHIN GROUP  ORDER BY added ASC  LIMIT 0,20

Query time: 0.043ms

That’s a lot faster than 0.163sec in MySQL.

The string filtering doesn’t have much of an impact on the query, because the fulltext matching will provide a restrained number of records that needs to be checked by the string filter. The string filter will be helpful in cases where you need to do an exact match which is suspected to be very common across the entire collection.

How will this scale?

The data will increase over time, that will increase the hits for each metric. Also if we add many servers, the hits for each metric will increase as well. The speed of the query will depend on the number of hits of the metric. On the other hand, we’ve already seen that fullscan (just having filtering) is a good idea only when the found number of records is small.

So it’s time for sharding.

Consider these things:

  • the metrics, in the end, can have lots of hits, so we need to reduce the numbers of those hits in each shard. One way can be to have one index per server.
  • Sphinx stores the min/max values for a numeric attribute. When a search is made and includes a numeric filtering, it checks if the filtered value is between the min/max. If it’s not, then it’s clear the query will return no results and so Sphinx doesn’t continue with the fulltext matching. So, sharding by date makes a lot of sense over time. Querying to get last month’s records on an archive index will finish very fast because of the date filtering.

Conclusion

We hope this exploration was enlightening. If you have any questions, please ask them.

Happy Sphinxing!


« »

2 Responses to “String filtering and a simple use case”

  1. Sabera Kazi says:

    Is it possible to do geo-location based search and search of businesses using your service?

    Thanks,
    Sabera

  2. steve says:

    Hi Sabera,
    Yes it is possible to do geo-location based search. We wrote a bit about it here. And, yes, if you have a way to collect/store information on businesses, Sphinx can search it.
    Thanks for asking,
    Steve

Leave a Reply