Feb 7, 2014. Use Sphinx with MySQL

In this blog post (which was inspired by Adrian Nuta’s recent talk about Sphinx at FOSDEM), we go through some more of the differences between MySQL fulltext search and Sphinx. People frequently ask us questions along these lines, so we decided another blog post on the subject would be worthwhile. Check it out!

Introduction

This blog post is for those of you who use MySQL, but are just getting to know Sphinx. We will do some quick comparing/contrasting and then dive into some Sphinx ranking examples (at this point, the comparison with MySQL will end.. these examples don’t really have a MySQL counterpart). Enjoy!

Fulltext Search in MySQL

Fulltext search has been available in MyISAM for a long time now. Recently, with MySQL 5.6, fulltext search became available in InnoDB.

MySQL is a powerful and popular database. But! Even with all the attention MySQL has given to fulltext search, when compared with Sphinx, MySQL’s fulltext search is much less flexible (less feature-rich) and much less powerful (less performant). So, in short, you should probably let MySQL do what it’s good at, and let Sphinx handle fulltext search.

Why Sphinx?

There are many reasons to use Sphinx with MySQL, here are some of them:

Simple fulltext search examples

Sphinx supports a subset of SQL (SphinxQL). You can query Sphinx using the MySQL command line client.

MySQL:

mysql> SELECT * FROM myindex
WHERE MATCH('title,content') AGAINST ('find me fast');

Sphinx:

mysql> SELECT * FROM myindex
WHERE MATCH('find me fast');

More complete Sphinx example

This is a more typical Sphinx example. It demonstrates quorum matching, ordering, and the expression based ranker.

mysql> SELECT * FROM myindex 
WHERE MATCH('"a quorum search is made here"/4')
ORDER BY WEIGHT() DESC, id ASC
OPTION ranker = expr('sum(exact_hit+10*(min_hit_pos==1)+lcs*(0.1*my_attr))*1000 +bm25');

Searching only selected fields

With Sphinx, it’s possible to limit your searches to specific fields using the field search operator. With MySQL, this task would be more complicated. You would have to declare a separate index for each field.

With Sphinx, limiting search to specific fields looks like this:

mysql> SELECT * FROM myindex
WHERE MATCH(‘@(title,content) find me fast’);

Indexing Options

As was mentioned earlier, Sphinx has more fulltext indexing options than MySQL. Here are some of those options:

There’s a lot to be said about each option. Luckily, they’re already described in the documentation (which is where the links above will take you).

Searching operators

Sphinx supports an advanced fulltext query syntax. There is an example of how to use the extended query syntax here. And, you can read about all the search operators here.

Search operators include (but are not limited to):

  • wildcard
  • proximity
  • phrase
  • start/end
  • quorum matching
  • strict order
  • sentence, paragraph, HTML zone limitation

Ranking

Ranking, or ‘weighting’, is described in more detail here. But, in short, ranking has to do with listing matched documents in order of relevance. When searching with Sphinx, each document gets a ‘weight’, which is a number used to describe how relevant that document is to the query.

Normally, the weight of each document is not returned in results, but if you would like to see the weights of all matched documents, you can just add WEIGHT() to your query. It would look like this:

mysql> SELECT *, WEIGHT() FROM yourtable WHERE MATCH(‘your query’);

Relevance is subjective (dependent on context) and so, to make relevance ranking easy, Sphinx supports a variety of built in rankers. But, to make relevance ranking really flexible, Sphinx also supports an expression based ranker where many ranking factors can be used together to create all sorts of relevance ranking formulas. With the expression based ranker, the sky is the limit.

This blog post is an overview of the relatively new IDF functions, different IDF modes, global IDF and the different BM25 functions that Sphinx supports. It’s a good read. Check it out.

That covers the main (obvious) differences. To sum it up: Sphinx brings powerful and flexible fulltext search to MySQL (or, to whatever database you prefer, but we’re talking about MySQL). Let’s finish this post with some Sphinx ranking examples.

Ranking with and without field weighting

The following two examples demonstrate how to use field weighting. They both use the same expression based ranking formula (hit_count*user_weight) but the first example has field_weights set to 1 for both fields and the second example sets the title’s weight to 100.

Notice how changing the field weighting for ‘title’ changes the document’s weight and (therefore) it’s position in the results.

Without field weighting:

mysql> SELECT id,title,weight() FROM wikipedia 
WHERE MATCH('inverted index') 
OPTION ranker=expr('sum(hit_count*user_weight)'), 
field_weights=(title=1,body=1);
+-----------+-----------------------+----------+
|     id    |         title         | weight() |
+-----------+-----------------------+----------+
| 221501516 | Index (search engine) |    125   |
| 221487412 | Inverted index        |    47    |

Doc. 221501516: 1 hit in ‘title’ x 100 + 124 hits in ‘body’ = 125
Doc. 221487412: 2 hits in ‘title’x 100 + 45 hits in ‘body’ = 47

With field weighting:

mysql> SELECT id,title,WEIGHT() FROM index 
WHERE MATCH('inverted index') 
OPTION ranker=expr('sum(hit_count*user_weight)'), 
field_weights=(title=100,body=1);
+-----------+-----------------------+----------+
|    id     |         title         | WEIGHT() |
+-----------+-----------------------+----------+
| 221487412 |     Inverted index    |    245   |
| 221501516 | Index (search engine) |    224   |

Doc. 221501516: 1 hit in ‘title’ x 100 + 124 hits in ‘body’ = 100+124 = 224
Doc. 221487412: 2 hits in ‘title’ x 100 + 45 hits in ‘body’ = 200 +45 = 245

Word proximity

Here, we’re searching the ‘title’ field and we’re using the expression based ranker. lcs (longest common subsequence) is a field-level ranking factor. It has a value of 1 when only stray keywords are matched. When keywords are matched in sequence, lcs takes a value equal to the amount of the keywords in sequence. You can see below that, with the query ‘@title list of football players’, the document containing ‘List of football players from Amsterdam’ was given a weight of 4. This is because the 4 keywords from the document match the sequence of keywords from the query.

mysql> SELECT id,title,WEIGHT() FROM index
WHERE MATCH('@title list of football players') OPTION ranker=expr('sum(lcs)');
+-----------+----------------------------------------------------+----------+
|    id     |                         title                      | weight() |
+-----------+----------------------------------------------------+----------+
| 207381464 | List of football players from Amsterdam            |    4     |
| 221196229 | List of Football Kingz F.C. players                |    3     |
| 210456301 | List of Florida State University football players  |    2     |
+-----------+----------------------------------------------------+----------+

Word and hit count

These two examples highlight the difference between word_count and hit_count. Notice that with ‘hit_count’ the document is given a weight of 3. This is because every occurrence (every hit) of the keyword in the document adds to its weight. With ‘word_count’, the same document is given a weight of 2. ‘word_count’ is the number of unique keywords that matched. Each unique word is only counted once.

mysql> SELECT id,title,WEIGHT() AS w FROM index 
WHERE MATCH('@title php | api') 
OPTION ranker=expr('sum(hit_count)');
+---------+------------------------------------+---+
|   id    |              title                 | w |
+---------+------------------------------------+---+
| 1000671 | PHP API gives PHP Warnings - tips? | 3 |

mysql> SELECT id,title,WEIGHT() AS w FROM index 
WHERE MATCH('@title php | api') 
OPTION ranker=expr('sum(word_count)');
+---------+------------------------------------+---+
|   id    |               title                | w |
+---------+------------------------------------+---+
| 1000671 | PHP API gives PHP Warnings - tips? | 2 |

Position

This example shows how to use ‘min_hit_pos’ in an expression based ranking formula. You can see that the weight of each document is equal to the position of the first matched keyword occurrence.

mysql> select id,title,weight() as w from forum where match('@title sphinx php api')
option ranker=expr('sum(min_hit_pos)');
+---------+-------------------------------------------------------+---+
|   id    |                      title                            | w |
+---------+-------------------------------------------------------+---+
| 1004955 | how can i do a sample search use sphinx php api       | 9 |
| 1004900 | How to update fulltext field using sphinx api of PHP? | 7 |
| 1008783 | Update MVA-Attributes with the PHP-API Sphinx 2.0.2   | 6 |
| 1000498 | Limits in sphinx when using PHP sphinx API            | 3 |

Conclusions

So, we pointed out some differences between MySQL and Sphinx, we pointed out how easy it is to use Sphinx with MySQL, and then we dove into some examples to demonstrate what we’re talking about. Some of our readers already know all about the reasons to use Sphinx with MySQL, but hopefully this blog post has been helpful to those of you who are just getting started with Sphinx.

Happy Sphinxing!


« »

Leave a Reply