We’ve been running an ongoing user survey and have received some interesting and useful feedback. Here, we’ll play around with the results from the ‘wishlist’ and ‘comments’ section of that survey using Sphinx. Enjoy!
Indexer includes a handy tool that you can use from the command line (–buildstops). It builds a dictionary of the most frequent words in a collection of documents. It is also possible to have the dictionary built to include the word’s frequency (–buildfreqs). We will be using these commands to look at the words frequently mentioned by survey respondents.
The command looks like this (‘responses’ is the name of our index, freq_words.txt is the name of the file to be output, and 1000 is the number of words we want our list to contain):
–buildstops responses freq_words.txt 1000 –buildfreqs
This will create a file called freq_words.txt in the bin folder of your Sphinx directory. This file contains the top 1000 most frequent words and their frequencies (each value is on a new line). Normally, you’d use –buildstops to help with creating a stopwords list, but for our purposes here, it’s just interesting to see the most frequent words.
What’s a stopword?
Stopwords are words that you’d rather not index because they don’t add much value to search (and by getting rid of them you save precious resources). The top 10 words, produced by running –buildstops on our survey response data, were “to, and, the, for, a, i, sphinx, is, in, it”. In most cases, those words won’t be important to users’ information needs (unless you’re searching Shakespeare and need to find something like ‘to be or not to be’). You can avoid indexing them without making search difficult by using a stopwords list (you just need to enable this option in your configuration file). Read more about how to use a stopwords list here.
Sphinx word cloud
With a list of words and their frequency, you might choose to build a word cloud. We did. Our word cloud looks like this (the super boring words like “in, and, the, etc..” have been removed):
Briefly scanning through the list of frequently used words, we noticed a few things..
There’s a lot of interest around JSON. MySQL was frequently mentioned (not surprising). SphinxQL is more frequently mentioned than SphinxSE or SphinxAPI. There is interest around Sphinx configuration and sharding. And, last but not least, Linux is mentioned more frequently than Windows. Interesting.
We thought it would be fun to dig into these responses with some Sphinx queries. The graph below demonstrates some (not all) of the different kinds of queries supported by Sphinx’s extended query syntax. It shows the number of documents returned by each query.
There is a phrase query (“thank you”), a proximity query (“sphinxql real time”~5), a query with the implicit AND operator (bad worse horrible, which returned no results), a query with the OR operator (bad|worse|horrible, which returned one result), a query with the NOT operator (sphinx -search) and a query using the quorum operator (“thank you so very much”/3). To learn more about these and the other types of full text search queries supported by Sphinx, go here.
And, that’s it for now. This wasn’t exactly an in-depth investigation, but hopefully it was enjoyable. Thanks for reading (and thanks for your interest in the Sphinx community).
|« January 23, 2014. Sphinx 2.1.5-release and Sphinx 2.0.10-release now available||February 7, 2014. Use Sphinx with MySQL »|