Jan 29, 2013. A new tool in the trunk: wordbreaker

In 2.1.1, Sphinx’s main application brings not only new features, but also a handy new tool: wordbreaker. As its name suggests, wordbreaker splits long words into several smaller words. This tool has many applications, but in this post we’ll focus on breaking apart URLs.

The Idea

The initial motivation behind the creation of wordbreaker was to have a tool that made indexing URL names faster and easier. Imagine that we have a field that holds the URL for a document. In a default setup, searching for “sphinx” will not match a field containing “sphinxsearch.com”, on the grounds that words “sphinx” and “sphinxsearch” do differ. We could use substrings, either prefixes or infixes, but those would somewhat bloat our index by storing things like “sphin”, “sphinxs”, “sphinxse” and so on. And those we aren’t really interested in. So how can we find sphinxsearch.com but avoid indexing all those non-words at the same time? We can break the URL into parts and search through the broken pieces (i.e., the now easily identifiable words).

Dictionary based word breaker

Wordbreaker needs a dictionary to recognize individual substrings within a string. Also, to differentiate between different guesses, we need to attach a frequency to each word in the dictionary, and then compute a combined probability (higher keyword frequency = higher split probability of being a candidate). Sound familiar? It should be if you’ve ever seen the dictionary file with frequencies generated by the indexer program. Attaching frequencies to words in the wordbreaker dictionary requires that we have the output of the indexer command when ran with –buildstops –buildfreq.

Let’s recap:

indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/sphinx.conf

This will write the first 100000 most frequent words along with their count from myindex index into dict.txt.

And you are set! Wordbreaker does not need anything else. Use indexer, generate a frequency dictionary, and that is it. The file is really a text file, so you can edit it manually, if need be. (Say, if some non-word is erroneously in your index, and too frequent, too, causing wordbreaker to generate wrong splits, you can just remove it from dict.txt with any text editor.)

Preparing the dictionary

Depending on your use case, you might need to index different data when generating a dictionary, and/or clean up either the source data, or the resulting dictionary.

For one, according to the original research paper that wordbreaker is based upon, when breaking down the domains and URLs, you get the best results when using Web document titles only, and not the contents. We also found that cleaning up those titles a little, removing the URLs and a few more things from them, reduces noise and improves quality significantly.

Second, the languages have to match. Apparently, having a dictionary built with Russian data won’t get you far when breaking down German compounds. (And, by the way, “compound” is the official term for multi-word creations like Volkswagen or Acetylmethylcarbinol.)

Last but not least, you might need to filter out the unwanted compounds themselves from the resulting dictionary. That can be done either manually or semi-manually, by passing each compound suspect through wordbreaker itself, but using a dictionary without that compound. (Otherwise it would just match.)

A simple wordbreaker usage example

$ echo sphinxsearch.com | bin/wordbreaker --dict dict.txt split
sphinx search com
$ echo iamlegend | bin/wordbreaker --dict dict.txt split
i am legend
$ echo lordoftherings | bin/wordbreaker --dict dict.txt split
lord of the rings

By default wordbreaker expects wordbreaker-dict.txt to be in the working folder. Alternatively, you can specify the dictionary file with the –dict option. You must have a dictionary file, Sphinx does not yet have a built-in dictionary.

The split command is used to break words from standard input and return the result at output. There are two more, test and bench, that let you test the splitting quality and benchmark splitting, respectively.

How precise is splitting? Well, it depends on the dictionary size. The bigger the dictionary, the more precise the split will be. We managed to get a 0.81 precision using a dictionary of 500k words.

How fast is splitting? In general, a single split takes about 1ms, but the tool is still in its infancy and will be getting more of our attention in the future. (Read: we might further optimize this.)

What also matters is the source of the dictionary. Let’s take the following example:

$ echo manofsteel | bin/wordbreaker --dict dictit.txt split
mano f steel

The output is not exactly what we expected. Why? In this example the dictionary file is from a source with mostly Italian content.  Why ‘mano’ and not ‘man’? When parsing the input word, wordbreaker considers the words with a bigger frequency (in the dictionary file) to be the more probable result. The ‘mano’ term had a higher frequency than ‘man’ in the dictionary we used.

Let’s try again, but now we’ll use the dictionary from previous tests (from an English source):

$ echo manofsteel | bin/wordbreaker --dict dict.txt split
man of steel

Now we’ve successfully broken up this string and have produced the words an English speaker would expect to see.

Lesson learned: remain mindful of the context in which you’re breaking a word. This is very important! Using the same dictionary under all circumstances will not always produce accurate splits.

A simple integration example

…that shows how to extract names from domain URLs from your PHP script. Just open a pipe to wordbreaker:

<?php
if ( !is_array($_SERVER["argv"]) || !isset($_SERVER["argv"][1]) || trim($_SERVER["argv"][1]) == "")
{
        print("No input URL found.\n");
        exit(1);
}
$url = trim($_SERVER["argv"][1]);
$host = parse_url($url,PHP_URL_HOST);
if($host == NULL) {
        print("Please use a valid URL.\n");
        exit(1);
}
$domain_name = preg_replace('/(.*?)((\.co)?.[a-z]{2,4})$/i','$1',$host);
if (substr($domain_name, 0, 4) == "www.") {
        $domain_name = substr($domain_name,4);
}
$words = exec("echo $domain_name | bin/wordbreaker split");
print($words."\n");

And here’s how that script works:

$ php urlbreaker.php http://www.sphinxsearch.com
sphinx search
$ php urlbreaker.php http://thedarkknightrises.com
the dark knight rises
$ php urlbreaker.php http://manofsteel.warnerbros.com
man of steel warner bros

Extracted words could either be recorded in the DB in a separate field for indexing, or passed to indexer via XMLpipe on the fly, or just INSERT-ed into a RT index. That would allow users to search for some or all of the words, and receive their expected result.

Of course, you can find many other uses for the tool. Breaking down URLs and indexing compound words spring to mind immediately. Or, for example, you could use wordbreaker to improve suggestion/correction. Any long term would be a good candidate for wordbreaker. We have a suggest tool that uses trigrams to correct typos, but it would not be able to handle multi-word terms caused by a missing space, as it’s only looking for a single terms.

So, there’s our run-down of the new Sphinx tool. We gave a quick description and a simple example of how you can use wordbreaker. You’ll soon be able to try Sphinx 2.1.1 packages yourself (and you can grab the trunk already), so keep an eye out for more posts on what you can expect.


« »

One Response to “A new tool in the trunk: wordbreaker”

  1. Oleg says:

    Any chance you could publish the English dict.txt file you used in your examples? The one which you have made from web titles and post-processed.

Leave a Reply