The concept is simple: a single word can appear in many forms when used in different contexts. When it comes to search, or (more formally) information retrieval, this is a big deal. There are many situations where, in order to satisfy the information needs of users, we want to match not only the exact form of the term that was entered, but also other related forms as well.
If, for example, a user was searching for ‘run’, the terms ‘ran’, ‘runs’ and ‘running’ may also be relevant to her information need. To catch these potentially relevant results, your search solution should be able to map these alternate forms back to ‘run’. Stemming and lemmatization are different approaches used to map multiple forms of a word back to a single root form.
Basically, stemming is all about chopping up an individual word (removing affixes) according to some algorithm that does not necessarily take the meaning of the word (context, part of speech) into account. To accomplish this, Sphinx uses the Porter Stemmer. A limitation of the Porter Stemmer can be seen in how it handles the words ‘business’ and ‘busy’: they both reduce to ‘busi’.. These words have very different meanings and we’d probably like to keep them separate.
On the other hand, lemmatizers work toward capturing overall meaning and then reduce a group of words to their logical root. The set of all the word forms that share the same meaning is called the ‘lexeme’ and the ‘lemma’ is the word that all words in the lexeme ascend to. In our example above, ‘ran’, ‘runs’, and ‘running’ would be the lexeme and ‘run’ would be the lemma.
Stemmers are easier to implement than lemmatizers because lemmatizing involves complex mapping of meaning while stemmers ‘blindly’ chop individual terms into smaller chunks. Lemmatization involves the understanding of context and part of speech, while stemming operates on a single word without knowledge of the context. For most cases, stemmers work great — Sphinx has supported stemming for some time already, and now we’re kicking it up a notch with lemmatization support.
Lemmatization with Sphinx
Some languages are more complex, with regard to the forms that words can take, than others. This is why, in version 2.1.1-beta, Sphinx begins lemmatization support with the Russian language. Many Russian words take on many different forms depending on their context — lemmatization support will greatly improve search quality for Russian search. Although, Russian is not the only language that lemmatizers will help with, we will soon be adding more languages to the list.
Lemmatization with Sphinx requires that a dictionary is downloaded from the Sphinx website. That dictionary needs to be installed in a directory specified by the lemmatizer_base directive. Also, there is a lemmatizer_cache directive that lets you speed up lemmatizing (and therefore indexing) by spending more RAM for, basically, an uncompressed cache of the dictionary.
General morphology tools
The morphology processors that are implemented in Sphinx itself include the Russian lemmatizer, English stemmer, Russian stemmer (that supports UTF-8 and Windows-1251 encodings), Czech stemmer, Arabic stemmer, and Soundex and Metaphone phonetic algorithms. Also, you can link with libstemmer library for even more stemmers. Additional stemmers provided by Snowball project libstemmer library can be enabled at compile time using –with-libstemmer configure option. Built-in English and Russian stemmers should be faster than their libstemmer counterparts, but can produce slightly different results, because they are based on an older version.
Built-in values that are available for use in the morphology feature are as follows:
none – do not perform any morphology processing;
lemmatize_ru – apply Russian lemmatizer and pick a single root form (added in 2.1.1-beta);
lemmatize_ru_all – apply Russian lemmatizer and index all possible root forms (added in 2.1.1-beta);
stem_en – apply Porter’s English stemmer;
stem_ru – apply Porter’s Russian stemmer;
stem_enru – apply Porter’s English and Russian stemmers;
stem_cz – apply Czech stemmer;
stem_ar – apply Arabic stemmer (added in 2.1.1-beta);
soundex – replace keywords with their SOUNDEX code;
metaphone – replace keywords with their METAPHONE code.
The wordforms feature adds even more control over morphology processing by allowing you to specifically replace one word with another. When the wordforms feature is enabled, the word will be looked up in word forms dictionary first, and if there is a matching entry in the dictionary, stemmers will not be applied at all. So, in other words, wordforms can be used to implement stemming exceptions.
Some notes regarding morphology and other tokenizing features
Having morphology and infixes/prefixes enabled at the same time requires enabling enable_star. When morphology and infixes/prefixes are enabled using ‘keywords’ type dictionary, index_exact_words is automatically enabled. This is needed because in the case of keywords, it’s necessary to have the original keyword to make the starred expansion. In the case of wordforms, starring for infixing/prefixing is applied to the wordform and not the original word. So if you try to search a starred version of a word that is the source for a wordform, it won’t be found. The wordform is stored only in the index and the starring can be applied only on it. Also keyword* > wordform* is not possible (as a starred transformation). Exceptions are made on raw text, and that is even before morphology is applied.
We hope this article has provided you with a basic understanding of Sphinx morphology processing. If there are any questions, please feel free to comment. And, a friendly reminder: if you need help implementing these tools, or if you need to fine-tune your search in general, we are here to help.
|« July 9, 2013. Using Blackhole Agents for Testing||July 19, 2013. Sphinx at OSCON 2013 »|