Dec 5, 2013. Working with the English Lemmatizer

If you watch the fulltext diary closely, you may have already heard that Sphinx 2.2.1-beta supports English and German lemmatization. In a previous post we announced support of Russian lemmatization. We also gave a general outline of stemming and lemmatization. Go here if you want some background information. Now, we’re going to show you some of the differences between English stemming and lemmatization. Read on to learn more. 

“Will”

To highlight the differences between English stemming and lemmatization, it’s best to use an example. A simple example is the word ‘will’. It has several meanings. It can either be a modal verb, a form of ‘be’, but it can also be a noun (‘Will”, as philosophy uses the term), not to mention it’s also a short form of a name — William. Besides this, a big problem is that “will” can also be a root word for other words. So, it’s a good candidate for our testing.

Let’s run a quick test searching ‘will’ on the same data indexed with the English stemmer and with the English lemmatizer:

Stemmer:

mysql> select count(*) from wikipedia where match('@title will');
+----------+
| count(*) |
+----------+
| 1202 |
+----------+
1 row in set (0.01 sec)

Lemmatizer:

mysql> select count(*) from wikipedia2 Where match('@title will');
+----------+
| count(*) |
+----------+
| 1149 |
+----------+
1 row in set (0.01 sec)

So, the lemmatizer gave us fewer results. Why? Because, with the stemmer, a bunch of unrelated terms were tokenized as ‘will’. To see this difference more clearly, we did a simple trick: we added an integer attribute that holds a number for each index (“1″ for the stemmer index or “2″ for the lemmatizer index) and then performed one search on both indexes. Now, it should be easy to differentiate between stemmed and lemmatized indexes. For common rows, we will get the result from the lemmatized index, since it’s the last in the index list.

mysql> select title,idx from wikipedia,wikipedia2 where match('@title will')   order by idx asc limit 20,40;
+--------------------------------------------+------+
| title                                      | idx  |
+--------------------------------------------+------+
| Willful blindness                          |    1 |
| Ready, Willing, and Able                   |    1 |
| G. Willing Pepper                          |    1 |
| John Willes (cricketer)                    |    1 |
| William S. S. Willes                       |    1 |
| Christine Willes                           |    1 |
| Ready, Willing, and Able (film)            |    1 |
| Mary Willing Byrd                          |    1 |
| Ann Willing Bingham                        |    1 |
| Stout Hearts and Willing Hands             |    1 |
| The Willing Well IV: The Final Cut         |    1 |
| Charles Willing                            |    1 |
| Thomas Willing                             |    1 |
| Clare Wille                                |    1 |
| Edward Willes                              |    1 |
| George M. Willing                          |    1 |
| Oscar F. Willing                           |    1 |
| Willing Suspension Productions             |    1 |
| John Willes (judge)                        |    1 |
| The Wild, the Willing and the Innocent     |    1 |
| Coalition of the Willing (Jericho episode) |    1 |
| Jodi Wille                                 |    1 |
| Coalition of the willing                   |    1 |
| Take the Willing                           |    1 |
| Ready an' Willing                          |    1 |
| Hitler's Willing Executioners              |    1 |
| Hilda M. Willing (skipjack)                |    1 |
| Nick Willing                               |    1 |
| Ava Lowle Willing                          |    1 |
| The Coalition of the Willing (album)       |    1 |
| Will the Shill                             |    2 |
| Hope   Will                                |    2 |
| Helen Wills                                |    2 |
| For You I Will                             |    2 |
| Donald Wills Douglas                       |    2 |
| Will Skinner                               |    2 |
| Will and Perception (album)                |    2 |
| Brush Creek (Wills Creek)                  |    2 |
| Wills Creek                                |    2 |
| Will (sociology)                           |    2 |
+--------------------------------------------+------+
40 rows in set (0.01 sec)

So, with the stemmed index, ‘Willing’ (which is an adjective as well as a name), ‘Wille’, ‘Willes’ (both names that have nothing to do with ‘will’) were returned as results. You may note that even the lemmatized index produces “Helen Wills” and “Wills Creek” as results, even though this isn’t ideal. This is because the lemmatizer produced ‘will’ as the lemma of ‘wills’ (which makes sense).

Another fail for the stemmer: when we search for ‘willing’ (as in, “Coalition of the Willing” , “Ready an’ Willing”, etc..), because the stemmer tokenized it as ‘will’, we will get all the results for “will”.  Although they have some relation, ‘will’ and ‘willing’ have different meanings. When we search for “willing” we usually aren’t thinking of anything related to “will”. When Schopenhauer talks about ‘will’, he means something different than ‘the willing’ (those who have given consent). With the lemmatizer ‘willing’ is tokenized separately from ‘will’. Problem solved.

This can be seen clearly when we use a class of words that differ only by their ending.

For example:

operate/operated/operator/operating/operation/operational :

Keyword Stemming Lemmatizer
operate oper operate
operated oper operate
operator oper operator
operating oper operate
operation oper operation
operational oper operational

 

Lemmatize_en_all

What about lemmatize_en_all ? Without lemmatize_en_all, the lemmatizer picks only one root form of the word, the one that is closest to the original word. However, sometimes a word form can have multiple corresponding root words. For instance, by looking at “dove” it is not possible to tell whether this is a past tense of “dive” the verb as in “He dove into a pool.”, or “dove” the noun as in “White dove” In this case, the lemmatizer can generate all the possible root forms with lemmatize_en_all. This can lead to an increase of  returned results, because in the case of  en_all, ‘dove’ will match ‘dive’ and ‘dove”. In short, lemmatize_en_all  brings more results because it allows searching of all possible forms.

Indexing speed

On the data we used, we got the following times:

Stemmer Lemmatize Lemmatize all
908.088 sec 1212.385 sec 1384.710 sec

Lemmatizing is a bit slower, which is normal, since it needs to do a lookup over a dictionary, while stemming simply performs some basic operations. For this, we used lemmatize_cache=32M, which is enough for the English lemmatizer, as it’s unpacked current size is less than 20M. However, for other lemmatizers, the cache size should be adjusted accordingly. For example, the Russian lemmatizer is about 110M, so for best performance, you should set a cache that fits the lemmatizer dictionary in memory.

Search speed

In terms of search speed, there is no difference between stemming and lemmatizing, it really just depends on how many documents hit the tokenized keyword. For example, with stemming, the word ‘running’ is tokenized as ‘run’ , which hits 267,196 documents within 16ms. If we use the lemmatizer, ‘running’ stays as it is, hitting only 94,532 documents within 6ms. However, using lemmatize_en_all,  results in 322,623 documents within 24ms. Choose what works best for you!

Conclusion

The lemmatizer is one step ahead of stemming. It not only improves relevance, but it can also reduce search times. Check it out. Download Sphinx 2.2.1-beta now.

Happy Sphinxing!


« »

2 Responses to “Working with the English Lemmatizer”

  1. conchis says:

    happy to see same basic benchmark results for a new feature

  2. Marc says:

    Can I hard code some common acronyms that searchers might use to be replace with terms that are indexed? For instance, users might commonly type “D2″ or “Div 2″. But it needs to actually be “Division II” which is what is in the database. Can I custom map common phrases to their actual meaning?

Leave a Reply