07.07.2015 | Blog The difference between stemming and lemmatization
So what exactly is the difference between these two methods? What are the advantages and disadvantages and which one should be preferred? [...]
The inverted index
Just as a quick reminder: The basis of a search engine is an index, called Inverted Index, a data structure which consists of a list of all unique words (index terms) occurring in any document of a document silo.
Additionally, for each unique term a list of documents, in which the term appears, is saved.
As an example, let's assume that we have to index two documents:
First document (doc1):
In my house lives a mouse.
Second document (doc2):
Mice are living in my houses.
The (lowercased) inverted index will look like this:
- doc1, doc2
- doc1, doc2
Now imagine the search query "mouse" (on the above index). It will end up only in the first document as a search result, although document two is also an expected result candidate, as it contains "mice", the plural of mouse.
A search engine of high quality must be able to handle those linguistical variations of index and search terms. Consequently, some kind of term normalization is indispensable.
Stemming and lemmatization are both natural language processing techniques to make sure that different word variants (inflectional and derivational word forms) are not left out.
So what exactly is the difference between these two methods? What are the advantages and disadvantages and which one should be preferred?
Stemming vs. lemmatization
Stemming is a procedure to strip inflectional and derivational suffixes from index and search terms with the aim to merge different word forms into one canonical form, called stem or root.
The most common stemmer is the Porter Stemmer (a Porter stemmer implementation is also provided by Lucene library), which works by heuristically (rule based) identifying word suffixes and then chopping them off.
By contrast, lemmatization means reducing an inflectional or derivationally related word form to its baseform (dictionary form) by applying a lookup in a word lexicon.
More exactly, the mentioned word lexicon is a dictionary which covers a complete morphological analysis for each word of a specific language.
Advantages of a stemmer are that there are freely available implementations and that there is no need of lexicons, which have to be maintained. However, the quality of stemming often is bad.
Just have a look at some examples produced by Porter Stemmer:
- Whereas for example "organization" as well as "organs" are both stemmed to "organ" (over-stemming - stemmer is cutting off too much with the result that words of different meaning are reduced to the same root),
- the following two terms of same origin "absorption" and "absorbing" are stemmed to "absorpt" and "absorb" (under-stemming).
- There are many other examples where stemming algorithms fail, especially words of irregular inflection (foot [singl.] - feet [pl.], go [inf.] - went [past tense], ...).
The problem concerning a search index is obvious - query "organization" will correctly match documents containing "organization" or "organizations", but will also erroneously merge documents containing "organs".
By contrast, if an efficient and sufficiently complete lexicon exists, a lemmatizer will mostly output correct baseforms.
Thus, a search engine based on a lemmatization normalization component compared to a stemming component significantly benefits and provides much more accurate search results.
High recall and precision for enterprise search
In order to guarantee high recall and precision of iFinder's search results, IntraFind developed its own linguistic analyzer, including a lemmatizer of high quality. It offers a complete morphological analysis by using complex prepared dictionaries of currently over 30 languages (German, English, Spanish, French, Dutch, Italien, Russian, Portuguese and many more).
Apart from lemmatization the linguistic analyzer is also a word decomposer, which is capable of splitting compound words in their individual word fragments. Imagine for example an English search query like "toothpaste". Thanks to the analyzer you will be able to find documents, which contain of course "toothpaste", as well as "paste" and "tooth".
As German is a language much more complex than English, the following example is more exciting:
The German search query "Glückskeks" decomposed by the anaylzer results in two word parts - "glück" and "keks". As you can see, the analyzer even succeeds in recognizing the semantic useless part of the word ("s"), called epenthesis (in German "Fugenelement") and therefore the query will provide result documents containing "glückskeks", "glück" and "keks".
For enterprise search, using high performance linguistics means that the user gets highly relevant search results and no information is lost.
Since 2013 Ursula works as a software architect for IntraFind and focuses on text analysis methods.