A lemma is the canonical, or dictionary, form of a set of related words—the
lemma of paying
, paid
, and pays
is pay
. Usually the lemma resembles
the words it is related to but sometimes it doesn’t — the lemma of is
,
was
, am
, and being
is be
.
Lemmatization, like stemming, tries to group related words, but it goes one step further than stemming in that it tries to group words by their word sense, or meaning. The same word may represent two meanings—for example,wake can mean to wake up or a funeral. While lemmatization would try to distinguish these two word senses, stemming would incorrectly conflate them.
Lemmatization is a much more complicated and expensive process that needs to understand the context in which words appear in order to make decisions about what they mean. In practice, stemming appears to be just as effective as lemmatization, but with a much lower cost.