Applications of stemming :
- Stemming is used in information retrieval systems like search engines.
- It is used to determine domain vocabularies in domain analysis.
Fun Fact : Google search adopted a word stemming in 2003. Previously a search for “fish” would not have returned “fishing” or “fishes”.
Some Stemming algorithms are:
Porter’s Stemmer algorithm
It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped on to the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
Advantage: It produces the best output as compared to other stemmers and it has less error rate. Limitation: Morphological variants produced are not always real words.
It is proposed by Lovins in 1968, that removes the longest suffix from a word then the word is recoded to convert this stem into valid words.
Example: sitting -> sitt -> sit
Advantage: It is fast and handles irregular plurals like ‘teeth’ and ‘tooth’ etc. Limitation: It is time consuming and frequently fails to form words from stem.
It is an extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter.
Advantage: It is fast in execution and covers more suffices. Limitation: It is very complex to implement.
It was proposed in 1993 by Robert Krovetz. Following are the steps:
- Convert the plural form of a word to its singular form.
- Convert the past tense of a word to its present tense and remove the suffix ‘ing’.
Example: ‘children’ -> ‘child’
Advantage: It is light in nature and can be used as pre-stemmer for other stemmers. Limitation: It is inefficient in case of large documents.
- ‘children’ -> ‘child’
- ‘understood’ -> ‘understand’
- ‘whom’ -> ‘who’
- ‘best’ -> ‘good’
An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S
Advantage: It is based on string comparisons and it is language dependent. Limitation: It requires space to create and index the n-grams and it is not time efficient.
- Snowball Stemmer:
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from the nltk package. This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is having greater computational speed.
- Lancaster Stemmer:
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm.