Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory.
To check the list of stopwords you can type the following commands in the python shell.
import nltk from nltk.corpus import stopwords print(stopwords.words('english'))
Removing stop words with NLTK
The following program removes stop words from a piece of text:
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize example_sent = """This is a sample sentence, showing off the stop words filtration.""" stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(example_sent) filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words] filtered_sentence =  for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(word_tokens) print(filtered_sentence)
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.'] ['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']