There are the following steps to build an NLP pipeline -
Step1: Sentence Segmentation
Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph into separate sentences.
Example: Consider the following paragraph -
Independence Day is one of the important festivals for every Indian citizen. It is celebrated on the 15th of August each year ever since India got independence from the British rule. The day celebrates independence in the true sense.
Sentence Segment produces the following result:
- “Independence Day is one of the important festivals for every Indian citizen.”
- “It is celebrated on the 15th of August each year ever since India got independence from the British rule.”
- “This day celebrates independence in the true sense.”
Step2: Word Tokenization
Word Tokenizer is used to break the sentence into separate words or tokens.
Example:
JavaTpoint offers Corporate Training, Summer Training, Online Training, and Winter Training.
Word Tokenizer generates the following result:
“JavaTpoint”, “offers”, “Corporate”, “Training”, “Summer”, “Training”, “Online”, “Training”, “and”, “Winter”, “Training”, “.”
Step3: Stemming
Stemming is used to normalize words into its base form or root form. For example, celebrates, celebrated and celebrating, all these words are originated with a single root word “celebrate.” The big problem with stemming is that sometimes it produces the root word which may not have any meaning.
For Example, intelligence, intelligent, and intelligently, all these words are originated with a single root word “intelligen.” In English, the word “intelligen” do not have any meaning.
Step 4: Lemmatization
Lemmatization is quite similar to the Stamming. It is used to group different inflected forms of the word, called Lemma. The main difference between Stemming and lemmatization is that it produces the root word, which has a meaning.
For example: In lemmatization, the words intelligence, intelligent, and intelligently has a root word intelligent, which has a meaning.
Step 5: Identifying Stop Words
In English, there are a lot of words that appear very frequently like “is”, “and”, “the”, and “a”. NLP pipelines will flag these words as stop words. Stop words might be filtered out before doing any statistical analysis.
Example: He is a good boy.
Note: When you are building a rock band search engine, then you do not ignore the word “The.”
Step 6: Dependency Parsing
Dependency Parsing is used to find that how all the words in the sentence are related to each other.
Step 7: POS tags
POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It indicates that how a word functions with its meaning as well as grammatically within the sentences. A word has one or more parts of speech based on the context in which it is used.
Example: “Google” something on the Internet.
In the above example, Google is used as a verb, although it is a proper noun.
Step 8: Named Entity Recognition (NER)
Named Entity Recognition (NER) is the process of detecting the named entity such as person name, movie name, organization name, or location.
Example: Steve Jobs introduced iPhone at the Macworld Conference in San Francisco, California.
Step 9: Chunking
Chunking is used to collect the individual piece of information and grouping them into bigger pieces of sentences.