Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
Key points of the session –
- Text into sentences tokenization
- Sentences into words tokenization
- Sentences using regular expressions tokenization
Example: Sentence Tokenization – Splitting sentences in the paragraph.
from nltk.tokenize import sent_tokenize text = “Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article” sent_tokenize(text)
Output: [‘Hello everyone.’, ‘Welcome to GeeksforGeeks.’, ‘You are studying NLP article’]