Explain Tokenization in detail?

board-infinity · 6 October 2022 06:52

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

Key points of the session –

Text into sentences tokenization
Sentences into words tokenization
Sentences using regular expressions tokenization

Example: Sentence Tokenization – Splitting sentences in the paragraph.

from nltk.tokenize import sent_tokenize text = “Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article” sent_tokenize(text)

Output: [‘Hello everyone.’, ‘Welcome to GeeksforGeeks.’, ‘You are studying NLP article’]