What is Tokenization in NLP?

vishrut-singhal · 12 May 2021 08:58

What is Tokenization in NLP?

Tokenization is one of the most common tasks when it comes to working with text data. But what does the term ‘tokenization’ actually mean?

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

Check out the below image to visualize this definition:

The tokens could be words, numbers or punctuation marks. In tokenization, smaller units are created by locating word boundaries. Wait – what are word boundaries?

These are the ending point of a word and the beginning of the next word. These tokens are considered as a first step for stemming and lemmatization (the next stage in text preprocessing which we will cover in the next article).