To provide output for the test data, Machine Learning algorithms learn from a pre-defined collection of features from the training data. However, the primary issue with language processing is that machine learning algorithms cannot work directly on raw text. To transform the text into a matrix (or vector) of features, we’ll need some feature extraction techniques.
The following are some of the most prominent feature extraction techniques:
- Bag-of-Words –
- Bag-of- One of the most basic approaches for transforming tokens into a set of features is to use words.
- Each word is used as a feature for training the classifier in the BoW model, which is employed in document classification.
- TF-IDF –
- The term frequency-inverse document frequency (TF-IDF) stands for term frequency-inverse document frequency.
- It draws attention to a specific issue that, while not common in our corpus, is extremely important.
- The TF–IFD value rises in proportion to the number of times a word appears in the document and falls in proportion to the number of documents in the corpus containing the word.