Explain NLP python code?

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that enables machines to understand the human language. Its goal is to build systems that can make sense of text and automatically perform tasks like translation, spell check, or topic classification. Automate routine tasks with NLP.

Step 1: Import dataset with setting delimiter as ‘\t’ as columns are separated as tab space. Reviews and their category(0 or 1) are not separated by any other symbol but with tab space as most of the other symbols are is the review (like $ for the price, ….!, etc) and the algorithm might use them as a delimiter, which will lead to strange behavior (like errors, weird output) in output.

# Importing Libraries
import numpy as np 
import pandas as pd
# Import dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

Step 2: Text Cleaning or Preprocessing

Remove Punctuations, Numbers: Punctuations, Numbers don’t help much in processing the given text, if included, they will just increase the size of a bag of words that we will create as the last step and decrease the efficiency of an algorithm.
Stemming: Take roots of the word
Convert each word into its lower case: For example, it is useless to have some words in different cases (eg ‘good’ and ‘GOOD’).

# library to clean data
import re
# Natural Language Tool Kit
import nltk
# to remove stopword
from nltk.corpus import stopwords
# for Stemming propose
from nltk.stem.porter import PorterStemmer
# Initialize empty array
# to append clean text
corpus = []
# 1000 (reviews) rows to clean
for i in range(0, 1000):
    # column : "Review", row ith
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    # convert all cases to lower cases
    review = review.lower()
    # split to array(default delimiter is " ")
    review = review.split()
    # creating PorterStemmer object to
    # take main stem of each word
    ps = PorterStemmer()
    # loop for stemming each word
    # in string array at ith row   
    review = [ps.stem(word) for word in review
                if not word in set(stopwords.words('english'))]
    # rejoin all string array elements
    # to create back into a string
    review = ' '.join(review) 
    # append each string to create
    # array of clean text

Step 3: Tokenization, involves splitting sentences and words from the body of the text.

Step 4: Making the bag of words via sparse matrix

Take all the different words of reviews in the dataset without repeating of words.
One column for each word, therefore there is going to be many columns.
Rows are reviews
If a word is there in the row of a dataset of reviews, then the count of the word will be there in the row of a bag of words under the column of the word.

Step 5: Splitting Corpus into Training and Test set. For this, we need class train_test_split from sklearn.cross_validation. Split can be made 70/30 or 80/20 or 85/15 or 75/25, here I choose 75/25 via “test_size”.

X is the bag of words, y is 0 or 1 (positive or negative).

# Splitting the dataset into
# the Training set and Test set
from sklearn.cross_validation import train_test_split
# experiment with "test_size"
# to get better results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

Step 6: Fitting a Predictive Model (here random forest)

Since Random forest is an ensemble model (made of many trees) from sklearn.ensemble, import RandomForestClassifier class
With 501 trees or “n_estimators” and criterion as ‘entropy’
Fit the model via .fit() method with attributes X_train and y_train

# Fitting Random Forest Classification
# to the Training set
from sklearn.ensemble import RandomForestClassifier
# n_estimators can be said as the number of
# trees, experiment with n_estimators
# to get better results
model = RandomForestClassifier(n_estimators = 501,
                            criterion = 'entropy')
model.fit(X_train, y_train)

Step 7: Predicting Final Results via using .predict() method with attribute X_test

 # Predicting the Test set results
y_pred = model.predict(X_test)

Note: Accuracy with the random forest was 72%.(It may be different when performed an experiment with different test sizes, here = 0.25).

Step 8: To know the accuracy, a confusion matrix is needed.

The confusion Matrix is a 2X2 Matrix.

 # Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)