In our previous blogs, we have seen the Introduction to Natural Language Processing and the Text Mining processes in detail.
Readers are recommended to go through previous blogs so this would be a little easy to understand.
In this blog, we will see the implementation of Natural Language Processing with the ‘Restaurant Reviews’ dataset. The aim is to predict whether the review given by a customer is positive or negative.
To achieve this aim we will go through all the below steps mentioned.
Exploratory Data Analysis
Data Visualisation
Text Pre-Processing
Building Model
And evaluating model performance.
The code in this blog has been implemented in Spyder IDE.
So let us begin our coding in Python.
We will begin by importing the required libraries.
import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns
Loading the dataset
dataset = pd.read_csv(r'Restaurant_Reviews.tsv', delimiter = '\t')
Since the dataset is in tsv form that is why we have mentioned the delimiter as ‘\t’.
Our dataset looks something like this.
Here the Review column contains the reviews given by the customer and the Liked column states whether the customer has liked the place(1) or not(0).
Exploring the dataset
dataset.groupby('Liked').describe()
We can see in the output there are equal review counts for likes and unlikes. Where the number of unique reviews for unlike and like is 497 and 499 respectively.
Create one more column which tells us the length of the review column.
dataset['Length'] = dataset['Review'].apply(len)
Here as we can see a new column ‘Length’ has been created.
Checking the statistical data of the newly added column.
dataset.Length.describe()
The max length of the review submitted by the user is 149 words. Let us now check the exact review which is 149 words long.
dataset[dataset['Length'] == 149]['Review'].iloc[0]

Well, this review doesn’t seem to be a positive one.
Data Visualization
We will now do some visualization of the dataset. Beginning with plotting the histogram for the Length column.
dataset['Length'].plot(bins=70, kind='hist')
From the output, we can see that the most common range of words for the reviews given by the customer is 20 – 80 words.
We will now visualize the length of the review separately for likes and dislikes.
dataset.hist(column='Length', by='Liked', bins=40,figsize=(12,4))
The range of words is almost similar for both the categories.
We will now begin to process the data.
Text Pre-Processing
We will begin by removing punctuation, number, converting each word into lower case, taking root of the words and removing the stopwords.
import re import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer #initializing empty array to append clean text corpus = [] #1000 reviews/rows to clean for i in range(0, 1000): # column : "Review", row ith review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i]) #converting all characters to lowercase review = review.lower() review = review.split() ps = PorterStemmer() #loop for stemming each word in string array at ith row review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] #rejoin all string array elements to create back into a string review = ' '.join(review) #append each string to create array of clean text corpus.append(review)
Checking the output after pre-processing the text, we can see that the words have been converted into its basic form.
Making a bag of words via the sparse matrix, for this, we need CountVectorizer class from sklearn.feature_extraction.text. We will do the training on the corpus and then apply the same transformation to the corpus “.fit_transform(corpus)”. After that, it will be converted into an array. If the review is positive or negative that answer is in the second column of the dataset [:, 1] that is all rows and first column.
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features = 1500) X = cv.fit_transform(corpus).toarray() y = dataset.iloc[:, 1].values
Here X is the bag of words which is shown as below:
Splitting corpus into training and test data.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
Training the model
For training the model we can use any classification algorithm, however, the Naive Bayes classifier algorithm is a good choice.
from sklearn.naive_bayes import MultinomialNB classifier = MultinomialNB() classifier.fit(X_train, y_train) y_pred = classifier.predict(X_test)
Evaluating the Model
To find the accuracy we will create a confusion matrix.
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) cm
Finding accuracy with the formula = (TP + TN) / (TP + TN + FP + FN)
The accuracy is 76% which is good for a start.
Furthermore, accuracy can be increased by detailed EDA and Hyperparameter optimization.
This brings us to the end of the blog. Hope you find this blog helpful.
Keep visiting our website for more blogs on Data Science and Data Analytics.R