Data Science and Artificial Intelligence

Analysis of restaurant reviews with NLP

In our previous blogs, we have seen the Introduction to Natural Language Processing and the Text Mining processes in detail. 

Readers are recommended to go through previous blogs so this would be a little easy to understand.

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

In this blog, we will see the implementation of Natural Language Processing with the ‘Restaurant Reviews’ dataset. The aim is to predict whether the review given by a customer is positive or negative.

To achieve this aim we will go through all the below steps mentioned.

Exploratory Data Analysis

Data Visualisation

Text Pre-Processing

Building Model

And evaluating model performance.

The code in this blog has been implemented in Spyder IDE.

So let us begin our coding in Python.

We will begin by importing the required libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Loading the dataset

dataset = pd.read_csv(r'Restaurant_Reviews.tsv', delimiter = '\t')

Since the dataset is in tsv form that is why we have mentioned the delimiter as ‘\t’. 

Our dataset looks something like this.

Here the Review column contains the reviews given by the customer and the Liked column states whether the customer has liked the place(1) or not(0).

Exploring the dataset

dataset.groupby('Liked').describe()

We can see in the output there are equal review counts for likes and unlikes. Where the number of unique reviews for unlike and like is 497 and 499 respectively. 

Create one more column which tells us the length of the review column.

dataset['Length'] = dataset['Review'].apply(len)

Here as we can see a new column ‘Length’ has been created.

Checking the statistical data of the newly added column.

dataset.Length.describe()

The max length of the review submitted by the user is 149 words. Let us now check the exact review which is 149 words long.

dataset[dataset['Length'] == 149]['Review'].iloc[0]

Well, this review doesn’t seem to be a positive one.

Data Visualization

We will now do some visualization of the dataset. Beginning with plotting the histogram for the Length column.

dataset['Length'].plot(bins=70, kind='hist') 

From the output, we can see that the most common range of words for the reviews given by the customer is 20 – 80 words.

We will now visualize the length of the review separately for likes and dislikes.

dataset.hist(column='Length', by='Liked', bins=40,figsize=(12,4))

The range of words is almost similar for both the categories. 

We will now begin to process the data. 

Text Pre-Processing

We will begin by removing punctuation, number, converting each word into lower case, taking root of the words and removing the stopwords.

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

#initializing empty array to append clean text 
corpus = []

#1000 reviews/rows to clean
for i in range(0, 1000):

    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])

    #converting all characters to lowercase
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()

   #loop for stemming each word in string array at ith row
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]

    #rejoin all string array elements to create back into a string
    review = ' '.join(review)

    #append each string to create array of clean text
    corpus.append(review)

Checking the output after pre-processing the text, we can see that the words have been converted into its basic form. 

Making a bag of words via the sparse matrix, for this, we need CountVectorizer class from sklearn.feature_extraction.text. We will do the training on the corpus and then apply the same transformation to the corpus “.fit_transform(corpus)”. After that, it will be converted into an array. If the review is positive or negative that answer is in the second column of the dataset [:, 1] that is all rows and first column.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

Here X is the bag of words which is shown as below:

Splitting corpus into training and test data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Training the model

For training the model we can use any classification algorithm, however, the Naive Bayes classifier algorithm is a good choice.

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

Evaluating the Model

To find the accuracy we will create a confusion matrix.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

Finding accuracy with the formula = (TP + TN) / (TP + TN + FP + FN)

The accuracy is 76% which is good for a start. 

Furthermore, accuracy can be increased by detailed EDA and Hyperparameter optimization.

This brings us to the end of the blog. Hope you find this blog helpful.

Keep visiting our website for more blogs on Data Science and Data Analytics.R

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close