Introduction

The objective of this project is to perform sentiment analysis (only positive and negative) on an imbalanced hotel review dataset.

This project covers:

  • TF-IDF
  • count features
  • logistic regression
  • naive bayes
  • svm
  • xgboost
  • grid search
  • word vectors (Universal Sentence Encoder model from Tensorflow HUB)
  • LSTM

The final LSTM model achieved an accuracy of ~81% in Test Dataset (75:25 split)

Dataset Source

Text Embedding Model

NB: This project also serves as my assignments for the course below -

Libraires & Configuration

Check GPU

In [ ]:
!nvidia-smi
Tue Jan  4 11:28:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Libraries

In [ ]:
%%capture
!pip install tensorflow_text
!pip install tqdm
In [ ]:
import os 

# session crash issue
# https://stackoverflow.com/a/54927279/11105356

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
In [ ]:
from tqdm import tqdm
import numpy as np
import pandas as pd 
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from numpy import newaxis
from wordcloud import WordCloud, STOPWORDS

from tqdm import tqdm

from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

import xgboost as xgb
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Activation, GRU, BatchNormalization
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.utils import np_utils
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping

from tensorflow.keras.optimizers import Adam

%matplotlib inline
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

plt.rcParams['figure.figsize'] = 12, 8

RANDOM_SEED = 42

nltk.download('stopwords')
stop_words = stopwords.words('english')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
In [ ]:
tf.test.gpu_device_name()
Out[ ]:
'/device:GPU:0'
In [ ]:
tf.__version__, hub.__version__, tensorflow_text.__version__
Out[ ]:
('2.7.0', '0.12.0', '2.7.3')
In [ ]:
!pip freeze | grep hub
!pip freeze | grep tensorflow_text
!pip freeze | grep keras
!pip freeze | grep scikit-learn
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
tensorflow-hub==0.12.0
keras==2.7.0
keras-vis==0.4.1
scikit-learn==0.22.2.post1

Load HUB Model


USE(Universal Sentence Encoder)

In [ ]:
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3'
use = hub.load(module_url)

Sample Example

In [ ]:
txt_1 = ["the bedroom is cozy"]
txt_2 = ["comfortable bedroom"]
In [ ]:
emb_1 = use(txt_1)
emb_2 = use(txt_2)
In [ ]:
print(emb_1.shape)
(1, 512)

Correlation

The USE is trained on a number of tasks but one of the main tasks is to identify the similarity between pairs of sentences. The authors note that the task was to identify "semantic textual similarity (STS) between sentence pairs scored by Pearson correlation with human judgments".

In [ ]:
print(np.inner(emb_1, emb_2).flatten()[0])
0.8467271

Load Dataset


Dataset Source :

jiashenliu/515k-hotel-reviews-data-in-europe

In [ ]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d jiashenliu/515k-hotel-reviews-data-in-europe
mkdir: cannot create directory ‘/root/.kaggle’: File exists
Downloading 515k-hotel-reviews-data-in-europe.zip to /content
 73% 33.0M/45.1M [00:02<00:00, 16.0MB/s]
100% 45.1M/45.1M [00:02<00:00, 22.9MB/s]
In [ ]:
!unzip /content/515k-hotel-reviews-data-in-europe.zip
Archive:  /content/515k-hotel-reviews-data-in-europe.zip
  inflating: Hotel_Reviews.csv       

EDA

In [ ]:
df_hotel_reviews = pd.read_csv("/content/Hotel_Reviews.csv")
df_hotel_reviews.head()
Out[ ]:
Hotel_Address Additional_Number_of_Scoring Review_Date Average_Score Hotel_Name Reviewer_Nationality Negative_Review Review_Total_Negative_Word_Counts Total_Number_of_Reviews Positive_Review Review_Total_Positive_Word_Counts Total_Number_of_Reviews_Reviewer_Has_Given Reviewer_Score Tags days_since_review lat lng
0 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 8/3/2017 7.7 Hotel Arena Russia I am so angry that i made this post available... 397 1403 Only the park outside of the hotel was beauti... 11 7 2.9 [' Leisure trip ', ' Couple ', ' Duplex Double... 0 days 52.360576 4.915968
1 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 8/3/2017 7.7 Hotel Arena Ireland No Negative 0 1403 No real complaints the hotel was great great ... 105 7 7.5 [' Leisure trip ', ' Couple ', ' Duplex Double... 0 days 52.360576 4.915968
2 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/31/2017 7.7 Hotel Arena Australia Rooms are nice but for elderly a bit difficul... 42 1403 Location was good and staff were ok It is cut... 21 9 7.1 [' Leisure trip ', ' Family with young childre... 3 days 52.360576 4.915968
3 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/31/2017 7.7 Hotel Arena United Kingdom My room was dirty and I was afraid to walk ba... 210 1403 Great location in nice surroundings the bar a... 26 1 3.8 [' Leisure trip ', ' Solo traveler ', ' Duplex... 3 days 52.360576 4.915968
4 s Gravesandestraat 55 Oost 1092 AA Amsterdam ... 194 7/24/2017 7.7 Hotel Arena New Zealand You When I booked with your company on line y... 140 1403 Amazing location and building Romantic setting 8 3 6.7 [' Leisure trip ', ' Couple ', ' Suite ', ' St... 10 days 52.360576 4.915968
In [ ]:
f"{df_hotel_reviews.shape[0]} rows, {df_hotel_reviews.shape[1]} columns"
Out[ ]:
'515738 rows, 17 columns'
In [ ]:
df_hotel_reviews.columns
Out[ ]:
Index(['Hotel_Address', 'Additional_Number_of_Scoring', 'Review_Date',
       'Average_Score', 'Hotel_Name', 'Reviewer_Nationality',
       'Negative_Review', 'Review_Total_Negative_Word_Counts',
       'Total_Number_of_Reviews', 'Positive_Review',
       'Review_Total_Positive_Word_Counts',
       'Total_Number_of_Reviews_Reviewer_Has_Given', 'Reviewer_Score', 'Tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')
In [ ]:
df_hotel_reviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515738 entries, 0 to 515737
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   Hotel_Address                               515738 non-null  object 
 1   Additional_Number_of_Scoring                515738 non-null  int64  
 2   Review_Date                                 515738 non-null  object 
 3   Average_Score                               515738 non-null  float64
 4   Hotel_Name                                  515738 non-null  object 
 5   Reviewer_Nationality                        515738 non-null  object 
 6   Negative_Review                             515738 non-null  object 
 7   Review_Total_Negative_Word_Counts           515738 non-null  int64  
 8   Total_Number_of_Reviews                     515738 non-null  int64  
 9   Positive_Review                             515738 non-null  object 
 10  Review_Total_Positive_Word_Counts           515738 non-null  int64  
 11  Total_Number_of_Reviews_Reviewer_Has_Given  515738 non-null  int64  
 12  Reviewer_Score                              515738 non-null  float64
 13  Tags                                        515738 non-null  object 
 14  days_since_review                           515738 non-null  object 
 15  lat                                         512470 non-null  float64
 16  lng                                         512470 non-null  float64
dtypes: float64(4), int64(5), object(8)
memory usage: 66.9+ MB
In [ ]:
df_hotel_reviews.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Additional_Number_of_Scoring 515738.0 498.081836 500.538467 1.000000 169.000000 341.000000 660.000000 2682.000000
Average_Score 515738.0 8.397487 0.548048 5.200000 8.100000 8.400000 8.800000 9.800000
Review_Total_Negative_Word_Counts 515738.0 18.539450 29.690831 0.000000 2.000000 9.000000 23.000000 408.000000
Total_Number_of_Reviews 515738.0 2743.743944 2317.464868 43.000000 1161.000000 2134.000000 3613.000000 16670.000000
Review_Total_Positive_Word_Counts 515738.0 17.776458 21.804185 0.000000 5.000000 11.000000 22.000000 395.000000
Total_Number_of_Reviews_Reviewer_Has_Given 515738.0 7.166001 11.040228 1.000000 1.000000 3.000000 8.000000 355.000000
Reviewer_Score 515738.0 8.395077 1.637856 2.500000 7.500000 8.800000 9.600000 10.000000
lat 512470.0 49.442439 3.466325 41.328376 48.214662 51.499981 51.516288 52.400181
lng 512470.0 2.823803 4.579425 -0.369758 -0.143372 0.010607 4.834443 16.429233
In [ ]:
df_hotel_reviews.describe(include='object').T
Out[ ]:
count unique top freq
Hotel_Address 515738 1493 163 Marsh Wall Docklands Tower Hamlets London ... 4789
Review_Date 515738 731 8/2/2017 2585
Hotel_Name 515738 1492 Britannia International Hotel Canary Wharf 4789
Reviewer_Nationality 515738 227 United Kingdom 245246
Negative_Review 515738 330011 No Negative 127890
Positive_Review 515738 412601 No Positive 35946
Tags 515738 55242 [' Leisure trip ', ' Couple ', ' Double Room '... 5101
days_since_review 515738 731 1 days 2585
In [ ]:
df_hotel_reviews.Reviewer_Score.describe().T
Out[ ]:
count    515738.000000
mean          8.395077
std           1.637856
min           2.500000
25%           7.500000
50%           8.800000
75%           9.600000
max          10.000000
Name: Reviewer_Score, dtype: float64
In [ ]:
df_hotel_reviews.Reviewer_Score.hist()
plt.title('Review Score Distribution');
In [ ]:
df_hotel_reviews.plot(kind='scatter', 
                      x='Review_Total_Positive_Word_Counts', 
                      y='Review_Total_Negative_Word_Counts', 
                      label='Total reviews',
             s=df_hotel_reviews.Total_Number_of_Reviews/100,
             c='Reviewer_Score',
             cmap=plt.get_cmap('jet'), 
             colorbar=True, 
             alpha=0.4, figsize=(15,12),
             sharex=False, # label not showing up 
             # https://stackoverflow.com/a/69661993/11105356 
             )
font_size = 15
plt.title("Review Sentiment Distribution",  fontsize=font_size)
plt.xlabel("Total Positive Word Counts", fontsize=font_size)
plt.ylabel("Total Negative Word Counts",  fontsize=font_size)
plt.legend()
plt.show()
In [ ]:
df_hotel_reviews.Reviewer_Nationality.value_counts()[:20]
Out[ ]:
 United Kingdom               245246
 United States of America      35437
 Australia                     21686
 Ireland                       14827
 United Arab Emirates          10235
 Saudi Arabia                   8951
 Netherlands                    8772
 Switzerland                    8678
 Germany                        7941
 Canada                         7894
 France                         7296
 Israel                         6610
 Italy                          6114
 Belgium                        6031
 Turkey                         5444
 Kuwait                         4920
 Spain                          4737
 Romania                        4552
 Russia                         3900
 South Africa                   3821
Name: Reviewer_Nationality, dtype: int64
In [ ]:
df_hotel_reviews.Average_Score.hist()
plt.title('Review Average Score Distribution');
In [ ]:
abs(df_hotel_reviews.Review_Total_Positive_Word_Counts - df_hotel_reviews.Review_Total_Negative_Word_Counts).hist()
plt.title('Difference Between Total Positive and Negative Word Count Among Hotel Reviews');

Cleaning Review Text

In [ ]:
df_hotel_reviews['Negative_Review'][1]
Out[ ]:
'No Negative'
In [ ]:
df_hotel_reviews.loc[:, 'Positive_Review'] = df_hotel_reviews.Positive_Review.apply(lambda x: x.replace('No Positive', ''))
df_hotel_reviews.loc[:, 'Negative_Review'] = df_hotel_reviews.Negative_Review.apply(lambda x: x.replace('No Negative', ''))
In [ ]:
df_hotel_reviews['Negative_Review'][1]
Out[ ]:
''

Merged Feature (Both Review Text)

In [ ]:
df_hotel_reviews['review'] = df_hotel_reviews.Positive_Review + df_hotel_reviews.Negative_Review

Set Sentiment Type

In [ ]:
df_hotel_reviews["review_type"] = df_hotel_reviews["Reviewer_Score"].apply(
    lambda x: "bad" if x < 7 else "good")
In [ ]:
df_reviews = df_hotel_reviews[["review", "review_type"]]
In [ ]:
df_reviews
Out[ ]:
review review_type
0 Only the park outside of the hotel was beauti... bad
1 No real complaints the hotel was great great ... good
2 Location was good and staff were ok It is cut... good
3 Great location in nice surroundings the bar a... bad
4 Amazing location and building Romantic settin... bad
... ... ...
515733 location no trolly or staff to help you take ... good
515734 Breakfast was ok and we got earlier check in ... bad
515735 The ac was useless It was a hot week in vienn... bad
515736 The rooms are enormous and really comfortable... good
515737 staff was very kind I was in 3rd floor It di... good

515738 rows × 2 columns

In [ ]:
df_reviews.review_type.hist();
# imbalanced distribution
In [ ]:
df_reviews[df_reviews.review_type == 'good'].review.value_counts()
Out[ ]:
 Location                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  947
 Everything Nothing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        938
 Everything                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                599
 Great location                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            253
 Everything                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                219
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ... 
 The hotel is in the center of the Navigli Area which is a nice quite area with lots of restaurants The hotel is clean modern and well maintained The breakfast is rich and even offers the option to prepare your own carrot juice And the hotel even offers a welcome drink not much really                                                                                                                                                                                                                                1
 Booked a standard double and was upgraded to a suite in the Taj51 which was just utterly fantastic Staff were really helpful room was amazing location was great could see Buck House from my window  A bit more info on how to use room service would ve helped I didn t find the hotel info books until I was leaving                                                                                                                                                                                                     1
 I likeed the Confort of the hotel and the location  The restaurant                                                                                                                                                                                                                                                                                                                                                                                                                                                          1
 Location excellent for Westfield beds really comfortable Nothing                                                                                                                                                                                                                                                                                                                                                                                                                                                            1
 Very clean and very nice staff They have a can do attitude which I really like I ve stayed here twice already and will be doing so more often  Breakfast is a little pricey for what you are getting I mostly chose continental and most days there were only apples available as fruit the two cold meat options were always dry which made me think they weren t fresh and the cheese were in awkward packages instead of slices Wish there was a bigger selection for a cheaper price Apart from this all was good       1
Name: review, Length: 413884, dtype: int64
In [ ]:
df_reviews[df_reviews.review_type == 'bad'].review.value_counts()
Out[ ]:
 Nothing Everything                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   124
 Location                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             107
 Nothing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               36
 location                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              26
 Staff                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 23
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ... 
 Good spot for o2 arena Took ober 900 pounds from my account on top if the 600 already taken to authorise 3 rooms not refunded until 10 days after stay                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 1
 Good location and insonorised room Staff was not nice We did not have any towels in the room when I asked staff he did not want to give me ones No pillows either Bathroom was dirty Water was brown And cherry on the top we had cockroach in the bathroom                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1
 Location Nice rooms  Good afternoon We have just returned from a stay at your hotel last night I have to say I m somewhat shocked and disappointed at the service we received whilst staying with you Firstly a member of your staff knocked and then proceeded to come straight into our room yesterday afternoon They quickly left shouting sorry once they realised we were there We thought this to be somewhat odd but quickly dismissed it as an obvious mistake This morning the same thing happened again This was a different member of staff and presumingly they were waiting to potentially clean our room on check out but even so I feel it s unacceptable and inappropriate to knock once and then use their own key to enter uninvited In all the years of hotel breaks I have never experienced something like this let alone it happening twice On check out this morning we explained the situation to Nelson who simply said sorry about that it can happen sometimes I didn t feel this was an acceptable apology it sounds like it s a common occurrence at your hotel Maybe it s a policy that needs revisiting Finally there was a discrepancy with our final bill when I politely questioned this with Stephanie she came across extremely rude and condescending The bill was in fact correct but a simple explanation with less attitude would have been greatly appreciated She made us feel very uneasy and it was very unnecessary and disappointing from a hotel of your caliber We paid 200 for our room last night our experience with you did not justify the price we paid and we feel very disappointed with our stay I have felt the need to add this post as our concerns fell on deaf ears at check out this morning I hope you will address these issues so that other guests are not in the same situation Kind Regards       1
 Good location Room was smallish with mould in corner but facilities in room were good Window was a high velux in ceiling which cannot open which was loud with heavy rain Service was slow and when i asked for a mocha i was told they did not know what was in it and asked me did i know how to make it which i thought was a bit smart The poshness shined tru though                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1
 There is paid parking next to the hotel for around 12 US dollars a night There is a bus and trolley stop a block away which makes it easy to get to central Amsterdam Take the number 21 bus as I recall the tram is less frequent Breakfast was good  If you intend to park your car in the parking garage when you check in go to the left of the check in counter and open the door to the garage just to note it s location and appearance otherwise you may never find your way out of the garage The door says emergency only which of course is not true The door from the garage to the hotel is very poorly marked with a sign hanging from the ceiling about 50 ft in front of the door which you ll never see After hours don t go into the automatic door to the shops in the garage which might be closed You will get trapped like we did and couldn t get out This was during our search for the door to the hotel Front desk should have been more helpful they didn t tell us about the garage door didn t mention the 21 bus only the tram Park on the ground floor of the parking garage and when you enter you have two choices for parking ticket take either one Front desk didn t tell us this either Room had four beds in it and there were only two of us So it was cramped Room safe was locked and on the floor so not usable                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1
Name: review, Length: 85648, dtype: int64
In [ ]:
good_reviews = df_reviews[df_reviews.review_type == "good"]
bad_reviews = df_reviews[df_reviews.review_type == "bad"]
In [ ]:
good_reviews_text = " ".join(good_reviews.review.to_numpy().tolist())
bad_reviews_text = " ".join(bad_reviews.review.to_numpy().tolist())

Word Cloud

In [ ]:
# generate Word Cloud
def gen_wc(txt):
  stopwords = set(STOPWORDS) 

  # crisp wordcloud : https://stackoverflow.com/a/28795577/11105356
  wc = WordCloud(width=800, height=400,background_color="white", max_font_size=300, stopwords = stopwords).generate(txt)
  plt.figure(figsize=(14,10))
  plt.imshow(wc, interpolation="bilinear")
  plt.axis('off')
  plt.show()
In [ ]:
gen_wc(good_reviews_text)