It's pretty important to find the right mattress since we're sleeping on them nightly. I read somewhere that some large proportion of people trust reviews just as much as they trust personal recommendations. Since this whole COVID disaster shut down businesses, it's really important to make sure we're buying a quality product without trying it out. I found this website that summarizes the "best" bed-in-a-box companies, but I needed to look at some reviews from others. The Trustpilot website states that they have 1.1 trillion ratings across 300,000 businesses. This is impressive, but unfortunately I don't have that much time, even in lockdown.
So here we're going to look at reviews of those best companies and try and figure out if there's a stand-out winner. We begin with scraping the trustpilot website. I followed this tutorial here, which was amazing to help write the code.
Let's start by figuring out how to access the right information. I went to the trustpilot website and looked at the html code of the parts I'm interested in. We'll begin with Emma.
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.trustpilot.com/review/www.emma-mattress.co.uk?languages=all&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
#we start with the user details
user_containers = html_soup.find_all('div', class_ = 'consumer-information__name')
print(type(user_containers))
print(len(user_containers))
<class 'bs4.element.ResultSet'> 20
first = user_containers[0]
first.text
'\n Sharon Druce\n '
# let's look for the review content
review_containers = html_soup.find_all('div', class_ = 'review-content__body')
print(type(review_containers))
print(len(review_containers))
<class 'bs4.element.ResultSet'> 20
first_rev= review_containers[0]
first_rev
<div class="review-content__body" v-pre=""> <h2 class="review-content__title"> <a class="link link--large link--dark" data-track-link="{'target': 'Single review', 'name': 'review-title'}" href="/reviews/5f52205b02e85708c8dfc5eb">Ok I’ve now had my Emma mattress for a…</a> </h2> <p class="review-content__text"> Ok I’ve now had my Emma mattress for a few months, it’s ok apart from the materials in it causes me to sweat , I thought it had a cooling layer to regulate body temperature, but if I get out during the night the matress actually feels damp and sweaty when I get back in bed,I’ve now removed the mattress protector but there’s no difference . I’ve been putting a bath sheet towel on to lay on , I’ve not even attempted to use the pillow it’s about an inch thick and rock hard so it’s a no no for me, I think I will be returning after the 200 days are up , I’m off to Dreams . I haven’t actually had a good solid nights sleep throughout . </p> </div>
Looks like we can collect the title and review from here.
title = first_rev.h2.a.text
review = first_rev.p.text
print(title)
print(review)
Ok I’ve now had my Emma mattress for a… Ok I’ve now had my Emma mattress for a few months, it’s ok apart from the materials in it causes me to sweat , I thought it had a cooling layer to regulate body temperature, but if I get out during the night the matress actually feels damp and sweaty when I get back in bed,I’ve now removed the mattress protector but there’s no difference . I’ve been putting a bath sheet towel on to lay on , I’ve not even attempted to use the pillow it’s about an inch thick and rock hard so it’s a no no for me, I think I will be returning after the 200 days are up , I’m off to Dreams . I haven’t actually had a good solid nights sleep throughout .
Would also like to collect the rating, so let's sort that out.
rating_container = html_soup.find_all('div',class_ = "star-rating star-rating--medium")
first_rating = rating_container[0]
first_rating
<div class="star-rating star-rating--medium"> <img alt="1 star: Bad" src="https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-1.svg"/> </div>
stars = first_rating.img['alt'][0]
stars
'1'
Cool, we can just select the first letter of the string since it's in the format '1 star: Bad'. Now onto the dates.
dates_container = html_soup.find_all('div',class_ = "review-content-header__dates")
first_date = dates_container[0]
first_date
<div class="review-content-header__dates"> <script data-initial-state="review-dates" type="application/json"> {"publishedDate":"2020-09-04T11:09:15Z","updatedDate":null,"reportedDate":null} </script> <review-dates :published-date="publishedDate" :reported-date="reportedDate" :updated-date="updatedDate"></review-dates> </div>
date = first_date.script
date
<script data-initial-state="review-dates" type="application/json"> {"publishedDate":"2020-09-04T11:09:15Z","updatedDate":null,"reportedDate":null} </script>
import re
date = str(date)
print(date)
result = re.search('publishedDate":(.*),"updatedDate"', date)
print(result.group(1)[1:11])
<script data-initial-state="review-dates" type="application/json"> {"publishedDate":"2020-09-04T11:09:15Z","updatedDate":null,"reportedDate":null} </script> 2020-09-04
There might be a better way to get the date, but this seems to be an ok workaround, let's see if it scales (update: it did).
One last thing is to get the link to the poster location so we can see where people are from. We'll collect the user ID and then add this to the base address.
profile_link_containers = html_soup.find_all('div', class_ = "review-card")
first_profile = profile_link_containers[0].aside.a['href']
first_profile
'/users/5b22bbab4de5666d34749b69'
Alright, I think that's everything we need so I'm going to define a function tying it all together.
from time import sleep
from random import randint
import re
from requests import get
import pandas as pd
from IPython.core.display import clear_output
from time import time
# we'll do this with a function in case we want to add different companies to the analysis later
def scrape_reviews(PATH, n_pages):
requests = 0
#data we collect will be stored in lists
names = []
ratings = []
headers = []
reviews = []
dates = []
locations = []
for p in range(n_pages):
start_time = time()
# sleep time to pause the loop between page collections, I had this shorter, but the host disrupted the collection, I guess it looked like a bot.
sleep(randint(8,13))
# make a get request
response = get(f'{PATH}{p}')
# Monitor the requests
requests += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
# Parse the content with BSoup
page_html = BeautifulSoup(response.text, 'html.parser')
# Select all of the relevant containers from a page
review_containers = page_html.find_all('div', class_ = 'review-content__body')
user_containers = page_html.find_all('div', class_ = 'consumer-information__name')
rating_container = page_html.find_all('div',class_ = "star-rating star-rating--medium")
date_container = page_html.find_all('div',class_ = "review-content-header__dates")
profile_link_containers = page_html.find_all('div', class_ = "review-card")
# now we start populating our lists
for x in range(len(review_containers)):
review_c = review_containers[x]
#here we need to account for cases where reviews have been reported and return nothing. This happens when reviews are reported. Also those reviews that have a title, but no text in the review need to be accounted for.
if review_c is None:
headers.append('None')
reviews.append('None')
if review_c.p is None:
reviews.append('None')
else:
reviews.append(review_c.p.text)
headers.append(review_c.h2.a.text)
reviewer = user_containers[x]
names.append(reviewer.text)
rating = rating_container[x]
ratings.append(rating.img['alt'][0])
date = date_container[x].script
date = str(date)
result = re.search('publishedDate":(.*),"updatedDate"', date)
dates.append(result.group(1)[1:11])
prof = profile_link_containers[x]
link = 'https://www.trustpilot.com'+ prof.aside.a['href']
c_profile = get(f'{link}')
csoup = BeautifulSoup(c_profile.text, 'html.parser')
cust_container = csoup.find('div', class_ = 'user-summary-location')
#some people apparently don't specify a location, an error was raised in testing, so we'll make sure it still works in that case
if cust_container is None:
locations.append('None')
else:
locations.append(cust_container.text)
rev_df = pd.DataFrame(list(zip(headers, reviews, ratings, names, locations, dates)),
columns = ['Header','Review','Rating', 'Name', 'Location', 'Date'])
rev_df['Review'] = rev_df['Review'].str.replace('\n', '')
rev_df['Name'] = rev_df['Name'].str.replace('\n', '')
rev_df['Location'] = rev_df['Location'].str.replace('\n', '')
rev_df['Date'] = pd.to_datetime(rev_df['Date'])
return rev_df
Trustpilot shows 20 reviews per page and the emma-mattress.co.uk site has 11,152 reviews (and climbing). This equates to approximately 558 pages. To begin, lets take 5 pages and make sure this works
df = scrape_reviews(PATH = 'https://www.trustpilot.com/review/www.emma-mattress.co.uk?languages=all&page=',
n_pages = 5)
Request:5; Frequency: 0.36804427864250117 requests/s
df
Header | Review | Rating | Name | Location | Date | |
---|---|---|---|---|---|---|
0 | I ordered the hybrid mattress | I ordered the hybrid mattress.... | 5 | Ramiro Cali-Corleo | Malta | 2020-09-05 |
1 | I much enjoy my Emma mattress together… | I much enjoy my Emma mattress ... | 5 | Margrit Dahm | United Kingdom | 2020-09-05 |
2 | Amazing mattress and impeccable customer service | This mattress is amazing and t... | 5 | Emma | United Kingdom | 2020-09-05 |
3 | My son loves his Mattress | My son loves his Mattress can ... | 5 | Annie | United Kingdom | 2020-09-05 |
4 | Best nights sleep ever | Having bought a double and a s... | 5 | Lisa Silver | United Kingdom | 2020-09-05 |
... | ... | ... | ... | ... | ... | ... |
88 | Mattress is comfy | Quick delivery and the best ma... | 5 | Carla Russell | United Kingdom | 2020-09-04 |
89 | Love my mattress and my pillows !! | With the added bit that came f... | 5 | Bill Compton | United Kingdom | 2020-09-04 |
90 | Excellent service and professional | Excellent mattress almost inst... | 5 | Rae Lloyd | United Kingdom | 2020-09-04 |
91 | Great mattress | Very impressed with the servic... | 5 | Shelley | United Kingdom | 2020-09-04 |
92 | Excellent customer service | Sadly tried both the types of ... | 2 | Jan Brooks | United Kingdom | 2020-09-04 |
93 rows × 6 columns
It's not a 100 reviews, but for the purposes here I think 92% of reviews is fine.
df.tail(10)
Header | Review | Rating | Name | Location | Date | |
---|---|---|---|---|---|---|
85 | Impressed | None | 5 | Ben Gardner | United Kingdom | 2020-08-31 |
86 | Best mattress we have ever had | I think the Emma matteress is ... | 3 | Amy Newman | United Kingdom | 2020-08-31 |
87 | Fabulous mattress .. | Fantastic mattress better nigh... | 5 | Cathy | United States | 2020-08-31 |
88 | Best nights sleep after purchase | Comfortable mattress but find ... | 2 | Carol | United Kingdom | 2020-08-31 |
89 | Thank you so much Emma. | I am totally satisfied with my... | 5 | Nick Archer | United Kingdom | 2020-08-31 |
90 | My wife has a back problem | I had enjoyed my Emma mattress... | 5 | Lesley Weeks | United Kingdom | 2020-08-31 |
91 | My Emma experience | Best mattress we have ever had... | 5 | Robinson | United Kingdom | 2020-08-31 |
92 | Best mattress EVER | Fabulous mattress ... comfy fr... | 5 | Lesley Wells | United Kingdom | 2020-08-31 |
93 | Got the Emma mattress for my son | Best nights sleep after purcha... | 5 | Ste P | United Kingdom | 2020-08-31 |
94 | Very happy with my mattress! | After having back pain for mon... | 5 | Jenny Parker | Netherlands | 2020-08-31 |
emma_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/www.emma-mattress.co.uk?languages=all&page=',
n_pages = 558)
Request:558; Frequency: 53.68421449576918 requests/s
emma_data
Header | Review | Rating | Name | Location | Date | |
---|---|---|---|---|---|---|
0 | I ordered the hybrid mattress | I ordered the hybrid mattress.... | 5 | Ramiro Cali-Corleo | Malta | 2020-09-05 |
1 | I much enjoy my Emma mattress together… | I much enjoy my Emma mattress ... | 5 | Margrit Dahm | United Kingdom | 2020-09-05 |
2 | Amazing mattress and impeccable customer service | This mattress is amazing and t... | 5 | Emma | United Kingdom | 2020-09-05 |
3 | My son loves his Mattress | My son loves his Mattress can ... | 5 | Annie | United Kingdom | 2020-09-05 |
4 | Best nights sleep ever | Having bought a double and a s... | 5 | Lisa Silver | United Kingdom | 2020-09-05 |
... | ... | ... | ... | ... | ... | ... |
10378 | Good products, prompt deliveries and good cust... | So pleased we trusted the revi... | 5 | Helen | United Kingdom | 2018-02-20 |
10379 | Sadly after 6 weeks on this mattress it… | Excellent company to deal with... | 4 | Berni Mackinnon | United Kingdom | 2018-02-20 |
10380 | I am a convert! | Incredible comfort, my sore ba... | 5 | Customer Esstee. | United Kingdom | 2018-02-20 |
10381 | Happy customer | This is a comfortable mattress... | 4 | jjgirl | United Kingdom | 2018-02-20 |
10382 | My wife has had both hips replaced and… | Best bed we have had, we both ... | 5 | Ian Newsham | United Kingdom | 2018-02-20 |
10383 rows × 6 columns
Pretty happy with this, 10,383 reviews in total. Plenty to work with. I'm going to save these to disk so I don't have to scrape again, I can just read them in.
emma_data.to_csv('Desktop/emma_data.csv')
Casper has 5,977 reviews, so 299 pages
casper_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/casper.com?languages=all&page=',
n_pages = 299)
Request:299; Frequency: 26.436810165083017 requests/s
casper_data
Header | Review | Rating | Name | Location | Date | |
---|---|---|---|---|---|---|
0 | Casper isn’t good at all for side sleepers | I starting by reading online t... | 2 | shaun | United Kingdom | 2020-09-03 |
1 | Thank you | None | 5 | MArk | United States | 2020-09-02 |
2 | I have had NOTHING but negative… | Arrived quickly and quality wa... | 5 | Sonia Ewell | United States | 2020-08-25 |
3 | arrived in excellent time and is a… | I have had NOTHING but negativ... | 1 | Michael E. | United States | 2020-08-21 |
4 | It broke but cos they are not in uk… | arrived in excellent time and ... | 5 | S Childs | United Kingdom | 2020-08-17 |
... | ... | ... | ... | ... | ... | ... |
5421 | The mattress is comfortable and… | None | 2 | Anon | United Kingdom | 2019-05-14 |
5422 | Fantastic | The mattress is fantastic, had... | 5 | Mr Colin Cassels | United Kingdom | 2019-05-14 |
5423 | Best sleep ever | Even though the pillow was muc... | 5 | Sheila Metzger | United Kingdom | 2019-05-14 |
5424 | Dog bed is cheap regular foam. Not worth £100+ | Absolutely love this mattress.... | 5 | Brutus | United Kingdom | 2019-05-14 |
5425 | The pillow was far too soft with no… | Excellent service from start t... | 5 | Dani | United Kingdom | 2019-05-14 |
5426 rows × 6 columns
casper_data.to_csv('Desktop/casper_data.csv')
Eve has 5,797 reviews, around 299 pages again
eve_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/www.evesleep.co.uk?languages=all&page=',
n_pages = 299)
Request:299; Frequency: 31.688304164405828 requests/s
eve_data.to_csv('Desktop/eve_data.csv')
Leesa has 1,494 reviews = 75 pages
leesa_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/leesa.co.uk?languages=all&page=',
n_pages = 75)
Request:75; Frequency: 8.762897036945937 requests/s
leesa_data.to_csv('Desktop/leesa_data.csv')
Otty has 5,895 reviews = 295 pages
otty_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/otty.com?languages=all&page=',
n_pages = 295)
Request:295; Frequency: 28.610990529202876 requests/s
otty_datata.to_csv('Desktop/otty_data.csv')
SilentNight has 4,074 reviews = 204 pages
silentnight_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/shop.silentnight.co.uk?languages=all&page=',
n_pages = 204)
Request:204; Frequency: 17.83837362789943 requests/s
silentnight_data.to_csv('Desktop/top/silentnight_data.csv')
Simba sleep has 15,617 reviews = 781 pages, yikes! I'd actually never even heard of this company.
simba_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/simbasleep.com?languages=all&page=',
n_pages = 781)
Request:781; Frequency: 68.55865285167769 requests/s
simba_data.to_csv('Desktop/simba_data.csv')
simba_data
Header | Review | Rating | Name | Location | Date | |
---|---|---|---|---|---|---|
0 | Great company | Great company, great product, ... | 5 | Robert Kendall | United Kingdom | 2020-09-06 |
1 | Noticeably better sleep straight away | This is an excellent mattress!... | 5 | J. M. L. | United Kingdom | 2020-09-06 |
2 | Not great at the moment 4 days in.... | Received my order at the plann... | 2 | Tami Shepherd | United Kingdom | 2020-09-06 |
3 | Excellent matteress | Excellent matteress, duvet and... | 5 | mr andrew dobson | United Kingdom | 2020-09-05 |
4 | Still waiting........ | Ordered the mattress on 25th A... | 1 | Su | United Kingdom | 2020-09-05 |
... | ... | ... | ... | ... | ... | ... |
14570 | Ordering quick and easy Delivery Not so! | No question about it. AND the... | 5 | Linda B. | United Kingdom | 2017-03-04 |
14571 | Amazing nights sleep | The mattress has reduced my ba... | 5 | Micaela A. | United Kingdom | 2017-03-04 |
14572 | Simba | Matttess was great. Exactly as... | 5 | Customer | United Kingdom | 2017-03-04 |
14573 | Easy to order | Best mattress around, fantasti... | 5 | Ratpig | United Kingdom | 2017-03-04 |
14574 | No support!! | Good service and quick deliver... | 5 | Anthony C | United Kingdom | 2017-03-04 |
14575 rows × 6 columns
I'm going to add a column for the company in each file and then combine the data frames into a single table.
emma_data['Company'] = 'Emma'
casper_data['Company'] = 'Casper'
eve_data['Company'] = 'Eve'
leesa_data['Company'] = 'Leesa'
otty_data['Company'] = 'Otty'
silentnight_data['Company'] = 'SilentNight'
simba_data['Company'] = 'Simba'
Companies = [emma_data,casper_data,eve_data,leesa_data,otty_data,silentnight_data,simba_data]
emma_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10383 entries, 0 to 10382 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Header 10383 non-null object 1 Review 10383 non-null object 2 Rating 10383 non-null object 3 Name 10383 non-null object 4 Location 10383 non-null object 5 Date 10383 non-null datetime64[ns] 6 Company 10383 non-null object dtypes: datetime64[ns](1), object(6) memory usage: 567.9+ KB
all_reviews = pd.concat(Companies, axis=0, ignore_index=True)
all_reviews
Header | Review | Rating | Name | Location | Date | Company | |
---|---|---|---|---|---|---|---|
0 | I ordered the hybrid mattress | I ordered the hybrid mattress.... | 5 | Ramiro Cali-Corleo | Malta | 2020-09-05 | Emma |
1 | I much enjoy my Emma mattress together… | I much enjoy my Emma mattress ... | 5 | Margrit Dahm | United Kingdom | 2020-09-05 | Emma |
2 | Amazing mattress and impeccable customer service | This mattress is amazing and t... | 5 | Emma | United Kingdom | 2020-09-05 | Emma |
3 | My son loves his Mattress | My son loves his Mattress can ... | 5 | Annie | United Kingdom | 2020-09-05 | Emma |
4 | Best nights sleep ever | Having bought a double and a s... | 5 | Lisa Silver | United Kingdom | 2020-09-05 | Emma |
... | ... | ... | ... | ... | ... | ... | ... |
46714 | Ordering quick and easy Delivery Not so! | No question about it. AND the... | 5 | Linda B. | United Kingdom | 2017-03-04 | Simba |
46715 | Amazing nights sleep | The mattress has reduced my ba... | 5 | Micaela A. | United Kingdom | 2017-03-04 | Simba |
46716 | Simba | Matttess was great. Exactly as... | 5 | Customer | United Kingdom | 2017-03-04 | Simba |
46717 | Easy to order | Best mattress around, fantasti... | 5 | Ratpig | United Kingdom | 2017-03-04 | Simba |
46718 | No support!! | Good service and quick deliver... | 5 | Anthony C | United Kingdom | 2017-03-04 | Simba |
46719 rows × 7 columns
It looks like we have 46,719 reviews from 7 companies to work with. I think this is a good amount to continue on with.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
all_reviews.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 46719 entries, 0 to 46718 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Header 46719 non-null object 1 Review 46719 non-null object 2 Rating 46719 non-null object 3 Name 46719 non-null object 4 Location 46719 non-null object 5 Date 46719 non-null datetime64[ns] 6 Company 46719 non-null object dtypes: datetime64[ns](1), object(6) memory usage: 2.5+ MB
all_reviews
Header | Review | Rating | Name | Location | Date | Company | |
---|---|---|---|---|---|---|---|
0 | I ordered the hybrid mattress | I ordered the hybrid mattress.... | 5 | Ramiro Cali-Corleo | Malta | 2020-09-05 | Emma |
1 | I much enjoy my Emma mattress together… | I much enjoy my Emma mattress ... | 5 | Margrit Dahm | United Kingdom | 2020-09-05 | Emma |
2 | Amazing mattress and impeccable customer service | This mattress is amazing and t... | 5 | Emma | United Kingdom | 2020-09-05 | Emma |
3 | My son loves his Mattress | My son loves his Mattress can ... | 5 | Annie | United Kingdom | 2020-09-05 | Emma |
4 | Best nights sleep ever | Having bought a double and a s... | 5 | Lisa Silver | United Kingdom | 2020-09-05 | Emma |
... | ... | ... | ... | ... | ... | ... | ... |
46714 | Ordering quick and easy Delivery Not so! | No question about it. AND the... | 5 | Linda B. | United Kingdom | 2017-03-04 | Simba |
46715 | Amazing nights sleep | The mattress has reduced my ba... | 5 | Micaela A. | United Kingdom | 2017-03-04 | Simba |
46716 | Simba | Matttess was great. Exactly as... | 5 | Customer | United Kingdom | 2017-03-04 | Simba |
46717 | Easy to order | Best mattress around, fantasti... | 5 | Ratpig | United Kingdom | 2017-03-04 | Simba |
46718 | No support!! | Good service and quick deliver... | 5 | Anthony C | United Kingdom | 2017-03-04 | Simba |
46719 rows × 7 columns
all_reviews.describe()
Header | Review | Rating | Name | Location | Date | Company | |
---|---|---|---|---|---|---|---|
count | 46719 | 46719 | 46719 | 46719 | 46719 | 46719 | 46719 |
unique | 33915 | 43369 | 6 | 35963 | 96 | 1313 | 7 |
top | Great mattress | None | 5 | customer | United Kingdom | 2020-02-13 00:00:00 | Simba |
freq | 540 | 3160 | 36899 | 944 | 42666 | 220 | 14575 |
first | NaN | NaN | NaN | NaN | NaN | 2017-01-24 00:00:00 | NaN |
last | NaN | NaN | NaN | NaN | NaN | 2020-09-06 00:00:00 | NaN |
Looking at the total dataframe, it seems like we have 6 unique ratings - wierd. Since 0 is not an option, there should just be 1-5. Let's investigate.
all_reviews['Rating'].unique()
array(['5', '2', '1', '4', '3', '#'], dtype=object)
It seems like some people have managed to not leave a review. Let's drop the rows with '#' as a rating.
all_reviews = all_reviews[all_reviews['Rating'] != '#']
all_reviews
Header | Review | Rating | Name | Location | Date | Company | |
---|---|---|---|---|---|---|---|
0 | I ordered the hybrid mattress | I ordered the hybrid mattress.... | 5 | Ramiro Cali-Corleo | Malta | 2020-09-05 | Emma |
1 | I much enjoy my Emma mattress together… | I much enjoy my Emma mattress ... | 5 | Margrit Dahm | United Kingdom | 2020-09-05 | Emma |
2 | Amazing mattress and impeccable customer service | This mattress is amazing and t... | 5 | Emma | United Kingdom | 2020-09-05 | Emma |
3 | My son loves his Mattress | My son loves his Mattress can ... | 5 | Annie | United Kingdom | 2020-09-05 | Emma |
4 | Best nights sleep ever | Having bought a double and a s... | 5 | Lisa Silver | United Kingdom | 2020-09-05 | Emma |
... | ... | ... | ... | ... | ... | ... | ... |
46714 | Ordering quick and easy Delivery Not so! | No question about it. AND the... | 5 | Linda B. | United Kingdom | 2017-03-04 | Simba |
46715 | Amazing nights sleep | The mattress has reduced my ba... | 5 | Micaela A. | United Kingdom | 2017-03-04 | Simba |
46716 | Simba | Matttess was great. Exactly as... | 5 | Customer | United Kingdom | 2017-03-04 | Simba |
46717 | Easy to order | Best mattress around, fantasti... | 5 | Ratpig | United Kingdom | 2017-03-04 | Simba |
46718 | No support!! | Good service and quick deliver... | 5 | Anthony C | United Kingdom | 2017-03-04 | Simba |
46712 rows × 7 columns
It was just 7 rows, but it also looks like there are 3160 reviews with no text. Since the idea is to ultimately use natural language processing with customer reviews, I think we'll drop these rows and move on.
all_reviews = all_reviews[all_reviews['Review'] != 'None']
#convert rating to an integer
all_reviews['Rating'] = all_reviews['Rating'].astype(int)
#convert Header, Review, Name, Location and Company to strings
all_reviews['Header'] = all_reviews['Header'].astype("string")
all_reviews['Review'] = all_reviews['Review'].astype("string")
all_reviews['Name'] = all_reviews['Name'].astype("string")
all_reviews['Location'] = all_reviews['Location'].astype("string")
all_reviews['Company'] = all_reviews['Company'].astype("string")
all_reviews.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 43559 entries, 0 to 46718 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Header 43559 non-null string 1 Review 43559 non-null string 2 Rating 43559 non-null int64 3 Name 43559 non-null string 4 Location 43559 non-null string 5 Date 43559 non-null datetime64[ns] 6 Company 43559 non-null string dtypes: datetime64[ns](1), int64(1), string(5) memory usage: 2.7 MB
mat_ratings = all_reviews.pivot_table(values='Rating', index='Company', aggfunc=np.mean, margins=True)
mat_ratings.plot(kind='barh', xlim=(4,5), title='Mean Ratings by Company', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x12decff70>
So it looks like Simba, Otty and Eve are above the average with the others below. Let's look closer at Emma, who has the worst average rating.
all_reviews['Rating'][all_reviews['Company'] == 'Emma'].value_counts(sort=False).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1490b5610>
Tough competition, it looks like they have overwhelmingly 5 star reviews! Since each company has a different number of reviews, we'll convert these ratings to a proportion of total reviews and then plot each company and look at what's happening.
names = ['Emma', 'Eve', 'Casper', 'Simba', 'SilentNight', 'Otty', 'Leesa']
proportions = []
proportions = pd.DataFrame(proportions)
for name in names:
subset = all_reviews[all_reviews['Company']==name]
calc = subset['Rating'].value_counts(normalize = True, sort = False)
proportions[name] = calc
proportions
Emma | Eve | Casper | Simba | SilentNight | Otty | Leesa | |
---|---|---|---|---|---|---|---|
1 | 0.098049 | 0.053238 | 0.071545 | 0.045284 | 0.089217 | 0.038298 | 0.049628 |
2 | 0.028131 | 0.013970 | 0.029878 | 0.014556 | 0.034860 | 0.015911 | 0.026468 |
3 | 0.045483 | 0.014159 | 0.053252 | 0.022716 | 0.049335 | 0.032747 | 0.066170 |
4 | 0.129261 | 0.045497 | 0.147561 | 0.066823 | 0.122895 | 0.092877 | 0.143921 |
5 | 0.699076 | 0.873136 | 0.697764 | 0.850621 | 0.703693 | 0.820167 | 0.713813 |
proportions.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x143670dc0>
It's a little bit hard to compare here, let's create a stacked plot.
import seaborn as sns
sns.set()
proportions.T.plot(kind='bar', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x150061bb0>
It looks like Emma has the higherst proportion of 1 star rating, which is obviously dragging their average down. Let's now see how the companies ratings have changed over time. Since each company has different periods, we'll do them separately.
#define function to subset
def subset(company):
subset = all_reviews[all_reviews['Company'] == company]
return subset
emma_clean = subset('Emma')
emma_clean
Header | Review | Rating | Name | Location | Date | Company | |
---|---|---|---|---|---|---|---|
0 | I ordered the hybrid mattress | I ordered the hybrid mattress.... | 5 | Ramiro Cali-Corleo | Malta | 2020-09-05 | Emma |
1 | I much enjoy my Emma mattress together… | I much enjoy my Emma mattress ... | 5 | Margrit Dahm | United Kingdom | 2020-09-05 | Emma |
2 | Amazing mattress and impeccable customer service | This mattress is amazing and t... | 5 | Emma | United Kingdom | 2020-09-05 | Emma |
3 | My son loves his Mattress | My son loves his Mattress can ... | 5 | Annie | United Kingdom | 2020-09-05 | Emma |
4 | Best nights sleep ever | Having bought a double and a s... | 5 | Lisa Silver | United Kingdom | 2020-09-05 | Emma |
... | ... | ... | ... | ... | ... | ... | ... |
10378 | Good products, prompt deliveries and good cust... | So pleased we trusted the revi... | 5 | Helen | United Kingdom | 2018-02-20 | Emma |
10379 | Sadly after 6 weeks on this mattress it… | Excellent company to deal with... | 4 | Berni Mackinnon | United Kingdom | 2018-02-20 | Emma |
10380 | I am a convert! | Incredible comfort, my sore ba... | 5 | Customer Esstee. | United Kingdom | 2018-02-20 | Emma |
10381 | Happy customer | This is a comfortable mattress... | 4 | jjgirl | United Kingdom | 2018-02-20 | Emma |
10382 | My wife has had both hips replaced and… | Best bed we have had, we both ... | 5 | Ian Newsham | United Kingdom | 2018-02-20 | Emma |
9740 rows × 7 columns
#we're going to calculate after each review, so we expect the beginning to have more variation.
count = 0
rolling_total = 0
mean_values = []
emma_clean = emma_clean.sort_values(by='Date')
for value in emma_clean['Rating']:
rating = value
rolling_total += rating
count += 1
mean = rolling_total/count
mean_values.append(mean)
emma_rolling_mean = emma_clean.copy()
emma_rolling_mean['Rolling_mean'] = mean_values
emma_clean
Header | Review | Rating | Name | Location | Date | Company | Rolling_mean | |
---|---|---|---|---|---|---|---|---|
10382 | My wife has had both hips replaced and… | Best bed we have had, we both ... | 5 | Ian Newsham | United Kingdom | 2018-02-20 | Emma | 5.000000 |
10380 | I am a convert! | Incredible comfort, my sore ba... | 5 | Customer Esstee. | United Kingdom | 2018-02-20 | Emma | 4.500000 |
10379 | Sadly after 6 weeks on this mattress it… | Excellent company to deal with... | 4 | Berni Mackinnon | United Kingdom | 2018-02-20 | Emma | 4.428571 |
10378 | Good products, prompt deliveries and good cust... | So pleased we trusted the revi... | 5 | Helen | United Kingdom | 2018-02-20 | Emma | 4.500000 |
10377 | The most comfortable mattress I have… | A brilliant product - a comfy ... | 5 | Alison | United Kingdom | 2018-02-20 | Emma | 4.555556 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
3 | My son loves his Mattress | My son loves his Mattress can ... | 5 | Annie | United Kingdom | 2020-09-05 | Emma | 4.302682 |
2 | Amazing mattress and impeccable customer service | This mattress is amazing and t... | 5 | Emma | United Kingdom | 2020-09-05 | Emma | 4.302610 |
1 | I much enjoy my Emma mattress together… | I much enjoy my Emma mattress ... | 5 | Margrit Dahm | United Kingdom | 2020-09-05 | Emma | 4.302538 |
4 | Best nights sleep ever | Having bought a double and a s... | 5 | Lisa Silver | United Kingdom | 2020-09-05 | Emma | 4.302753 |
0 | I ordered the hybrid mattress | I ordered the hybrid mattress.... | 5 | Ramiro Cali-Corleo | Malta | 2020-09-05 | Emma | 4.303183 |
9740 rows × 8 columns
This seemed to work, so we'll define a function below:
def rolling_mean(company_sub):
count = 0
rolling_total = 0
reviews = []
mean_values = []
company_sub = company_sub.sort_values(by='Date')
for value in company_sub['Rating']:
rating = value
rolling_total += rating
count += 1
#calculate the mean and append it
mean = rolling_total/count
mean_values.append(mean)
#keep track of number of reviews
reviews.append(count)
company_sub = company_sub.copy()
company_sub['Rolling_mean'] = mean_values
company_sub['Accumulated_reviews'] = reviews
return company_sub
emma_clean = subset('Emma')
eve_clean = subset('Eve')
casper_clean = subset('Casper')
simba_clean = subset('Simba')
silentnight_clean = subset('SilentNight')
leesa_clean = subset('Leesa')
otty_clean = subset('Otty')
emma_rolling_mean = rolling_mean(emma_clean)
eve_rolling_mean = rolling_mean(eve_clean)
casper_rolling_mean = rolling_mean(casper_clean)
simba_rolling_mean = rolling_mean(simba_clean)
silentnight_rolling_mean = rolling_mean(silentnight_clean)
leesa_rolling_mean = rolling_mean(leesa_clean)
otty_rolling_mean = rolling_mean(otty_clean)
all_rolling_mean = pd.concat([emma_rolling_mean, eve_rolling_mean, casper_rolling_mean, simba_rolling_mean, silentnight_rolling_mean, leesa_rolling_mean, otty_rolling_mean], axis=0, ignore_index=True)
import seaborn as sns
sns.lineplot(data=all_rolling_mean, x='Date', y='Rolling_mean', hue='Company')
<matplotlib.axes._subplots.AxesSubplot at 0x14f908dc0>
Ok, so as expected we're seeing the large variations when the companies started receiving reviews before the rolling average starts to stabilize. Eve and Simba have always had very good reviews, but there is a slight downward trend, this also seems to be the case for Silent Night. Leesa and Otty and relatively stable, while Casper has recently dipped. Emma, although having the lowest overall average, appear to be the only company on an upward trend. Good news for Emma!
I wonder if there is a link between the rate new reviews from customers. I did notice that while perfecting the code for scraping the data, Emma's reviews were continuously increasing. Let's check that out.
sns.lineplot(data=all_rolling_mean, x='Date', y='Accumulated_reviews', hue='Company')
<matplotlib.axes._subplots.AxesSubplot at 0x1436ac5b0>
So probably no link between the rate of reviews since Emma is on par with Simba, although these two companies collect far more reviews than the others. Casper seems to have really leveled off though. I just checked their website and they've stopped selling in Europe to concentrate on the North American market. We'll keep them in the data set to see what people were saying about them, maybe we can see why they closed European operations.
One last thing I wanted to check was whether we can do a break down of customer location.
all_rolling_mean['Location'].describe()
count 43559 unique 95 top United Kingdom freq 39888 Name: Location, dtype: object
The UK really dominates the reviews, so I don't think this would add much. Maybe in the future we can gather non-English reviews for further insights.
Earlier we identified Emma as having the lowest average reviews, so let's process the reviews and see if we can identify why customers are so unhappy.
We have our data set already looking good for the purposes of examining the numbers, but now we want to break down the reviews for each company and that's going to require a bit more cleaning. Briefly, here are the steps we'll take:
Let's begin there and see how this progresses. We'll start with the Emma dataset, which has already been subsetted (emma_clean) from earlier.
import nltk
from nltk import FreqDist
#we quickly look at the frequency of words in the uncleaned dataset
def freq_words(x, terms = 30):
all_words = ' '.join([text for text in x])
all_words = all_words.split()
fdist = FreqDist(all_words)
words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})
# selecting top 30 most frequent words
d = words_df.nlargest(columns="count", n = terms)
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d, x= "word", y = "count")
ax.set(ylabel = 'Count')
plt.show()
freq_words(emma_clean['Review'])
Although there are relevant words like 'mattress' and 'sleep', the majority are not at all informative. We're going to proceed with spacy to clean the data.
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import spacy
emma_clean = subset('Emma')
#get punctuation list
punctuations = '’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
#create a list of stop words
stop_words = spacy.lang.en.stop_words.STOP_WORDS
#create a function to remove stopwords and punctuation
def remove_stopword(review):
review_new = ' '.join([word for word in review if word not in stop_words and word not in punctuations])
return review_new
emma_clean_copy = emma_clean.copy()
#remove stopwords
reviews = [remove_stopword(r.split()) for r in emma_clean_copy['Review']]
#make lowercase and remove remaining punctuation
for punc in punctuations:
reviews = [r.replace(punc, " ") for r in reviews]
reviews = [r.strip().lower() for r in reviews]
#remove extra spaces
reviews = [re.sub(' +', ' ', r) for r in reviews]
#remove words with 3 letters or less
reviews = [re.sub(r'\b\w{1,3}\b', '', r) for r in reviews]
freq_words(reviews)
This is looking better. Mattress is the most common word which makes sense. We've removed a bunch of words that don't mean anything by themselves and now have words like 'delivery', 'service', ,'order' and 'pain', which seem like relevant words. In these top 30 words, we also see 'sleep' and 'sleeping', which we can fix now by lemmatizing. This means the words will be reduced to their basic form, so sleep and sleeping will just become 'sleep'.
We can further reduce the word list to include only nouns and adjectives to remove things like 'very' and 'this'.
import en_core_web_sm
nlp = en_core_web_sm.load()
def lemmatization(texts, tags=['NOUN', 'ADJ', 'VERB']):
lemma_out = []
for i in texts:
doc = nlp(" ".join(i))
lemma_out.append([token.lemma_ for token in doc if token.pos_ in tags])
return lemma_out
#tokenize each review
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])
['enjoy', 'emma', 'mattress', 'emma', 'pillows', 'they', 'help', 'good', 'night', 'rest']
We get a good idea of what the first review is talking about still.
reviews_lemma = lemmatization(tokenized_reviews)
print(reviews_lemma[1])
['enjoy', 'pillow', 'help', 'good', 'night', 'rest']
emma_clean_copy_2 = emma_clean.copy()
#add the lemmatized and cleaned reviews back to the original tables
reviews_3 = []
for i in range(len(reviews_lemma)):
reviews_3.append(' '.join(reviews_lemma[i]))
emma_clean_copy_2['Reviews'] = reviews_3
freq_words(emma_clean_copy_2['Reviews'])
Looks like we have some relevant words here. Let's build a Latent Dirichlet Allocation (LDA) model. This is a matrix factorization technique that tries to predict topics. Essentially, LDA takes a number of documents (in this case reviews) and assumes that they are made up of different topics. It then backtracks and tries to figure out which topics could make those reviews in the first place.
import gensim
from gensim import corpora
# Create a dictionary of words from all reviews
dictionary = corpora.Dictionary(reviews_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_lemma]
# Creating the object for LDA model using gensim library
LDA = gensim.models.ldamodel.LdaModel
# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=10, random_state=100,
chunksize=200, passes=50)
lda_model.print_topics()
[(0, '0.104*"mattress" + 0.065*"comfortable" + 0.050*"great" + 0.047*"delivery" + 0.034*"good" + 0.029*"sleep" + 0.024*"recommend" + 0.024*"excellent" + 0.023*"easy" + 0.019*"service"'), (1, '0.155*"sleep" + 0.124*"night" + 0.067*"good" + 0.061*"mattress" + 0.053*"pain" + 0.040*"comfortable" + 0.027*"wake" + 0.025*"ache" + 0.019*"feel" + 0.017*"get"'), (2, '0.059*"delivery" + 0.054*"order" + 0.039*"email" + 0.037*"company" + 0.029*"day" + 0.029*"week" + 0.029*"refund" + 0.026*"receive" + 0.026*"return" + 0.025*"tell"'), (3, '0.107*"mattress" + 0.056*"recommend" + 0.040*"buy" + 0.035*"year" + 0.033*"comfortable" + 0.032*"emma" + 0.026*"good" + 0.022*"would" + 0.021*"sleep" + 0.019*"purchase"'), (4, '0.083*"customer" + 0.071*"service" + 0.031*"send" + 0.029*"emma" + 0.027*"mattress" + 0.021*"company" + 0.020*"say" + 0.019*"phone" + 0.017*"contact" + 0.015*"work"'), (5, '0.053*"deliver" + 0.052*"edge" + 0.036*"summer" + 0.035*"efficient" + 0.032*"happen" + 0.029*"daughter" + 0.029*"warm" + 0.026*"simple" + 0.023*"terrible" + 0.022*"resolve"'), (6, '0.070*"mattress" + 0.032*"take" + 0.030*"time" + 0.027*"week" + 0.022*"month" + 0.019*"day" + 0.019*"go" + 0.018*"find" + 0.016*"arrive" + 0.015*"trial"'), (7, '0.035*"change" + 0.024*"difficult" + 0.020*"turn" + 0.018*"move" + 0.016*"need" + 0.015*"stay" + 0.015*"sheet" + 0.015*"cover" + 0.015*"leave" + 0.015*"notice"'), (8, '0.100*"firm" + 0.085*"soft" + 0.063*"mattress" + 0.057*"foam" + 0.041*"find" + 0.040*"feel" + 0.039*"memory" + 0.031*"little" + 0.029*"perfect" + 0.025*"comfortable"'), (9, '0.081*"money" + 0.053*"value" + 0.044*"cover" + 0.032*"pick" + 0.030*"reason" + 0.025*"good" + 0.023*"keep" + 0.023*"complaint" + 0.023*"great" + 0.022*"promise"')]
import pyLDAvis
import pyLDAvis.gensim
# Visualize the different topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis
Above we are visualizing our 10 topics. On the left the topics are represented by circles whose centres are placed by calculating the Jensen-Shannon divergence between topics and then using multidimensional scaling for inter-topic differences. In other words, the further apart the circles are, the more different the topics are likely to be. The size of the circle represents the prevalance of the topic.
On the right hand side we have an overlap of the words and their frequency in the entire dataset (blue) and their prevalance in the selected topic. The λ slider allows to rank the terms according to term relevance. The optimal value in the LDAvis documentation is suggested to be 0.6.
The topics predicted seem relevant. For example, topic 9 appears to be about returns and refunds, topic 4 is perhaps more unhappy customers reporting aches and pains, while in topic 7 the top 3 terms are 'would', 'recommend', 'friend', so probably some happy customers there. This topic modelling might be more informative if we look at topics of unhappy customers, or those who rated the company as 1 or 2 stars. In this way we can look at the pain points from the customer perspective.
But first we should look at topic coherence, that is the average similarity between the top words in a given topic. We might be able to improve the topic coherence and the model.
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews_lemma, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Coherence Score: 0.5006080048113015
Let's see if we can improve the coherence score by altering the number of topics. We can then plot this.
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step =1):
#our returned values
coherence_values = []
model_list = []
for num_topics in range(start,limit,step):
model = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics, random_state=100,
chunksize=200, passes=50)
model_list.append(model)
coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherence_model.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=reviews_lemma, limit=10, start=2, step=1)
#we can plot this
limit=10
start=2
step=1
x=range(start,limit,step)
plt.plot(x,coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model_list[0], doc_term_matrix, dictionary)
vis
It looks like 2 is the optimal number of topics for the full dataset. At a glance, it looks like they're separated into good and bad reviews, which is probably appropriate. Let's look again at just the bad reviews.
emma_bad_reviews = emma_clean[emma_clean['Rating'].isin([1,2])]
emma_bad_reviews
Header | Review | Rating | Name | Location | Date | Company | |
---|---|---|---|---|---|---|---|
5 | The first mattress we got was in April… | The first mattress we got was ... | 2 | Mrs A Mustard | United Kingdom | 2020-09-04 | Emma |
14 | I am a convert! | Sadly after 6 weeks on this ma... | 1 | joanne | United Kingdom | 2020-09-04 | Emma |
25 | Ordered the Emma king-size mattress… | The first mattress we got was ... | 2 | Mrs A Mustard | United Kingdom | 2020-09-04 | Emma |
34 | My wife has had both hips replaced and… | Sadly after 6 weeks on this ma... | 1 | joanne | United Kingdom | 2020-09-04 | Emma |
49 | Appalling customer service | Very disappointed with the cus... | 1 | kevin Miller | United Kingdom | 2020-09-04 | Emma |
... | ... | ... | ... | ... | ... | ... | ... |
10214 | Much softer after short time after purchase | I've only had this since Novem... | 2 | lyndsey | United Kingdom | 2018-02-26 | Emma |
10229 | Just OK | The delivery was the only good... | 1 | Mr David Sims | United Kingdom | 2018-02-25 | Emma |
10264 | Very Happy | I have had this now less than ... | 1 | Jan Cadogan | United Kingdom | 2018-02-23 | Emma |
10302 | Very comfortable initially but will not last. | I bought the emma mattress due... | 2 | Hayley | Ireland | 2018-02-23 | Emma |
10348 | I ordered the hybrid mattress | Ordered superking on 8th Jan. ... | 1 | Stan Owen | United Kingdom | 2018-02-21 | Emma |
1229 rows × 7 columns
#remove stopwords
reviews = [remove_stopword(r.split()) for r in emma_bad_reviews['Review']]
#make lowercase and remove remaining punctuation
for punc in punctuations:
reviews = [r.replace(punc, " ") for r in reviews]
reviews = [r.strip().lower() for r in reviews]
#remove words with 3 letters or less
reviews = [re.sub(r'\b\w{1,3}\b', '', r) for r in reviews]
#remove extra spaces
reviews = [re.sub(' +', ' ', r) for r in reviews]
#tokenize each review
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])
['sadly', 'weeks', 'mattress', 'wasn', 'soft', 'firm', 'topper', 'sent', 'free', 'made', 'difference', 'sadly', 'waking', 'loads', 'night', 'absolutely', 'boiling', 'sweating', 'however', 'fault', 'emma', 'customer', 'service', 'anyway', 'very', 'prompt', 'very', 'understanding', 'delivery', 'collection', 'fantastic']
reviews_lemma = lemmatization(tokenized_reviews)
print(reviews_lemma[1])
['week', 'soft', 'firm', 'topper', 'send', 'make', 'difference', 'wake', 'load', 'night', 'boil', 'sweat', 'fault', 'emma', 'customer', 'service', 'prompt', 'understanding', 'delivery', 'collection', 'fantastic']
#add the lemmatized and cleaned reviews back to the original tables
reviews_3 = []
for i in range(len(reviews_lemma)):
reviews_3.append(' '.join(reviews_lemma[i]))
emma_bad_reviews = emma_bad_reviews.copy()
emma_bad_reviews['Reviews'] = reviews_3
freq_words(emma_bad_reviews['Reviews'])
Since these are the bad reviews, I think we can safely say that there are some issues with customer service and deliveries based on this plot. Let's look at the topics.
# Create a dictionary of words from all bad reviews
dictionary = corpora.Dictionary(reviews_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_lemma]
# Build LDA model with coherence values
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=reviews_lemma, limit=15, start=2, step=1)
#we can plot this
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
It seems like 11 or 12 is giving us the best scores here, let's quickly check.
print("\nNumber of Topics: 11", "Coherence Value: ", coherence_values[9])
print("\nNumber of Topics: 12", "Coherence Value: ", coherence_values[10])
Number of Topics: 11 Coherence Value: 0.4883790515167501 Number of Topics: 12 Coherence Value: 0.48810269694729547
11 topics is the best by a marginal fraction, so we'll proceed with that.
# Visualize the different topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model_list[9], doc_term_matrix, dictionary)
vis
Wow, this is where customers are clearly upset. It looks like the biggest pain point has to do with customer service and deliveries, as defined by topic 1 and 2. Topic 5 suggest there might be a strong smell as well, not ideal. To give a little more context, let's take a look at the bigrams, where we take 2 consecutive words.
from nltk import bigrams
bigram_terms = [list(bigrams(review)) for review in tokenized_reviews]
import itertools
import collections
# Flatten list of bigrams in clean review
bigrams = list(itertools.chain(*bigram_terms))
# Create counter of words in clean bigrams
bigram_counts = collections.Counter(bigrams)
bigram_counts.most_common(20)
[(('customer', 'service'), 575), (('emma', 'mattress'), 320), (('ordered', 'mattress'), 128), (('working', 'days'), 96), (('customer', 'services'), 91), (('phone', 'number'), 91), (('delivery', 'date'), 88), (('order', 'number'), 83), (('mattress', 'delivered'), 74), (('received', 'email'), 65), (('customer', 'support'), 64), (('return', 'mattress'), 62), (('contacted', 'emma'), 60), (('mattress', 'arrived'), 58), (('mattress', 'collected'), 58), (('telephone', 'number'), 54), (('cancelled', 'order'), 53), (('placed', 'order'), 51), (('days', 'later'), 49), (('cancel', 'order'), 46)]
bigram_df = pd.DataFrame(bigram_counts.most_common(30), columns=['bigram', 'count'])
def freq_bigrams(x, terms = 30):
d = x.nlargest(columns="count", n = terms)
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d, x= "bigram", y = "count")
ax.set(ylabel = 'Count')
for item in ax.get_xticklabels():
item.set_rotation(90)
plt.show()
freq_bigrams(bigram_df)
This really just confirms what we saw in our topic models with 'customer service' taking out top spot for negative reviews.
Now we'll continue on with other companies to compare. As a reminder, these are the other companies:
To begin with, we'll define some functions to simplify what we've done above.
def clean_reviews(company_subset):
#remove stopwords
reviews = [remove_stopword(r.split()) for r in company_subset['Review']]
#make lowercase and remove remaining punctuation
for punc in punctuations:
reviews = [r.replace(punc, " ") for r in reviews]
reviews = [r.strip().lower() for r in reviews]
#remove words with 3 letters or less
reviews = [re.sub(r'\b\w{1,3}\b', '', r) for r in reviews]
#remove extra spaces
reviews = [re.sub(' +', ' ', r) for r in reviews]
return reviews
def token_lemma(reviews):
#tokenize each review
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
reviews_lemma = lemmatization(tokenized_reviews)
return tokenized_reviews, reviews_lemma
def get_LDA_vis(reviews_lemma, num_topics):
#Create a dictionary of words from all bad reviews
dictionary = corpora.Dictionary(reviews_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_lemma]
# Creating the object for LDA model using gensim library
LDA = gensim.models.ldamodel.LdaModel
# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics, random_state=100,
chunksize=200, passes=50)
# Visualize the different topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
return vis
def get_bigrams(tokenized_reviews):
bigram_terms = [bigrams(review) for review in tokenized_reviews]
bigram_terms = list(bigram_terms)
# Flatten list of bigrams in clean review
tok_bigrams = list(itertools.chain(*bigram_terms))
# Create counter of words in clean bigrams
bigram_counts = collections.Counter(tok_bigrams)
bigram_df = pd.DataFrame(bigram_counts.most_common(30), columns=['bigram', 'count'])
return bigram_df
casper_sub = subset('Casper')
casper_clean = clean_reviews(casper_sub)
casper_token, casper_lemma = token_lemma(casper_clean)
casper_LDA = get_LDA_vis(casper_lemma)
casper_LDA
There is some overwhelmingly positive reviews here, let's look at the bigrams.
casper_bigrams = get_bigrams(casper_token)
freq_bigrams(casper_bigrams)
Seems like we have a few German reviews mixed in, which is ok. Fast delivery (schnelle lieferung) and very satisfied (sehr zufrieden) are among the top terms, in contrast to Emma who had reviews more about the comfort. Let's switch to the negative reviews quickly.
casper_bad_reviews = casper_sub[casper_sub['Rating'].isin([1,2])]
casper_bad_clean = clean_reviews(casper_bad_reviews)
casper_bad_token, casper_bad_lemma = token_lemma(casper_bad_clean)
#define values for LDA model
dictionary = corpora.Dictionary(casper_bad_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in casper_bad_lemma]
# Build LDA model with coherence values
casper_bad_model_list, casper_bad_coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=casper_bad_lemma, limit=15, start=2, step=1)
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,casper_bad_coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
# 5 topics is the winner for Casper
casper_bad_LDA = get_LDA_vis(casper_bad_lemma, 5)
casper_bad_LDA
Customer service is also a big pain point here.
casper_bad_bigrams = get_bigrams(casper_bad_token)
freq_bigrams(casper_bad_bigrams)
Again some patterns emerge with poor customer service and no answer (keine antwort), cancellations of orders and orders not arriving.
silentnight_sub = subset('SilentNight')
silentnight_clean = clean_reviews(silentnight_sub)
silentnight_token, silentnight_lemma = token_lemma(silentnight_clean)
silentnight_LDA = get_LDA_vis(silentnight_lemma)
silentnight_LDA
silentnight_bigrams = get_bigrams(silentnight_token)
freq_bigrams(silentnight_bigrams)
silentnight_bad_reviews = silentnight_sub[silentnight_sub['Rating'].isin([1,2])]
silentnight_bad_clean = clean_reviews(silentnight_bad_reviews)
silentnight_bad_token, silentnight_bad_lemma = token_lemma(silentnight_bad_clean)
#define values for LDA model
dictionary = corpora.Dictionary(silentnight_bad_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in silentnight_bad_lemma]
# Build LDA model with coherence values
silentnight_bad_model_list, silentnight_bad_coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=silentnight_bad_lemma, limit=15, start=2, step=1)
# Plot results
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,silentnight_bad_coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
silentnight_bad_LDA = get_LDA_vis(silentnight_bad_lemma, 3)
silentnight_bad_LDA
silentnight_bad_bigrams = get_bigrams(silentnight_bad_token)
freq_bigrams(silentnight_bad_bigrams)
Let's switch to the highest rated company and also the company with the most reviews.
simba_sub = subset('Simba')
simba_clean = clean_reviews(simba_sub)
simba_token, simba_lemma = token_lemma(simba_clean)
simba_LDA = get_LDA_vis(simba_lemma)
simba_LDA
simba_bigrams = get_bigrams(simba_token)
freq_bigrams(simba_bigrams)
simba_bad_reviews = simba_sub[simba_sub['Rating'].isin([1,2])]
simba_bad_clean = clean_reviews(simba_bad_reviews)
simba_bad_token, simba_bad_lemma = token_lemma(simba_bad_clean)
#define values for LDA model
dictionary = corpora.Dictionary(simba_bad_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in simba_bad_lemma]
# Build LDA model with coherence values
simba_bad_model_list, simba_bad_coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=simba_bad_lemma, limit=15, start=2, step=1)
# Plot results
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,simba_bad_coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
simba_bad_LDA = get_LDA_vis(simba_bad_lemma, 9)
simba_bad_LDA
simba_bad_bigrams = get_bigrams(simba_bad_token)
freq_bigrams(simba_bad_bigrams)
leesa_sub = subset('Leesa')
leesa_clean = clean_reviews(leesa_sub)
leesa_token, leesa_lemma = token_lemma(leesa_clean)
leesa_LDA = get_LDA_vis(leesa_lemma, 10)
leesa_LDA
leesa_bigrams = get_bigrams(leesa_token)
freq_bigrams(leesa_bigrams)
leesa_bad_reviews = leesa_sub[leesa_sub['Rating'].isin([1,2])]
leesa_bad_clean = clean_reviews(leesa_bad_reviews)
leesa_bad_token, leesa_bad_lemma = token_lemma(leesa_bad_clean)
#define values for LDA model
dictionary = corpora.Dictionary(leesa_bad_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in leesa_bad_lemma]
# Build LDA model with coherence values
leesa_bad_model_list, leesa_bad_coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=leesa_bad_lemma, limit=15, start=2, step=1)
# Plot results
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,leesa_bad_coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
leesa_bad_LDA = get_LDA_vis(leesa_bad_lemma, 14)
leesa_bad_LDA
leesa_bad_bigrams = get_bigrams(leesa_bad_token)
freq_bigrams(leesa_bad_bigrams)
otty_sub = subset('Otty')
otty_clean = clean_reviews(otty_sub)
otty_token, otty_lemma = token_lemma(otty_clean)
otty_LDA = get_LDA_vis(otty_lemma, 10)
otty_LDA
otty_bigrams = get_bigrams(otty_token)
freq_bigrams(otty_bigrams)
otty_bad_reviews = otty_sub[otty_sub['Rating'].isin([1,2])]
otty_bad_clean = clean_reviews(otty_bad_reviews)
otty_bad_token, otty_bad_lemma = token_lemma(otty_bad_clean)
#define values for LDA model
dictionary = corpora.Dictionary(otty_bad_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in otty_bad_lemma]
# Build LDA model with coherence values
otty_bad_model_list, otty_bad_coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=otty_bad_lemma, limit=15, start=2, step=1)
# Plot results
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,otty_bad_coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
otty_bad_LDA = get_LDA_vis(otty_bad_lemma, 14)
otty_bad_LDA
otty_bad_bigrams = get_bigrams(otty_bad_token)
freq_bigrams(otty_bad_bigrams)
eve_sub = subset('Eve')
eve_clean = clean_reviews(eve_sub)
eve_token, eve_lemma = token_lemma(eve_clean)
eve_LDA = get_LDA_vis(eve_lemma, 10)
eve_LDA
eve_bigrams = get_bigrams(eve_token)
freq_bigrams(eve_bigrams)
eve_bad_reviews = eve_sub[eve_sub['Rating'].isin([1,2])]
eve_bad_clean = clean_reviews(eve_bad_reviews)
eve_bad_token, eve_bad_lemma = token_lemma(eve_bad_clean)
#define values for LDA model
dictionary = corpora.Dictionary(eve_bad_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in eve_bad_lemma]
# Build LDA model with coherence values
eve_bad_model_list, eve_bad_coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=eve_bad_lemma, limit=15, start=2, step=1)
# Plot results
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,eve_bad_coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
eve_bad_LDA = get_LDA_vis(eve_bad_lemma, 2)
eve_bad_LDA
eve_bad_bigrams = get_bigrams(eve_bad_token)
freq_bigrams(eve_bad_bigrams)
Here we set out to look at online mattress companies and specifically the customer feedback on the companies. We started by scraping data from trustpilot using BeuatifulSoup and ended up with 46,719 reviews. We took a look at the average ratings of each company and over time there seems to be a near universal downward trend. Emma appears to currently have a slight upwards trajectory, but this could be short-lived given that compared to early 2018 the average hasn't really moved.
We proceeded to analyze the reviews themselves and attempted to identify topics cropping up in the data. Here's a few key points from this analysis:
Simba and Eve have been around the longest, but also maintain the best average reviews;
All companies are struggling with negative reviews around their customer service and deliveries;
Otty, Casper and Emma seem to have mattresses with a strong smell - not ideal;
Other negative reviews mention the fact that mattresses are too soft or too hard, this will always be a problem if there's no try before you buy, since customers will inevitably have different preferences.
This analysis is really just the beginning of what needs to be done for a full analysis, which brings us to other additions.
Some other minor points: