Bed-in-a-box: what's the consumer perspective?

It's pretty important to find the right mattress since we're sleeping on them nightly. I read somewhere that some large proportion of people trust reviews just as much as they trust personal recommendations. Since this whole COVID disaster shut down businesses, it's really important to make sure we're buying a quality product without trying it out. I found this website that summarizes the "best" bed-in-a-box companies, but I needed to look at some reviews from others. The Trustpilot website states that they have 1.1 trillion ratings across 300,000 businesses. This is impressive, but unfortunately I don't have that much time, even in lockdown.

So here we're going to look at reviews of those best companies and try and figure out if there's a stand-out winner. We begin with scraping the trustpilot website. I followed this tutorial here, which was amazing to help write the code.

Scraping trustpilot with BeautifulSoup

Let's start by figuring out how to access the right information. I went to the trustpilot website and looked at the html code of the parts I'm interested in. We'll begin with Emma.

In [140]:
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.trustpilot.com/review/www.emma-mattress.co.uk?languages=all&page=1'
response = get(url)
In [2]:
html_soup = BeautifulSoup(response.text, 'html.parser')
#we start with the user details
user_containers = html_soup.find_all('div', class_ = 'consumer-information__name')
print(type(user_containers))
print(len(user_containers))
<class 'bs4.element.ResultSet'>
20
In [3]:
first = user_containers[0]
first.text
Out[3]:
'\n            Sharon Druce\n        '
In [4]:
# let's look for the review content
review_containers = html_soup.find_all('div', class_ = 'review-content__body')
print(type(review_containers))
print(len(review_containers))
<class 'bs4.element.ResultSet'>
20
In [5]:
first_rev= review_containers[0]
first_rev
Out[5]:
<div class="review-content__body" v-pre="">
<h2 class="review-content__title">
<a class="link link--large link--dark" data-track-link="{'target': 'Single review', 'name': 'review-title'}" href="/reviews/5f52205b02e85708c8dfc5eb">Ok I’ve now had my Emma mattress for a…</a>
</h2>
<p class="review-content__text">
                Ok I’ve now had my Emma mattress for a few months, it’s ok apart from the materials in it causes me to sweat , I thought it had a cooling layer to regulate body temperature, but if I get out during the night the matress actually feels damp and sweaty when I get back in bed,I’ve now removed the mattress protector but there’s no difference . I’ve been putting a bath sheet towel on to lay on , I’ve not even attempted to use the pillow it’s about an inch thick and rock hard so it’s a no no for me, I think I will be returning after the 200 days are up , I’m off to Dreams . I haven’t actually had a good solid nights sleep throughout .
            </p>
</div>

Looks like we can collect the title and review from here.

In [6]:
title = first_rev.h2.a.text
review = first_rev.p.text
print(title)
print(review)
Ok I’ve now had my Emma mattress for a…

                Ok I’ve now had my Emma mattress for a few months, it’s ok apart from the materials in it causes me to sweat , I thought it had a cooling layer to regulate body temperature, but if I get out during the night the matress actually feels damp and sweaty when I get back in bed,I’ve now removed the mattress protector but there’s no difference . I’ve been putting a bath sheet towel on to lay on , I’ve not even attempted to use the pillow it’s about an inch thick and rock hard so it’s a no no for me, I think I will be returning after the 200 days are up , I’m off to Dreams . I haven’t actually had a good solid nights sleep throughout .
            

Would also like to collect the rating, so let's sort that out.

In [7]:
rating_container = html_soup.find_all('div',class_ = "star-rating star-rating--medium")
first_rating = rating_container[0]
first_rating
Out[7]:
<div class="star-rating star-rating--medium">
<img alt="1 star: Bad" src="https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-1.svg"/>
</div>
In [8]:
stars = first_rating.img['alt'][0]
stars
Out[8]:
'1'

Cool, we can just select the first letter of the string since it's in the format '1 star: Bad'. Now onto the dates.

In [9]:
dates_container = html_soup.find_all('div',class_ = "review-content-header__dates")
first_date = dates_container[0]
first_date
Out[9]:
<div class="review-content-header__dates">
<script data-initial-state="review-dates" type="application/json">
{"publishedDate":"2020-09-04T11:09:15Z","updatedDate":null,"reportedDate":null}
</script>
<review-dates :published-date="publishedDate" :reported-date="reportedDate" :updated-date="updatedDate"></review-dates>
</div>
In [10]:
date = first_date.script
date
Out[10]:
<script data-initial-state="review-dates" type="application/json">
{"publishedDate":"2020-09-04T11:09:15Z","updatedDate":null,"reportedDate":null}
</script>
In [11]:
import re
date = str(date)
print(date)
result = re.search('publishedDate":(.*),"updatedDate"', date)
print(result.group(1)[1:11])
<script data-initial-state="review-dates" type="application/json">
{"publishedDate":"2020-09-04T11:09:15Z","updatedDate":null,"reportedDate":null}
</script>
2020-09-04

There might be a better way to get the date, but this seems to be an ok workaround, let's see if it scales (update: it did).

One last thing is to get the link to the poster location so we can see where people are from. We'll collect the user ID and then add this to the base address.

In [12]:
profile_link_containers = html_soup.find_all('div', class_ = "review-card")
first_profile = profile_link_containers[0].aside.a['href']
first_profile
Out[12]:
'/users/5b22bbab4de5666d34749b69'

Alright, I think that's everything we need so I'm going to define a function tying it all together.

In [120]:
from time import sleep
from random import randint
import re
from requests import get
import pandas as pd
from IPython.core.display import clear_output
from time import time

# we'll do this with a function in case we want to add different companies to the analysis later
def scrape_reviews(PATH, n_pages):
    
    requests = 0
    
    #data we collect will be stored in lists
    names = []
    ratings = []
    headers = []
    reviews = []
    dates = []
    locations = []
    
    for p in range(n_pages):
        start_time = time()
        # sleep time to pause the loop between page collections, I had this shorter, but the host disrupted the collection, I guess it looked like a bot.
        sleep(randint(8,13))
        
        # make a get request
        response = get(f'{PATH}{p}')
        
        # Monitor the requests
        requests += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)

        
        # Parse the content with BSoup
        page_html = BeautifulSoup(response.text, 'html.parser')
        
        # Select all of the relevant containers from a page
        review_containers = page_html.find_all('div', class_ = 'review-content__body')
        user_containers = page_html.find_all('div', class_ = 'consumer-information__name')
        rating_container = page_html.find_all('div',class_ = "star-rating star-rating--medium")
        date_container = page_html.find_all('div',class_ = "review-content-header__dates")
        profile_link_containers = page_html.find_all('div', class_ = "review-card")
        
        # now we start populating our lists
        for x in range(len(review_containers)):

            review_c = review_containers[x]
            #here we need to account for cases where reviews have been reported and return nothing. This happens when reviews are reported. Also those reviews that have a title, but no text in the review need to be accounted for.
            if review_c is None:
                headers.append('None')
                reviews.append('None')
            if review_c.p is None:
                reviews.append('None')
            else:
                reviews.append(review_c.p.text)
                headers.append(review_c.h2.a.text)
                
            reviewer = user_containers[x]
            names.append(reviewer.text)
            rating = rating_container[x]
            ratings.append(rating.img['alt'][0])
            date = date_container[x].script
            date = str(date)
            result = re.search('publishedDate":(.*),"updatedDate"', date)
            dates.append(result.group(1)[1:11])


            prof = profile_link_containers[x]
            link = 'https://www.trustpilot.com'+ prof.aside.a['href']
            c_profile = get(f'{link}')
            csoup = BeautifulSoup(c_profile.text, 'html.parser')
            cust_container = csoup.find('div', class_ = 'user-summary-location')
            #some people apparently don't specify a location, an error was raised in testing, so we'll make sure it still works in that case
            if cust_container is None:
                locations.append('None')
            else:
                locations.append(cust_container.text)
            
    rev_df = pd.DataFrame(list(zip(headers, reviews, ratings, names, locations, dates)),
                  columns = ['Header','Review','Rating', 'Name', 'Location', 'Date'])
    
    rev_df['Review'] = rev_df['Review'].str.replace('\n', '')
    rev_df['Name'] = rev_df['Name'].str.replace('\n', '')
    rev_df['Location'] = rev_df['Location'].str.replace('\n', '')    
    rev_df['Date'] = pd.to_datetime(rev_df['Date'])
    
    return rev_df

Trustpilot shows 20 reviews per page and the emma-mattress.co.uk site has 11,152 reviews (and climbing). This equates to approximately 558 pages. To begin, lets take 5 pages and make sure this works

In [121]:
df = scrape_reviews(PATH = 'https://www.trustpilot.com/review/www.emma-mattress.co.uk?languages=all&page=',
                   n_pages = 5)
Request:5; Frequency: 0.36804427864250117 requests/s
In [122]:
df
Out[122]:
Header Review Rating Name Location Date
0 I ordered the hybrid mattress I ordered the hybrid mattress.... 5 Ramiro Cali-Corleo Malta 2020-09-05
1 I much enjoy my Emma mattress together… I much enjoy my Emma mattress ... 5 Margrit Dahm United Kingdom 2020-09-05
2 Amazing mattress and impeccable customer service This mattress is amazing and t... 5 Emma United Kingdom 2020-09-05
3 My son loves his Mattress My son loves his Mattress can ... 5 Annie United Kingdom 2020-09-05
4 Best nights sleep ever Having bought a double and a s... 5 Lisa Silver United Kingdom 2020-09-05
... ... ... ... ... ... ...
88 Mattress is comfy Quick delivery and the best ma... 5 Carla Russell United Kingdom 2020-09-04
89 Love my mattress and my pillows !! With the added bit that came f... 5 Bill Compton United Kingdom 2020-09-04
90 Excellent service and professional Excellent mattress almost inst... 5 Rae Lloyd United Kingdom 2020-09-04
91 Great mattress Very impressed with the servic... 5 Shelley United Kingdom 2020-09-04
92 Excellent customer service Sadly tried both the types of ... 2 Jan Brooks United Kingdom 2020-09-04

93 rows × 6 columns

It's not a 100 reviews, but for the purposes here I think 92% of reviews is fine.

In [108]:
df.tail(10)
Out[108]:
Header Review Rating Name Location Date
85 Impressed None 5 Ben Gardner United Kingdom 2020-08-31
86 Best mattress we have ever had I think the Emma matteress is ... 3 Amy Newman United Kingdom 2020-08-31
87 Fabulous mattress .. Fantastic mattress better nigh... 5 Cathy United States 2020-08-31
88 Best nights sleep after purchase Comfortable mattress but find ... 2 Carol United Kingdom 2020-08-31
89 Thank you so much Emma. I am totally satisfied with my... 5 Nick Archer United Kingdom 2020-08-31
90 My wife has a back problem I had enjoyed my Emma mattress... 5 Lesley Weeks United Kingdom 2020-08-31
91 My Emma experience Best mattress we have ever had... 5 Robinson United Kingdom 2020-08-31
92 Best mattress EVER Fabulous mattress ... comfy fr... 5 Lesley Wells United Kingdom 2020-08-31
93 Got the Emma mattress for my son Best nights sleep after purcha... 5 Ste P United Kingdom 2020-08-31
94 Very happy with my mattress! After having back pain for mon... 5 Jenny Parker Netherlands 2020-08-31
In [123]:
emma_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/www.emma-mattress.co.uk?languages=all&page=',
                   n_pages = 558)
Request:558; Frequency: 53.68421449576918 requests/s
In [124]:
emma_data
Out[124]:
Header Review Rating Name Location Date
0 I ordered the hybrid mattress I ordered the hybrid mattress.... 5 Ramiro Cali-Corleo Malta 2020-09-05
1 I much enjoy my Emma mattress together… I much enjoy my Emma mattress ... 5 Margrit Dahm United Kingdom 2020-09-05
2 Amazing mattress and impeccable customer service This mattress is amazing and t... 5 Emma United Kingdom 2020-09-05
3 My son loves his Mattress My son loves his Mattress can ... 5 Annie United Kingdom 2020-09-05
4 Best nights sleep ever Having bought a double and a s... 5 Lisa Silver United Kingdom 2020-09-05
... ... ... ... ... ... ...
10378 Good products, prompt deliveries and good cust... So pleased we trusted the revi... 5 Helen United Kingdom 2018-02-20
10379 Sadly after 6 weeks on this mattress it… Excellent company to deal with... 4 Berni Mackinnon United Kingdom 2018-02-20
10380 I am a convert! Incredible comfort, my sore ba... 5 Customer Esstee. United Kingdom 2018-02-20
10381 Happy customer This is a comfortable mattress... 4 jjgirl United Kingdom 2018-02-20
10382 My wife has had both hips replaced and… Best bed we have had, we both ... 5 Ian Newsham United Kingdom 2018-02-20

10383 rows × 6 columns

Pretty happy with this, 10,383 reviews in total. Plenty to work with. I'm going to save these to disk so I don't have to scrape again, I can just read them in.

In [125]:
emma_data.to_csv('Desktop/emma_data.csv')

Casper has 5,977 reviews, so 299 pages

In [126]:
casper_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/casper.com?languages=all&page=',
                   n_pages = 299)
Request:299; Frequency: 26.436810165083017 requests/s
In [127]:
casper_data
Out[127]:
Header Review Rating Name Location Date
0 Casper isn’t good at all for side sleepers I starting by reading online t... 2 shaun United Kingdom 2020-09-03
1 Thank you None 5 MArk United States 2020-09-02
2 I have had NOTHING but negative… Arrived quickly and quality wa... 5 Sonia Ewell United States 2020-08-25
3 arrived in excellent time and is a… I have had NOTHING but negativ... 1 Michael E. United States 2020-08-21
4 It broke but cos they are not in uk… arrived in excellent time and ... 5 S Childs United Kingdom 2020-08-17
... ... ... ... ... ... ...
5421 The mattress is comfortable and… None 2 Anon United Kingdom 2019-05-14
5422 Fantastic The mattress is fantastic, had... 5 Mr Colin Cassels United Kingdom 2019-05-14
5423 Best sleep ever Even though the pillow was muc... 5 Sheila Metzger United Kingdom 2019-05-14
5424 Dog bed is cheap regular foam. Not worth £100+ Absolutely love this mattress.... 5 Brutus United Kingdom 2019-05-14
5425 The pillow was far too soft with no… Excellent service from start t... 5 Dani United Kingdom 2019-05-14

5426 rows × 6 columns

In [128]:
casper_data.to_csv('Desktop/casper_data.csv')

Eve has 5,797 reviews, around 299 pages again

In [129]:
eve_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/www.evesleep.co.uk?languages=all&page=',
                   n_pages = 299)
Request:299; Frequency: 31.688304164405828 requests/s
In [130]:
eve_data.to_csv('Desktop/eve_data.csv')

Leesa has 1,494 reviews = 75 pages

In [131]:
leesa_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/leesa.co.uk?languages=all&page=',
                   n_pages = 75)
Request:75; Frequency: 8.762897036945937 requests/s
In [132]:
leesa_data.to_csv('Desktop/leesa_data.csv')

Otty has 5,895 reviews = 295 pages

In [133]:
otty_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/otty.com?languages=all&page=',
                   n_pages = 295)
Request:295; Frequency: 28.610990529202876 requests/s
In [134]:
otty_datata.to_csv('Desktop/otty_data.csv')

SilentNight has 4,074 reviews = 204 pages

In [135]:
silentnight_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/shop.silentnight.co.uk?languages=all&page=',
                   n_pages = 204)
Request:204; Frequency: 17.83837362789943 requests/s
In [136]:
silentnight_data.to_csv('Desktop/top/silentnight_data.csv')

Simba sleep has 15,617 reviews = 781 pages, yikes! I'd actually never even heard of this company.

In [137]:
simba_data = scrape_reviews(PATH = 'https://www.trustpilot.com/review/simbasleep.com?languages=all&page=',
                   n_pages = 781)
Request:781; Frequency: 68.55865285167769 requests/s
In [138]:
simba_data.to_csv('Desktop/simba_data.csv')
In [139]:
simba_data
Out[139]:
Header Review Rating Name Location Date
0 Great company Great company, great product, ... 5 Robert Kendall United Kingdom 2020-09-06
1 Noticeably better sleep straight away This is an excellent mattress!... 5 J. M. L. United Kingdom 2020-09-06
2 Not great at the moment 4 days in.... Received my order at the plann... 2 Tami Shepherd United Kingdom 2020-09-06
3 Excellent matteress Excellent matteress, duvet and... 5 mr andrew dobson United Kingdom 2020-09-05
4 Still waiting........ Ordered the mattress on 25th A... 1 Su United Kingdom 2020-09-05
... ... ... ... ... ... ...
14570 Ordering quick and easy Delivery Not so! No question about it. AND the... 5 Linda B. United Kingdom 2017-03-04
14571 Amazing nights sleep The mattress has reduced my ba... 5 Micaela A. United Kingdom 2017-03-04
14572 Simba Matttess was great. Exactly as... 5 Customer United Kingdom 2017-03-04
14573 Easy to order Best mattress around, fantasti... 5 Ratpig United Kingdom 2017-03-04
14574 No support!! Good service and quick deliver... 5 Anthony C United Kingdom 2017-03-04

14575 rows × 6 columns

I'm going to add a column for the company in each file and then combine the data frames into a single table.

In [146]:
emma_data['Company'] = 'Emma'
casper_data['Company'] = 'Casper'
eve_data['Company'] = 'Eve'
leesa_data['Company'] = 'Leesa'
otty_data['Company'] = 'Otty'
silentnight_data['Company'] = 'SilentNight'
simba_data['Company'] = 'Simba'
Companies = [emma_data,casper_data,eve_data,leesa_data,otty_data,silentnight_data,simba_data]
In [177]:
emma_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10383 entries, 0 to 10382
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Header    10383 non-null  object        
 1   Review    10383 non-null  object        
 2   Rating    10383 non-null  object        
 3   Name      10383 non-null  object        
 4   Location  10383 non-null  object        
 5   Date      10383 non-null  datetime64[ns]
 6   Company   10383 non-null  object        
dtypes: datetime64[ns](1), object(6)
memory usage: 567.9+ KB
In [200]:
all_reviews = pd.concat(Companies, axis=0, ignore_index=True)
all_reviews
Out[200]:
Header Review Rating Name Location Date Company
0 I ordered the hybrid mattress I ordered the hybrid mattress.... 5 Ramiro Cali-Corleo Malta 2020-09-05 Emma
1 I much enjoy my Emma mattress together… I much enjoy my Emma mattress ... 5 Margrit Dahm United Kingdom 2020-09-05 Emma
2 Amazing mattress and impeccable customer service This mattress is amazing and t... 5 Emma United Kingdom 2020-09-05 Emma
3 My son loves his Mattress My son loves his Mattress can ... 5 Annie United Kingdom 2020-09-05 Emma
4 Best nights sleep ever Having bought a double and a s... 5 Lisa Silver United Kingdom 2020-09-05 Emma
... ... ... ... ... ... ... ...
46714 Ordering quick and easy Delivery Not so! No question about it. AND the... 5 Linda B. United Kingdom 2017-03-04 Simba
46715 Amazing nights sleep The mattress has reduced my ba... 5 Micaela A. United Kingdom 2017-03-04 Simba
46716 Simba Matttess was great. Exactly as... 5 Customer United Kingdom 2017-03-04 Simba
46717 Easy to order Best mattress around, fantasti... 5 Ratpig United Kingdom 2017-03-04 Simba
46718 No support!! Good service and quick deliver... 5 Anthony C United Kingdom 2017-03-04 Simba

46719 rows × 7 columns

It looks like we have 46,719 reviews from 7 companies to work with. I think this is a good amount to continue on with.

Data exploration

In [149]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [201]:
all_reviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46719 entries, 0 to 46718
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Header    46719 non-null  object        
 1   Review    46719 non-null  object        
 2   Rating    46719 non-null  object        
 3   Name      46719 non-null  object        
 4   Location  46719 non-null  object        
 5   Date      46719 non-null  datetime64[ns]
 6   Company   46719 non-null  object        
dtypes: datetime64[ns](1), object(6)
memory usage: 2.5+ MB
In [202]:
all_reviews
Out[202]:
Header Review Rating Name Location Date Company
0 I ordered the hybrid mattress I ordered the hybrid mattress.... 5 Ramiro Cali-Corleo Malta 2020-09-05 Emma
1 I much enjoy my Emma mattress together… I much enjoy my Emma mattress ... 5 Margrit Dahm United Kingdom 2020-09-05 Emma
2 Amazing mattress and impeccable customer service This mattress is amazing and t... 5 Emma United Kingdom 2020-09-05 Emma
3 My son loves his Mattress My son loves his Mattress can ... 5 Annie United Kingdom 2020-09-05 Emma
4 Best nights sleep ever Having bought a double and a s... 5 Lisa Silver United Kingdom 2020-09-05 Emma
... ... ... ... ... ... ... ...
46714 Ordering quick and easy Delivery Not so! No question about it. AND the... 5 Linda B. United Kingdom 2017-03-04 Simba
46715 Amazing nights sleep The mattress has reduced my ba... 5 Micaela A. United Kingdom 2017-03-04 Simba
46716 Simba Matttess was great. Exactly as... 5 Customer United Kingdom 2017-03-04 Simba
46717 Easy to order Best mattress around, fantasti... 5 Ratpig United Kingdom 2017-03-04 Simba
46718 No support!! Good service and quick deliver... 5 Anthony C United Kingdom 2017-03-04 Simba

46719 rows × 7 columns

In [203]:
all_reviews.describe()
Out[203]:
Header Review Rating Name Location Date Company
count 46719 46719 46719 46719 46719 46719 46719
unique 33915 43369 6 35963 96 1313 7
top Great mattress None 5 customer United Kingdom 2020-02-13 00:00:00 Simba
freq 540 3160 36899 944 42666 220 14575
first NaN NaN NaN NaN NaN 2017-01-24 00:00:00 NaN
last NaN NaN NaN NaN NaN 2020-09-06 00:00:00 NaN

Looking at the total dataframe, it seems like we have 6 unique ratings - wierd. Since 0 is not an option, there should just be 1-5. Let's investigate.

In [207]:
all_reviews['Rating'].unique()
Out[207]:
array(['5', '2', '1', '4', '3', '#'], dtype=object)

It seems like some people have managed to not leave a review. Let's drop the rows with '#' as a rating.

In [210]:
all_reviews = all_reviews[all_reviews['Rating'] != '#']
In [211]:
all_reviews
Out[211]:
Header Review Rating Name Location Date Company
0 I ordered the hybrid mattress I ordered the hybrid mattress.... 5 Ramiro Cali-Corleo Malta 2020-09-05 Emma
1 I much enjoy my Emma mattress together… I much enjoy my Emma mattress ... 5 Margrit Dahm United Kingdom 2020-09-05 Emma
2 Amazing mattress and impeccable customer service This mattress is amazing and t... 5 Emma United Kingdom 2020-09-05 Emma
3 My son loves his Mattress My son loves his Mattress can ... 5 Annie United Kingdom 2020-09-05 Emma
4 Best nights sleep ever Having bought a double and a s... 5 Lisa Silver United Kingdom 2020-09-05 Emma
... ... ... ... ... ... ... ...
46714 Ordering quick and easy Delivery Not so! No question about it. AND the... 5 Linda B. United Kingdom 2017-03-04 Simba
46715 Amazing nights sleep The mattress has reduced my ba... 5 Micaela A. United Kingdom 2017-03-04 Simba
46716 Simba Matttess was great. Exactly as... 5 Customer United Kingdom 2017-03-04 Simba
46717 Easy to order Best mattress around, fantasti... 5 Ratpig United Kingdom 2017-03-04 Simba
46718 No support!! Good service and quick deliver... 5 Anthony C United Kingdom 2017-03-04 Simba

46712 rows × 7 columns

It was just 7 rows, but it also looks like there are 3160 reviews with no text. Since the idea is to ultimately use natural language processing with customer reviews, I think we'll drop these rows and move on.

In [213]:
all_reviews = all_reviews[all_reviews['Review'] != 'None']
In [236]:
#convert rating to an integer
all_reviews['Rating'] = all_reviews['Rating'].astype(int)

#convert Header, Review, Name, Location and Company to strings
all_reviews['Header'] = all_reviews['Header'].astype("string")
all_reviews['Review'] = all_reviews['Review'].astype("string")
all_reviews['Name'] = all_reviews['Name'].astype("string")
all_reviews['Location'] = all_reviews['Location'].astype("string")
all_reviews['Company'] = all_reviews['Company'].astype("string")
In [237]:
all_reviews.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43559 entries, 0 to 46718
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Header    43559 non-null  string        
 1   Review    43559 non-null  string        
 2   Rating    43559 non-null  int64         
 3   Name      43559 non-null  string        
 4   Location  43559 non-null  string        
 5   Date      43559 non-null  datetime64[ns]
 6   Company   43559 non-null  string        
dtypes: datetime64[ns](1), int64(1), string(5)
memory usage: 2.7 MB
In [285]:
mat_ratings = all_reviews.pivot_table(values='Rating', index='Company', aggfunc=np.mean, margins=True)
In [240]:
mat_ratings.plot(kind='barh', xlim=(4,5), title='Mean Ratings by Company', legend=False)
Out[240]:
<matplotlib.axes._subplots.AxesSubplot at 0x12decff70>

So it looks like Simba, Otty and Eve are above the average with the others below. Let's look closer at Emma, who has the worst average rating.

In [271]:
all_reviews['Rating'][all_reviews['Company'] == 'Emma'].value_counts(sort=False).plot.bar()
Out[271]:
<matplotlib.axes._subplots.AxesSubplot at 0x1490b5610>

Tough competition, it looks like they have overwhelmingly 5 star reviews! Since each company has a different number of reviews, we'll convert these ratings to a proportion of total reviews and then plot each company and look at what's happening.

In [313]:
names = ['Emma', 'Eve', 'Casper', 'Simba', 'SilentNight', 'Otty', 'Leesa']
proportions = []
proportions = pd.DataFrame(proportions)
for name in names:
    subset = all_reviews[all_reviews['Company']==name]
    calc = subset['Rating'].value_counts(normalize = True, sort = False)
    proportions[name] = calc
In [314]:
proportions
Out[314]:
Emma Eve Casper Simba SilentNight Otty Leesa
1 0.098049 0.053238 0.071545 0.045284 0.089217 0.038298 0.049628
2 0.028131 0.013970 0.029878 0.014556 0.034860 0.015911 0.026468
3 0.045483 0.014159 0.053252 0.022716 0.049335 0.032747 0.066170
4 0.129261 0.045497 0.147561 0.066823 0.122895 0.092877 0.143921
5 0.699076 0.873136 0.697764 0.850621 0.703693 0.820167 0.713813
In [315]:
proportions.plot()
Out[315]:
<matplotlib.axes._subplots.AxesSubplot at 0x143670dc0>

It's a little bit hard to compare here, let's create a stacked plot.

In [540]:
import seaborn as sns
sns.set()
proportions.T.plot(kind='bar', stacked=True)
Out[540]:
<matplotlib.axes._subplots.AxesSubplot at 0x150061bb0>

It looks like Emma has the higherst proportion of 1 star rating, which is obviously dragging their average down. Let's now see how the companies ratings have changed over time. Since each company has different periods, we'll do them separately.

In [357]:
#define function to subset
def subset(company):
    subset = all_reviews[all_reviews['Company'] == company]
    return subset
In [494]:
emma_clean = subset('Emma')
emma_clean
Out[494]:
Header Review Rating Name Location Date Company
0 I ordered the hybrid mattress I ordered the hybrid mattress.... 5 Ramiro Cali-Corleo Malta 2020-09-05 Emma
1 I much enjoy my Emma mattress together… I much enjoy my Emma mattress ... 5 Margrit Dahm United Kingdom 2020-09-05 Emma
2 Amazing mattress and impeccable customer service This mattress is amazing and t... 5 Emma United Kingdom 2020-09-05 Emma
3 My son loves his Mattress My son loves his Mattress can ... 5 Annie United Kingdom 2020-09-05 Emma
4 Best nights sleep ever Having bought a double and a s... 5 Lisa Silver United Kingdom 2020-09-05 Emma
... ... ... ... ... ... ... ...
10378 Good products, prompt deliveries and good cust... So pleased we trusted the revi... 5 Helen United Kingdom 2018-02-20 Emma
10379 Sadly after 6 weeks on this mattress it… Excellent company to deal with... 4 Berni Mackinnon United Kingdom 2018-02-20 Emma
10380 I am a convert! Incredible comfort, my sore ba... 5 Customer Esstee. United Kingdom 2018-02-20 Emma
10381 Happy customer This is a comfortable mattress... 4 jjgirl United Kingdom 2018-02-20 Emma
10382 My wife has had both hips replaced and… Best bed we have had, we both ... 5 Ian Newsham United Kingdom 2018-02-20 Emma

9740 rows × 7 columns

In [502]:
#we're going to calculate after each review, so we expect the beginning to have more variation.
count = 0
rolling_total = 0
mean_values = []
emma_clean = emma_clean.sort_values(by='Date')
for value in emma_clean['Rating']:
    rating = value
    rolling_total += rating
    count += 1
    mean = rolling_total/count
    mean_values.append(mean)

emma_rolling_mean = emma_clean.copy()
emma_rolling_mean['Rolling_mean'] = mean_values
In [503]:
emma_clean
Out[503]:
Header Review Rating Name Location Date Company Rolling_mean
10382 My wife has had both hips replaced and… Best bed we have had, we both ... 5 Ian Newsham United Kingdom 2018-02-20 Emma 5.000000
10380 I am a convert! Incredible comfort, my sore ba... 5 Customer Esstee. United Kingdom 2018-02-20 Emma 4.500000
10379 Sadly after 6 weeks on this mattress it… Excellent company to deal with... 4 Berni Mackinnon United Kingdom 2018-02-20 Emma 4.428571
10378 Good products, prompt deliveries and good cust... So pleased we trusted the revi... 5 Helen United Kingdom 2018-02-20 Emma 4.500000
10377 The most comfortable mattress I have… A brilliant product - a comfy ... 5 Alison United Kingdom 2018-02-20 Emma 4.555556
... ... ... ... ... ... ... ... ...
3 My son loves his Mattress My son loves his Mattress can ... 5 Annie United Kingdom 2020-09-05 Emma 4.302682
2 Amazing mattress and impeccable customer service This mattress is amazing and t... 5 Emma United Kingdom 2020-09-05 Emma 4.302610
1 I much enjoy my Emma mattress together… I much enjoy my Emma mattress ... 5 Margrit Dahm United Kingdom 2020-09-05 Emma 4.302538
4 Best nights sleep ever Having bought a double and a s... 5 Lisa Silver United Kingdom 2020-09-05 Emma 4.302753
0 I ordered the hybrid mattress I ordered the hybrid mattress.... 5 Ramiro Cali-Corleo Malta 2020-09-05 Emma 4.303183

9740 rows × 8 columns

This seemed to work, so we'll define a function below:

In [515]:
def rolling_mean(company_sub):
    count = 0
    rolling_total = 0
    reviews = []
    mean_values = []
    company_sub = company_sub.sort_values(by='Date')
    for value in company_sub['Rating']:
        rating = value
        rolling_total += rating
        count += 1
        #calculate the mean and append it
        mean = rolling_total/count
        mean_values.append(mean)
        #keep track of number of reviews 
        reviews.append(count)

    company_sub = company_sub.copy()
    company_sub['Rolling_mean'] = mean_values
    company_sub['Accumulated_reviews'] = reviews
    return company_sub


emma_clean = subset('Emma')
eve_clean = subset('Eve')
casper_clean = subset('Casper')
simba_clean = subset('Simba')
silentnight_clean = subset('SilentNight')
leesa_clean = subset('Leesa')
otty_clean = subset('Otty')

emma_rolling_mean = rolling_mean(emma_clean)
eve_rolling_mean = rolling_mean(eve_clean)
casper_rolling_mean = rolling_mean(casper_clean)
simba_rolling_mean = rolling_mean(simba_clean)
silentnight_rolling_mean = rolling_mean(silentnight_clean)
leesa_rolling_mean = rolling_mean(leesa_clean)
otty_rolling_mean = rolling_mean(otty_clean)

all_rolling_mean = pd.concat([emma_rolling_mean, eve_rolling_mean, casper_rolling_mean, simba_rolling_mean, silentnight_rolling_mean, leesa_rolling_mean, otty_rolling_mean], axis=0, ignore_index=True)
In [507]:
import seaborn as sns
sns.lineplot(data=all_rolling_mean, x='Date', y='Rolling_mean', hue='Company')
Out[507]:
<matplotlib.axes._subplots.AxesSubplot at 0x14f908dc0>

Ok, so as expected we're seeing the large variations when the companies started receiving reviews before the rolling average starts to stabilize. Eve and Simba have always had very good reviews, but there is a slight downward trend, this also seems to be the case for Silent Night. Leesa and Otty and relatively stable, while Casper has recently dipped. Emma, although having the lowest overall average, appear to be the only company on an upward trend. Good news for Emma!

I wonder if there is a link between the rate new reviews from customers. I did notice that while perfecting the code for scraping the data, Emma's reviews were continuously increasing. Let's check that out.

In [517]:
sns.lineplot(data=all_rolling_mean, x='Date', y='Accumulated_reviews', hue='Company')
Out[517]:
<matplotlib.axes._subplots.AxesSubplot at 0x1436ac5b0>

So probably no link between the rate of reviews since Emma is on par with Simba, although these two companies collect far more reviews than the others. Casper seems to have really leveled off though. I just checked their website and they've stopped selling in Europe to concentrate on the North American market. We'll keep them in the data set to see what people were saying about them, maybe we can see why they closed European operations.

One last thing I wanted to check was whether we can do a break down of customer location.

In [520]:
all_rolling_mean['Location'].describe()
Out[520]:
count                                  43559
unique                                    95
top                   United Kingdom        
freq                                   39888
Name: Location, dtype: object

The UK really dominates the reviews, so I don't think this would add much. Maybe in the future we can gather non-English reviews for further insights.

Natural Language Processing

Earlier we identified Emma as having the lowest average reviews, so let's process the reviews and see if we can identify why customers are so unhappy.

We have our data set already looking good for the purposes of examining the numbers, but now we want to break down the reviews for each company and that's going to require a bit more cleaning. Briefly, here are the steps we'll take:

  • Tokenize the text, which means we split each sentence into separate words;
  • Remove punctuation;
  • Remove stopwords. These are words that don't really add anything to the meaning of a sentence, things like 'he', 'she', 'because', 'something' etc.;
  • Lemmatize our words. This means that we'll convert our words to their root form. For example, 'walk', 'walks' and 'walking' all stem from 'walk'.

Let's begin there and see how this progresses. We'll start with the Emma dataset, which has already been subsetted (emma_clean) from earlier.

Emma the sleep company

In [542]:
import nltk
from nltk import FreqDist
In [543]:
#we quickly look at the frequency of words in the uncleaned dataset
def freq_words(x, terms = 30):
      all_words = ' '.join([text for text in x])
      all_words = all_words.split()

      fdist = FreqDist(all_words)
      words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})

      # selecting top 30 most frequent words
      d = words_df.nlargest(columns="count", n = terms) 
      plt.figure(figsize=(20,5))
      ax = sns.barplot(data=d, x= "word", y = "count")
      ax.set(ylabel = 'Count')
      plt.show()
    
freq_words(emma_clean['Review'])

Although there are relevant words like 'mattress' and 'sleep', the majority are not at all informative. We're going to proceed with spacy to clean the data.

In [546]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
In [861]:
import spacy

emma_clean = subset('Emma')
#get punctuation list
punctuations = '’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

#create a list of stop words
stop_words = spacy.lang.en.stop_words.STOP_WORDS

#create a function to remove stopwords and punctuation
def remove_stopword(review):
    review_new = ' '.join([word for word in review if word not in stop_words and word not in punctuations])
    return review_new

            
emma_clean_copy = emma_clean.copy()

#remove stopwords
reviews = [remove_stopword(r.split()) for r in emma_clean_copy['Review']]

#make lowercase and remove remaining punctuation
for punc in punctuations:
    reviews = [r.replace(punc, " ") for r in reviews]
reviews = [r.strip().lower() for r in reviews]

#remove extra spaces
reviews = [re.sub(' +', ' ', r) for r in reviews]
#remove words with 3 letters or less
reviews = [re.sub(r'\b\w{1,3}\b', '', r) for r in reviews]
In [862]:
freq_words(reviews)

This is looking better. Mattress is the most common word which makes sense. We've removed a bunch of words that don't mean anything by themselves and now have words like 'delivery', 'service', ,'order' and 'pain', which seem like relevant words. In these top 30 words, we also see 'sleep' and 'sleeping', which we can fix now by lemmatizing. This means the words will be reduced to their basic form, so sleep and sleeping will just become 'sleep'.

We can further reduce the word list to include only nouns and adjectives to remove things like 'very' and 'this'.

In [863]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def lemmatization(texts, tags=['NOUN', 'ADJ', 'VERB']):
    lemma_out = []
    for i in texts:
             doc = nlp(" ".join(i)) 
             lemma_out.append([token.lemma_ for token in doc if token.pos_ in tags])
    return lemma_out

#tokenize each review
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])
['enjoy', 'emma', 'mattress', 'emma', 'pillows', 'they', 'help', 'good', 'night', 'rest']

We get a good idea of what the first review is talking about still.

In [864]:
reviews_lemma = lemmatization(tokenized_reviews)
print(reviews_lemma[1])
['enjoy', 'pillow', 'help', 'good', 'night', 'rest']
In [865]:
emma_clean_copy_2 = emma_clean.copy()

#add the lemmatized and cleaned reviews back to the original tables
reviews_3 = []
for i in range(len(reviews_lemma)):
    reviews_3.append(' '.join(reviews_lemma[i]))

emma_clean_copy_2['Reviews'] = reviews_3

freq_words(emma_clean_copy_2['Reviews'])

Looks like we have some relevant words here. Let's build a Latent Dirichlet Allocation (LDA) model. This is a matrix factorization technique that tries to predict topics. Essentially, LDA takes a number of documents (in this case reviews) and assumes that they are made up of different topics. It then backtracks and tries to figure out which topics could make those reviews in the first place.

In [638]:
import gensim
from gensim import corpora
In [866]:
# Create a dictionary of words from all reviews
dictionary = corpora.Dictionary(reviews_lemma)
In [867]:
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_lemma]
In [868]:
# Creating the object for LDA model using gensim library
LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=10, random_state=100,
                chunksize=200, passes=50)
In [869]:
lda_model.print_topics()
Out[869]:
[(0,
  '0.104*"mattress" + 0.065*"comfortable" + 0.050*"great" + 0.047*"delivery" + 0.034*"good" + 0.029*"sleep" + 0.024*"recommend" + 0.024*"excellent" + 0.023*"easy" + 0.019*"service"'),
 (1,
  '0.155*"sleep" + 0.124*"night" + 0.067*"good" + 0.061*"mattress" + 0.053*"pain" + 0.040*"comfortable" + 0.027*"wake" + 0.025*"ache" + 0.019*"feel" + 0.017*"get"'),
 (2,
  '0.059*"delivery" + 0.054*"order" + 0.039*"email" + 0.037*"company" + 0.029*"day" + 0.029*"week" + 0.029*"refund" + 0.026*"receive" + 0.026*"return" + 0.025*"tell"'),
 (3,
  '0.107*"mattress" + 0.056*"recommend" + 0.040*"buy" + 0.035*"year" + 0.033*"comfortable" + 0.032*"emma" + 0.026*"good" + 0.022*"would" + 0.021*"sleep" + 0.019*"purchase"'),
 (4,
  '0.083*"customer" + 0.071*"service" + 0.031*"send" + 0.029*"emma" + 0.027*"mattress" + 0.021*"company" + 0.020*"say" + 0.019*"phone" + 0.017*"contact" + 0.015*"work"'),
 (5,
  '0.053*"deliver" + 0.052*"edge" + 0.036*"summer" + 0.035*"efficient" + 0.032*"happen" + 0.029*"daughter" + 0.029*"warm" + 0.026*"simple" + 0.023*"terrible" + 0.022*"resolve"'),
 (6,
  '0.070*"mattress" + 0.032*"take" + 0.030*"time" + 0.027*"week" + 0.022*"month" + 0.019*"day" + 0.019*"go" + 0.018*"find" + 0.016*"arrive" + 0.015*"trial"'),
 (7,
  '0.035*"change" + 0.024*"difficult" + 0.020*"turn" + 0.018*"move" + 0.016*"need" + 0.015*"stay" + 0.015*"sheet" + 0.015*"cover" + 0.015*"leave" + 0.015*"notice"'),
 (8,
  '0.100*"firm" + 0.085*"soft" + 0.063*"mattress" + 0.057*"foam" + 0.041*"find" + 0.040*"feel" + 0.039*"memory" + 0.031*"little" + 0.029*"perfect" + 0.025*"comfortable"'),
 (9,
  '0.081*"money" + 0.053*"value" + 0.044*"cover" + 0.032*"pick" + 0.030*"reason" + 0.025*"good" + 0.023*"keep" + 0.023*"complaint" + 0.023*"great" + 0.022*"promise"')]
In [870]:
import pyLDAvis
import pyLDAvis.gensim

# Visualize the different topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis
Out[870]:

Above we are visualizing our 10 topics. On the left the topics are represented by circles whose centres are placed by calculating the Jensen-Shannon divergence between topics and then using multidimensional scaling for inter-topic differences. In other words, the further apart the circles are, the more different the topics are likely to be. The size of the circle represents the prevalance of the topic.

On the right hand side we have an overlap of the words and their frequency in the entire dataset (blue) and their prevalance in the selected topic. The λ slider allows to rank the terms according to term relevance. The optimal value in the LDAvis documentation is suggested to be 0.6.

The topics predicted seem relevant. For example, topic 9 appears to be about returns and refunds, topic 4 is perhaps more unhappy customers reporting aches and pains, while in topic 7 the top 3 terms are 'would', 'recommend', 'friend', so probably some happy customers there. This topic modelling might be more informative if we look at topics of unhappy customers, or those who rated the company as 1 or 2 stars. In this way we can look at the pain points from the customer perspective.

But first we should look at topic coherence, that is the average similarity between the top words in a given topic. We might be able to improve the topic coherence and the model.

In [871]:
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews_lemma, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Coherence Score:  0.5006080048113015

Let's see if we can improve the coherence score by altering the number of topics. We can then plot this.

In [872]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step =1):
    #our returned values
    coherence_values = []
    model_list = []
    for num_topics in range(start,limit,step):
        model = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics, random_state=100,
                chunksize=200, passes=50)
        model_list.append(model)
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherence_model.get_coherence())
    return model_list, coherence_values
In [873]:
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=reviews_lemma, limit=10, start=2, step=1)
In [875]:
#we can plot this
limit=10 
start=2
step=1
x=range(start,limit,step)
plt.plot(x,coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
In [876]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model_list[0], doc_term_matrix, dictionary)
vis
Out[876]:

It looks like 2 is the optimal number of topics for the full dataset. At a glance, it looks like they're separated into good and bad reviews, which is probably appropriate. Let's look again at just the bad reviews.

In [877]:
emma_bad_reviews = emma_clean[emma_clean['Rating'].isin([1,2])]
In [878]:
emma_bad_reviews
Out[878]:
Header Review Rating Name Location Date Company
5 The first mattress we got was in April… The first mattress we got was ... 2 Mrs A Mustard United Kingdom 2020-09-04 Emma
14 I am a convert! Sadly after 6 weeks on this ma... 1 joanne United Kingdom 2020-09-04 Emma
25 Ordered the Emma king-size mattress… The first mattress we got was ... 2 Mrs A Mustard United Kingdom 2020-09-04 Emma
34 My wife has had both hips replaced and… Sadly after 6 weeks on this ma... 1 joanne United Kingdom 2020-09-04 Emma
49 Appalling customer service Very disappointed with the cus... 1 kevin Miller United Kingdom 2020-09-04 Emma
... ... ... ... ... ... ... ...
10214 Much softer after short time after purchase I've only had this since Novem... 2 lyndsey United Kingdom 2018-02-26 Emma
10229 Just OK The delivery was the only good... 1 Mr David Sims United Kingdom 2018-02-25 Emma
10264 Very Happy I have had this now less than ... 1 Jan Cadogan United Kingdom 2018-02-23 Emma
10302 Very comfortable initially but will not last. I bought the emma mattress due... 2 Hayley Ireland 2018-02-23 Emma
10348 I ordered the hybrid mattress Ordered superking on 8th Jan. ... 1 Stan Owen United Kingdom 2018-02-21 Emma

1229 rows × 7 columns

In [879]:
#remove stopwords
reviews = [remove_stopword(r.split()) for r in emma_bad_reviews['Review']]

#make lowercase and remove remaining punctuation
for punc in punctuations:
    reviews = [r.replace(punc, " ") for r in reviews]
reviews = [r.strip().lower() for r in reviews]

#remove words with 3 letters or less
reviews = [re.sub(r'\b\w{1,3}\b', '', r) for r in reviews]
#remove extra spaces
reviews = [re.sub(' +', ' ', r) for r in reviews]

#tokenize each review
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])
['sadly', 'weeks', 'mattress', 'wasn', 'soft', 'firm', 'topper', 'sent', 'free', 'made', 'difference', 'sadly', 'waking', 'loads', 'night', 'absolutely', 'boiling', 'sweating', 'however', 'fault', 'emma', 'customer', 'service', 'anyway', 'very', 'prompt', 'very', 'understanding', 'delivery', 'collection', 'fantastic']
In [880]:
reviews_lemma = lemmatization(tokenized_reviews)
print(reviews_lemma[1])
['week', 'soft', 'firm', 'topper', 'send', 'make', 'difference', 'wake', 'load', 'night', 'boil', 'sweat', 'fault', 'emma', 'customer', 'service', 'prompt', 'understanding', 'delivery', 'collection', 'fantastic']
In [881]:
#add the lemmatized and cleaned reviews back to the original tables
reviews_3 = []
for i in range(len(reviews_lemma)):
    reviews_3.append(' '.join(reviews_lemma[i]))

emma_bad_reviews = emma_bad_reviews.copy()
emma_bad_reviews['Reviews'] = reviews_3

freq_words(emma_bad_reviews['Reviews'])

Since these are the bad reviews, I think we can safely say that there are some issues with customer service and deliveries based on this plot. Let's look at the topics.

In [887]:
# Create a dictionary of words from all bad reviews
dictionary = corpora.Dictionary(reviews_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_lemma]

# Build LDA model with coherence values
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=reviews_lemma, limit=15, start=2, step=1)
In [888]:
#we can plot this
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()

It seems like 11 or 12 is giving us the best scores here, let's quickly check.

In [890]:
print("\nNumber of Topics: 11", "Coherence Value: ", coherence_values[9])
print("\nNumber of Topics: 12", "Coherence Value: ", coherence_values[10])
Number of Topics: 11 Coherence Value:  0.4883790515167501

Number of Topics: 12 Coherence Value:  0.48810269694729547

11 topics is the best by a marginal fraction, so we'll proceed with that.

In [891]:
# Visualize the different topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model_list[9], doc_term_matrix, dictionary)
vis
Out[891]:

Wow, this is where customers are clearly upset. It looks like the biggest pain point has to do with customer service and deliveries, as defined by topic 1 and 2. Topic 5 suggest there might be a strong smell as well, not ideal. To give a little more context, let's take a look at the bigrams, where we take 2 consecutive words.

In [899]:
from nltk import bigrams
bigram_terms = [list(bigrams(review)) for review in tokenized_reviews]
In [893]:
import itertools
import collections
# Flatten list of bigrams in clean review
bigrams = list(itertools.chain(*bigram_terms))

# Create counter of words in clean bigrams
bigram_counts = collections.Counter(bigrams)

bigram_counts.most_common(20)
Out[893]:
[(('customer', 'service'), 575),
 (('emma', 'mattress'), 320),
 (('ordered', 'mattress'), 128),
 (('working', 'days'), 96),
 (('customer', 'services'), 91),
 (('phone', 'number'), 91),
 (('delivery', 'date'), 88),
 (('order', 'number'), 83),
 (('mattress', 'delivered'), 74),
 (('received', 'email'), 65),
 (('customer', 'support'), 64),
 (('return', 'mattress'), 62),
 (('contacted', 'emma'), 60),
 (('mattress', 'arrived'), 58),
 (('mattress', 'collected'), 58),
 (('telephone', 'number'), 54),
 (('cancelled', 'order'), 53),
 (('placed', 'order'), 51),
 (('days', 'later'), 49),
 (('cancel', 'order'), 46)]
In [894]:
bigram_df = pd.DataFrame(bigram_counts.most_common(30), columns=['bigram', 'count'])
In [895]:
def freq_bigrams(x, terms = 30):
    d = x.nlargest(columns="count", n = terms) 
    plt.figure(figsize=(20,5))
    ax = sns.barplot(data=d, x= "bigram", y = "count")
    ax.set(ylabel = 'Count')
    for item in ax.get_xticklabels():
        item.set_rotation(90)
    plt.show()
In [896]:
freq_bigrams(bigram_df)

This really just confirms what we saw in our topic models with 'customer service' taking out top spot for negative reviews.

Now we'll continue on with other companies to compare. As a reminder, these are the other companies:

  • Casper
  • SilentNight
  • Leesa
  • Otty
  • Simba
  • Eve

Casper

To begin with, we'll define some functions to simplify what we've done above.

In [906]:
def clean_reviews(company_subset):
    #remove stopwords
    reviews = [remove_stopword(r.split()) for r in company_subset['Review']]

    #make lowercase and remove remaining punctuation
    for punc in punctuations:
        reviews = [r.replace(punc, " ") for r in reviews]
    reviews = [r.strip().lower() for r in reviews]

    
    #remove words with 3 letters or less
    reviews = [re.sub(r'\b\w{1,3}\b', '', r) for r in reviews]
    #remove extra spaces
    reviews = [re.sub(' +', ' ', r) for r in reviews]
    return reviews

def token_lemma(reviews):
    #tokenize each review
    tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
    reviews_lemma = lemmatization(tokenized_reviews)
    return tokenized_reviews, reviews_lemma 

def get_LDA_vis(reviews_lemma, num_topics):
    #Create a dictionary of words from all bad reviews
    dictionary = corpora.Dictionary(reviews_lemma)
    doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_lemma]

    # Creating the object for LDA model using gensim library
    LDA = gensim.models.ldamodel.LdaModel

    # Build LDA model
    lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics, random_state=100,
                chunksize=200, passes=50)
    # Visualize the different topics
    pyLDAvis.enable_notebook()
    vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
    return vis

def get_bigrams(tokenized_reviews):
    bigram_terms = [bigrams(review) for review in tokenized_reviews]
    bigram_terms = list(bigram_terms)
    # Flatten list of bigrams in clean review
    tok_bigrams = list(itertools.chain(*bigram_terms))

    # Create counter of words in clean bigrams
    bigram_counts = collections.Counter(tok_bigrams)
    bigram_df = pd.DataFrame(bigram_counts.most_common(30), columns=['bigram', 'count'])
    return bigram_df
In [897]:
casper_sub = subset('Casper')
casper_clean = clean_reviews(casper_sub)
casper_token, casper_lemma = token_lemma(casper_clean)
casper_LDA = get_LDA_vis(casper_lemma)
casper_LDA
Out[897]:

There is some overwhelmingly positive reviews here, let's look at the bigrams.

In [900]:
casper_bigrams = get_bigrams(casper_token)
freq_bigrams(casper_bigrams)

Seems like we have a few German reviews mixed in, which is ok. Fast delivery (schnelle lieferung) and very satisfied (sehr zufrieden) are among the top terms, in contrast to Emma who had reviews more about the comfort. Let's switch to the negative reviews quickly.

In [901]:
casper_bad_reviews = casper_sub[casper_sub['Rating'].isin([1,2])]
In [902]:
casper_bad_clean = clean_reviews(casper_bad_reviews)
casper_bad_token, casper_bad_lemma = token_lemma(casper_bad_clean)

#define values for LDA model
dictionary = corpora.Dictionary(casper_bad_lemma)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in casper_bad_lemma]

# Build LDA model with coherence values
casper_bad_model_list, casper_bad_coherence_values = compute_coherence_values(dictionary=dictionary, corpus =doc_term_matrix, texts=casper_bad_lemma, limit=15, start=2, step=1)
In [904]:
limit=15
start=2
step=1
x=range(start,limit,step)
plt.plot(x,casper_bad_coherence_values)
plt.xlabel('Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
In [907]:
# 5 topics is the winner for Casper
casper_bad_LDA = get_LDA_vis(casper_bad_lemma, 5)
casper_bad_LDA
Out[907]:

Customer service is also a big pain point here.

In [845]:
casper_bad_bigrams = get_bigrams(casper_bad_token)
freq_bigrams(casper_bad_bigrams)

Again some patterns emerge with poor customer service and no answer (keine antwort), cancellations of orders and orders not arriving.

SilentNight

In [846]:
silentnight_sub = subset('SilentNight')
silentnight_clean = clean_reviews(silentnight_sub)
silentnight_token, silentnight_lemma = token_lemma(silentnight_clean)
silentnight_LDA = get_LDA_vis(silentnight_lemma)
silentnight_LDA
Out[846]: