Retail Product Recommender Engine¶

Contents¶

I. Overview
II. Business Problem
III. Data Understanding
IV. Recommendation Systems
V. Results and Recommendations
VI. Further Research

I. Overview¶

A recommender engine is developed to increase revenue of clothing rental companies by predicting user preferences and recommending products for users to rent. I apply different algorithms to create personalized recommendations using content-based and collaborative filtering systems. The algorithm that attained the lowest mean absolute error of 0.5 is the Singular Value Decomposition.

II. Business Problem¶

The clothing rental industry grows as more companies follow suit of the retailer Rent the Runway, which pioneered online services and subscriptions for designer rentals. To help grow the revenue of clothing rental companies, I develop recommendation systems that predict a set of user preferences and recommend the top preferences for the user. Doing so will conveniently expose users to relevant products to rent that tailor to their preferences. Using data from Rent the Runway, I conduct an analysis of the product reviews, model the data to predict user ratings, and provide recommendations accordingly.

III. Data Understanding¶

The Rent the Runway reviews (data source) contain 200,000 ratings of 6,000 unique items rented between 2010 and 2018 by over 100,000 unique users.

A quick look at the data structure:

In [1]:

import pandas as pd

raw_data = pd.read_csv('data/data.csv')
raw_data.head(2)

Out[1]:

	fit	user_id	bust size	item_id	weight	rating	rented for	review_text	body type	review_summary	category	height	size	age	review_date
0	fit	420272	34d	2260466	137lbs	10.0	vacation	An adorable romper! Belt and zipper were a lit...	hourglass	So many compliments!	romper	5' 8"	14	28.0	April 20, 2016
1	fit	273551	34b	153475	132lbs	10.0	other	I rented this dress for a photo shoot. The the...	straight & narrow	I felt so glamourous!!!	gown	5' 6"	12	36.0	June 18, 2013

Missing values:

In [2]:

raw_data.isna().sum()

Out[2]:

fit                   0
user_id               0
bust size         18411
item_id               0
weight            29982
rating               82
rented for           10
review_text          62
body type         14637
review_summary      345
category              0
height              677
size                  0
age                 960
review_date           0
dtype: int64

Target variable:

In [3]:

raw_data['rating'].value_counts()

Out[3]:

10.0    124537
8.0      53391
6.0      10697
4.0       2791
2.0       1046
Name: rating, dtype: int64

Data Cleaning

To perform the pre-processing steps, I define the function preprocess_data:

Drop missing values for rating and change the scale for from 2-10 to 1-5.
Remove the units of measurement for weight and height to only keep the numerical values.
Impute missing values with the median for age and a few other features and with mode for rented_for.
Impute missing value for bust_size with the median value by body_type and vice versa.
Count the number of words in text_summary and text_review together to create new feature length.
Create new features review_month, review_season, and review_year based on review_date.

In [4]:

import numpy as np

def convert_height(x):
    '''
    Converts height from string format as feet and inches to integer in inches.
    '''
    height = [int(i) for i in x.replace('\'', '').replace('"', '').split()]
    return height[0]*12 + height[1]

def preprocess_data(df):
    '''
    Cleans the dataframe using imputation and feature engineering.
    '''
    df.columns = df.columns.str.replace(' ', '_')
    df = df.dropna(subset=['rating'])
    
    df['weight'] = df['weight'].str.replace('lbs', '')
    df['rating'] = df['rating']/2
    df['height'] = df['height'].apply(lambda x: convert_height(x) if pd.notnull(x) else x)
    
    to_num = ['rating', 'weight', 'age']
    df[to_num] = df[to_num].apply(pd.to_numeric, errors='coerce')
    
    for col in ['height', 'age']:
        df[col] = df[col].fillna(df[col].median())
        
    weight_map = dict(df.groupby('height')['weight'].median())
    df['weight'] = df['weight'].fillna(df['height'].map(weight_map))
    
    for col in ['review_text', 'review_summary']:
        df[col] = df[col].replace('-', np.nan)
    df['review'] = df['review_summary'] + ' ' + df['review_text']
    df['review'] = df['review'].fillna('')
    df['review_length'] = df['review_text'].fillna('').apply(lambda x: len(x.split()))
        
    age_limit = (df['age'] > 60) | (df['age'] < 13)
    df['age'] = np.where(age_limit==True, df['age'].median(), df['age'])
    
    for col in ['bust_size', 'body_type']:
        to_map = dict(df.groupby('size')[col].last())
        df[col] = df[col].fillna(df['size'].map(to_map))
    
    df['rented_for'] = df['rented_for'].fillna(df['rented_for'].value_counts().index[0])
    
    df['review_date'] = pd.to_datetime(df['review_date'])
    df['review_month'] = pd.DatetimeIndex(df['review_date']).month
    df['review_season'] = pd.cut(df['review_month'].replace(12, 0), [0, 3, 6, 9, 11], include_lowest=True, labels=['Winter', 'Spring', 'Summer', 'Fall'])
    df['review_year'] = pd.DatetimeIndex(df['review_date']).year
    return df

import warnings
warnings.filterwarnings('ignore')

In [5]:

# Create new dataframe for processed data
data = preprocess_data(raw_data)

In [6]:

pd.options.display.float_format = '{:.2f}'.format

# Summary statistics for numerical features
data.drop(columns=['user_id', 'item_id']).describe().T

Out[6]:

	count	mean	std	min	25%	50%	75%	max
weight	192462.00	136.92	20.43	50.00	125.00	135.00	145.00	300.00
rating	192462.00	4.55	0.72	1.00	4.00	5.00	5.00	5.00
height	192462.00	65.31	2.66	54.00	63.00	65.00	67.00	78.00
size	192462.00	12.25	8.50	0.00	8.00	12.00	16.00	58.00
age	192462.00	33.60	7.39	14.00	29.00	32.00	37.00	60.00
review_length	192462.00	58.40	43.04	0.00	27.00	50.00	79.00	398.00
review_month	192462.00	6.85	3.38	1.00	4.00	7.00	10.00	12.00
review_year	192462.00	2015.69	1.33	2010.00	2015.00	2016.00	2017.00	2018.00

Data Visualization¶

Let's explore and visualize the processed data!

User Data I create a separate table for user information by grouping the data by user_id and adding new features:

rating_count is the total number of items the user rated and reviewed.
rating_average is the average rating of the items reviewed by the user.
rented_for_top is the user's most common reason for renting an item.
category_top is the most common clothing category among the items reviewed by the user.
review_length_average is the average length of text review posted by the user.
review_month_top and review_season_top are the most common month and season the user posted the review.
rented_for_all is a list of all the user's reasons for renting the items.
category_for_all is a list of all the clothing categories of the items reviewed by the user.

In [7]:

def create_user_data(df):
    '''
    Groups the data by user and returns dataframe containing user information. 
    '''
    user_df = pd.DataFrame(df.groupby('user_id').count().reset_index()['user_id'])
    
    for col in df.columns:
        if col in ['bust_size', 'weight']:
            feature = df.sort_values('review_date', ascending=False).groupby('user_id')[col].first()
            user_df = user_df.merge(feature, on='user_id')
        if col == 'item_id':
            feature = df.groupby(df['user_id']).count()[col]
            user_df = user_df.merge(feature, on='user_id')
        if col == 'rating':
            feature = df.groupby(df['user_id']).mean()[col]
            user_df = user_df.merge(feature, on='user_id')
        if col in ['body_type', 'height', 'size', 'age']:
            feature = df.sort_values('review_date', ascending=False).groupby('user_id')[col].first()
            user_df = user_df.merge(feature, on='user_id')
        if col == 'review_length':
            feature = df.groupby(df['user_id']).mean()[col]
            user_df = user_df.merge(feature, on='user_id')
        if col in ['review_month', 'review_season']:
            feature = df.sort_values('review_date', ascending=False).groupby('user_id')[col].agg(lambda x: x.value_counts().index[0])
            user_df = user_df.merge(feature, on='user_id')
        if col in ['rented_for', 'category']:
            feature = df.sort_values('review_date', ascending=False).groupby('user_id')[col].agg(pd.Series.mode).apply(lambda x: x[0] if type(x)==np.ndarray else x)
            user_df = user_df.merge(feature, on='user_id')
        else:
            continue
        
    for col in ['rented_for', 'category']:
        feature = df.groupby('user_id')[col].apply(set).apply(lambda x: list(x))
        user_df = user_df.merge(feature, on='user_id')
    
    user_df.columns = ['user_id', 'bust_size', 'rating_count', 'weight', 'rating_average', 'rented_for_top', 
                       'body_type', 'category_top', 'height', 'size', 'age', 'review_length_average', 
                       'review_month_top', 'review_season_top', 'rented_for_all', 'category_all']
    
    return user_df

In [8]:

# Create new dataframe for user data
user_data = create_user_data(data)
user_data.head(2)

Out[8]:

	user_id	bust_size	rating_count	weight	rating_average	rented_for_top	body_type	category_top	height	size	age	review_length_average	review_month_top	review_season_top	rented_for_all	category_all
0	9	32c	2	145.00	5.00	formal affair	pear	dress	66.00	8	32.00	121.50	3	Winter	[formal affair, wedding]	[dress, gown]
1	25	34ddd/e	1	130.00	5.00	party	full bust	legging	67.00	8	40.00	14.00	12	Winter	[party]	[legging]

In [9]:

import seaborn as sns
import matplotlib.pyplot as plt

# Sort data by bust size
bust_size_sorted_data = user_data.loc[(user_data['bust_size']>='32a') & (user_data['bust_size']<='38ddd/e')].sort_values('bust_size')

sns.set_style('whitegrid')
fig, axes = plt.subplots(nrows=2, figsize=(20, 18))

# Plot the distribution of users by bust size
sns.countplot(x='bust_size', data=bust_size_sorted_data, palette='twilight', ax=axes[0])
axes[0].set_title('User Count by Bust Size', fontsize=16)
axes[0].set_xlabel('Bust Size')
axes[0].set_ylabel('User Count')
axes[0].set_xticklabels(bust_size_sorted_data['bust_size'].unique(), rotation=90, fontsize=12)

# Plot the distribution of users by size
sns.countplot(x='size', data=user_data, palette='PuBu_r', ax=axes[1])
axes[1].set_title('User Count by Size', fontsize=16)
axes[1].set_xlabel('Size')
axes[1].set_ylabel('User Count')
plt.savefig('data/images/fig0.png', dpi=200, transparent=True) 
plt.show()

In [10]:

weight_data = user_data.loc[(user_data['weight']>=90) & (user_data['weight']<=210)]

sns.set_style('whitegrid')
fig, axes = plt.subplots(ncols=3, figsize=(15, 5))

# Plot the distribution of users by weight
sns.histplot(x='weight', data=weight_data, bins=24, color='darksalmon', kde=True, ax=axes[0])
axes[0].set_title('User Count by Weight', fontsize=16)
axes[0].set_xlabel('Weight')
axes[0].set_ylabel('User Count')
axes[0].grid(axis='x')

# Plot the distribution of users by height
sns.histplot(x='height', data=user_data, bins=24, color='midnightblue', ax=axes[1])
axes[1].set_title('User Count by Height', fontsize=16)
axes[1].set_xlabel('Height')
axes[1].set(ylabel=None)
axes[1].grid(axis='x')

# Plot the distribution of users by age
sns.histplot(x='age', data=user_data, bins=24, color='rebeccapurple', kde=True, ax=axes[2])
axes[2].set_title('User Count by Age', fontsize=16)
axes[2].set_xlabel('Age')
axes[2].set(ylabel=None)
axes[2].grid(axis='x')

plt.savefig('data/images/fig1.png', dpi=200, transparent=True) 
plt.show()

Normally distributed and diverse ranges of weight, height, and age above.

In [11]:

body_type_values = user_data['body_type'].value_counts().values
body_type_names = user_data['body_type'].value_counts().index

body_type_circle = plt.Circle((0,0), 0.7, color='white')

plt.style.use('seaborn')
plt.figure(figsize=(8,8))
colors = ['#DC7F8E', '#E5A1AA', '#F4BFBE', '#FFE0DA', '#F4C4B2', '#E8B08D', '#C68C73']

# Plot a donut chart of user body type
plt.pie(body_type_values, labels=body_type_names, colors=colors, autopct='%1.0f%%', startangle=40, pctdistance=0.85)
p = plt.gcf()
p.gca().add_artist(body_type_circle)

plt.title('User Percentage by Body Type', fontsize=16)
plt.savefig('data/images/fig2.png', dpi=200, transparent=True) 
plt.show()

In [12]:

sns.set_style('whitegrid')
fig, axes = plt.subplots(ncols=2, figsize=(16, 6))

axes[0] = plt.subplot2grid((1, 5), (0, 0))
axes[1] = plt.subplot2grid((1, 5), (0, 1), colspan=4)

# Categorize rating count into binary classes
item_count_data = user_data.copy()
item_count_data['rating_count'] = item_count_data['rating_count'].apply(lambda x: 'Only one' if x==1 else 'More than one')

# Plot the distribution of binary classes
sns.countplot(x='rating_count', data=item_count_data, palette=['#5d2349', '#861d23'], order=['Only one', 'More than one'], ax=axes[0])
axes[0].set(xlabel=None)
axes[0].set_ylabel('User Count')

rating_count_data = user_data.loc[(user_data['rating_count']>=2) & (user_data['rating_count']<=10)]

# Show the distribution of the second class
sns.countplot(x='rating_count', data=rating_count_data, palette='Reds_r', ax=axes[1])
axes[1].set_title('User Count by Number of Items Rated', fontsize=16)
axes[1].set_xlabel('Item rating count', fontsize=12)
axes[1].set(ylabel=None)

plt.savefig('data/images/fig3.png', dpi=200, transparent=True) 
plt.show()

Overall, two thirds of users rented only one item and the remaining third rented more than one, on the left chart. Majority of those who rented more than once rented exactly two items, on the right chart.

In [13]:

sns.set_style('whitegrid')
fig, axes = plt.subplots(ncols=2, figsize=(15, 5))

# Plot the distribution of users by average rating
sns.histplot(x='rating_average', data=user_data, bins=10, color='thistle', ax=axes[0])
axes[0].set_title('User Count by Average Rating', fontsize=16)
axes[0].set_xlabel('Average Rating')
axes[0].set_ylabel('User Count')
axes[0].grid(axis='x')

# Plot the distribution of users by average review length
sns.histplot(x='review_length_average', data=user_data, bins=30, color='lightsteelblue', kde=True, ax=axes[1])
axes[1].set_title('User Count by Average Review Length', fontsize=16)
axes[1].set_xlabel('Average Review Length')
axes[1].set(ylabel=None)
axes[1].grid(axis='x')

plt.savefig('data/images/fig4.png', dpi=200, transparent=True) 
plt.show()

A left-skewed distribution for the average rating per user with most of them giving the highest rating, on the left chart. And a right-skewed distribution for the average length of text review per user, on the right chart.

In [14]:

sns.set_style('whitegrid')
fig, axes = plt.subplots(ncols=2, figsize=(15, 5))

# Plot the distribution of users by month of review posted
sns.countplot(x='review_month_top', data=user_data, palette='PuBuGn', ax=axes[0])
axes[0].set_title('User Count by Top Month Rated', fontsize=16)
axes[0].set_xlabel('Top Month Rated')
axes[0].set_ylabel('User Count')

# Plot the distribution of users by season of review posted
sns.countplot(x='review_season_top', data=user_data, order=['Spring', 'Summer', 'Fall', 'Winter'], palette='PuBuGn', ax=axes[1])
axes[1].set_title('User Count by Top Season Rated', fontsize=16)
axes[1].set_xlabel('Top Season Rated')
axes[1].set(ylabel=None)

plt.savefig('data/images/fig5.png', dpi=200, transparent=True) 
plt.show()

In [15]:

sns.set_style('whitegrid')
fig, axes = plt.subplots(ncols=2, figsize=(16, 8))

# Show top five clothing categories and top five reasons for rent
category_top_data = user_data.loc[user_data['category_top'].isin(user_data['category_top'].value_counts()[:5].index.tolist())]
rented_for_top_data = user_data.loc[user_data['rented_for_top'].isin(user_data['rented_for_top'].value_counts()[:5].index.tolist())]

category_top_values = (category_top_data['category_top'].value_counts(normalize=True).values*100).tolist()
category_top_labels = category_top_data['category_top'].value_counts().index.tolist()
category_colors = ['#E5E4F4', '#E8F1DE', '#FDF9F0', '#F2E6F0', '#D9E4FB']

# Plot the distribution of users by clothing category
axes[0].pie(category_top_values, labels=category_top_labels, colors=category_colors, autopct='%.0f%%')
axes[0].set_title('User Percentage by Top 5 Clothing Category', fontsize=16)

rented_for_top_values = (rented_for_top_data['rented_for_top'].value_counts(normalize=True).values*100).tolist()
rented_for_top_labels = rented_for_top_data['rented_for_top'].value_counts().index.tolist()
rented_for_colors = ['#FBFBFB', '#FAEDDA', '#D2C1CE', '#E1CEC9', '#F3C0A1']

# Plot the distribution of users by reason for rent
axes[1].pie(rented_for_top_values, labels=rented_for_top_labels, colors=rented_for_colors, autopct='%.0f%%')
axes[1].set_title('User Percentage by Top 5 Reason for Rent', fontsize=16)

plt.savefig('data/images/fig6.png', dpi=200, transparent=True) 
plt.show()

The most common clothing categories are dresses and gowns that align with the most common reasons for renting which are for wedding, formal affair, and party.

Item Data I create a separate table for item information by grouping the data by item_id and adding new features:

fit_small, fit_large, and fit are the count of users who rated the item as too small, too large, or right fit.
user_count is the total numbers of users who rated and reviewed the item.
bust_size_top and body_type_top are the most common bust size and body type of the users who rented the item.
mean and median of the weight, height, size, and age of all users who rented the item.
rating_average is the average of all user ratings of the item.

In [16]:

def create_item_data(df):
    '''
    Groups the data by item and returns dataframe containing item details. 
    '''
    item_df = pd.DataFrame(df.groupby('item_id').count().reset_index()['item_id'])
    
    for col in df.columns:
        if col == 'fit':
            feature_small = df.loc[df[col]=='small'].groupby('item_id').count()[col]
            feature_fit = df.loc[df[col]=='fit'].groupby('item_id').count()[col]
            feature_large = df.loc[df[col]=='large'].groupby('item_id').count()[col]
            for idx, feature in enumerate([feature_small, feature_fit, feature_large]):
                item_df = item_df.join(feature, on='item_id', rsuffix=idx).fillna(0)
        if col == 'user_id':
            feature = df.groupby(df['item_id']).count()[col]
            item_df = item_df.merge(feature, on='item_id')
        if col in ['bust_size', 'body_type']:
            feature = df.sort_values('review_date', ascending=False).groupby('item_id')[col].agg(pd.Series.mode).apply(lambda x: x[0] if type(x)==np.ndarray else x)
            item_df = item_df.merge(feature, on='item_id')
        if col in ['weight', 'height', 'size', 'age']:
            feature_mean = df.groupby(df['item_id']).mean()[col]
            feature_median = df.groupby(df['item_id']).median()[col]
            for feature in [feature_mean, feature_median]:
                item_df = item_df.merge(feature, on='item_id')
        if col in ['rating', 'review_length']:
            feature = df.groupby(df['item_id']).mean()[col]
            item_df = item_df.merge(feature, on='item_id')
        if col in ['rented_for', 'category']:
            feature = df.sort_values('review_date', ascending=False).groupby('item_id')[col].agg(pd.Series.mode).apply(lambda x: x[0] if type(x)==np.ndarray else x)
            item_df = item_df.merge(feature, on='item_id')
        if col == 'rented_for':
            feature = df.groupby('item_id')[col].apply(set).apply(lambda x: list(x))
            item_df = item_df.merge(feature, on='item_id')
        if col in ['review_month', 'review_season']:
            feature = df.sort_values('review_date', ascending=False).groupby('item_id')[col].agg(lambda x: x.value_counts().index[0])
            item_df = item_df.merge(feature, on='item_id')
        else:
            continue
            
    item_df.columns = ['item_id', 'fit_small', 'fit', 'fit_large', 'user_count', 'bust_size_top', 'weight_mean', 
                       'weight_median', 'rating_average', 'rented_for_top', 'rented_for_all', 'body_type_top', 
                       'category_top', 'height_mean', 'height_median', 'size_mean', 'size_median', 'age_mean', 
                       'age_median', 'review_length_average', 'review_month_top', 'review_season_top']
    
    return item_df

Table to use for item to item recommendations later:

In [17]:

# Create new dataframe for item data
item_data = create_item_data(data)
item_data.head(2)

Out[17]:

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	...	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top
0	123373	73.00	566.00	47.00	686	36d	140.67	135.00	4.40	formal affair	...	gown	65.39	65.00	15.12	13.00	34.36	33.00	66.08	12	Winter
1	123793	65.00	1497.00	152.00	1714	34b	132.98	130.00	4.77	formal affair	...	gown	65.06	65.00	9.72	8.00	31.31	31.00	74.41	5	Winter

2 rows × 22 columns

Time Series Analysis¶

In [18]:

plt.style.use('seaborn')

time_series_data = data.sort_values('review_date').set_index('review_date', drop=True).drop('2010-11-03')

# Resample data to yearly count of reviews
yearly_data = time_series_data.resample('Y').count()
yearly_data = yearly_data.drop(yearly_data.index[-1])

# Plot the aggregated yearly count of reviews
yearly_data['rating'].plot(figsize=(8,5), colormap='PRGn', xlabel='')

plt.title('Total Count of Reviews By Year', fontsize=16)
plt.savefig('data/images/fig7.png', dpi=200, transparent=True)
plt.show()

The count of reviews increased over the years from 10,000 in 2013 to almost 70,000 by 2018.

In [19]:

# Resample data to monthly count of reviews
monthly_data = time_series_data[~(time_series_data['review_year']==2011)].resample('MS').count()
monthly_data = monthly_data.drop(monthly_data.index[-1])

# Plot the aggregated monthly count of reviews
monthly_data['rating'].plot(figsize=(8,5), colormap='seismic', xlabel='')

plt.title('Total Count of Reviews By Month', fontsize=16)
plt.savefig('data/images/fig8.png', dpi=200, transparent=True)
plt.show()

The count of reviews peak during months of spring and fall with the highest spike of over 8,000 reviews in October of 2017.

In [20]:

# Resample data to yearly average rating of reviews
yearly_data = time_series_data.resample('Y').mean()
yearly_data = yearly_data.drop(yearly_data.index[-1])

# Plot the aggregated to yearly average rating of reviews
yearly_data['rating'].plot(figsize=(8,5), colormap='PRGn', xlabel='')

plt.title('Average Rating of Reviews By Year', fontsize=16)
plt.savefig('data/images/fig9.png', dpi=200, transparent=True)
plt.show()

The average ratings steadily increased from over 4.45 in 2013 to 4.575 in 2016 but went down by less then 0.025 in 2017.

In [21]:

# Resample data to monthly average rating of reviews
monthly_data = time_series_data[~(time_series_data['review_year']==2011)].resample('MS').mean()
monthly_data = monthly_data.drop(monthly_data.index[-1])

# Plot the aggregated to monthly average rating of reviews
monthly_data['rating'].plot(figsize=(8,5), colormap='seismic', xlabel='')

plt.title('Average Rating of Reviews By Month', fontsize=16)
plt.savefig('data/images/fig10.png', dpi=200, transparent=True)
plt.show()

The average ratings peak during the latter months of the year and aligned with the higher counts of rentals in the fall.

IV. Recommendation Systems¶

"Recommendation Systems are software agents that elicit the interests and preferences of individual consumers […] and make recommendations accordingly. They have the potential to support and improve the quality of the

decisions consumers make while searching for and selecting products online." (Bo Xiao and Izak Benbasat)

To start, I create a set of generalized recommendations that are based on all the data. For all the items, I calculate a weighted rating and return the top 10 highest-rated items across the board. To personalize the recommendations, I apply the different algorithms for Content-Based Recommenders and Collaborative Filtering Systems, which I implement using the surprise library later.

Popularity Recommendations¶

Bayesian Average¶

$$W = \left(\frac{v}{v + m} \right)R + \left(\frac{m}{v + m} \right)C$$

where:

$W$ = Weighted rating
$v$ = Number of ratings for the item
$m$ = Minimum number of ratings required to be listed on top chart
$R$ = Average rating of the item
$C$ = Mean rating across the entire data

In [22]:

m = item_data['user_count'].quantile(0.9)
C = item_data['rating_average'].mean()

def weighted_rating(x, m=m, C=C):
    '''
    Calculates weighted rating based on Bayesian Average.
    '''
    v = x['user_count']
    R = x['rating_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

def popular_recommendation(df=data, n=10):
    '''
    Returns the most popular items according to the highest weighted ratings.
    '''
    item_df = create_item_data(df)
    
    top_item_ratings = item_df.loc[(item_df['user_count']>=m)]
    top_item_ratings['score'] = top_item_ratings.apply(weighted_rating, axis=1)
    top_item_ratings = top_item_ratings.sort_values('score', ascending=False)
    
    return top_item_ratings.head(n)

Top 10 Popularity-Based Recommendations:

In [23]:

pd.set_option('display.max_columns', 30)

top10_overall = popular_recommendation()
top10_overall

Out[23]:

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top	score
1948	1064397	12.00	257.00	15.00	284	34b	129.45	130.00	4.89	wedding	[formal affair, other, party, wedding, date]	athletic	gown	64.81	65.00	8.14	8.00	28.65	28.00	60.74	10	Winter	4.82
1223	709832	30.00	151.00	9.00	190	34b	134.62	135.00	4.86	formal affair	[formal affair, other, party, wedding, date]	hourglass	gown	65.11	65.00	12.14	12.00	33.95	32.00	72.49	3	Winter	4.77
2260	1213427	27.00	270.00	11.00	308	34b	132.22	130.00	4.82	wedding	[formal affair, everyday, other, party, weddin...	athletic	gown	66.23	66.00	9.27	8.00	29.32	29.00	61.10	4	Spring	4.77
1608	903647	13.00	126.00	4.00	143	34b	140.04	138.00	4.88	formal affair	[formal affair, other, party, wedding, date]	athletic	gown	65.94	66.00	12.67	12.00	34.43	32.00	60.87	10	Winter	4.76
2599	1378631	20.00	302.00	8.00	330	34b	139.89	135.00	4.81	wedding	[formal affair, other, party, wedding, date, v...	hourglass	maxi	65.93	66.00	12.42	11.00	32.98	32.00	65.14	6	Spring	4.76
1	123793	65.00	1497.00	152.00	1714	34b	132.98	130.00	4.77	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.06	65.00	9.72	8.00	31.31	31.00	74.41	5	Winter	4.76
1812	1003076	5.00	122.00	5.00	132	34b	137.98	135.00	4.89	wedding	[formal affair, other, party, wedding, date, v...	athletic	dress	65.96	66.00	11.55	12.00	32.80	32.00	55.21	10	Winter	4.76
2201	1186923	9.00	91.00	3.00	103	34c	137.48	135.00	4.92	party	[formal affair, other, party, wedding, date, v...	hourglass	dress	65.33	65.00	13.53	14.00	30.33	30.00	65.70	11	Winter	4.76
2351	1260666	25.00	126.00	3.00	154	34b	130.13	130.00	4.84	party	[formal affair, other, party, wedding, date, v...	hourglass	dress	64.82	65.00	10.66	8.00	32.77	32.50	54.05	7	Spring	4.74
1236	714374	23.00	112.00	2.00	137	34b	133.23	132.00	4.85	wedding	[formal affair, work, other, party, wedding, d...	athletic	dress	65.83	66.00	10.18	8.00	31.49	30.00	55.97	10	Fall	4.74

To simulate the online shopping experience, I can also filter the popularity-based recommendations on the data features such as dress for clothing category and wedding for reason to rent using the function I define as filter_popular_recommendation.

In [24]:

column_list = []
operator_list = []
condition_list = []

def append_condition(column, operation, condition):
    '''
    Appends a filter to column, operator, and condition lists.
    ''' 
    column_list.append(column)
    operator_list.append(operation)
    condition_list.append(condition)

def filter_popular_recommendation(df=data, n=10, bust_size=None, weight=None, rating=None, rented_for=None, 
                                  body_type=None, category=None, height=None, size=None, age=None, 
                                  review_month=None, review_season=None, review_year=None):
    '''
    Returns the most popular recommendations filtered by the features passed as arguments.
    '''
    if bust_size:
        append_condition('bust_size', '==', bust_size)
    if weight:
        append_condition('weight', '>=', weight-10)
        append_condition('weight', '<=', weight+10)
    if rented_for:
        append_condition('rented_for', '==', rented_for)
    if body_type:
        append_condition('body_type', '==', body_type)
    if category:
        append_condition('category', '==', category)
    if height:
        append_condition('height', '>=', height-2)
        append_condition('height', '>=', height+2)
    if size:
        append_condition('size', '==', size)
    if age:
        append_condition('age', '>=', age-4)
        append_condition('age', '<=', age+4)
    if review_month:
        append_condition('review_month', '==', review_month)
    if review_season:
        append_condition('review_season', '==', review_season)
    if review_year:
        append_condition('review_year', '==', review_year)
    
    condition = ' & '.join(f'{col} {op} {repr(cond)}' for col, op, cond in zip(column_list, operator_list, condition_list))
    filtered_df = df.query(condition)
    
    return popular_recommendation(filtered_df, n)

def reset_condition():
    '''
    Reinitializes lists for query for filtered popularity recommender.
    '''
    column_list = []
    operator_list = []
    condition_list = []
    return column_list, operator_list, condition_list

Top 10 Popular Recommendations for Dress:

In [25]:

top10_dress = filter_popular_recommendation(category='dress')
column_list, operator_list, condition_list = reset_condition()
top10_dress

Out[25]:

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top	score
1223	1003076	5.00	122.00	5.00	132	34b	137.98	135.00	4.89	wedding	[formal affair, other, party, wedding, date, v...	athletic	dress	65.96	66.00	11.55	12.00	32.80	32.00	55.21	10	Winter	4.76
1483	1186923	9.00	91.00	3.00	103	34c	137.48	135.00	4.92	party	[formal affair, other, party, wedding, date, v...	hourglass	dress	65.33	65.00	13.53	14.00	30.33	30.00	65.70	11	Winter	4.76
1584	1260666	25.00	126.00	3.00	154	34b	130.13	130.00	4.84	party	[formal affair, other, party, wedding, date, v...	hourglass	dress	64.82	65.00	10.66	8.00	32.77	32.50	54.05	7	Spring	4.74
832	714374	23.00	112.00	2.00	137	34b	133.23	132.00	4.85	wedding	[formal affair, work, other, party, wedding, d...	athletic	dress	65.83	66.00	10.18	8.00	31.49	30.00	55.97	10	Fall	4.74
847	724319	5.00	99.00	14.00	118	32d	134.67	135.00	4.86	wedding	[formal affair, work, other, party, wedding, d...	hourglass	dress	65.27	65.00	9.83	8.00	33.51	33.00	51.08	10	Summer	4.73
1378	1106101	2.00	111.00	27.00	140	36b	142.86	140.00	4.84	wedding	[formal affair, everyday, other, party, weddin...	hourglass	dress	65.81	66.00	11.57	14.00	31.01	30.00	48.90	11	Summer	4.73
154	241461	10.00	227.00	4.00	241	34d	141.32	135.00	4.78	wedding	[formal affair, everyday, other, party, weddin...	hourglass	dress	66.10	66.00	14.15	12.00	31.70	30.00	67.82	6	Spring	4.72
2540	1940985	14.00	99.00	1.00	114	34d	139.75	138.00	4.83	wedding	[formal affair, work, other, party, wedding, d...	hourglass	dress	65.60	66.00	12.93	14.00	34.98	34.00	49.37	11	Winter	4.71
1262	1031440	71.00	160.00	1.00	232	36c	138.88	135.00	4.77	party	[formal affair, other, party, wedding, date, v...	hourglass	dress	65.31	65.00	15.09	14.00	31.23	31.00	60.11	12	Winter	4.71
181	263699	1.00	105.00	38.00	144	34b	136.67	135.00	4.81	formal affair	[formal affair, other, party, wedding, date]	hourglass	dress	65.45	65.00	9.37	8.00	34.01	32.00	66.34	6	Winter	4.71

Top 10 Popular Recommendations for Wedding:

In [26]:

top10_wedding = filter_popular_recommendation(rented_for='wedding')
column_list, operator_list, condition_list = reset_condition()
top10_wedding

Out[26]:

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top	score
1397	1064397	7.00	153.00	8.00	168	34b	129.88	129.00	4.87	wedding	[wedding]	athletic	gown	64.68	64.00	8.08	8.00	29.57	29.00	63.77	10	Fall	4.77
1	123793	20.00	435.00	46.00	501	34b	133.64	130.00	4.78	wedding	[wedding]	athletic	gown	64.87	65.00	10.08	8.00	31.68	31.00	76.00	10	Summer	4.75
1867	1378631	15.00	226.00	7.00	248	34c	139.42	135.00	4.81	wedding	[wedding]	athletic	maxi	65.90	66.00	11.69	9.00	32.47	31.00	67.14	6	Summer	4.75
1623	1213427	15.00	150.00	7.00	172	34b	132.06	130.00	4.82	wedding	[wedding]	athletic	gown	65.90	66.00	9.26	8.00	29.80	29.00	59.12	9	Spring	4.73
10	127865	28.00	400.00	10.00	438	34b	134.43	135.00	4.76	wedding	[wedding]	hourglass	gown	64.72	65.00	11.68	12.00	35.64	34.00	68.75	10	Fall	4.72
1110	870184	1.00	88.00	10.00	99	34c	135.56	135.00	4.86	wedding	[wedding]	hourglass	dress	65.35	65.00	9.96	8.00	32.52	31.00	60.19	7	Summer	4.72
1295	1003076	3.00	75.00	3.00	81	34b	137.58	135.00	4.89	wedding	[wedding]	athletic	dress	65.83	66.00	11.31	12.00	32.28	31.00	55.79	10	Fall	4.72
220	241461	6.00	173.00	2.00	181	34d	141.25	135.00	4.79	wedding	[wedding]	hourglass	dress	66.17	66.00	13.56	12.00	32.01	30.00	71.35	6	Summer	4.71
1767	1309537	7.00	69.00	7.00	83	34c	138.06	135.00	4.88	wedding	[wedding]	hourglass	gown	64.88	65.00	13.87	13.00	31.48	30.00	58.58	10	Summer	4.71
2315	1687082	9.00	200.00	9.00	218	34c	131.89	132.50	4.78	wedding	[wedding]	athletic	gown	65.55	65.00	9.14	8.00	31.08	30.00	66.33	5	Spring	4.71

Content-Based Recommenders¶

Content-based recommendation systems are based on the idea that if a user likes an item, the user will also like items similar to it. To measure the similarity between the items, I calculate the Pearson correlation using numerical and categorical features from the table item_data created earlier. Then, I complete a similarity_matrix of all the items to use in the function content_based_similarity I define, which generates content-based recommendations for any item_id. Lastly, I use the text features later to create a text review-based recommender using Natural Language Processing.

In [27]:

def item_similarity(item_df):
    '''
    Measures pearson correlation of items from the table of item data and returns a similarity matrix.
    '''
    item_df = item_df.drop(['fit_small', 'fit_large', 'weight_mean', 'rented_for_all', 'height_mean', 'size_mean', 
                            'age_mean', 'review_month_top'], axis=1)
    
    similarity_features = item_df[['item_id', 'fit', 'user_count', 'weight_median', 'rating_average', 'rented_for_top', 
                                 'body_type_top', 'category_top', 'height_median', 'size_median', 'age_median', 
                                 'review_length_average', 'review_season_top']]
    similarity_features = similarity_features.set_index('item_id')
    similarity_features = pd.get_dummies(similarity_features, columns=['rented_for_top', 'body_type_top', 'category_top', 'review_season_top'])
    
    similarity_matrix = similarity_features.T
    similarity_matrix = similarity_matrix.corr(method='pearson')
    
    return similarity_features, similarity_matrix

pd.set_option('display.max_columns', 30)

def content_based_similarity(similarity_matrix, item_id, n=20):
    '''
    Returns the most similar item recommendations to the given item based on the similarity matrix.
    '''
    recommendations = similarity_matrix[item_id].sort_values(ascending=False)
    recommendations = recommendations.drop([item_id], axis=0).index
    
    recommendations_list = []
    for i in range(n):
        recommendations_list.append(recommendations[i])
    
    display(item_data.loc[item_data['item_id']==item_id])
    print(f'----------------------------------------\nTop {n} Recommendations for Item #{item_id}:')
                          
    recommendations_df = item_data.loc[item_data['item_id'].isin(recommendations_list)]
    return recommendations_df

In [28]:

similarity_features, similarity_matrix = item_similarity(item_data)

In [29]:

similarity_matrix

Out[29]:

item_id	123373	123793	124204	124553	125424	125465	125564	126335	127081	127495	127865	128730	128959	129831	130259	...	2959486	2959777	2960025	2960913	2960940	2960969	2961855	2962646	2963344	2963601	2963850	2964470	2965009	2965924	2966087
item_id
123373	1.00	0.99	1.00	0.98	0.99	1.00	1.00	0.99	0.96	1.00	0.99	0.98	1.00	0.93	1.00	...	0.26	0.16	0.25	0.17	0.19	0.28	0.17	0.24	0.24	0.16	0.25	0.18	0.19	0.16	0.19
123793	0.99	1.00	0.99	0.96	0.97	1.00	0.99	1.00	0.92	0.98	1.00	0.96	0.98	0.88	0.99	...	0.14	0.05	0.14	0.06	0.08	0.17	0.06	0.13	0.13	0.05	0.14	0.07	0.08	0.05	0.08
124204	1.00	0.99	1.00	0.98	0.99	1.00	1.00	0.99	0.95	0.99	0.99	0.98	1.00	0.92	1.00	...	0.23	0.13	0.22	0.15	0.16	0.25	0.15	0.22	0.21	0.14	0.23	0.16	0.17	0.14	0.17
124553	0.98	0.96	0.98	1.00	1.00	0.98	0.99	0.97	0.97	0.99	0.96	0.98	0.98	0.96	0.99	...	0.35	0.25	0.33	0.26	0.28	0.37	0.26	0.34	0.33	0.25	0.34	0.27	0.28	0.24	0.29
125424	0.99	0.97	0.99	1.00	1.00	0.98	1.00	0.97	0.98	1.00	0.96	0.99	0.99	0.96	0.99	...	0.36	0.26	0.34	0.27	0.29	0.38	0.27	0.35	0.34	0.26	0.35	0.28	0.29	0.25	0.29
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2963850	0.25	0.14	0.23	0.34	0.35	0.21	0.28	0.13	0.50	0.33	0.16	0.42	0.31	0.59	0.22	...	1.00	0.97	0.98	0.98	0.99	1.00	0.97	1.00	1.00	0.99	1.00	0.99	0.99	0.90	0.99
2964470	0.18	0.07	0.16	0.27	0.28	0.14	0.21	0.06	0.43	0.26	0.08	0.35	0.24	0.52	0.15	...	0.99	0.99	0.97	0.98	0.99	0.99	0.98	0.99	1.00	0.99	0.99	1.00	0.99	0.90	1.00
2965009	0.19	0.08	0.17	0.28	0.29	0.15	0.22	0.07	0.44	0.27	0.10	0.36	0.25	0.53	0.16	...	0.99	0.99	0.96	0.97	1.00	0.99	0.99	1.00	1.00	1.00	0.99	0.99	1.00	0.86	0.99
2965924	0.16	0.05	0.14	0.24	0.25	0.12	0.19	0.04	0.42	0.23	0.07	0.32	0.22	0.49	0.13	...	0.91	0.81	0.96	0.96	0.86	0.87	0.80	0.88	0.88	0.87	0.90	0.90	0.86	1.00	0.91
2966087	0.19	0.08	0.17	0.29	0.29	0.15	0.23	0.07	0.45	0.27	0.09	0.36	0.25	0.54	0.16	...	0.99	0.98	0.98	0.99	0.99	0.99	0.97	0.99	0.99	0.99	0.99	1.00	0.99	0.91	1.00

5850 rows × 5850 columns

In [30]:

# Example item
content_based_similarity(similarity_matrix, 123373)

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top
0	123373	73.00	566.00	47.00	686	36d	140.67	135.00	4.40	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.39	65.00	15.12	13.00	34.36	33.00	66.08	12	Winter

----------------------------------------
Top 20 Recommendations for Item #123373:

Out[30]:

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top
2	124204	42.00	630.00	123.00	795	34b	136.88	135.00	4.65	party	[formal affair, work, other, party, wedding, d...	hourglass	dress	65.15	65.00	10.97	8.00	33.33	33.00	62.13	12	Winter
5	125465	150.00	720.00	13.00	883	34b	143.82	138.00	4.69	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.83	66.00	16.92	13.00	32.70	31.00	64.22	4	Spring
6	125564	98.00	443.00	66.00	607	34b	142.99	138.00	4.43	formal affair	[formal affair, everyday, other, party, weddin...	hourglass	gown	65.48	65.00	17.05	16.00	38.16	36.00	61.73	11	Winter
12	128959	44.00	447.00	25.00	516	34c	140.82	138.00	4.65	formal affair	[formal affair, work, everyday, other, party, ...	hourglass	gown	65.29	65.00	14.20	13.00	35.57	34.00	70.76	11	Winter
14	130259	54.00	643.00	216.00	913	36c	148.22	140.00	4.38	wedding	[formal affair, work, everyday, other, party, ...	hourglass	dress	65.50	65.00	18.00	16.00	36.15	35.00	65.43	1	Winter
16	131117	71.00	798.00	112.00	981	34b	139.21	135.00	4.52	formal affair	[formal affair, work, other, party, wedding, d...	athletic	gown	65.91	66.00	12.54	12.00	31.31	31.00	70.12	4	Spring
17	131533	70.00	973.00	48.00	1091	34b	140.68	138.00	4.69	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.81	66.00	13.81	12.00	33.32	31.00	64.30	5	Spring
28	136860	66.00	618.00	24.00	708	34c	138.95	135.00	4.62	wedding	[formal affair, work, other, party, wedding, d...	hourglass	sheath	65.79	66.00	12.19	12.00	33.36	32.00	62.59	6	Spring
29	137585	74.00	932.00	94.00	1100	34b	129.47	129.00	4.63	wedding	[formal affair, other, party, wedding, date, v...	athletic	sheath	64.81	65.00	7.93	8.00	30.77	31.00	57.05	4	Winter
30	138431	87.00	395.00	6.00	488	34c	139.79	138.00	4.63	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.67	66.00	13.86	12.00	34.00	32.00	66.25	4	Winter
41	144051	59.00	416.00	25.00	500	36c	146.32	140.00	4.52	wedding	[formal affair, work, other, party, wedding, d...	hourglass	sheath	65.54	65.00	19.34	16.00	34.70	33.00	55.15	10	Winter
48	146684	31.00	425.00	38.00	494	34b	133.21	131.50	4.50	formal affair	[formal affair, other, party, wedding, date, v...	hourglass	gown	64.88	65.00	10.89	11.00	34.02	33.00	69.59	1	Winter
62	152662	39.00	404.00	42.00	485	34b	136.36	135.00	4.66	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.71	66.00	12.29	12.00	33.24	32.00	70.60	5	Spring
63	152836	174.00	585.00	35.00	794	34b	125.80	125.00	4.34	party	[formal affair, other, party, wedding, date, v...	petite	mini	63.97	64.00	7.89	8.00	31.92	32.00	53.74	5	Winter
65	153475	41.00	408.00	70.00	519	34c	139.58	135.00	4.55	formal affair	[formal affair, other, party, wedding, date]	hourglass	gown	65.66	66.00	13.68	12.00	34.06	33.00	66.26	4	Winter
66	154002	37.00	537.00	49.00	623	34c	135.25	135.00	4.56	formal affair	[formal affair, other, party, wedding, date]	hourglass	gown	65.43	65.00	11.22	9.00	33.78	33.00	78.80	5	Winter
88	166633	128.00	615.00	61.00	804	34b	126.02	125.00	4.33	wedding	[formal affair, work, other, party, wedding, d...	athletic	mini	64.09	64.00	7.45	8.00	30.84	31.00	59.31	6	Spring
91	168592	29.00	402.00	98.00	529	34c	135.33	135.00	4.65	formal affair	[formal affair, party, other, wedding]	hourglass	gown	65.50	66.00	9.92	8.00	32.64	32.00	61.09	4	Spring
92	168610	44.00	429.00	59.00	532	34b	135.79	135.00	4.55	wedding	[formal affair, other, party, wedding, date]	hourglass	gown	65.58	66.00	11.10	9.00	31.60	31.00	65.23	5	Spring
1972	1076484	76.00	437.00	42.00	555	36c	142.37	138.00	4.51	wedding	[formal affair, work, other, party, wedding, d...	hourglass	dress	65.23	65.00	15.47	12.00	34.40	33.00	59.17	10	Winter

Text Review-Based Recommender¶

To recommend items based on text reviews, Natural Language Processing (NLP) is used to:

Clean the text by removing stopwords and performing lemmatization.
Create the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for the documents, which are the reviews.
Compute the pairwise cosine similarity from the constructed matrix of TF-IDF scores.

In [31]:

# Import the Natural Language Toolkit (nltk)
import re
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/czarinaluna/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/czarinaluna/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

In [32]:

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = nltk.stem.WordNetLemmatizer()

def preprocess(text):
    '''
    Text preprocessing to standardize, remove special characters and stopwords, and lemmatize.
    '''
    text = text.apply(lambda x: x.lower())
    text = text.apply(lambda x: re.sub(r'([^A-Za-z0-9|\s|[:punct:]]*)', '', x))
    text = text.apply(lambda x: x.replace('[^a-zA-Z#]', ' '))
    text = text.apply(lambda x: ' '.join([i for i in x.split() if len(i)>3]))
    text = text.apply(lambda x: x.split())
    text = text.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
    text = text.apply(lambda x: [word for word in x if word not in stopwords])
    text = text.apply(lambda x: ' '.join(x))
    
    return text

def create_text_df(df=data, item_df=item_data, text_review=True, category=False):
    '''
    Creates new feature combining review summary and review text, to add to item data.
    '''
    if item_df is None:
        item_df = create_item_df(df)
        
    text_df = df.copy()
    text_df['review'] = text_df['review_summary'] + ' ' + text_df['review_text']
    text_df['review'] = text_df['review'].fillna('')
    text_df['review'] = preprocess(text_df['review'])
    
    if text_review:
        text_df = text_df[['item_id', 'review']].groupby('item_id').agg(' '.join).reset_index()
        text_item_df = item_df.merge(text_df, on='item_id')
        
    if text_review == False and category == True:
        text_df = text_df[['item_id', 'rented_for']].groupby('item_id').agg(' '.join).reset_index()
        text_item_df = item_df.merge(text_df, on='item_id')
        
    return text_item_df

In [33]:

# Create new dataframe for text item data
text_item_data = create_text_df()

In [34]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

count = CountVectorizer()
tfidf = TfidfVectorizer(ngram_range=(1,3))

"The TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs." (Aditya Sharma)

Finally, to compute the cosine similarity score between the text reviews, the dot product between each TF-IDF vector is calculated in the function text_based_recommendation below:

In [35]:

from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

def text_based_recommendation(text_item_df, item_id, n=10, text_review=True, category=False):
    '''
    Returns the most similar item recommendations to the given item based on text reviews.
    '''
    if text_review:
        tfidf_matrix = tfidf.fit_transform(text_item_df['review'])
        cosine_similarity_ = linear_kernel(tfidf_matrix, tfidf_matrix)

    if text_review == False and category == True:
        count_matrix = count.fit_transform(text_item_df['rented_for'])
        cosine_similarity_ = cosine_similarity(count_matrix, count_matrix)
        
    indices = pd.Series(text_item_df.index, index=text_item_df['item_id']).drop_duplicates()
    idx = indices[item_id]
    
    similarity_scores = list(enumerate(cosine_similarity_[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    top_similarity_scores = similarity_scores[1:n+1]
    item_indices = [i[0] for i in top_similarity_scores]
    top_text_based_recommendations = text_item_df['item_id'].iloc[item_indices]
    
    display(item_data.loc[item_data['item_id']==item_id])
    print(f'----------------------------------------\nTop {n} Recommendations for Item #{item_id}:')
                          
    recommendations_df = item_data.loc[item_data['item_id'].isin(top_text_based_recommendations)]
    return recommendations_df

In [36]:

# Same example item
text_based_recommendation(text_item_data, 123373, n=10)

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top
0	123373	73.00	566.00	47.00	686	36d	140.67	135.00	4.40	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.39	65.00	15.12	13.00	34.36	33.00	66.08	12	Winter

----------------------------------------
Top 10 Recommendations for Item #123373:

Out[36]:

	item_id	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top
1	123793	65.00	1497.00	152.00	1714	34b	132.98	130.00	4.77	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.06	65.00	9.72	8.00	31.31	31.00	74.41	5	Winter
5	125465	150.00	720.00	13.00	883	34b	143.82	138.00	4.69	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.83	66.00	16.92	13.00	32.70	31.00	64.22	4	Spring
6	125564	98.00	443.00	66.00	607	34b	142.99	138.00	4.43	formal affair	[formal affair, everyday, other, party, weddin...	hourglass	gown	65.48	65.00	17.05	16.00	38.16	36.00	61.73	11	Winter
10	127865	78.00	1278.00	37.00	1393	34b	136.03	135.00	4.72	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.17	65.00	12.04	12.00	36.35	35.00	68.93	11	Winter
17	131533	70.00	973.00	48.00	1091	34b	140.68	138.00	4.69	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.81	66.00	13.81	12.00	33.32	31.00	64.30	5	Spring
20	132738	60.00	937.00	580.00	1577	34b	144.50	138.00	4.62	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.56	66.00	15.45	13.00	32.14	32.00	78.31	5	Winter
31	139086	7.00	331.00	178.00	516	34b	137.44	135.00	4.61	formal affair	[formal affair, other, party, wedding, date]	hourglass	gown	65.45	65.00	11.99	11.00	35.49	34.00	61.45	12	Winter
33	140321	6.00	288.00	156.00	450	34c	140.17	135.00	4.54	formal affair	[formal affair, work, other, party, wedding, d...	hourglass	gown	65.70	66.00	13.66	12.00	35.42	33.50	55.51	11	Winter
47	145906	76.00	1154.00	242.00	1472	34b	135.60	135.00	4.51	formal affair	[formal affair, other, party, wedding, date, v...	hourglass	gown	65.78	66.00	10.59	9.00	30.36	30.00	64.90	4	Winter
65	153475	41.00	408.00	70.00	519	34c	139.58	135.00	4.55	formal affair	[formal affair, other, party, wedding, date]	hourglass	gown	65.66	66.00	13.68	12.00	34.06	33.00	66.26	4	Winter

Key Differences between the text-based recommendations and the content-based recommendations to the same item:

Feature	Content-based	Text-based	Item
rating_average	4.38 - 4.69	4.43 - 4.77	4.40
rented_for_top	party, formal affair, wedding	formal affair (across the board)	formal affair
body_type_top	hourglass, athlete	hourglass (across the board)	hourglass
category_top	dress, gown, sheath	gown (across the board)	gown

Collaborative Filtering Systems¶

Collaborative filtering systems recommend items to a user based on the user's past ratings and on the past ratings and preferences of other similar users. I apply the different implementations of collaborative filtering recommendation systems using the Python library surprise:

Prediction Algorithm	Description
Normal Predictor	Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
Baseline Only	Algorithm predicting the baseline estimate for given user and item.
KNN Basic	A basic collaborative filtering algorithm.
KNN Baseline	A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
KNN with Means	A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.
KNN with Z-Score	A basic collaborative filtering algorithm taking into account a baseline rating.
Single Value Decomposition	The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. When baselines are not used, this is equivalent to Probabilistic Matrix Factorization.
Single Value Decomposition ++	The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
Non-Negative Matrix Factorization	A collaborative filtering algorithm based on Non-negative Matrix Factorization.
SlopeOne	A simple yet accurate collaborative filtering algorithm.
CoClustering	A collaborative filtering algorithm based on co-clustering.

In [37]:

data = data.rename(columns={'user_id': 'userID', 'item_id': 'itemID'})

df_columns = ['userID', 'itemID', 'rating']
df = data[df_columns]

# Only use items with more than 25 ratings
df['reviews'] = df.groupby(['itemID'])['rating'].transform('count')
df = df.loc[df['reviews']>25, df_columns]

In [38]:

from surprise import Reader, Dataset

reader = Reader(rating_scale=(1,5))
read_data = Dataset.load_from_df(df, reader)

Data Modeling:

In [39]:

from surprise import NormalPredictor, BaselineOnly, SVD, SVDpp, NMF, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

from surprise.prediction_algorithms import knns
sim_cos = {'name':'cosine', 'user_based':False}

evaluation = []
recommendation_systems = [NormalPredictor(), BaselineOnly(), knns.KNNBasic(sim_options=sim_cos), knns.KNNBaseline(sim_options=sim_cos), knns.KNNWithMeans(sim_options=sim_cos), knns.KNNWithZScore(sim_options=sim_cos), SVD(), SVDpp(), NMF(), SlopeOne(), CoClustering()]

# Evaluate recommendation systems using Mean Absolute Error
for system in recommendation_systems:
    score = cross_validate(system, read_data, measures=['MAE'], cv=3, verbose=False)
    evaluation.append((str(system).split(' ')[0].split('.')[-1], score['test_mae'].mean()))

pd.options.display.float_format = '{:.4f}'.format    

evaluation = pd.DataFrame(evaluation, columns=['system', 'mae'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.

To evaluate, I use the mean absolute error which measures the difference between the rating predicted by the model and the actual rating by the user:

In [40]:

evaluation

Out[40]:

	system	mae
0	NormalPredictor	0.6566
1	BaselineOnly	0.5398
2	KNNBasic	0.5727
3	KNNBaseline	0.5457
4	KNNWithMeans	0.5624
5	KNNWithZScore	0.5635
6	SVD	0.5359
7	SVDpp	0.5396
8	NMF	0.6980
9	SlopeOne	0.5802
10	CoClustering	0.5673

In [41]:

# Switch similarity measure from cosine to pearson
sim_pearson = {'name':'pearson', 'user_based':False}

pearson_evaluation = []
pearson_knns = [knns.KNNBasic(sim_options=sim_pearson), 
                knns.KNNBaseline(sim_options=sim_pearson), 
                knns.KNNWithMeans(sim_options=sim_pearson), 
                knns.KNNWithZScore(sim_options=sim_pearson)]

for system in pearson_knns:
    pearson_score = cross_validate(system, read_data, measures=['MAE'], cv=3, verbose=False)
    pearson_evaluation.append((str(system).split(' ')[0].split('.')[-1], pearson_score['test_mae'].mean()))

pearson_evaluation = pd.DataFrame(pearson_evaluation, columns=['system', 'mae'])
pearson_evaluation

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.

Out[41]:

	system	mae
0	KNNBasic	0.5729
1	KNNBaseline	0.5391
2	KNNWithMeans	0.5586
3	KNNWithZScore	0.5583

The mean absolute errors of KNNBaseline, KNNWithMeans, and KNNWithZScore decreased by 0.01 and KNNBaseline becomes second to SVD.

Hyperparamater Tuning¶

GridSearchCV is performed to optimize the Single Value Decomposition models:

In [42]:

from surprise.model_selection import GridSearchCV

def grid_search(system, params):
    '''
    Implements grid search and returns best cross validation scores and parameters.
    '''
    model = GridSearchCV(system, param_grid=params, n_jobs=-1)
    model.fit(read_data)
    
    print(model.best_score)
    print(model.best_params)

In [43]:

params_svd1 = {'n_factors': [10, 50, 100], 'n_epochs': [10, 20, 100], 'lr_all': [0.001, 0.005, 0.01], 'reg_all': [0.02, 0.05, 0.1]}
grid_search(SVD, params_svd1)

{'rmse': 0.6785759649206107, 'mae': 0.5288288608687417}
{'rmse': {'n_factors': 10, 'n_epochs': 100, 'lr_all': 0.001, 'reg_all': 0.1}, 'mae': {'n_factors': 10, 'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.02}}

In [44]:

params_svdpp1 = {'n_factors': [10, 50, 100], 'n_epochs': [10, 20, 100], 'lr_all': [0.001, 0.005, 0.01], 'reg_all': [0.02, 0.05, 0.1]}
grid_search(SVDpp, params_svdpp1)

{'rmse': 0.6785226031170102, 'mae': 0.5300882259833981}
{'rmse': {'n_factors': 10, 'n_epochs': 100, 'lr_all': 0.001, 'reg_all': 0.1}, 'mae': {'n_factors': 10, 'n_epochs': 100, 'lr_all': 0.005, 'reg_all': 0.1}}

In [45]:

svd_evaluation = []

# Evaluate tuned Singular Value Decomposition models
for system in [SVD(n_factors=10, n_epochs=20, lr_all=0.01, reg_all=0.02), 
               SVDpp(n_factors=10, n_epochs=100, lr_all=0.005, reg_all=0.1)]:
    svd_score = cross_validate(system, read_data, measures=['MAE'], cv=3, verbose=False)
    svd_evaluation.append((str(system).split(' ')[0].split('.')[-1], svd_score['test_mae'].mean()))

svd_evaluation = pd.DataFrame(svd_evaluation, columns=['system', 'mae'])
svd_evaluation

Out[45]:

	system	mae
0	SVD	0.5302
1	SVDpp	0.5329

V. Results and Recommendations¶

Systems Performance:¶

In [46]:

all_systems = pd.concat([evaluation, pearson_evaluation, svd_evaluation], ignore_index=True)
all_systems

Out[46]:

	system	mae
0	NormalPredictor	0.6566
1	BaselineOnly	0.5398
2	KNNBasic	0.5727
3	KNNBaseline	0.5457
4	KNNWithMeans	0.5624
5	KNNWithZScore	0.5635
6	SVD	0.5359
7	SVDpp	0.5396
8	NMF	0.6980
9	SlopeOne	0.5802
10	CoClustering	0.5673
11	KNNBasic	0.5729
12	KNNBaseline	0.5391
13	KNNWithMeans	0.5586
14	KNNWithZScore	0.5583
15	SVD	0.5302
16	SVDpp	0.5329

The results show that the tuned Singular Value Decomposition algorithm attains the lowest Mean Absolute Error of 0.5302 with a rating scale of 1 to 5.

In [47]:

sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(20,8))
plt.subplots_adjust(bottom=0.2)

# Plot the Mean Absolute Error of the models
sns.barplot(all_systems.index, all_systems['mae'], palette='tab20b') 
ax.set(xlim=[-0.5,16.5], xlabel='Recommendation System', ylabel='Mean Absolute Error')
ax.set_title('Collaborative Filtering and Recommender Systems Evaluation', fontsize=20)

labels = ['Normal Predictor', 'Baseline Only', 'KNN Basic Cosine', 'KNN Baseline Cosine', 'KNN Means Cosine', 
          'KNN Z-Score Cosine', 'Default SVD', 'Default SVD++', 'NMF', 'Slope One', 'Co-Clustering', 'KNN Basic Pearson', 
          'KNN Baseline Pearson', 'KNN Means Pearson', 'KNN Z-Score Pearson', 'Tuned SVD', 'Tuned SVD++']
plt.xticks(all_systems.index, labels, rotation=45)

plt.savefig('data/images/fig11.png', dpi=200, transparent=True)
plt.show()

Recommender Engine:¶

In [48]:

def svd_recommendation(user_id, n=10):
    '''
    Returns top item recommendations generated by the Single Value Decomposition model.
    '''
    unique_ids = df['itemID'].unique()
    item_user_id = df.loc[df['userID']==user_id, 'itemID']
    items_to_predict = np.setdiff1d(unique_ids, item_user_id)
    
    engine = SVD(n_factors=10, n_epochs=20, lr_all=0.01, reg_all=0.02)
    engine.fit(read_data.build_full_trainset())

    svd_recommendations = []
    for i in items_to_predict:
        svd_recommendations.append((i, engine.predict(uid=user_id, iid=i).est))

    display(user_data.loc[user_data['user_id']==user_id])
    print(f'----------------------------------------\nTop {n} Recommendations for User #{user_id}:')
    
    svd_recommendations = pd.DataFrame(svd_recommendations, columns=['item_id', 'predicted_rating'])
    svd_recommendations = svd_recommendations.sort_values('predicted_rating', ascending=False).head(n)
    svd_recommendations = svd_recommendations.merge(item_data, on='item_id')
    
    return svd_recommendations

In [49]:

# Sample user
sample = user_data.sort_values('rating_count', ascending=False)
sample.loc[((user_data['bust_size']=='32a') & (user_data['height']==62))].head(5)

Out[49]:

	user_id	bust_size	rating_count	weight	rating_average	rented_for_top	body_type	category_top	height	size	age	review_length_average	review_month_top	review_season_top	rented_for_all	category_all
52769	501485	32a	42	105.0000	4.9762	party	petite	dress	62.0000	4	27.0000	33.6190	4	Spring	[formal affair, work, everyday, other, party, ...	[sheath, dress, gown, romper, skirt, shirtdres...
76995	731517	32a	17	113.0000	5.0000	party	petite	dress	62.0000	1	25.0000	59.1176	9	Summer	[formal affair, work, everyday, party, date]	[down, dress, sheath, gown, suit, top, jacket,...
89653	849603	32a	17	100.0000	5.0000	everyday	petite	dress	62.0000	1	27.0000	91.8824	6	Summer	[formal affair, work, everyday, other, party, ...	[shift, dress, gown, pants, sweater, top]
15262	145029	32a	16	115.0000	4.5000	everyday	petite	dress	62.0000	4	43.0000	14.1875	8	Summer	[work, everyday, vacation, other]	[dress, romper, maxi, skirt, culottes, jumpsui...
50530	480611	32a	12	125.0000	4.5833	party	petite	dress	62.0000	1	35.0000	80.0000	4	Fall	[party, wedding]	[sheath, dress, skirt]

In [50]:

svd_recommendation(480611)

	user_id	bust_size	rating_count	weight	rating_average	rented_for_top	body_type	category_top	height	size	age	review_length_average	review_month_top	review_season_top	rented_for_all	category_all
50530	480611	32a	12	125.0000	4.5833	party	petite	dress	62.0000	1	35.0000	80.0000	4	Fall	[party, wedding]	[sheath, dress, skirt]

----------------------------------------
Top 10 Recommendations for User #480611:

Out[50]:

	item_id	predicted_rating	fit_small	fit	fit_large	user_count	bust_size_top	weight_mean	weight_median	rating_average	rented_for_top	rented_for_all	body_type_top	category_top	height_mean	height_median	size_mean	size_median	age_mean	age_median	review_length_average	review_month_top	review_season_top
0	1215281	4.8961	4.0000	62.0000	5.0000	71	34d	131.7606	130.0000	4.9577	wedding	[formal affair, party, other, wedding]	athletic	gown	64.9437	65.0000	10.0423	8.0000	32.8592	32.0000	55.4507	10	Fall
1	1451390	4.8904	0.0000	27.0000	2.0000	29	34b	136.5862	135.0000	4.9655	wedding	[formal affair, work, everyday, party, wedding]	hourglass	maxi	66.2069	66.0000	13.7931	12.0000	34.8276	35.0000	62.7241	7	Summer
2	2546911	4.8897	1.0000	28.0000	0.0000	29	34c	130.7931	130.0000	4.9310	everyday	[work, everyday, party, wedding, date, vacation]	hourglass	pant	64.8276	65.0000	9.4483	8.0000	35.0690	35.0000	35.7931	10	Summer
3	1186923	4.8640	9.0000	91.0000	3.0000	103	34c	137.4757	135.0000	4.9223	party	[formal affair, other, party, wedding, date, v...	hourglass	dress	65.3301	65.0000	13.5340	14.0000	30.3301	30.0000	65.6990	11	Winter
4	1547051	4.8606	1.0000	44.0000	3.0000	48	34c	137.2917	135.0000	4.9167	wedding	[formal affair, party, other, wedding]	hourglass	gown	65.2708	65.0000	11.6875	12.0000	34.0833	32.0000	55.8958	11	Spring
5	371022	4.8581	1.0000	53.0000	8.0000	62	34b	138.8065	137.5000	4.9194	formal affair	[formal affair, other, party, wedding, date]	athletic	dress	65.0645	65.0000	12.6613	12.0000	36.2419	33.5000	55.7097	12	Winter
6	1200223	4.8574	3.0000	37.0000	2.0000	42	34c	142.9286	140.0000	4.9048	wedding	[formal affair, work, everyday, other, party, ...	hourglass	sheath	65.7857	66.0000	15.9048	16.0000	43.1667	41.5000	51.5476	11	Winter
7	386314	4.8559	0.0000	28.0000	9.0000	37	38d	171.7568	150.0000	4.9189	wedding	[formal affair, work, other, party, wedding, d...	hourglass	sheath	65.0000	65.0000	33.0000	32.0000	39.0000	39.0000	55.0270	11	Winter
8	740349	4.8526	1.0000	21.0000	7.0000	29	34b	136.2069	135.0000	4.8621	wedding	[formal affair, work, everyday, other, party, ...	hourglass	shift	65.1034	65.0000	12.7586	12.0000	35.6897	35.0000	41.0345	5	Spring
9	1328898	4.8524	1.0000	42.0000	1.0000	44	34b	135.7500	135.0000	4.9091	formal affair	[formal affair, other, wedding]	hourglass	gown	66.0455	66.0000	9.6136	8.0000	33.8182	32.0000	63.2727	10	Winter

VI. Further Research¶

For further research, the data should be updated with more recent rentals, and more features should be added such as prices for the products as well as product description.

Contact¶

Feel free to contact me for any questions and connect with me on Linkedin.