#!/usr/bin/env python # coding: utf-8 # Who Reviews the Reviewers? #

Table of Contents

# #
# #
# # Introduction # This is an artificial intelligence learning project. The purpose of this project is to apply and explore A.I. concepts to a real world problem. For our topic we have chosen to investigate Amazon user reviews and their helpfulness to customers. Specifically, we are going to apply basic procedures of A.I. subdivision, machine learning, to see if we can create a model that will allow us to predict when a review is considered helpful to customers only by looking at the content of the review. The main M.L. methods will attempt are random forests and vector counts. # # Load Data # # Our data comes from UCSD professor Julian McAuley's collection of Amazon review data. # These reviews were written and submitted at some point between the dates May 1996 - July 2014. # # Each of the data files we have added are "5-core." That is, they have been filtered down from a larger dataset in such a way that these remaining reviews are on products that have at least 5 reviews and each review is written by a reviewer that has written at least 5 reviews. Duplicate reviews have been removed to an extent. There may still be instances of duplicate reviews as in the case where different users accounts post a copy of the same review, but said duplicates only account for less than 1 percent of the data. # # Each of the 4 data files we have included contain reviews from distinct product categories. In this code block we join the data from all four sources into one pandas dataframe. We also add a column in the dataset called "category" and label each of the reviews with a shorthand name for their category: # * Hardware - Tools and home improvement, 134,476 reviews # * Beauty - Beauty products, 198,502 reviews # * Games - Toys and games, 167,597 reviews # * Pets - Pet supplies, 157,836 reviews # # In all, that is a total starting dataset of about 650k reviews. # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np # In[2]: dfHardware = pd.read_json('data/reviews_Tools_and_Home_Improvement_5.json.gz', lines=True, compression='infer') dfHardware["category"] = "Hardware" dfBeauty = pd.read_json('data/reviews_Beauty_5.json.gz', lines=True, compression='infer') dfBeauty["category"] = "Beauty" dfGames = pd.read_json('data/reviews_Toys_and_Games_5.json.gz', lines=True, compression='infer') dfGames["category"] = "Games" dfPets = pd.read_json('data/reviews_Pet_Supplies_5.json.gz', lines=True, compression='infer') dfPets["category"] = "Pets" frames = [dfHardware, dfBeauty, dfGames, dfPets] df = pd.concat(frames) # In[3]: df.head() # In[4]: len(df) # At this point we create the "helpfulness" column. The value is just the ratio of positive "helpful" votes to the total number of votes (both helpful and unhelpful) on a review. # In[5]: df["total_votes"] = df.helpful.str[1] df["helpful_votes"] = df.helpful.str[0] df["helpfulness"] = df.helpful.str[0] / df.helpful.str[1] df.drop(columns=["helpful"], inplace=True) df.reset_index(drop=True, inplace=True) # # Cleaning # # In cleaning the dataset, the most important filtering we want to do is set a baseline number of total votes for each review to have. This is important because votes are our sole standard by which we are classifying the reviews. A review with too few votes is likely to have helpfulness score that is volatile and inaccurate compared to the reading it would have given more votes. And of course, reviews with zero votes are mathematically useless for our modeling. We will set our baseline at least 5 votes. # In[6]: df=df[df.total_votes >= 5] df.reset_index(drop=True, inplace=True) len(df) # We can immediately see a stark reduction in total dataset count. In fact it has been reduced to a tenth of the original size. What this tells us is that the vast majority of the reviews in the dataset have few to no votes on them at all. That is an unfortunate fact for our project because it means a great reduction in the amount of useful data we have to model on. # # It will be useful to investigate what correlations may exist with the way people choose to vote on reviews. To catch some insight into this, we create a scatter plot of total votes vs helpfulness on the remaining data. # In[7]: fig, axc = plt.subplots(figsize=(10, 10)) axc.scatter(df.total_votes, df.helpfulness) axc.set_title('Total Votes vs Helpfulness', fontsize= 30) axc.set_xlabel('Total Votes', fontsize= 20) axc.set_ylabel('Helpfulness', fontsize= 20) # There definitely seems to be a correlation. The more votes a review has, the higher the likelihood it the review is positively scored as helpful. It would not benefit our modeling to skew the dataset one way or the other, helpful or unhelpful. We can cut away those reviews that have a large number of votes. # In[8]: df = df[df.total_votes <= 100] df.reset_index(drop=True, inplace=True) len(df) # Placing a ceiling of 100 votes does not significantly impact the amount of data we have left to work with, or at least not nearly as significantly as placing the floor value. # # If we make a scatter plot of this new zoomed in picture, we can make out more detail. # In[9]: fig, axc = plt.subplots(figsize=(10, 10)) axc.scatter(df.total_votes, df.helpfulness) axc.set_title('Total Votes vs Helpfulness', fontsize= 30) axc.set_xlabel('Total Votes', fontsize= 20) axc.set_ylabel('Helpfulness', fontsize= 20) # Now we can clearly see patterns emerge and pose interpretations on them. # # As we move along the total votes axis, we see that review helpfulness ratings follow a kind of logarithmic curve. The curves tend high, center, or low depending on starting position. It would appear that reviews that are judged as helpful, unhelpful, or neutral early on tend towards those respective helpfulness ratings the more votes are amassed. However, only the reviews that tend toward the positive helpfulness continue to gain in total votes while non-helpful review curves fade off. There are some possible explanations for this. # # * Review readers are less inclined to issue negative votes. When faced with an unhelpful review, readers are more hesitant or less motivated to pass judgment. This may also help explain the vast numbers of unrated reviews. # * Amazon senses which reviews are trending as helpful and disproportionately gives them more exposure to customers in an effort to maximize their utility as helpful reviews. By the same token, reviews that begin trending unhelpful are hidden away or buried to the point that readers will barely see them, never mind vote on them. # # The implication of these trends is that the majority of reviews in the usable dataset are skewed towards high helpfulness rating. It may seem from the above graph that we could lower our total votes ceiling to bring the spread of helpful to unhelpful reviews closer to the center, but the 2D histogram of the same variables shows that this would be futile. # # In[10]: fig, ax = plt.subplots(figsize=(10,10)) ax.hist2d(df.total_votes, df.helpfulness) ax.set_title('Total Votes vs Helpfulness', fontsize= 20) ax.set_xlabel('Total Votes', fontsize= 15) ax.set_ylabel('Helpfulness', fontsize= 15) plt.show(); # The dataset is concentrated in the region of low number of votes and high helpfulness. We are not going to achieve an evenly helpfulness rating by filtering away anymore of our data so we can stop here. # # However, we are going to use random forest to perform a binary classification on this data and so it will benefit us to find a labeling (helpful, unhelpful) for each review in such a way that both classes are balanced in the number of entries. To achieve this we will select a high helpfulness rating standard. We found the rating of .87 or greater to be a good, balanced cutoff point. In the following code block we label our data with a new a new column "helpful" taking on values of either 0 or 1 indicating if a review is helpful or not. Percent distribution of the two cases is printed below. # In[11]: df["helpful"] = np.where(df["helpfulness"] >= .87, 1, 0) df.helpful.value_counts(normalize=True) # # Exploration # In this section we map the binary helpful value of the review against some high level and meta traits to see if we can gain insight that may help guide our analysis. In the first plot we have a 2D histogram of review length vs helpful. The histogram is just two rows of blocks showing the density spread of the reviews on the given axis. Any discrepancy in colorization in the two rows will tell us if there is a correlation between the length of the review and whether it is considered helpful. # In[12]: df['textLength'] = [len(df.loc[row, 'reviewText']) for row in df.index] fig, ax = plt.subplots(figsize = (10, 2)) ax.hist2d(df.textLength, df.helpful, bins=(300, 2)) ax.set_title('Review Length vs Helpful', fontsize= 20) ax.set_xlabel('Review Length (words)', fontsize= 15) ax.set_ylabel('Helpful', fontsize= 15) plt.yticks(range(0,2)) plt.axis([0, 2000, 0, 1]) # We can see that the majority of reviews are centered around the 300 word count area for both helpful and non helpful reviews. Reviews labeled helpful do definitely do tend to be slightly longer than their unhelpful counterparts however. The difference is not huge but it is there. # # In the next bit of exploration we introduce the concept of "readability" through the textstat library. Readability is the measure of how complex a piece of text is. The higher the readability score a text gets, the higher the level of education necessary for proper comprehension. There are many different models for measuring readability of text and textstat implements a wide array of them. For this exploration we used Automated Readability Index (ARI). ARI uses an equation that relates the ratios of characters to words and words to sentences to assign a score that approximates a grade school level readability. # # After evaluating all of the review texts based on AIR we can see in the results that there is hardly any distinction in the histogram. Most reviews helpful and unhelpful fall under the same 6th to 9th grade readability level. # In[13]: import textstat df['readability'] = [textstat.automated_readability_index(df.loc[row, 'reviewText']) for row in df.index] fig, ax = plt.subplots(figsize = (10, 2)) ax.hist2d(df.readability, df.helpful, bins=(900, 2)) ax.set_title('Helpful vs Readability Score', fontsize= 20) ax.set_xlabel('Readability (grade level)', fontsize= 15) ax.set_ylabel('Helpful', fontsize= 15) plt.yticks(range(0,2)) plt.axis([0, 20, 0, 1]) # One more bit of exploration we will conduct is to compare the "overall" field of the review with helpful. Overall represents the out-of-five score that the reviewer issued to the product. # # This histogram has some interesting information. It shows that the vast majority of product reviews in our data are overwhelmingly positive toward their products. It seems that as with the helpful votes, people are by in large much more inclined to pass positive judgment rather than negative. It may also indicate that Amazon customers are generally very satisfied with their shopping experience. # # In any case, there is a discernible difference in the helpful and unhelpful spreads. Reviews considered more unhelpful tilt more towards unsatisfied reviewers. It seems that negative reviews attract unhelpful votes. Without considering too deeply the psychological or business analytics implications, one idea that we can come out with is this. Review text with language or vocabulary that lean positive or content may hold may have a definite correlation with toward helpful labeling, and vice versa. # In[14]: fig, ax = plt.subplots(figsize = (10, 2)) ax.hist2d(df.overall, df.helpful, bins=(5, 2)) ax.set_title('Overall vs Helpful', fontsize= 20) ax.set_xlabel('Overall', fontsize= 15) ax.set_ylabel('Helpful', fontsize= 15) plt.yticks(range(0,2)) plt.xticks(range(1,6)) plt.show(); # One way we can investigate this conjecture is through sentiment analysis. Here "sentiment" refers to a measure of positivity/negativity of a text. The TextBlob library offers a basic sentiment evaluation implementation that we can use to label the review data. The sentiment rating is on a scale from -1 to 1, -1 signifying negative sentiment and 1 signifying positive. # # The results seem to mimic those of the "overall" exploration. Reviews labeled helpful do skew a bit more into the positive range, and reviews labeled unhelpful skew more negative. Both however are centered in just about the same place, right in the neutral/mildly positive region of around .2. # In[15]: from textblob import TextBlob df['sentiment'] = [TextBlob(df.loc[row, 'reviewText']).sentiment.polarity for row in df.index] fig, ax = plt.subplots(figsize = (10, 2)) ax.hist2d(df.sentiment, df.helpful, bins=(15, 2)) ax.set_title('Sentiment vs Helpful', fontsize= 20) ax.set_xlabel('Sentiment (polarity)', fontsize= 15) ax.set_ylabel('Helpful', fontsize= 15) plt.yticks(range(0,2)) plt.xticks(range(-1,1)) plt.show(); # # Random Forest # As mentioned, we are attempting a binary classification through random forest. We have already properly labeled our reviews as either helpful or not. The next bit is to train a random forest with a portion of our data called the training set so that it may make accurate predictions the leftover portion of the data called the test set. We will split the data into the two sets by an 80-20 split training-test. In the following code block we perform the dataset split. # In[16]: from sklearn.model_selection import train_test_split x = df['reviewText'] y = df['helpful'] X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state = 100) print(y_train.value_counts(normalize=True)) print(y_test.value_counts(normalize=True)) # The other important part of random forest modeling is features. Features are what the random forest algorithm uses to classify the data. As we have committed to only using the review text itself, we will have to generate features from it. That is where Count vectors comes in. # ## Count Vectors # Vector counts are specialized way of counting the frequencies of words and producing features of texts based on those word counts. One of the most important aspects of a vector count is the stop list. This is simply a list of words which we choose to ignore when creating our count vectors. Here we set up a stop list of common conjunctive words such as "and" and "the" and punctuation. # In[17]: import nltk nltk.download('stopwords') from nltk.corpus import stopwords import string stop_list = stopwords.words('english') stop_list += list(string.punctuation) stop_list += ['br', '.<', '..', '...', '``', "''", '--'] # At this point we are ready to train our random forest through count vector. Here we create the random forest classifier using n_estimators value of 100 meaning using 100 decision trees and an n_jobs argument of -1 which means to utilize all available processors (very useful). The training accuracy of the classifier comes out to about 60 percent which is not amazing but definitely better than guessing classifications at random. Maybe there are ways we can do better. # In[18]: from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, classification_report pipe_rf = Pipeline([('vectorizer', CountVectorizer(stop_words = stop_list, max_features = 300)), ('forest', RandomForestClassifier(n_estimators=100, n_jobs=-1))]) pipe_rf.fit(X_train, y_train) # In[19]: from sklearn.metrics import accuracy_score, confusion_matrix, classification_report y_pred_train = pipe_rf.predict(X_test) print('Training accuracy:', accuracy_score(y_test, y_pred_train)) # ## Vector Count Split # Let's see if we can get better results by building the random forest classifiers separately on each of the four categories. After all, each review category may have its own lexicon and appreciable descriptions. # In[20]: def vector_forest(df): x = df['reviewText'] y = df['helpful'] X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state = 100) train_test_split(x, y, test_size=.2, random_state = 100) print(y_train.value_counts(normalize=True)) print(y_test.value_counts(normalize=True)) pipe_rf = Pipeline([('vectorizer', CountVectorizer(stop_words = stop_list, max_features = 300)), ('forest', RandomForestClassifier(n_estimators=100, n_jobs=-1))]) pipe_rf.fit(X_train, y_train) y_pred_train = pipe_rf.predict(X_test) print('Training accuracy:', accuracy_score(y_test, y_pred_train)) # In[21]: print("Hardware: ") dfHardware = df[df.category=="Hardware"].copy() dfHardware["helpful"] = np.where(dfHardware["helpfulness"] >= .89, 1, 0) dfHardware.helpful.value_counts(normalize=True) print(len(dfHardware)) print(dfHardware.helpful.value_counts(normalize=True)) print() print("Beauty: ") dfBeauty = df[df.category=="Beauty"].copy() dfBeauty["helpful"] = np.where(dfBeauty["helpfulness"] >= .82, 1, 0) dfBeauty.helpful.value_counts(normalize=True) print(len(dfBeauty)) print(dfBeauty.helpful.value_counts(normalize=True)) print() print("Games: ") dfGames = df[df.category=="Games"].copy() dfGames["helpful"] = np.where(dfGames["helpfulness"] >= .88, 1, 0) dfGames.helpful.value_counts(normalize=True) print(len(dfGames)) print(dfGames.helpful.value_counts(normalize=True)) print() print("Pets: ") dfPets = df[df.category=="Pets"].copy() dfPets["helpful"] = np.where(dfPets["helpfulness"] >= .92, 1, 0) dfPets.helpful.value_counts(normalize=True) print(len(dfPets)) print(dfPets.helpful.value_counts(normalize=True)) print() # In this final step we run the four separate random forest classifiers. Unfortunately, the results trend downward in all four cases. The predictive power of sum is better than its parts. There is one more idea we can try though. # In[22]: print("Hardware:") vector_forest(dfHardware) print() print("Beauty:") vector_forest(dfBeauty) print() print("Games:") vector_forest(dfGames) print() print("Pets:") vector_forest(dfPets) print() # ## Count Vector Descriptive # In this final attempt we are going try to only count-vectorize on words that are descriptive in nature. For our purposes that will mean words that are classified as either adjectives or adverbs. A good review may after make good usage of descriptive language. We had a couple ideas about how to do this but neither did very well. The first had a running time that was too long and the second, yielded another suboptimal result and a fat userwarning having to do with word normalization (that is, having to do with setting words to a normalized tense). # In[23]: #First idea, adding entry to reviews with all descriptive words # def get_adjectives(text): # blob = TextBlob(text) # return [ word for (word,tag) in blob.tags if (tag.startswith("JJ") or tag.startswith("RB"))] # dfDescriptors = df['reviewText'].apply(get_adjectives) # In[24]: import nltk nltk.download('words') from nltk.corpus import words all_words = words.words() non_descriptive_words = [] for word in all_words: blob = TextBlob(word) word, tag = blob.tags[0] if(not (tag.startswith("JJ") or tag.startswith("RB"))): non_descriptive_words.append(word) len(non_descriptive_words) # In[25]: pipe_forest_nd = Pipeline([('vectorizer', CountVectorizer(stop_words = non_descriptive_words, max_features = 300)), ('forest', RandomForestClassifier(n_estimators=100, n_jobs=-1))]) pipe_forest_nd.fit(X_train, y_train) y_pred_train_nd = pipe_forest_nd.predict(X_test) print('Training accuracy:', accuracy_score(y_test, y_pred_train_nd)) # # Conclusion # In the end, the best fit training model was the first attempt with an accuracy of 60%. All other attempts trailed this value by a significant amount.