#!/usr/bin/env python # coding: utf-8 # # Mozilla Feedback Analysis # # # Mozilla Firefox releases various versions of its web browser over time. These versions are run on different platforms by users around the world. Some versions may make users happy, while some others, not so much. There is a huge amount of feedback that is generated by users across platforms and versions. # #### How do we make sense of this feedback?
How can Mozilla analyze this data in order to cater to users' growing needs and tackle issues swiftly? # # In this project, the aim is to analyze the feedback data and find interesting behaviours associated with it.
# The feedback is available at the following link.
# https://input.mozilla.org/en-US/?product=Firefox

# # ## Part 1: Web Scraping # # ### Task: Scrape data from the site # ### Completed: Scraped a month's worth of data (774 pages) using BeautifulSoup # Since the data is not readily available in the form of a table, it is necessary to scrape relevant information from the website. After a sample dataset (1 week's data) that I had scraped earlier (results can be viewed here http://nbviewer.ipython.org/gist/priyankamandikal/18bd3c13a29d7397b883), I scraped a month's worth of data for October 2015 using BeautifulSoup, a Python library for pulling data out of HTML and XML files.

# The code for scraping is as follows: # # In[79]: #I have commented out the scraping code here. ''' import requests from pattern import web from bs4 import BeautifulSoup import traceback #base url with GET dictionary url = 'https://input.mozilla.org/en-US' feedback1 = open('feedback1.txt','a') for i in xrange(1,774): #feedback4 - 1st 774 pages are valid from 01-10-2015 to 31-10-2015 try: params = dict(date_end='2015-10-31', selected='30d', product='Firefox', date_start='2015-10-01', page=i) r = requests.get(url, params=params) bs = BeautifulSoup(r.text) for opinion in bs.findAll('li','opinion'): senti = opinion.find('span','sprite').contents[0] datetime = opinion.find('time')['datetime'] time = opinion.find('time').string #not a very useful value. We want the absolute date and time #to remove whitespaces on either side of s tring, just do for e.g. time.strip() version = time.find_next('a').find_next('a').contents[0] platform = version.find_next('a').contents[0] locale = platform.find_next('a').contents[0] feedback1.write(senti+'\t'+datetime+'\t'+version+'\t'+platform+'\t'+locale+'\n') except: #the try catch block was added beacuse ceratin opinions in a page were geving errors as #certain characters couldn't be encoded by ascii codec. E.g. (Norwegian Bokmal), the 'a' was different print "Error while retreiving page ", i print traceback.format_exc() continue feedback1.close() ''' # In[80]: # Scraped data has been saved in feedback4.txt get_ipython().system('head feedback4.txt') # ## Part 2: Converting the data into a dataframe # # It is extremely important to convert the data into the right format to be able to perform operations on it. # #### I'm using Pandas (a Python data analysis toolkit) for dataframe related operations. # # In[81]: import matplotlib.pyplot as plt import matplotlib.pylab as P import pandas as pd import numpy as np # In[82]: names = ['sentiment', 'date', 'version', 'platform', 'locale'] data = pd.read_csv('feedback4.txt', delimiter='\t', names=names).dropna() print "Number of rows: %i" % data.shape[0] data.head() # Now, we have the data in the right format.
# For the next part, I'm using an ad-hoc way for dealing with the datetime column. I am going to extract only the day. However, I'll learn how to use the actual date in later analysis that I perform in the future. For now, this is a dirty way, but suffices for the sample dataset that I have scraped. # In[83]: # extracting the exact day from date # This is not a good practice since I'm just extracting the date # Ideally we should be extracting the entire date to compare them. Will do that later. data.date = [int(d.split('-')[2]) for d in data.date] data.head() # In[84]: data[['sentiment', 'version', 'platform', 'locale']].describe() # So we can infer that majority of the feedabck has been negative.
# The top version, platform and locale from where feedback has been coming can also be seen. # In[85]: data['date'].describe() # ## Part 3: Analyze the data and make some Basic plots # # We can now look at answering some questions that we may have regarding the given data.
# What exactly are we looking for? Some questions instantly come to mind.
# #### 1. Which version is causing a lot of problems? # #### 2. Which platform is causing the most problems? # #### 3. Which localities are users sending in feedback from? # #### 4. How is the positive feedback characterized? # #### 5. Which day has had maximum feedback coming in? # # Although the above are important questions, we may want to dig a little deeper into the data to find certain
# correlations between attributes. # #### 6. Is there any correlation between specific versions causing problems when used on certain platforms? # #### 7. Which versions and platforms go well together? #
# I will be using Matplotlib, a Python graph-plotting library.

# I have written a custom function that will be used with the matplotlib to make certain modifications to the in-built graphs. This makes the graphs simple and pretty by removing uneccesary borders, etc. # In[86]: # Custom function to make graphs simple and pretty get_ipython().run_line_magic('matplotlib', 'inline') #tell pandas to display wide tables as pretty HTML tables pd.set_option('display.width', 500) pd.set_option('display.max_columns', 100) #--------------------------To remove borders from the matplotlib plots------------------------- def remove_border(axes=None, top=False, right=False, left=True, bottom=True): """ Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn """ ax = axes or plt.gca() ax.spines['top'].set_visible(top) ax.spines['right'].set_visible(right) ax.spines['left'].set_visible(left) ax.spines['bottom'].set_visible(bottom) #turn off all ticks ax.yaxis.set_ticks_position('none') ax.xaxis.set_ticks_position('none') #now re-enable visibles if top: ax.xaxis.tick_top() if bottom: ax.xaxis.tick_bottom() if left: ax.yaxis.tick_left() if right: ax.yaxis.tick_right() # ### Feedback over time # In[87]: #negative and positive feedback over time #resources: http://matplotlib.org/examples/pylab_examples/histogram_demo_extended.html # http://matplotlib.org/examples/statistics/histogram_demo_multihist.html #Happy gives KeyError without .values fig = plt.figure(1) fig.set_size_inches(15, 5) #figure(figsize=(8,6)) plt.hist([[data[data.sentiment=='Sad'].date],[data[data.sentiment=='Happy'].date]], bins=np.arange(1,32), color=['crimson','chartreuse'], label=['Negative','Positive']) plt.xlabel('Date (day of October 2015)') plt.title('Feedback over Time') plt.legend(prop={'size': 15}, loc=2) remove_border() #handles, labels = ax.get_legend_handles_labels() #fig.savefig('fig1.png', bbox_inches='tight') # We observe that positive feedback has been low and almost constant on all days.
# Negative feedback has slightly varied with time, peaking on 31-10-2015.
# We can assume a bias in the result because not everyone reports positive feedback, but problems are immediately reported by users.
# So we need to place more emphasis on negative feedback in our analysis in order to pinpoint areas which Mozilla needs to focus on, in order to fix issues. # ### Which versions are popular? # In[88]: #Finding the versions with the most number of positive and negative feedbacks.I am ignoring the ones which #have a count of <20. I am also ignoring the 'Unknown' column. #Resources: http://pandas.pydata.org/pandas-docs/version/0.13.1/visualization.html fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5)) data[(data.sentiment=='Sad') & (data.version!='Unknown')].version.value_counts()[:9].plot(kind="bar", ax=axes[0],color='crimson') axes[0].set_xlabel('Version') axes[0].set_title('Versions with negative feedback') remove_border(axes=axes[0]) data[(data.sentiment=='Happy') & (data.version!='Unknown')].version.value_counts()[:9].plot(kind="bar", ax=axes[1],color='chartreuse') plt.ylim([0,4500]) axes[1].set_xlabel('Version') axes[1].set_title('Versions with postive feedback') remove_border(axes=axes[1]) #fig.savefig('fig2.png', bbox_inches='tight') # We find that the versoin causing maximum problems is also the one with the most positive feedback.
# This is true for other versions as well. So it is safe to assume that Version 41.0.1 seems to be the most # popular version amongst users, as it generates the most feedback (positive and negative) overall. Version 41.0.2 is also equally popular with getting feedback. But as I had already mentioned, the positive feedback doesn't really help us as the number is small. We may have to look into the text to see if there's any information in there.
#
If we look at the previous two plots, something seems to be spooky regarding the last day (31st) of October. Also, versions 41.0.1 and 41.0.2 seem to be really racing with each other in terms of feedback. So to investigate this unusual behaviour, I decided to plot a histogram of the two versions with time. # ##### And the results were truly eye-opening! # In[89]: #negative feedback over time for Versions 41.0.1 and 41.0.2 fig = plt.figure(1) fig.set_size_inches(15, 5) plt.hist([[data[(data.sentiment=='Sad') & (data.version=='41.0.1')].date], [data[(data.sentiment=='Sad') & (data.version=='41.0.2')].date]], bins=np.arange(1,32), color=['purple','orange'], label=['41.0.1','41.0.2']) plt.xlabel('Date (day of October 2015)') plt.title('Negative feedback of the two popular versions over time') plt.legend(prop={'size': 15}, loc=2) remove_border() #fig.savefig('fig7.png', bbox_inches='tight') # In[90]: #negative feedback over time for Versions 41.0.1 and 41.0.2 fig = plt.figure(1) fig.set_size_inches(15, 5) plt.hist([[data[(data.sentiment=='Happy') & (data.version=='41.0.1')].date], [data[(data.sentiment=='Happy') & (data.version=='41.0.2')].date]], bins=np.arange(1,32), color=['purple','orange'], label=['41.0.1','41.0.2']) plt.xlabel('Date (day of October 2015)') plt.title('Positive feedback of the two popular versions over time') plt.legend(prop={'size': 15}, loc=2) remove_border() #fig.savefig('fig8.png', bbox_inches='tight') # Version 41.0.1 seems to have plummeted in popularity during mid-October, where Version 41.0.2 has picked up. This means that a new version of Firefox was probably upgraded and released to the users and more feedback started flowing in.

# Below I have commented out some code which I used for creating csv files that I would later on use in my D3.js plots. # In[91]: #version_count = data[(data.sentiment=='Happy') & (data.version!='Unknown')].version.value_counts()[:10] #version_count # In[92]: #versions = pd.DataFrame({ 'version': version_count.keys(), 'count': version_count.values }) #versions # In[93]: #versions = versions.reindex_axis(sorted(versions.columns, reverse=True), axis=1) #versions # In[94]: #versions.to_csv('versions_happy.csv', index=False) # ### Which platforms are causing problems or are compatible? # In[95]: #Finding the platforms with the most number of positive and negative feedbacks.I am ignoring the ones #with a low count value. I am also ignoring the 'Unknown' column. fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5)) data[(data.sentiment=='Sad') & (data.platform!='Unknown')].platform.value_counts()[:8].plot(kind="bar", ax=axes[0],color='crimson') axes[0].set_xlabel('Platform') axes[0].set_title('Platforms with negative feedback') remove_border(axes=axes[0]) data[(data.sentiment=='Happy') & (data.platform!='Unknown')].platform.value_counts()[:8].plot(kind="bar", ax=axes[1],color='chartreuse') plt.ylim([0,5000]) axes[1].set_xlabel('Platform') axes[1].set_title('Platforms with postive feedback') remove_border(axes=axes[1]) #fig.savefig('fig3.png', bbox_inches='tight') # The Windows platforms seem to be generating the most problems. One thing to note here is that platform count is not directly proportional to the percentag of those platform users facing the problem. This is because the number of overall users for a platform varies.

# For e.g. Windows 7 problems > Windows 8.1 problems
# But this does not imply that %(7 users facing problems) > %(8 users facing problems)
# Because the number of 8 users may be much lesser than 7 users leading to lower feedback count.

# While dealing with the issues, Mozilla probably needs to assign equal importance to 8 users so that it doesn't lose out on the 8 userbase. # ### Where is feedback coming in from? # In[96]: #Finding the locations with the most number of positive and negative feedbacks.I am ignoring the ones #with a low count value. I am also ignoring the 'Unknown' column. fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5)) data[(data.sentiment=='Sad') & (data.locale!='Permalink')].locale.value_counts()[:8].plot(kind="bar", ax=axes[0],color='crimson') axes[0].set_xlabel('Locale') axes[0].set_title('Locales with negative feedback') remove_border(axes=axes[0]) data[(data.sentiment=='Happy') & (data.locale!='Permalink')].locale.value_counts()[:8].plot(kind="bar", ax=axes[1],color='chartreuse') plt.ylim([0,8000]) axes[1].set_xlabel('Locale') axes[1].set_title('Locales with postive feedback') remove_border(axes=axes[1]) #fig.savefig('fig4.png', bbox_inches='tight') # Users in USA have a huge share in the providing feedback, both positive and negative. # ## Part 4: Finding Correlations # ### Task: Find which version/platform combinations are successful/unsuccessful # ### Completed: Plotted a heatmap showing the distribution # After dabbling with a lot of bar graphs and getting a sense of how the general bahaviour of the data is, it would be interesting to see if there are any correlations between different versions and platforms. The most apt graph for thos purpose would be a heat map. Since the attributes are non-numerical, it is quite a task to get a heatmap out of it. But with a bit of effort (and a LOT of googling), I was successful in getting the desired output!

# In the following sections, I have explained in detail how I went about the entire process. # In[97]: #I'm using dummy column to have a lits of 1s as count value for each instance. This is needed for the pivot table. data.insert(len(data.columns), column='dummy', value=[1]*len(data)) # In[98]: #Resources: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html pt1 = pd.pivot_table(data[(data.sentiment=='Sad') & (data.version!='Unknown') & (data.platform!='Unknown')], index='version', values='dummy', columns='platform', aggfunc=np.sum, fill_value=0) pt1.head() #use pt1 to look at the entire pivot table # In[99]: #Removing versions having a total negative feedback count of less than 100 #Note that I haven't used a for loop since the upper limit is taken to be non-changing in for loops. #But in our case, we keep decrementing nrows every time a row is deleted. So I have used a do-while equivalent #in Python. nrows = len(pt1) i = 0; while True: row = pt1.ix[i,] if(row.sum()<100): pt1.drop(pt1.index[i],inplace=True) i=i-1 #Because of dropping, indexing of rows changes. So need to read next row from same position nrows = nrows-1 i = i+1 if(i>=nrows): break pt1 # In[100]: #pt1.to_csv('heatmap_sad.csv', index=False) # In[101]: # You can also plot a heatmap with normed values if needed. Replace pt1 with pt1_norm in the next code snippet. # And uncomment below line. pt1_norm = (pt1 - pt1.mean()) / (pt1.max() - pt1.min()) pt1_norm.drop(['Maemo','OpenBSD amd64'], axis=1, inplace=True) pt1_norm #pt1_norm.to_csv('heatmap_sad.csv', index=False) # In[102]: #Resources: #http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor #https://plot.ly/python/heatmaps/ fig, ax = plt.subplots() heatmap = ax.pcolor(pt1, cmap=plt.cm.Reds, alpha=0.89) # Format fig = plt.gcf() fig.set_size_inches(15,5) # turn off the frame ax.set_frame_on(False) # put the major ticks at the middle of each cell ax.set_yticks(np.arange(pt1.shape[0]) + 0.5, minor=False) ax.set_xticks(np.arange(pt1.shape[1]) + 0.5, minor=False) # want a more natural, table-like display ax.invert_yaxis() ax.xaxis.tick_top() # Set the labels ax.set_xticklabels(pt1.columns, minor=False) ax.set_yticklabels(pt1.index, minor=False) # rotate the x labels plt.xticks(rotation=90) ax.grid(False) # Turn off all the ticks ax = plt.gca() for t in ax.xaxis.get_major_ticks(): t.tick1On = False t.tick2On = False for t in ax.yaxis.get_major_ticks(): t.tick1On = False t.tick2On = False # name the axes and plot ax.set_xlabel('Platform') ax.set_ylabel('Version') ax.xaxis.set_label_position('top') plt.title('Heatmap of Version vs Platform having negative feedback',y=-0.08) handles, labels = ax.get_legend_handles_labels() #lgd = ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5,-0.1)) #fig.savefig('heatmap.png', bbox_extra_artists=(lgd,), bbox_inches='tight') #fig.savefig('fig5.png', bbox_inches='tight') # Voila! Here is our much-waited heatmap!
# #### Some observations: # 1. Version 41.0.1 is in general problematic with a lot of platforms, Windows 7 and 10 in general.
# 2. Version 41.0.2 (which seems to be an updated version of 41.0.1) has comparatively lesser number of issues, but users are still facing problems. So it will have to be looked into. # 3. Firefox probably works best with Linux, as it has a remarkably low negative feedback count. # # ### Repeating the heatmap process for positive feedback # # In[103]: pt2 = pd.pivot_table(data[(data.sentiment=='Happy') & (data.version!='Unknown') & (data.platform!='Unknown')], index='version', values='dummy', columns='platform', aggfunc=np.sum, fill_value=0) pt2.head() #use pt2 to look at the entire pivot table # In[104]: #Removing versions having a total positive feedback count of less than 50 nrows = len(pt2) i = 0; while True: row = pt2.ix[i,] if(row.sum()<50): pt2.drop(pt2.index[i],inplace=True) i=i-1 #Because of dropping, indexing of rows changes. So need to read next row from same position nrows = nrows-1 i = i+1 if(i>=nrows): break pt2 # In[105]: pt2.to_csv('heatmap_happy.csv', index=False) # In[106]: #Resources: #http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor #https://plot.ly/python/heatmaps/ fig, ax = plt.subplots() heatmap = ax.pcolor(pt2, cmap=plt.cm.Greens, alpha=0.89) # Format fig = plt.gcf() fig.set_size_inches(15,5) # turn off the frame ax.set_frame_on(False) # put the major ticks at the middle of each cell ax.set_yticks(np.arange(pt2.shape[0]) + 0.5, minor=False) ax.set_xticks(np.arange(pt2.shape[1]) + 0.5, minor=False) # want a more natural, table-like display ax.invert_yaxis() ax.xaxis.tick_top() # Set the labels ax.set_xticklabels(pt2.columns, minor=False) ax.set_yticklabels(pt2.index, minor=False) # rotate the x labels plt.xticks(rotation=90) ax.grid(False) # Turn off all the ticks ax = plt.gca() for t in ax.xaxis.get_major_ticks(): t.tick1On = False t.tick2On = False for t in ax.yaxis.get_major_ticks(): t.tick1On = False t.tick2On = False # name the axes and plot ax.set_xlabel('Platform') ax.set_ylabel('Version') ax.xaxis.set_label_position('top') plt.title('Heatmap of Version vs Platform having positive feedback',y=-0.08) handles, labels = ax.get_legend_handles_labels() #lgd = ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5,-0.1)) #fig.savefig('heatmap.png', bbox_extra_artists=(lgd,), bbox_inches='tight') #fig.savefig('fig6.png', bbox_inches='tight') # We notice a similar trend in this heatmap as well. But it should be noted that positive count is much lesser than the negative one. A heatmap is always relative to the data that is used for plotting it.
# But an interesting observation is that Linux users seem to be happy with Version 8! So that's a bonus! Version 8 also didn't have too many negative reviews coming from Linux users. # #

Finally we need to delete the dummy column that we had created. # In[107]: #deleting the dummy column data.drop('dummy', axis=1, inplace=True) data.head() # ## Final Thoughts # In this analysis, a Mozilla Firefox feedback data was analyzed and certain insights were drawn into user behaviour and version/platform dependency. # # #### Further Analysis: # 1. I hope to scrape a larger amount of data (3 months' worth) and run the analysis again.
# 2. It might help to explore some group properties
# 3. An extensive time-series analysis can be performed for data over a longer period of time to determine whether issues have been resolved or new ones have cropped up after a version update