Mozilla Feedback Analysis

Mozilla Firefox releases various versions of its web browser over time. These versions are run on different platforms by users around the world. Some versions may make users happy, while some others, not so much. There is a huge amount of feedback that is generated by users across platforms and versions.

How do we make sense of this feedback?
How can Mozilla analyze this data in order to cater to users' growing needs and tackle issues swiftly?

In this project, the aim is to analyze the feedback data and find interesting behaviours associated with it.
The feedback is available at the following link.
https://input.mozilla.org/en-US/?product=Firefox

Part 1: Web Scraping

Task: Scrape data from the site

Completed: Scraped a month's worth of data (774 pages) using BeautifulSoup

Since the data is not readily available in the form of a table, it is necessary to scrape relevant information from the website. After a sample dataset (1 week's data) that I had scraped earlier (results can be viewed here http://nbviewer.ipython.org/gist/priyankamandikal/18bd3c13a29d7397b883), I scraped a month's worth of data for October 2015 using BeautifulSoup, a Python library for pulling data out of HTML and XML files.

The code for scraping is as follows:

In [79]:
#I have commented out the scraping code here.

'''
import requests
from pattern import web
from bs4 import BeautifulSoup
import traceback

#base url with GET dictionary
url = 'https://input.mozilla.org/en-US'
feedback1 = open('feedback1.txt','a')
for i in xrange(1,774): 
#feedback4 - 1st 774 pages are valid from 01-10-2015 to 31-10-2015
	try:
		params = dict(date_end='2015-10-31', selected='30d', product='Firefox', date_start='2015-10-01', page=i)
		r = requests.get(url, params=params)
		bs = BeautifulSoup(r.text)
		for opinion in bs.findAll('li','opinion'):
			senti = opinion.find('span','sprite').contents[0]
			datetime = opinion.find('time')['datetime']
			time = opinion.find('time').string #not a very useful value. We want the absolute date and time
			#to remove whitespaces on either side of s tring, just do for e.g. time.strip()
			version = time.find_next('a').find_next('a').contents[0]
			platform = version.find_next('a').contents[0]
			locale = platform.find_next('a').contents[0]
			feedback1.write(senti+'\t'+datetime+'\t'+version+'\t'+platform+'\t'+locale+'\n')
	except:
		#the try catch block was added beacuse ceratin opinions in a page were geving errors as
		#certain characters couldn't be encoded by ascii codec. E.g. (Norwegian Bokmal), the 'a' was different
		print "Error while retreiving page ", i
		print traceback.format_exc()
		continue

feedback1.close()
'''
Out[79]:
'\nimport requests\nfrom pattern import web\nfrom bs4 import BeautifulSoup\nimport traceback\n\n#base url with GET dictionary\nurl = \'https://input.mozilla.org/en-US\'\nfeedback1 = open(\'feedback1.txt\',\'a\')\nfor i in xrange(1,774): \n#feedback4 - 1st 774 pages are valid from 01-10-2015 to 31-10-2015\n\ttry:\n\t\tparams = dict(date_end=\'2015-10-31\', selected=\'30d\', product=\'Firefox\', date_start=\'2015-10-01\', page=i)\n\t\tr = requests.get(url, params=params)\n\t\tbs = BeautifulSoup(r.text)\n\t\tfor opinion in bs.findAll(\'li\',\'opinion\'):\n\t\t\tsenti = opinion.find(\'span\',\'sprite\').contents[0]\n\t\t\tdatetime = opinion.find(\'time\')[\'datetime\']\n\t\t\ttime = opinion.find(\'time\').string #not a very useful value. We want the absolute date and time\n\t\t\t#to remove whitespaces on either side of s tring, just do for e.g. time.strip()\n\t\t\tversion = time.find_next(\'a\').find_next(\'a\').contents[0]\n\t\t\tplatform = version.find_next(\'a\').contents[0]\n\t\t\tlocale = platform.find_next(\'a\').contents[0]\n\t\t\tfeedback1.write(senti+\'\t\'+datetime+\'\t\'+version+\'\t\'+platform+\'\t\'+locale+\'\n\')\n\texcept:\n\t\t#the try catch block was added beacuse ceratin opinions in a page were geving errors as\n\t\t#certain characters couldn\'t be encoded by ascii codec. E.g. (Norwegian Bokmal), the \'a\' was different\n\t\tprint "Error while retreiving page ", i\n\t\tprint traceback.format_exc()\n\t\tcontinue\n\nfeedback1.close()\n'
In [80]:
# Scraped data has been saved in feedback4.txt
!head feedback4.txt
Sad	2015-10-31-08:00	41.0.2	Windows 7	Spanish (Argentina)
Sad	2015-10-31-08:00	41.0.2	Windows 7	Dutch
Sad	2015-10-31-08:00	41.0.2	Windows 7	Russian
Sad	2015-10-31-08:00	41.0.2	Windows 7	English (US)
Sad	2015-10-31-08:00	41.0.2	Windows Vista	English (US)
Sad	2015-10-31-08:00	10.0	Windows XP	English (US)
Sad	2015-10-31-08:00	42.0	OS X	English (US)
Sad	2015-10-31-08:00	41.0.2	Windows 7	German
Sad	2015-10-31-08:00	41.0.2	Windows 8.1	English (US)
Sad	2015-10-31-08:00	40.0	Windows XP	English (US)

Part 2: Converting the data into a dataframe

It is extremely important to convert the data into the right format to be able to perform operations on it.

In [81]:
import matplotlib.pyplot as plt
import matplotlib.pylab as P
import pandas as pd
import numpy as np
In [82]:
names = ['sentiment', 'date', 'version', 'platform', 'locale']
data = pd.read_csv('feedback4.txt', delimiter='\t', names=names).dropna()
print "Number of rows: %i" % data.shape[0]
data.head()
Number of rows: 15031
Out[82]:
sentiment date version platform locale
0 Sad 2015-10-31-08:00 41.0.2 Windows 7 Spanish (Argentina)
1 Sad 2015-10-31-08:00 41.0.2 Windows 7 Dutch
2 Sad 2015-10-31-08:00 41.0.2 Windows 7 Russian
3 Sad 2015-10-31-08:00 41.0.2 Windows 7 English (US)
4 Sad 2015-10-31-08:00 41.0.2 Windows Vista English (US)

Now, we have the data in the right format.
For the next part, I'm using an ad-hoc way for dealing with the datetime column. I am going to extract only the day. However, I'll learn how to use the actual date in later analysis that I perform in the future. For now, this is a dirty way, but suffices for the sample dataset that I have scraped.

In [83]:
# extracting the exact day from date
# This is not a good practice since I'm just extracting the date
# Ideally we should be extracting the entire date to compare them. Will do that later.

data.date = [int(d.split('-')[2]) for d in data.date]
data.head()
Out[83]:
sentiment date version platform locale
0 Sad 31 41.0.2 Windows 7 Spanish (Argentina)
1 Sad 31 41.0.2 Windows 7 Dutch
2 Sad 31 41.0.2 Windows 7 Russian
3 Sad 31 41.0.2 Windows 7 English (US)
4 Sad 31 41.0.2 Windows Vista English (US)
In [84]:
data[['sentiment', 'version', 'platform', 'locale']].describe()
Out[84]:
sentiment version platform locale
count 15031 15031 15031 15031
unique 2 95 48 54
top Sad 41.0.1 Windows 7 English (US)
freq 12442 4780 5765 8491

So we can infer that majority of the feedabck has been negative.
The top version, platform and locale from where feedback has been coming can also be seen.

In [85]:
data['date'].describe()
Out[85]:
count    15031.000000
mean        15.303506
std          8.913861
min          1.000000
25%          7.000000
50%         15.000000
75%         23.000000
max         31.000000
Name: date, dtype: float64

Part 3: Analyze the data and make some Basic plots

We can now look at answering some questions that we may have regarding the given data.
What exactly are we looking for? Some questions instantly come to mind.

1. Which version is causing a lot of problems?

2. Which platform is causing the most problems?

3. Which localities are users sending in feedback from?

4. How is the positive feedback characterized?

5. Which day has had maximum feedback coming in?

Although the above are important questions, we may want to dig a little deeper into the data to find certain
correlations between attributes.

6. Is there any correlation between specific versions causing problems when used on certain platforms?

7. Which versions and platforms go well together?


I will be using Matplotlib, a Python graph-plotting library.

I have written a custom function that will be used with the matplotlib to make certain modifications to the in-built graphs. This makes the graphs simple and pretty by removing uneccesary borders, etc.

In [86]:
# Custom function to make graphs simple and pretty

%matplotlib inline

#tell pandas to display wide tables as pretty HTML tables
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

#--------------------------To remove borders from the matplotlib plots-------------------------
def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

Feedback over time

In [87]:
#negative and positive feedback over time
#resources: http://matplotlib.org/examples/pylab_examples/histogram_demo_extended.html
#           http://matplotlib.org/examples/statistics/histogram_demo_multihist.html

#Happy gives KeyError without .values

fig = plt.figure(1)
fig.set_size_inches(15, 5)
#figure(figsize=(8,6))
plt.hist([[data[data.sentiment=='Sad'].date],[data[data.sentiment=='Happy'].date]],
         bins=np.arange(1,32), 
         color=['crimson','chartreuse'],
         label=['Negative','Positive'])
plt.xlabel('Date (day of October 2015)')
plt.title('Feedback over Time')
plt.legend(prop={'size': 15}, loc=2)
remove_border()

#handles, labels = ax.get_legend_handles_labels()
#fig.savefig('fig1.png', bbox_inches='tight')

We observe that positive feedback has been low and almost constant on all days.
Negative feedback has slightly varied with time, peaking on 31-10-2015.
We can assume a bias in the result because not everyone reports positive feedback, but problems are immediately reported by users.
So we need to place more emphasis on negative feedback in our analysis in order to pinpoint areas which Mozilla needs to focus on, in order to fix issues.

Which versions are popular?

In [88]:
#Finding the versions with the most number of positive and negative feedbacks.I am ignoring the ones which 
#have a count of <20. I am also ignoring the 'Unknown' column.
#Resources: http://pandas.pydata.org/pandas-docs/version/0.13.1/visualization.html

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5))

data[(data.sentiment=='Sad') & (data.version!='Unknown')].version.value_counts()[:9].plot(kind="bar",
                                                                                        ax=axes[0],color='crimson')
axes[0].set_xlabel('Version')
axes[0].set_title('Versions with negative feedback')
remove_border(axes=axes[0])


data[(data.sentiment=='Happy') & (data.version!='Unknown')].version.value_counts()[:9].plot(kind="bar",
                                                                                    ax=axes[1],color='chartreuse')
plt.ylim([0,4500])
axes[1].set_xlabel('Version')
axes[1].set_title('Versions with postive feedback')
remove_border(axes=axes[1])

#fig.savefig('fig2.png', bbox_inches='tight')

We find that the versoin causing maximum problems is also the one with the most positive feedback.
This is true for other versions as well. So it is safe to assume that Version 41.0.1 seems to be the most popular version amongst users, as it generates the most feedback (positive and negative) overall. Version 41.0.2 is also equally popular with getting feedback. But as I had already mentioned, the positive feedback doesn't really help us as the number is small. We may have to look into the text to see if there's any information in there.

If we look at the previous two plots, something seems to be spooky regarding the last day (31st) of October. Also, versions 41.0.1 and 41.0.2 seem to be really racing with each other in terms of feedback. So to investigate this unusual behaviour, I decided to plot a histogram of the two versions with time.

And the results were truly eye-opening!
In [89]:
#negative feedback over time for Versions 41.0.1 and 41.0.2

fig = plt.figure(1)
fig.set_size_inches(15, 5)
plt.hist([[data[(data.sentiment=='Sad') & (data.version=='41.0.1')].date],
          [data[(data.sentiment=='Sad') & (data.version=='41.0.2')].date]],
         bins=np.arange(1,32), 
         color=['purple','orange'],
         label=['41.0.1','41.0.2'])
plt.xlabel('Date (day of October 2015)')
plt.title('Negative feedback of the two popular versions over time')
plt.legend(prop={'size': 15}, loc=2)
remove_border()

#fig.savefig('fig7.png', bbox_inches='tight')
In [90]:
#negative feedback over time for Versions 41.0.1 and 41.0.2

fig = plt.figure(1)
fig.set_size_inches(15, 5)
plt.hist([[data[(data.sentiment=='Happy') & (data.version=='41.0.1')].date],
          [data[(data.sentiment=='Happy') & (data.version=='41.0.2')].date]],
         bins=np.arange(1,32), 
         color=['purple','orange'],
         label=['41.0.1','41.0.2'])
plt.xlabel('Date (day of October 2015)')
plt.title('Positive feedback of the two popular versions over time')
plt.legend(prop={'size': 15}, loc=2)
remove_border()

#fig.savefig('fig8.png', bbox_inches='tight')

Version 41.0.1 seems to have plummeted in popularity during mid-October, where Version 41.0.2 has picked up. This means that a new version of Firefox was probably upgraded and released to the users and more feedback started flowing in.

Below I have commented out some code which I used for creating csv files that I would later on use in my D3.js plots.

In [91]:
#version_count = data[(data.sentiment=='Happy') & (data.version!='Unknown')].version.value_counts()[:10]
#version_count
In [92]:
#versions = pd.DataFrame({ 'version': version_count.keys(), 'count': version_count.values })
#versions
In [93]:
#versions = versions.reindex_axis(sorted(versions.columns, reverse=True), axis=1)
#versions
In [94]:
#versions.to_csv('versions_happy.csv', index=False)

Which platforms are causing problems or are compatible?

In [95]:
#Finding the platforms with the most number of positive and negative feedbacks.I am ignoring the ones 
#with a low count value. I am also ignoring the 'Unknown' column.

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5))

data[(data.sentiment=='Sad') & (data.platform!='Unknown')].platform.value_counts()[:8].plot(kind="bar",
                                                                                        ax=axes[0],color='crimson')
axes[0].set_xlabel('Platform')
axes[0].set_title('Platforms with negative feedback')
remove_border(axes=axes[0])


data[(data.sentiment=='Happy') & (data.platform!='Unknown')].platform.value_counts()[:8].plot(kind="bar",
                                                                                    ax=axes[1],color='chartreuse')
plt.ylim([0,5000])
axes[1].set_xlabel('Platform')
axes[1].set_title('Platforms with postive feedback')
remove_border(axes=axes[1])

#fig.savefig('fig3.png', bbox_inches='tight')

The Windows platforms seem to be generating the most problems. One thing to note here is that platform count is not directly proportional to the percentag of those platform users facing the problem. This is because the number of overall users for a platform varies.

For e.g. Windows 7 problems > Windows 8.1 problems
But this does not imply that %(7 users facing problems) > %(8 users facing problems)
Because the number of 8 users may be much lesser than 7 users leading to lower feedback count.

While dealing with the issues, Mozilla probably needs to assign equal importance to 8 users so that it doesn't lose out on the 8 userbase.

Where is feedback coming in from?

In [96]:
#Finding the locations with the most number of positive and negative feedbacks.I am ignoring the ones 
#with a low count value. I am also ignoring the 'Unknown' column.

fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5))

data[(data.sentiment=='Sad') & (data.locale!='Permalink')].locale.value_counts()[:8].plot(kind="bar",
                                                                                    ax=axes[0],color='crimson')
axes[0].set_xlabel('Locale')
axes[0].set_title('Locales with negative feedback')
remove_border(axes=axes[0])


data[(data.sentiment=='Happy') & (data.locale!='Permalink')].locale.value_counts()[:8].plot(kind="bar",
                                                                                    ax=axes[1],color='chartreuse')
plt.ylim([0,8000])
axes[1].set_xlabel('Locale')
axes[1].set_title('Locales with postive feedback')
remove_border(axes=axes[1])

#fig.savefig('fig4.png', bbox_inches='tight')

Users in USA have a huge share in the providing feedback, both positive and negative.

Part 4: Finding Correlations

Task: Find which version/platform combinations are successful/unsuccessful

Completed: Plotted a heatmap showing the distribution

After dabbling with a lot of bar graphs and getting a sense of how the general bahaviour of the data is, it would be interesting to see if there are any correlations between different versions and platforms. The most apt graph for thos purpose would be a heat map. Since the attributes are non-numerical, it is quite a task to get a heatmap out of it. But with a bit of effort (and a LOT of googling), I was successful in getting the desired output!

In the following sections, I have explained in detail how I went about the entire process.

In [97]:
#I'm using dummy column to have a lits of 1s as count value for each instance. This is needed for the pivot table.

data.insert(len(data.columns), column='dummy', value=[1]*len(data))
In [98]:
#Resources: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

pt1 = pd.pivot_table(data[(data.sentiment=='Sad') & (data.version!='Unknown') & (data.platform!='Unknown')], 
                    index='version', values='dummy', columns='platform', aggfunc=np.sum, fill_value=0)
pt1.head()
#use pt1 to look at the entire pivot table
Out[98]:
platform Android 4.2.2 Android 4.3 Android 5.0.2 Fedora FreeBSD Linux Maemo OS X OpenBSD amd64 Windows 10 Windows 2000 Windows 7 Windows 8 Windows 8.1 Windows NT Windows NT 4.10 Windows NT 9.0 Windows Vista Windows XP X18
version
10.0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 5 0
10.0.2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
11.0 0 0 0 0 0 0 0 0 0 0 0 14 4 0 0 0 0 0 9 0
12.0 0 0 0 0 0 0 0 0 0 0 0 14 3 0 0 0 0 1 8 0
13.0 0 0 0 0 0 1 0 0 0 0 0 10 4 0 0 0 0 0 4 0
In [99]:
#Removing versions having a total negative feedback count of less than 100
#Note that I haven't used a for loop since the upper limit is taken to be non-changing in for loops.
#But in our case, we keep decrementing nrows every time a row is deleted. So I have used a do-while equivalent
#in Python.

nrows = len(pt1)
i = 0;
while True:
    row = pt1.ix[i,]
    if(row.sum()<100):
        pt1.drop(pt1.index[i],inplace=True)
        i=i-1 #Because of dropping, indexing of rows changes. So need to read next row from same position
        nrows = nrows-1
    i = i+1
    if(i>=nrows):
        break

pt1
Out[99]:
platform Android 4.2.2 Android 4.3 Android 5.0.2 Fedora FreeBSD Linux Maemo OS X OpenBSD amd64 Windows 10 Windows 2000 Windows 7 Windows 8 Windows 8.1 Windows NT Windows NT 4.10 Windows NT 9.0 Windows Vista Windows XP X18
version
40.0 0 0 0 0 0 3 0 11 0 18 0 55 1 14 0 0 0 2 19 0
40.0.3 0 0 0 0 0 12 0 44 0 87 0 185 10 56 2 0 0 11 77 0
41.0 0 0 1 0 1 75 0 107 0 297 1 646 22 122 0 1 0 59 136 0
41.0.1 1 0 0 2 0 121 0 351 0 905 3 1680 73 399 1 0 1 143 484 0
41.0.2 0 1 0 3 0 96 0 325 0 943 0 1544 55 402 0 0 0 147 487 1
42.0 0 1 0 0 0 10 0 65 0 186 0 237 18 42 0 0 0 82 65 0
44.0a1 0 0 0 0 1 41 0 15 0 107 0 74 3 20 0 0 0 2 7 0
In [100]:
#pt1.to_csv('heatmap_sad.csv', index=False)
In [101]:
# You can also plot a heatmap with normed values if needed. Replace pt1 with pt1_norm in the next code snippet.
# And uncomment below line.

pt1_norm = (pt1 - pt1.mean()) / (pt1.max() - pt1.min())
pt1_norm.drop(['Maemo','OpenBSD amd64'], axis=1, inplace=True)
pt1_norm
#pt1_norm.to_csv('heatmap_sad.csv', index=False)
Out[101]:
platform Android 4.2.2 Android 4.3 Android 5.0.2 Fedora FreeBSD Linux OS X Windows 10 Windows 2000 Windows 7 Windows 8 Windows 8.1 Windows NT Windows NT 4.10 Windows NT 9.0 Windows Vista Windows XP X18
version
40.0 -0.142857 -0.285714 -0.142857 -0.238095 -0.285714 -0.407990 -0.353361 -0.373282 -0.190476 -0.354813 -0.347222 -0.352356 -0.214286 -0.142857 -0.142857 -0.425616 -0.339881 -0.142857
40.0.3 -0.142857 -0.285714 -0.142857 -0.238095 -0.285714 -0.331719 -0.256303 -0.298687 -0.190476 -0.274813 -0.222222 -0.244109 0.785714 -0.142857 -0.142857 -0.363547 -0.219048 -0.142857
41.0 -0.142857 -0.285714 0.857143 -0.238095 0.714286 0.202179 -0.071008 -0.071660 0.142857 0.008879 -0.055556 -0.074006 -0.214286 0.857143 -0.142857 -0.032512 -0.096131 -0.142857
41.0.1 0.857143 -0.285714 -0.142857 0.428571 -0.285714 0.592010 0.646639 0.585637 0.809524 0.645187 0.652778 0.639912 0.285714 -0.142857 0.857143 0.546798 0.628869 -0.142857
41.0.2 -0.142857 0.714286 -0.142857 0.761905 -0.285714 0.380145 0.570168 0.626718 -0.190476 0.561495 0.402778 0.647644 -0.214286 -0.142857 -0.142857 0.574384 0.635119 0.857143
42.0 -0.142857 0.714286 -0.142857 -0.238095 -0.285714 -0.348668 -0.194538 -0.191660 -0.190476 -0.242813 -0.111111 -0.280191 -0.214286 -0.142857 -0.142857 0.126108 -0.244048 -0.142857
44.0a1 -0.142857 -0.285714 -0.142857 -0.238095 0.714286 -0.085956 -0.341597 -0.277066 -0.190476 -0.343121 -0.319444 -0.336892 -0.214286 -0.142857 -0.142857 -0.425616 -0.364881 -0.142857
In [102]:
#Resources:
#http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor
#https://plot.ly/python/heatmaps/

fig, ax = plt.subplots()
heatmap = ax.pcolor(pt1, cmap=plt.cm.Reds, alpha=0.89)

# Format
fig = plt.gcf()
fig.set_size_inches(15,5)

# turn off the frame
ax.set_frame_on(False)

# put the major ticks at the middle of each cell
ax.set_yticks(np.arange(pt1.shape[0]) + 0.5, minor=False)
ax.set_xticks(np.arange(pt1.shape[1]) + 0.5, minor=False)

# want a more natural, table-like display
ax.invert_yaxis()
ax.xaxis.tick_top()

# Set the labels
ax.set_xticklabels(pt1.columns, minor=False)
ax.set_yticklabels(pt1.index, minor=False)

# rotate the x labels
plt.xticks(rotation=90)

ax.grid(False)

# Turn off all the ticks
ax = plt.gca()

for t in ax.xaxis.get_major_ticks():
    t.tick1On = False
    t.tick2On = False
for t in ax.yaxis.get_major_ticks():
    t.tick1On = False
    t.tick2On = False
    
# name the axes and plot
ax.set_xlabel('Platform')
ax.set_ylabel('Version') 
ax.xaxis.set_label_position('top')
plt.title('Heatmap of Version vs Platform having negative feedback',y=-0.08)

handles, labels = ax.get_legend_handles_labels()
#lgd = ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5,-0.1))
#fig.savefig('heatmap.png', bbox_extra_artists=(lgd,), bbox_inches='tight')
#fig.savefig('fig5.png', bbox_inches='tight')

Voila! Here is our much-waited heatmap!

Some observations:

  1. Version 41.0.1 is in general problematic with a lot of platforms, Windows 7 and 10 in general.
  2. Version 41.0.2 (which seems to be an updated version of 41.0.1) has comparatively lesser number of issues, but users are still facing problems. So it will have to be looked into.
  3. Firefox probably works best with Linux, as it has a remarkably low negative feedback count.

Repeating the heatmap process for positive feedback

In [103]:
pt2 = pd.pivot_table(data[(data.sentiment=='Happy') & (data.version!='Unknown') & (data.platform!='Unknown')], 
                    index='version', values='dummy', columns='platform', aggfunc=np.sum, fill_value=0)
pt2.head()
#use pt2 to look at the entire pivot table
Out[103]:
platform Android 4.1.2 Android 4.2.2 Android 5.1.1 CrOS i686 3912.101.0 FC Linux i686 Fedora Linux Maemo OS X Windows 10 Windows 2000 Windows 7 Windows 8 Windows 8.1 Windows NT Windows Vista Windows XP
version
10.0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 1 1
10.0.2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
11.0 0 0 0 0 0 0 0 0 0 0 0 8 2 0 0 0 6
12.0 0 0 0 0 0 0 0 0 0 0 0 8 3 0 1 0 10
13.0 0 0 0 0 0 0 0 0 0 0 0 7 5 0 0 0 5
In [104]:
#Removing versions having a total positive feedback count of less than 50

nrows = len(pt2)
i = 0;
while True:
    row = pt2.ix[i,]
    if(row.sum()<50):
        pt2.drop(pt2.index[i],inplace=True)
        i=i-1 #Because of dropping, indexing of rows changes. So need to read next row from same position
        nrows = nrows-1
    i = i+1
    if(i>=nrows):
        break

pt2
Out[104]:
platform Android 4.1.2 Android 4.2.2 Android 5.1.1 CrOS i686 3912.101.0 FC Linux i686 Fedora Linux Maemo OS X Windows 10 Windows 2000 Windows 7 Windows 8 Windows 8.1 Windows NT Windows Vista Windows XP
version
40.0.3 0 0 0 0 0 0 5 0 8 12 0 45 0 10 0 3 17
41.0 0 0 1 0 0 0 30 0 21 51 0 110 6 36 1 8 33
41.0.1 1 1 0 0 0 0 22 0 38 110 0 204 6 51 0 20 143
41.0.2 0 0 0 1 0 0 24 0 45 121 0 222 11 47 0 20 123
42.0 0 0 0 0 0 0 3 0 9 25 0 80 2 10 0 7 32
44.0a1 0 0 0 0 0 0 9 0 2 18 0 20 1 5 0 0 1
8.0 0 0 0 0 0 0 73 0 0 0 0 5 3 0 0 0 5
In [105]:
pt2.to_csv('heatmap_happy.csv', index=False)
In [106]:
#Resources:
#http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor
#https://plot.ly/python/heatmaps/

fig, ax = plt.subplots()
heatmap = ax.pcolor(pt2, cmap=plt.cm.Greens, alpha=0.89)

# Format
fig = plt.gcf()
fig.set_size_inches(15,5)

# turn off the frame
ax.set_frame_on(False)

# put the major ticks at the middle of each cell
ax.set_yticks(np.arange(pt2.shape[0]) + 0.5, minor=False)
ax.set_xticks(np.arange(pt2.shape[1]) + 0.5, minor=False)

# want a more natural, table-like display
ax.invert_yaxis()
ax.xaxis.tick_top()

# Set the labels
ax.set_xticklabels(pt2.columns, minor=False)
ax.set_yticklabels(pt2.index, minor=False)

# rotate the x labels
plt.xticks(rotation=90)

ax.grid(False)

# Turn off all the ticks
ax = plt.gca()

for t in ax.xaxis.get_major_ticks():
    t.tick1On = False
    t.tick2On = False
for t in ax.yaxis.get_major_ticks():
    t.tick1On = False
    t.tick2On = False
    
# name the axes and plot
ax.set_xlabel('Platform')
ax.set_ylabel('Version') 
ax.xaxis.set_label_position('top')
plt.title('Heatmap of Version vs Platform having positive feedback',y=-0.08)

handles, labels = ax.get_legend_handles_labels()
#lgd = ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5,-0.1))
#fig.savefig('heatmap.png', bbox_extra_artists=(lgd,), bbox_inches='tight')
#fig.savefig('fig6.png', bbox_inches='tight')

We notice a similar trend in this heatmap as well. But it should be noted that positive count is much lesser than the negative one. A heatmap is always relative to the data that is used for plotting it.
But an interesting observation is that Linux users seem to be happy with Version 8! So that's a bonus! Version 8 also didn't have too many negative reviews coming from Linux users.



Finally we need to delete the dummy column that we had created.

In [107]:
#deleting the dummy column

data.drop('dummy', axis=1, inplace=True)
data.head()
Out[107]:
sentiment date version platform locale
0 Sad 31 41.0.2 Windows 7 Spanish (Argentina)
1 Sad 31 41.0.2 Windows 7 Dutch
2 Sad 31 41.0.2 Windows 7 Russian
3 Sad 31 41.0.2 Windows 7 English (US)
4 Sad 31 41.0.2 Windows Vista English (US)

Final Thoughts

In this analysis, a Mozilla Firefox feedback data was analyzed and certain insights were drawn into user behaviour and version/platform dependency.

Further Analysis:

  1. I hope to scrape a larger amount of data (3 months' worth) and run the analysis again.
  2. It might help to explore some group properties
  3. An extensive time-series analysis can be performed for data over a longer period of time to determine whether issues have been resolved or new ones have cropped up after a version update