Introduction to Non-Personalized Recommenders¶

The recommendation problem¶

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

Amazon
Netflix
Facebook
Last.fm.

What exactly do they do?

Definitions from the literature¶

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997

Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992

In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005

Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

Notation¶

$U$ is the set of users in our domain. Its size is $|U|$.
$I$ is the set of items in our domain. Its size is $|I|$.
$I(u)$ is the set of items that user $u$ has rated.
$-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
$U(i)$ is the set of users that have rated item $i$.
$-U(i)$ is the complement of $U(i)$.

Goal of a recommendation system¶

$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} \forall{u \in U},\; i^* = \argmax_{i \in -I(u)} [S(u,i)] $

Problem statement¶

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Challenges¶

Availability of item metadata¶

Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:

Ingredients
Nutrition facts
Brand
Description
County of origin

New user problem¶

A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.

New item problem¶

Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.

Data sparsity¶

When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.

Flow chart: the big picture¶

In [4]:

from IPython.core.display import Image 
Image(filename='/Users/chengjun/GitHub/cjc2016/figure/recsys_arch.png')

Out[4]:

The CourseTalk dataset: loading and first look¶

Loading of the CourseTalk database.

The CourseTalk data is spread across three files. Using the pd.read_table method we load each file:

In [5]:

import pandas as pd

unames = ['user_id', 'username']
users = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/users_set.dat',
                      sep='|', header=None, names=unames)

rnames = ['user_id', 'course_id', 'rating']
ratings = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/ratings.dat',
                        sep='|', header=None, names=rnames)

mnames = ['course_id', 'title', 'avg_rating', 'workload', 'university', 'difficulty', 'provider']
courses = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/cursos.dat',
                       sep='|', header=None, names=mnames)

# show how one of them looks
ratings.head(10)

Out[5]:

	user_id	course_id	rating
0	1	1	5
1	2	1	5
2	3	1	5
3	4	1	5
4	5	1	5
5	6	1	5
6	7	1	5
7	8	1	5
8	9	1	5
9	10	1	5

In [6]:

# show how one of them looks
users[:5]

Out[6]:

	user_id	username
0	1	patrickdijusto1
1	2	natalya_ivanova
2	3	justineittreim
3	4	ronmay
4	5	paulstock

In [7]:

courses[:5]

Out[7]:

	course_id	title	avg_rating	workload	university	difficulty	provider
0	1	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera
1	2	Modern & Contemporary American Poetry	4.9	5-9 hours/week	University of Pennsylvania	Easy/medium	coursera
2	3	A Beginner's Guide to Irrational Behavior	4.9	7-10 hours/week	Duke University	Medium	coursera
3	4	Design: Creation of Artifacts in Society	4.9	5-10 hours/week	University of Pennsylvania	Medium	coursera
4	5	Greek and Roman Mythology	4.9	8-10 hours/week	University of Pennsylvania	Medium	coursera

Using pd.merge we get it all into one big DataFrame.

In [8]:

coursetalk = pd.merge(pd.merge(ratings, courses), users)
coursetalk

Out[8]:

	user_id	course_id	rating	title	avg_rating	workload	university	difficulty	provider	username
0	1	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	patrickdijusto1
1	2	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	natalya_ivanova
2	3	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	justineittreim
3	4	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	ronmay
4	5	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	paulstock
5	6	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	boyarsky
6	6	11	4.5	Functional Programming Principles in Scala	4.8	5-7 hours/week	Ecole Polytechnique Federale de Lausanne	Medium/hard	coursera	boyarsky
7	6	12	4.0	Gamification	4.8	4-8 hours/week	University of Pennsylvania	Easy/medium	coursera	boyarsky
8	6	19	5.0	M101P: MongoDB for Developers	4.7	TBA	NaN	Medium	None	boyarsky
9	6	21	5.0	6.002x: Circuits and Electronics	4.7	12 hours/week.	MIT	Medium/hard	edx	boyarsky
10	6	32	5.0	Internet History, Technology, and Security	4.6	3-5 hours/week	University of Michigan	Easy	coursera	boyarsky
11	6	33	4.0	Web Development	4.6	Self-paced	NaN	Easy/medium	udacity	boyarsky
12	6	93	5.0	CS-169.1x: Software as a Service	4.2	TBA	UC Berkeley	Medium	edx	boyarsky
13	6	108	5.0	Human-Computer Interaction	4.1	10-12 hours/week	Stanford University	Easy/medium	coursera	boyarsky
14	6	134	2.0	Web Intelligence and Big Data	3.9	3-4 hours/week	Indian Institute of Technology Delhi	Medium	coursera	boyarsky
15	6	141	4.0	Coding the Matrix: Linear Algebra through Comp...	3.8	7-10 hours/week	Brown University	Medium/hard	coursera	boyarsky
16	6	145	4.0	Game Theory	3.8	5-7 hours/week	Stanford University	Medium	coursera	boyarsky
17	6	198	2.0	Software Testing	2.9	Self-paced	NaN	Easy	udacity	boyarsky
18	7	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	barak
19	8	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	alexjeffrey
20	9	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	celsoagustinhernandezdiaz
21	10	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	vadimsolomonik
22	10	14	5.0	An Introduction to Operations Management	4.8	5-7 hours/week	University of Pennsylvania	Medium	coursera	vadimsolomonik
23	10	35	5.0	Model Thinking	4.5	4-8 hours/week	University of Michigan	Easy/medium	coursera	vadimsolomonik
24	10	49	5.0	Fantasy and Science Fiction: The Human Mind, O...	4.5	8-12 hours/week	University of Michigan	Medium	coursera	vadimsolomonik
25	10	87	4.0	Networked Life	4.2	In session	University of Pennsylvania	Easy/medium	coursera	vadimsolomonik
26	10	142	5.0	Social Network Analysis	3.8	5-7 hours/week (8-10 if completing additional ...	University of Michigan	Medium	coursera	vadimsolomonik
27	10	188	1.0	Computational Investing, Part I	3.2	8-12 hours/week	Georgia Institute of Technology	Medium	coursera	vadimsolomonik
28	11	1	3.5	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	CrunchyCookie
29	12	1	5.0	An Introduction to Interactive Programming in ...	4.9	7-10 hours/week	Rice University	Medium	coursera	skywalking
...	...	...	...	...	...	...	...	...	...	...
2743	1988	188	3.0	Computational Investing, Part I	3.2	8-12 hours/week	Georgia Institute of Technology	Medium	coursera	alanwilliams
2744	1989	188	1.0	Computational Investing, Part I	3.2	8-12 hours/week	Georgia Institute of Technology	Medium	coursera	Lyon
2745	1990	188	1.5	Computational Investing, Part I	3.2	8-12 hours/week	Georgia Institute of Technology	Medium	coursera	fernandomontenegro
2746	1991	188	4.5	Computational Investing, Part I	3.2	8-12 hours/week	Georgia Institute of Technology	Medium	coursera	lemic
2747	1992	188	5.0	Computational Investing, Part I	3.2	8-12 hours/week	Georgia Institute of Technology	Medium	coursera	andreamitchell
2748	1993	189	4.5	CB22x: The Ancient Greek Hero	3.1	TBA	Harvard University	Easy/medium	edx	megkawento
2749	1994	189	0.5	CB22x: The Ancient Greek Hero	3.1	TBA	Harvard University	Easy/medium	edx	anonymous
2750	1995	189	2.0	CB22x: The Ancient Greek Hero	3.1	TBA	Harvard University	Easy/medium	edx	anonymous
2751	1996	189	5.0	CB22x: The Ancient Greek Hero	3.1	TBA	Harvard University	Easy/medium	edx	anonymous
2752	1997	190	4.0	Introduction to Systems Biology	3.1	6-8 hours/week	Mount Sinai School of Medicine	Medium/hard	coursera	anonymous
2753	1998	192	1.0	Property and Liability: An Introduction to Law...	3.0	2-4 hours/week	Wesleyan University	Easy/medium	coursera	echolearning
2754	1999	193	3.0	Physics	3.0	Self-paced	NaN	Medium	khanacademy	anonymous
2755	2000	196	3.0	Poetry: What It Is, and How to Understand It	3.0	Self-paced	NaN	Medium	udemy	marcianuffer
2756	2001	197	3.0	Principles of Obesity Economics	2.9	3-5 hours/week	Johns Hopkins University	Easy	coursera	irvwiswall
2757	2002	198	3.5	Software Testing	2.9	Self-paced	NaN	Easy	udacity	ivandutoit
2758	2003	200	1.0	Introduction to Logic	2.8	5-7 hours/week	Stanford University	Medium	coursera	llewellynfalco1
2759	2004	200	4.5	Introduction to Logic	2.8	5-7 hours/week	Stanford University	Medium	coursera	anonymous
2760	2005	200	1.5	Introduction to Logic	2.8	5-7 hours/week	Stanford University	Medium	coursera	caleesarya
2761	2006	200	1.0	Introduction to Logic	2.8	5-7 hours/week	Stanford University	Medium	coursera	nathanhall
2762	2007	200	5.0	Introduction to Logic	2.8	5-7 hours/week	Stanford University	Medium	coursera	valeria
2763	2008	205	2.0	Preparation for Introductory Biology: DNA to O...	2.4	8-10 hours/week	UC Irvine	Hard	coursera	stuartwilloughby
2764	2009	207	1.5	Computer Architecture	2.2	5-8 hours/week	Princeton University	Hard	coursera	jonsnow
2765	2010	212	0.5	HTML5 Game Development	1.8	Self-paced	NaN	Medium	udacity	anonymous
2766	2011	212	3.0	HTML5 Game Development	1.8	Self-paced	NaN	Medium	udacity	florianschaetz
2767	2012	212	1.0	HTML5 Game Development	1.8	Self-paced	NaN	Medium	udacity	anonymous
2768	2013	212	1.0	HTML5 Game Development	1.8	Self-paced	NaN	Medium	udacity	anonymous
2769	2014	213	0.5	A New History for a New China, 1700-2000: New ...	1.4	3-4 hours/week	The Hong Kong University of Science and Techno...	Medium/hard	coursera	chihchengyuan1
2770	2015	213	0.5	A New History for a New China, 1700-2000: New ...	1.4	3-4 hours/week	The Hong Kong University of Science and Techno...	Medium/hard	coursera	kj
2771	2016	214	1.0	Sports and Society	1.3	3-5 hours/week	Duke University	Easy	coursera	debbie
2772	2017	214	0.5	Sports and Society	1.3	3-5 hours/week	Duke University	Easy	coursera	kuba

2773 rows × 10 columns

In [9]:

coursetalk.ix[0]

Out[9]:

user_id                                                       1
course_id                                                     1
rating                                                        5
title         An Introduction to Interactive Programming in ...
avg_rating                                                  4.9
workload                                        7-10 hours/week
university                                      Rice University
difficulty                                               Medium
provider                                               coursera
username                                        patrickdijusto1
Name: 0, dtype: object

Collaborative filtering: generalizations of the aggregation function¶

Non-personalized recommendations¶

Groupby¶

The idea of groupby is that of split-apply-combine:

split data in an object according to a given key;
apply a function to each subset;
combine results into a new object.

To get mean course ratings grouped by the provider, we can use the pivot_table method:

In [14]:

dir(pivot_table)

Out[14]:

['__call__',
 '__class__',
 '__closure__',
 '__code__',
 '__defaults__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__get__',
 '__getattribute__',
 '__globals__',
 '__hash__',
 '__init__',
 '__module__',
 '__name__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'func_closure',
 'func_code',
 'func_defaults',
 'func_dict',
 'func_doc',
 'func_globals',
 'func_name']

In [15]:

from pandas import pivot_table
mean_ratings = pivot_table(coursetalk, values = 'rating', columns='provider', aggfunc='mean')
mean_ratings.order(ascending=False)

Out[15]:

provider
None            4.562500
coursera        4.527835
edx             4.491620
codecademy      4.450000
udacity         4.241071
udemy           4.200000
open2study      4.083333
khanacademy     4.000000
novoed          3.281250
mruniversity    3.250000
Name: rating, dtype: float64

Now let's filter down to courses that received at least 20 ratings (a completely arbitrary number); To do this, I group the data by course_id and use size() to get a Series of group sizes for each title:

In [16]:

ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title[:10]

Out[16]:

title
14.73x: The Challenges of Global Poverty                     2
2.01x: Elements of Structures                                2
3.091x: Introduction to Solid State Chemistry                3
6.002x: Circuits and Electronics                            10
6.00x: Introduction to Computer Science and Programming     21
7.00x: Introduction to Biology - The Secret of Life          3
8.02x: Electricity and Magnetism                             3
8.MReVx: Mechanics ReView                                    1
A Beginner&#39;s Guide to Irrational Behavior              147
A Crash Course on Creativity                                 5
dtype: int64

In [17]:

active_titles = ratings_by_title.index[ratings_by_title >= 20]
active_titles[:10]

Out[17]:

Index([u'6.00x: Introduction to Computer Science and Programming',
       u'A Beginner&#39;s Guide to Irrational Behavior',
       u'An Introduction to Interactive Programming in Python',
       u'An Introduction to Operations Management',
       u'CS-191x: Quantum Mechanics and Quantum Computation',
       u'CS188.1x Artificial Intelligence', u'Calculus: Single Variable',
       u'Computing for Data Analysis',
       u'Critical Thinking in Global Challenges', u'Cryptography I'],
      dtype='object', name=u'title')

The index of titles receiving at least 20 ratings can then be used to select rows from mean_ratings above:

In [18]:

mean_ratings = coursetalk.pivot_table('rating', columns='title', aggfunc='mean')
mean_ratings

Out[18]:

title
14.73x: The Challenges of Global Poverty                                      4.250000
2.01x: Elements of Structures                                                 4.750000
3.091x: Introduction to Solid State Chemistry                                 4.166667
6.002x: Circuits and Electronics                                              4.800000
6.00x: Introduction to Computer Science and Programming                       4.166667
7.00x: Introduction to Biology - The Secret of Life                           4.666667
8.02x: Electricity and Magnetism                                              4.333333
8.MReVx: Mechanics ReView                                                     5.000000
A Beginner&#39;s Guide to Irrational Behavior                                 4.874150
A Crash Course on Creativity                                                  3.500000
A History of the World since 1300                                             4.318182
A Look at Nuclear Science and Technology                                      3.000000
A New History for a New China, 1700-2000: New Data and New Methods, Part 1    0.500000
AIDS                                                                          5.000000
Aboriginal Worldviews and Education                                           4.333333
Algorithms                                                                    4.250000
Algorithms, Part I                                                            4.555556
Algorithms, Part II                                                           4.500000
Algorithms: Design and Analysis, Part 1                                       4.777778
Algorithms: Design and Analysis, Part 2                                       4.500000
An Introduction to Interactive Programming in Python                          4.915652
An Introduction to Operations Management                                      4.785714
An Introduction to the U.S. Food System: Perspectives from Public Health      5.000000
Animal Behaviour                                                              4.500000
Applied Cryptography                                                          4.666667
Archaeology&#39;s Dirty Little Secrets                                        4.928571
Artificial Intelligence Planning                                              3.250000
Artificial Intelligence for Robotics                                          4.333333
Astrobiology and the Search for Extraterrestrial Life                         3.928571
Automata                                                                      4.000000
                                                                                ...   
Sports and Society                                                            0.666667
Stat2.1X: Introduction to Statistics: Descriptive Statistics                  4.642857
Stat2.2x: Introduction to Statistics: Probability                             5.000000
Stat2.3x: Introduction to Statistics: Inference                               4.500000
Statistics One                                                                3.909091
Synapses, Neurons and Brains                                                  4.600000
Teaching Adult Learners (WPTrain)                                             2.500000
Technology Entrepreneurship Part 1                                            2.900000
Technology Entrepreneurship Part 2                                            0.500000
The Ancient Greeks                                                            4.550000
The Eurozone Crisis                                                           3.250000
The Fiction of Relationship                                                   5.000000
The Hardware/Software Interface                                               3.857143
The Language of Hollywood: Storytelling, Sound, and Color                     4.800000
The Massey Method: Learn Spanish from a Former NSA Agent                      4.000000
The Modern World: Global History since 1760                                   4.775862
The Modern and the Postmodern                                                 4.777778
The Science of Gastronomy                                                     4.000000
The Social Context of Mental Health and Illness                               4.333333
Think Again: How to Reason and Argue                                          3.815789
Useful Genetics Part 1                                                        4.500000
VLSI CAD:  Logic to Layout                                                    4.500000
Vaccine Trials: Methods and Best Practices                                    5.000000
Vaccines                                                                      3.750000
Web Development                                                               4.625000
Web Intelligence and Big Data                                                 3.802326
Women and the Civil Rights Movement                                           5.000000
Writing for the Web (WriteWeb)                                                5.000000
Writing in the Sciences                                                       4.000000
jQuery                                                                        4.250000
Name: rating, dtype: float64

By computing the mean rating for each course, we will order with the highest rating listed first.

In [19]:

mean_ratings.ix[active_titles].order(ascending=False)

Out[19]:

title
An Introduction to Interactive Programming in Python             4.915652
Modern &amp; Contemporary American Poetry                        4.901515
Design: Creation of Artifacts in Society                         4.879581
A Beginner&#39;s Guide to Irrational Behavior                    4.874150
Greek and Roman Mythology                                        4.864198
Calculus: Single Variable                                        4.854167
CS188.1x Artificial Intelligence                                 4.833333
Machine Learning                                                 4.830000
Functional Programming Principles in Scala                       4.822581
Gamification                                                     4.796296
An Introduction to Operations Management                         4.785714
The Modern World: Global History since 1760                      4.775862
Programming Languages                                            4.770833
CS-191x: Quantum Mechanics and Quantum Computation               4.727273
Cryptography I                                                   4.700000
Discrete Optimization                                            4.695652
Introduction to Computer Science                                 4.687500
Learn to Program: Crafting Quality Code                          4.585714
Model Thinking                                                   4.578125
Internet History, Technology, and Security                       4.541667
Fantasy and Science Fiction: The Human Mind, Our Modern World    4.522727
Learn to Program: The Fundamentals                               4.303571
6.00x: Introduction to Computer Science and Programming          4.166667
Critical Thinking in Global Challenges                           3.961538
Web Intelligence and Big Data                                    3.802326
Computing for Data Analysis                                      3.187500
Introduction to Finance                                          3.086957
Introduction to Data Science                                     3.060000
Name: rating, dtype: float64

To see the top courses among Coursera students, we can sort by the 'Coursera' column in descending order:

In [20]:

mean_ratings = coursetalk.pivot_table('rating', index='title',columns='provider', aggfunc='mean')
mean_ratings[:10]

Out[20]:

provider	None	codecademy	coursera	edx	khanacademy	mruniversity	novoed	open2study	udacity	udemy
title
14.73x: The Challenges of Global Poverty	NaN	NaN	NaN	4.250000	NaN	NaN	NaN	NaN	NaN	NaN
2.01x: Elements of Structures	NaN	NaN	NaN	4.750000	NaN	NaN	NaN	NaN	NaN	NaN
3.091x: Introduction to Solid State Chemistry	NaN	NaN	NaN	4.166667	NaN	NaN	NaN	NaN	NaN	NaN
6.002x: Circuits and Electronics	NaN	NaN	NaN	4.800000	NaN	NaN	NaN	NaN	NaN	NaN
6.00x: Introduction to Computer Science and Programming	NaN	NaN	NaN	4.166667	NaN	NaN	NaN	NaN	NaN	NaN
7.00x: Introduction to Biology - The Secret of Life	NaN	NaN	NaN	4.666667	NaN	NaN	NaN	NaN	NaN	NaN
8.02x: Electricity and Magnetism	NaN	NaN	NaN	4.333333	NaN	NaN	NaN	NaN	NaN	NaN
8.MReVx: Mechanics ReView	NaN	NaN	NaN	5.000000	NaN	NaN	NaN	NaN	NaN	NaN
A Beginner's Guide to Irrational Behavior	NaN	NaN	4.87415	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Crash Course on Creativity	NaN	NaN	NaN	NaN	NaN	NaN	3.5	NaN	NaN	NaN

In [21]:

mean_ratings['coursera'][active_titles].order(ascending=False)[:10]

Out[21]:

title
An Introduction to Interactive Programming in Python    4.915652
Modern &amp; Contemporary American Poetry               4.901515
Design: Creation of Artifacts in Society                4.879581
A Beginner&#39;s Guide to Irrational Behavior           4.874150
Greek and Roman Mythology                               4.864198
Calculus: Single Variable                               4.854167
Programming Languages                                   4.850000
Machine Learning                                        4.830000
Functional Programming Principles in Scala              4.822581
Gamification                                            4.796296
Name: coursera, dtype: float64

Now, let's go further! How about rank the courses with the highest percentage of ratings that are 4 or higher ? % of ratings 4+

Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:

In [23]:

# transform the ratings frame into a ratings matrix
ratings_mtx_df = coursetalk.pivot_table(values='rating',
                                             index='user_id',
                                             columns='title')
ratings_mtx_df.ix[ratings_mtx_df.index[:15], ratings_mtx_df.columns[:15]]

Out[23]:

title	14.73x: The Challenges of Global Poverty	2.01x: Elements of Structures	3.091x: Introduction to Solid State Chemistry	6.002x: Circuits and Electronics	6.00x: Introduction to Computer Science and Programming	7.00x: Introduction to Biology - The Secret of Life	8.02x: Electricity and Magnetism	8.MReVx: Mechanics ReView	A Beginner's Guide to Irrational Behavior	A Crash Course on Creativity	A History of the World since 1300	A Look at Nuclear Science and Technology	A New History for a New China, 1700-2000: New Data and New Methods, Part 1	AIDS	Aboriginal Worldviews and Education
user_id
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Let's extract only the rating that are 4 or higher.

In [24]:

ratings_gte_4 = ratings_mtx_df[ratings_mtx_df>=4.0]
# with an integer axis index only label-based indexing is possible

ratings_gte_4.ix[ratings_gte_4.index[:15], ratings_gte_4.columns[:15]]

Out[24]:

title	14.73x: The Challenges of Global Poverty	2.01x: Elements of Structures	3.091x: Introduction to Solid State Chemistry	6.002x: Circuits and Electronics	6.00x: Introduction to Computer Science and Programming	7.00x: Introduction to Biology - The Secret of Life	8.02x: Electricity and Magnetism	8.MReVx: Mechanics ReView	A Beginner's Guide to Irrational Behavior	A Crash Course on Creativity	A History of the World since 1300	A Look at Nuclear Science and Technology	A New History for a New China, 1700-2000: New Data and New Methods, Part 1	AIDS	Aboriginal Worldviews and Education
user_id
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Now picking the number of total ratings for each course and the count of ratings 4+ , we can merge them into one DataFrame.

In [25]:

ratings_gte_4_pd = pd.DataFrame({'total': ratings_mtx_df.count(), 'gte_4': ratings_gte_4.count()})
ratings_gte_4_pd.head(10)

Out[25]:

	gte_4	total
title
14.73x: The Challenges of Global Poverty	2	2
2.01x: Elements of Structures	2	2
3.091x: Introduction to Solid State Chemistry	2	3
6.002x: Circuits and Electronics	10	10
6.00x: Introduction to Computer Science and Programming	15	21
7.00x: Introduction to Biology - The Secret of Life	3	3
8.02x: Electricity and Magnetism	2	3
8.MReVx: Mechanics ReView	1	1
A Beginner's Guide to Irrational Behavior	146	147
A Crash Course on Creativity	2	5

In [26]:

ratings_gte_4_pd['gte_4_ratio'] = (ratings_gte_4_pd['gte_4'] * 1.0)/ ratings_gte_4_pd.total
ratings_gte_4_pd.head(10)

Out[26]:

	gte_4	total	gte_4_ratio
title
14.73x: The Challenges of Global Poverty	2	2	1.000000
2.01x: Elements of Structures	2	2	1.000000
3.091x: Introduction to Solid State Chemistry	2	3	0.666667
6.002x: Circuits and Electronics	10	10	1.000000
6.00x: Introduction to Computer Science and Programming	15	21	0.714286
7.00x: Introduction to Biology - The Secret of Life	3	3	1.000000
8.02x: Electricity and Magnetism	2	3	0.666667
8.MReVx: Mechanics ReView	1	1	1.000000
A Beginner's Guide to Irrational Behavior	146	147	0.993197
A Crash Course on Creativity	2	5	0.400000

In [27]:

ranking = [(title,total,gte_4, score) for title, total, gte_4, score in ratings_gte_4_pd.itertuples()]

for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[3], x[2], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score

Functional Programming Principles in Scala 31 31 1.0
Introduction to Computer Science 24 24 1.0
Programming Languages 24 24 1.0
Web Development 16 16 1.0
6.002x: Circuits and Electronics 10 10 1.0
Compilers 8 8 1.0
Archaeology&#39;s Dirty Little Secrets 7 7 1.0
How to Build a Startup 7 7 1.0
Introduction to Sociology 7 7 1.0
Stat2.1X: Introduction to Statistics: Descriptive Statistics 7 7 1.0

Let's now go easy. Let's count the number of ratings for each course, and order with the most number of ratings.

In [28]:

ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title.order(ascending=False)[:10]

Out[28]:

title
An Introduction to Interactive Programming in Python    575
Design: Creation of Artifacts in Society                191
A Beginner&#39;s Guide to Irrational Behavior           147
Modern &amp; Contemporary American Poetry               132
An Introduction to Operations Management                 98
Greek and Roman Mythology                                81
Critical Thinking in Global Challenges                   65
Gamification                                             54
Machine Learning                                         50
Web Intelligence and Big Data                            43
dtype: int64

Considering this information we can sort by the most rated ones with highest percentage of 4+ ratings.

In [29]:

for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[2], x[3], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score

An Introduction to Interactive Programming in Python 572 575 0.994782608696
Design: Creation of Artifacts in Society 190 191 0.994764397906
A Beginner&#39;s Guide to Irrational Behavior 146 147 0.993197278912
Modern &amp; Contemporary American Poetry 130 132 0.984848484848
An Introduction to Operations Management 96 98 0.979591836735
Greek and Roman Mythology 80 81 0.987654320988
Critical Thinking in Global Challenges 47 65 0.723076923077
Gamification 52 54 0.962962962963
Machine Learning 48 49 0.979591836735
Web Intelligence and Big Data 26 43 0.604651162791

Finally using the formula above that we learned, let's find out what the courses that most often occur wit the popular MOOC An introduction to Interactive Programming with Python by using the method "x + y/ x" . For each course, calculate the percentage of Programming with python raters who also rated that course. Order with the highest percentage first, and voilá we have the top 5 moocs.

In [31]:

course_users = coursetalk.pivot_table('rating', index='title', columns='user_id')
course_users.ix[course_users.index[:15], course_users.columns[:15]]

Out[31]:

user_id	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
title
14.73x: The Challenges of Global Poverty	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2.01x: Elements of Structures	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3.091x: Introduction to Solid State Chemistry	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.002x: Circuits and Electronics	NaN	NaN	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.00x: Introduction to Computer Science and Programming	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7.00x: Introduction to Biology - The Secret of Life	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.02x: Electricity and Magnetism	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.MReVx: Mechanics ReView	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Beginner's Guide to Irrational Behavior	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Crash Course on Creativity	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A History of the World since 1300	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Look at Nuclear Science and Technology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A New History for a New China, 1700-2000: New Data and New Methods, Part 1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
AIDS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aboriginal Worldviews and Education	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

First, let's get only the users that rated the course An Introduction to Interactive Programming in Python

In [32]:

ratings_by_course = coursetalk[coursetalk.title == 'An Introduction to Interactive Programming in Python']
ratings_by_course.set_index('user_id', inplace=True)

Now, for all other courses let's filter out only the ratings from users that rated the Python course.

In [33]:

their_ids = ratings_by_course.index
their_ratings = course_users[their_ids]
course_users[their_ids].ix[course_users[their_ids].index[:15], course_users[their_ids].columns[:15]]

Out[33]:

user_id	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
title
14.73x: The Challenges of Global Poverty	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2.01x: Elements of Structures	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3.091x: Introduction to Solid State Chemistry	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.002x: Circuits and Electronics	NaN	NaN	NaN	NaN	NaN	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6.00x: Introduction to Computer Science and Programming	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7.00x: Introduction to Biology - The Secret of Life	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.02x: Electricity and Magnetism	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8.MReVx: Mechanics ReView	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Beginner's Guide to Irrational Behavior	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Crash Course on Creativity	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A History of the World since 1300	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Look at Nuclear Science and Technology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A New History for a New China, 1700-2000: New Data and New Methods, Part 1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
AIDS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aboriginal Worldviews and Education	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

By applying the division: number of ratings who rated Python Course and the given course / total of ratings who rated the Python Course we have our percentage.

In [34]:

course_count =  their_ratings.ix['An Introduction to Interactive Programming in Python'].count()
sims = their_ratings.apply(lambda profile: profile.count() / float(course_count) , axis=1)

Ordering by the score, highest first excepts the first one which contains the course itself.

In [35]:

sims.order(ascending=False)[1:][:10]

Out[35]:

title
Cryptography I                             0.006957
Machine Learning                           0.006957
CS-169.1x: Software as a Service           0.005217
Python                                     0.005217
Introduction to Computer Science           0.005217
Human-Computer Interaction                 0.005217
Computational Investing, Part I            0.005217
Learn to Program: Crafting Quality Code    0.005217
Web Development                            0.005217
Gamification                               0.005217
dtype: float64

In [ ]: