Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:
What exactly do they do?
In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012
$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} \forall{u \in U},\; i^* = \argmax_{i \in -I(u)} [S(u,i)] $
The recommendation problem in its most basic form is quite simple to define:
|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1 | ? | ? | 4 | ? | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_2 | 3 | ? | ? | 2 | 2 |
|-------------------+-----+-----+-----+-----+-----|
| u_3 | 3 | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_4 | ? | 1 | 2 | 1 | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_5 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_6 | 2 | ? | 2 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_7 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_8 | 3 | 1 | 5 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_9 | ? | ? | ? | ? | 2 |
|-------------------+-----+-----+-----+-----+-----|
Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.
Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:
A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.
Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.
When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.
from IPython.core.display import Image
Image(filename='/Users/chengjun/GitHub/cjc2016/figure/recsys_arch.png')
Loading of the CourseTalk database.
The CourseTalk data is spread across three files. Using the pd.read_table
method we load each file:
import pandas as pd
unames = ['user_id', 'username']
users = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/users_set.dat',
sep='|', header=None, names=unames)
rnames = ['user_id', 'course_id', 'rating']
ratings = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/ratings.dat',
sep='|', header=None, names=rnames)
mnames = ['course_id', 'title', 'avg_rating', 'workload', 'university', 'difficulty', 'provider']
courses = pd.read_table('/Users/chengjun/GitHub/cjc2016/data/cursos.dat',
sep='|', header=None, names=mnames)
# show how one of them looks
ratings.head(10)
user_id | course_id | rating | |
---|---|---|---|
0 | 1 | 1 | 5 |
1 | 2 | 1 | 5 |
2 | 3 | 1 | 5 |
3 | 4 | 1 | 5 |
4 | 5 | 1 | 5 |
5 | 6 | 1 | 5 |
6 | 7 | 1 | 5 |
7 | 8 | 1 | 5 |
8 | 9 | 1 | 5 |
9 | 10 | 1 | 5 |
# show how one of them looks
users[:5]
user_id | username | |
---|---|---|
0 | 1 | patrickdijusto1 |
1 | 2 | natalya_ivanova |
2 | 3 | justineittreim |
3 | 4 | ronmay |
4 | 5 | paulstock |
courses[:5]
course_id | title | avg_rating | workload | university | difficulty | provider | |
---|---|---|---|---|---|---|---|
0 | 1 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera |
1 | 2 | Modern & Contemporary American Poetry | 4.9 | 5-9 hours/week | University of Pennsylvania | Easy/medium | coursera |
2 | 3 | A Beginner's Guide to Irrational Behavior | 4.9 | 7-10 hours/week | Duke University | Medium | coursera |
3 | 4 | Design: Creation of Artifacts in Society | 4.9 | 5-10 hours/week | University of Pennsylvania | Medium | coursera |
4 | 5 | Greek and Roman Mythology | 4.9 | 8-10 hours/week | University of Pennsylvania | Medium | coursera |
Using pd.merge
we get it all into one big DataFrame.
coursetalk = pd.merge(pd.merge(ratings, courses), users)
coursetalk
user_id | course_id | rating | title | avg_rating | workload | university | difficulty | provider | username | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | patrickdijusto1 |
1 | 2 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | natalya_ivanova |
2 | 3 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | justineittreim |
3 | 4 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | ronmay |
4 | 5 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | paulstock |
5 | 6 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | boyarsky |
6 | 6 | 11 | 4.5 | Functional Programming Principles in Scala | 4.8 | 5-7 hours/week | Ecole Polytechnique Federale de Lausanne | Medium/hard | coursera | boyarsky |
7 | 6 | 12 | 4.0 | Gamification | 4.8 | 4-8 hours/week | University of Pennsylvania | Easy/medium | coursera | boyarsky |
8 | 6 | 19 | 5.0 | M101P: MongoDB for Developers | 4.7 | TBA | NaN | Medium | None | boyarsky |
9 | 6 | 21 | 5.0 | 6.002x: Circuits and Electronics | 4.7 | 12 hours/week. | MIT | Medium/hard | edx | boyarsky |
10 | 6 | 32 | 5.0 | Internet History, Technology, and Security | 4.6 | 3-5 hours/week | University of Michigan | Easy | coursera | boyarsky |
11 | 6 | 33 | 4.0 | Web Development | 4.6 | Self-paced | NaN | Easy/medium | udacity | boyarsky |
12 | 6 | 93 | 5.0 | CS-169.1x: Software as a Service | 4.2 | TBA | UC Berkeley | Medium | edx | boyarsky |
13 | 6 | 108 | 5.0 | Human-Computer Interaction | 4.1 | 10-12 hours/week | Stanford University | Easy/medium | coursera | boyarsky |
14 | 6 | 134 | 2.0 | Web Intelligence and Big Data | 3.9 | 3-4 hours/week | Indian Institute of Technology Delhi | Medium | coursera | boyarsky |
15 | 6 | 141 | 4.0 | Coding the Matrix: Linear Algebra through Comp... | 3.8 | 7-10 hours/week | Brown University | Medium/hard | coursera | boyarsky |
16 | 6 | 145 | 4.0 | Game Theory | 3.8 | 5-7 hours/week | Stanford University | Medium | coursera | boyarsky |
17 | 6 | 198 | 2.0 | Software Testing | 2.9 | Self-paced | NaN | Easy | udacity | boyarsky |
18 | 7 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | barak |
19 | 8 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | alexjeffrey |
20 | 9 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | celsoagustinhernandezdiaz |
21 | 10 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | vadimsolomonik |
22 | 10 | 14 | 5.0 | An Introduction to Operations Management | 4.8 | 5-7 hours/week | University of Pennsylvania | Medium | coursera | vadimsolomonik |
23 | 10 | 35 | 5.0 | Model Thinking | 4.5 | 4-8 hours/week | University of Michigan | Easy/medium | coursera | vadimsolomonik |
24 | 10 | 49 | 5.0 | Fantasy and Science Fiction: The Human Mind, O... | 4.5 | 8-12 hours/week | University of Michigan | Medium | coursera | vadimsolomonik |
25 | 10 | 87 | 4.0 | Networked Life | 4.2 | In session | University of Pennsylvania | Easy/medium | coursera | vadimsolomonik |
26 | 10 | 142 | 5.0 | Social Network Analysis | 3.8 | 5-7 hours/week (8-10 if completing additional ... | University of Michigan | Medium | coursera | vadimsolomonik |
27 | 10 | 188 | 1.0 | Computational Investing, Part I | 3.2 | 8-12 hours/week | Georgia Institute of Technology | Medium | coursera | vadimsolomonik |
28 | 11 | 1 | 3.5 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | CrunchyCookie |
29 | 12 | 1 | 5.0 | An Introduction to Interactive Programming in ... | 4.9 | 7-10 hours/week | Rice University | Medium | coursera | skywalking |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2743 | 1988 | 188 | 3.0 | Computational Investing, Part I | 3.2 | 8-12 hours/week | Georgia Institute of Technology | Medium | coursera | alanwilliams |
2744 | 1989 | 188 | 1.0 | Computational Investing, Part I | 3.2 | 8-12 hours/week | Georgia Institute of Technology | Medium | coursera | Lyon |
2745 | 1990 | 188 | 1.5 | Computational Investing, Part I | 3.2 | 8-12 hours/week | Georgia Institute of Technology | Medium | coursera | fernandomontenegro |
2746 | 1991 | 188 | 4.5 | Computational Investing, Part I | 3.2 | 8-12 hours/week | Georgia Institute of Technology | Medium | coursera | lemic |
2747 | 1992 | 188 | 5.0 | Computational Investing, Part I | 3.2 | 8-12 hours/week | Georgia Institute of Technology | Medium | coursera | andreamitchell |
2748 | 1993 | 189 | 4.5 | CB22x: The Ancient Greek Hero | 3.1 | TBA | Harvard University | Easy/medium | edx | megkawento |
2749 | 1994 | 189 | 0.5 | CB22x: The Ancient Greek Hero | 3.1 | TBA | Harvard University | Easy/medium | edx | anonymous |
2750 | 1995 | 189 | 2.0 | CB22x: The Ancient Greek Hero | 3.1 | TBA | Harvard University | Easy/medium | edx | anonymous |
2751 | 1996 | 189 | 5.0 | CB22x: The Ancient Greek Hero | 3.1 | TBA | Harvard University | Easy/medium | edx | anonymous |
2752 | 1997 | 190 | 4.0 | Introduction to Systems Biology | 3.1 | 6-8 hours/week | Mount Sinai School of Medicine | Medium/hard | coursera | anonymous |
2753 | 1998 | 192 | 1.0 | Property and Liability: An Introduction to Law... | 3.0 | 2-4 hours/week | Wesleyan University | Easy/medium | coursera | echolearning |
2754 | 1999 | 193 | 3.0 | Physics | 3.0 | Self-paced | NaN | Medium | khanacademy | anonymous |
2755 | 2000 | 196 | 3.0 | Poetry: What It Is, and How to Understand It | 3.0 | Self-paced | NaN | Medium | udemy | marcianuffer |
2756 | 2001 | 197 | 3.0 | Principles of Obesity Economics | 2.9 | 3-5 hours/week | Johns Hopkins University | Easy | coursera | irvwiswall |
2757 | 2002 | 198 | 3.5 | Software Testing | 2.9 | Self-paced | NaN | Easy | udacity | ivandutoit |
2758 | 2003 | 200 | 1.0 | Introduction to Logic | 2.8 | 5-7 hours/week | Stanford University | Medium | coursera | llewellynfalco1 |
2759 | 2004 | 200 | 4.5 | Introduction to Logic | 2.8 | 5-7 hours/week | Stanford University | Medium | coursera | anonymous |
2760 | 2005 | 200 | 1.5 | Introduction to Logic | 2.8 | 5-7 hours/week | Stanford University | Medium | coursera | caleesarya |
2761 | 2006 | 200 | 1.0 | Introduction to Logic | 2.8 | 5-7 hours/week | Stanford University | Medium | coursera | nathanhall |
2762 | 2007 | 200 | 5.0 | Introduction to Logic | 2.8 | 5-7 hours/week | Stanford University | Medium | coursera | valeria |
2763 | 2008 | 205 | 2.0 | Preparation for Introductory Biology: DNA to O... | 2.4 | 8-10 hours/week | UC Irvine | Hard | coursera | stuartwilloughby |
2764 | 2009 | 207 | 1.5 | Computer Architecture | 2.2 | 5-8 hours/week | Princeton University | Hard | coursera | jonsnow |
2765 | 2010 | 212 | 0.5 | HTML5 Game Development | 1.8 | Self-paced | NaN | Medium | udacity | anonymous |
2766 | 2011 | 212 | 3.0 | HTML5 Game Development | 1.8 | Self-paced | NaN | Medium | udacity | florianschaetz |
2767 | 2012 | 212 | 1.0 | HTML5 Game Development | 1.8 | Self-paced | NaN | Medium | udacity | anonymous |
2768 | 2013 | 212 | 1.0 | HTML5 Game Development | 1.8 | Self-paced | NaN | Medium | udacity | anonymous |
2769 | 2014 | 213 | 0.5 | A New History for a New China, 1700-2000: New ... | 1.4 | 3-4 hours/week | The Hong Kong University of Science and Techno... | Medium/hard | coursera | chihchengyuan1 |
2770 | 2015 | 213 | 0.5 | A New History for a New China, 1700-2000: New ... | 1.4 | 3-4 hours/week | The Hong Kong University of Science and Techno... | Medium/hard | coursera | kj |
2771 | 2016 | 214 | 1.0 | Sports and Society | 1.3 | 3-5 hours/week | Duke University | Easy | coursera | debbie |
2772 | 2017 | 214 | 0.5 | Sports and Society | 1.3 | 3-5 hours/week | Duke University | Easy | coursera | kuba |
2773 rows × 10 columns
coursetalk.ix[0]
user_id 1 course_id 1 rating 5 title An Introduction to Interactive Programming in ... avg_rating 4.9 workload 7-10 hours/week university Rice University difficulty Medium provider coursera username patrickdijusto1 Name: 0, dtype: object
The idea of groupby is that of split-apply-combine:
To get mean course ratings grouped by the provider, we can use the pivot_table method:
dir(pivot_table)
['__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__doc__', '__format__', '__get__', '__getattribute__', '__globals__', '__hash__', '__init__', '__module__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'func_closure', 'func_code', 'func_defaults', 'func_dict', 'func_doc', 'func_globals', 'func_name']
from pandas import pivot_table
mean_ratings = pivot_table(coursetalk, values = 'rating', columns='provider', aggfunc='mean')
mean_ratings.order(ascending=False)
provider None 4.562500 coursera 4.527835 edx 4.491620 codecademy 4.450000 udacity 4.241071 udemy 4.200000 open2study 4.083333 khanacademy 4.000000 novoed 3.281250 mruniversity 3.250000 Name: rating, dtype: float64
Now let's filter down to courses that received at least 20 ratings (a completely arbitrary number); To do this, I group the data by course_id and use size() to get a Series of group sizes for each title:
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title[:10]
title 14.73x: The Challenges of Global Poverty 2 2.01x: Elements of Structures 2 3.091x: Introduction to Solid State Chemistry 3 6.002x: Circuits and Electronics 10 6.00x: Introduction to Computer Science and Programming 21 7.00x: Introduction to Biology - The Secret of Life 3 8.02x: Electricity and Magnetism 3 8.MReVx: Mechanics ReView 1 A Beginner's Guide to Irrational Behavior 147 A Crash Course on Creativity 5 dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 20]
active_titles[:10]
Index([u'6.00x: Introduction to Computer Science and Programming', u'A Beginner's Guide to Irrational Behavior', u'An Introduction to Interactive Programming in Python', u'An Introduction to Operations Management', u'CS-191x: Quantum Mechanics and Quantum Computation', u'CS188.1x Artificial Intelligence', u'Calculus: Single Variable', u'Computing for Data Analysis', u'Critical Thinking in Global Challenges', u'Cryptography I'], dtype='object', name=u'title')
The index of titles receiving at least 20 ratings can then be used to select rows from mean_ratings above:
mean_ratings = coursetalk.pivot_table('rating', columns='title', aggfunc='mean')
mean_ratings
title 14.73x: The Challenges of Global Poverty 4.250000 2.01x: Elements of Structures 4.750000 3.091x: Introduction to Solid State Chemistry 4.166667 6.002x: Circuits and Electronics 4.800000 6.00x: Introduction to Computer Science and Programming 4.166667 7.00x: Introduction to Biology - The Secret of Life 4.666667 8.02x: Electricity and Magnetism 4.333333 8.MReVx: Mechanics ReView 5.000000 A Beginner's Guide to Irrational Behavior 4.874150 A Crash Course on Creativity 3.500000 A History of the World since 1300 4.318182 A Look at Nuclear Science and Technology 3.000000 A New History for a New China, 1700-2000: New Data and New Methods, Part 1 0.500000 AIDS 5.000000 Aboriginal Worldviews and Education 4.333333 Algorithms 4.250000 Algorithms, Part I 4.555556 Algorithms, Part II 4.500000 Algorithms: Design and Analysis, Part 1 4.777778 Algorithms: Design and Analysis, Part 2 4.500000 An Introduction to Interactive Programming in Python 4.915652 An Introduction to Operations Management 4.785714 An Introduction to the U.S. Food System: Perspectives from Public Health 5.000000 Animal Behaviour 4.500000 Applied Cryptography 4.666667 Archaeology's Dirty Little Secrets 4.928571 Artificial Intelligence Planning 3.250000 Artificial Intelligence for Robotics 4.333333 Astrobiology and the Search for Extraterrestrial Life 3.928571 Automata 4.000000 ... Sports and Society 0.666667 Stat2.1X: Introduction to Statistics: Descriptive Statistics 4.642857 Stat2.2x: Introduction to Statistics: Probability 5.000000 Stat2.3x: Introduction to Statistics: Inference 4.500000 Statistics One 3.909091 Synapses, Neurons and Brains 4.600000 Teaching Adult Learners (WPTrain) 2.500000 Technology Entrepreneurship Part 1 2.900000 Technology Entrepreneurship Part 2 0.500000 The Ancient Greeks 4.550000 The Eurozone Crisis 3.250000 The Fiction of Relationship 5.000000 The Hardware/Software Interface 3.857143 The Language of Hollywood: Storytelling, Sound, and Color 4.800000 The Massey Method: Learn Spanish from a Former NSA Agent 4.000000 The Modern World: Global History since 1760 4.775862 The Modern and the Postmodern 4.777778 The Science of Gastronomy 4.000000 The Social Context of Mental Health and Illness 4.333333 Think Again: How to Reason and Argue 3.815789 Useful Genetics Part 1 4.500000 VLSI CAD: Logic to Layout 4.500000 Vaccine Trials: Methods and Best Practices 5.000000 Vaccines 3.750000 Web Development 4.625000 Web Intelligence and Big Data 3.802326 Women and the Civil Rights Movement 5.000000 Writing for the Web (WriteWeb) 5.000000 Writing in the Sciences 4.000000 jQuery 4.250000 Name: rating, dtype: float64
By computing the mean rating for each course, we will order with the highest rating listed first.
mean_ratings.ix[active_titles].order(ascending=False)
title An Introduction to Interactive Programming in Python 4.915652 Modern & Contemporary American Poetry 4.901515 Design: Creation of Artifacts in Society 4.879581 A Beginner's Guide to Irrational Behavior 4.874150 Greek and Roman Mythology 4.864198 Calculus: Single Variable 4.854167 CS188.1x Artificial Intelligence 4.833333 Machine Learning 4.830000 Functional Programming Principles in Scala 4.822581 Gamification 4.796296 An Introduction to Operations Management 4.785714 The Modern World: Global History since 1760 4.775862 Programming Languages 4.770833 CS-191x: Quantum Mechanics and Quantum Computation 4.727273 Cryptography I 4.700000 Discrete Optimization 4.695652 Introduction to Computer Science 4.687500 Learn to Program: Crafting Quality Code 4.585714 Model Thinking 4.578125 Internet History, Technology, and Security 4.541667 Fantasy and Science Fiction: The Human Mind, Our Modern World 4.522727 Learn to Program: The Fundamentals 4.303571 6.00x: Introduction to Computer Science and Programming 4.166667 Critical Thinking in Global Challenges 3.961538 Web Intelligence and Big Data 3.802326 Computing for Data Analysis 3.187500 Introduction to Finance 3.086957 Introduction to Data Science 3.060000 Name: rating, dtype: float64
To see the top courses among Coursera students, we can sort by the 'Coursera' column in descending order:
mean_ratings = coursetalk.pivot_table('rating', index='title',columns='provider', aggfunc='mean')
mean_ratings[:10]
provider | None | codecademy | coursera | edx | khanacademy | mruniversity | novoed | open2study | udacity | udemy |
---|---|---|---|---|---|---|---|---|---|---|
title | ||||||||||
14.73x: The Challenges of Global Poverty | NaN | NaN | NaN | 4.250000 | NaN | NaN | NaN | NaN | NaN | NaN |
2.01x: Elements of Structures | NaN | NaN | NaN | 4.750000 | NaN | NaN | NaN | NaN | NaN | NaN |
3.091x: Introduction to Solid State Chemistry | NaN | NaN | NaN | 4.166667 | NaN | NaN | NaN | NaN | NaN | NaN |
6.002x: Circuits and Electronics | NaN | NaN | NaN | 4.800000 | NaN | NaN | NaN | NaN | NaN | NaN |
6.00x: Introduction to Computer Science and Programming | NaN | NaN | NaN | 4.166667 | NaN | NaN | NaN | NaN | NaN | NaN |
7.00x: Introduction to Biology - The Secret of Life | NaN | NaN | NaN | 4.666667 | NaN | NaN | NaN | NaN | NaN | NaN |
8.02x: Electricity and Magnetism | NaN | NaN | NaN | 4.333333 | NaN | NaN | NaN | NaN | NaN | NaN |
8.MReVx: Mechanics ReView | NaN | NaN | NaN | 5.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
A Beginner's Guide to Irrational Behavior | NaN | NaN | 4.87415 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Crash Course on Creativity | NaN | NaN | NaN | NaN | NaN | NaN | 3.5 | NaN | NaN | NaN |
mean_ratings['coursera'][active_titles].order(ascending=False)[:10]
title An Introduction to Interactive Programming in Python 4.915652 Modern & Contemporary American Poetry 4.901515 Design: Creation of Artifacts in Society 4.879581 A Beginner's Guide to Irrational Behavior 4.874150 Greek and Roman Mythology 4.864198 Calculus: Single Variable 4.854167 Programming Languages 4.850000 Machine Learning 4.830000 Functional Programming Principles in Scala 4.822581 Gamification 4.796296 Name: coursera, dtype: float64
Now, let's go further! How about rank the courses with the highest percentage of ratings that are 4 or higher ? % of ratings 4+
Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:
# transform the ratings frame into a ratings matrix
ratings_mtx_df = coursetalk.pivot_table(values='rating',
index='user_id',
columns='title')
ratings_mtx_df.ix[ratings_mtx_df.index[:15], ratings_mtx_df.columns[:15]]
title | 14.73x: The Challenges of Global Poverty | 2.01x: Elements of Structures | 3.091x: Introduction to Solid State Chemistry | 6.002x: Circuits and Electronics | 6.00x: Introduction to Computer Science and Programming | 7.00x: Introduction to Biology - The Secret of Life | 8.02x: Electricity and Magnetism | 8.MReVx: Mechanics ReView | A Beginner's Guide to Irrational Behavior | A Crash Course on Creativity | A History of the World since 1300 | A Look at Nuclear Science and Technology | A New History for a New China, 1700-2000: New Data and New Methods, Part 1 | AIDS | Aboriginal Worldviews and Education |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Let's extract only the rating that are 4 or higher.
ratings_gte_4 = ratings_mtx_df[ratings_mtx_df>=4.0]
# with an integer axis index only label-based indexing is possible
ratings_gte_4.ix[ratings_gte_4.index[:15], ratings_gte_4.columns[:15]]
title | 14.73x: The Challenges of Global Poverty | 2.01x: Elements of Structures | 3.091x: Introduction to Solid State Chemistry | 6.002x: Circuits and Electronics | 6.00x: Introduction to Computer Science and Programming | 7.00x: Introduction to Biology - The Secret of Life | 8.02x: Electricity and Magnetism | 8.MReVx: Mechanics ReView | A Beginner's Guide to Irrational Behavior | A Crash Course on Creativity | A History of the World since 1300 | A Look at Nuclear Science and Technology | A New History for a New China, 1700-2000: New Data and New Methods, Part 1 | AIDS | Aboriginal Worldviews and Education |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
13 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Now picking the number of total ratings for each course and the count of ratings 4+ , we can merge them into one DataFrame.
ratings_gte_4_pd = pd.DataFrame({'total': ratings_mtx_df.count(), 'gte_4': ratings_gte_4.count()})
ratings_gte_4_pd.head(10)
gte_4 | total | |
---|---|---|
title | ||
14.73x: The Challenges of Global Poverty | 2 | 2 |
2.01x: Elements of Structures | 2 | 2 |
3.091x: Introduction to Solid State Chemistry | 2 | 3 |
6.002x: Circuits and Electronics | 10 | 10 |
6.00x: Introduction to Computer Science and Programming | 15 | 21 |
7.00x: Introduction to Biology - The Secret of Life | 3 | 3 |
8.02x: Electricity and Magnetism | 2 | 3 |
8.MReVx: Mechanics ReView | 1 | 1 |
A Beginner's Guide to Irrational Behavior | 146 | 147 |
A Crash Course on Creativity | 2 | 5 |
ratings_gte_4_pd['gte_4_ratio'] = (ratings_gte_4_pd['gte_4'] * 1.0)/ ratings_gte_4_pd.total
ratings_gte_4_pd.head(10)
gte_4 | total | gte_4_ratio | |
---|---|---|---|
title | |||
14.73x: The Challenges of Global Poverty | 2 | 2 | 1.000000 |
2.01x: Elements of Structures | 2 | 2 | 1.000000 |
3.091x: Introduction to Solid State Chemistry | 2 | 3 | 0.666667 |
6.002x: Circuits and Electronics | 10 | 10 | 1.000000 |
6.00x: Introduction to Computer Science and Programming | 15 | 21 | 0.714286 |
7.00x: Introduction to Biology - The Secret of Life | 3 | 3 | 1.000000 |
8.02x: Electricity and Magnetism | 2 | 3 | 0.666667 |
8.MReVx: Mechanics ReView | 1 | 1 | 1.000000 |
A Beginner's Guide to Irrational Behavior | 146 | 147 | 0.993197 |
A Crash Course on Creativity | 2 | 5 | 0.400000 |
ranking = [(title,total,gte_4, score) for title, total, gte_4, score in ratings_gte_4_pd.itertuples()]
for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[3], x[2], x[1]) , reverse=True)[:10]:
print title, total, gte_4, score
Functional Programming Principles in Scala 31 31 1.0 Introduction to Computer Science 24 24 1.0 Programming Languages 24 24 1.0 Web Development 16 16 1.0 6.002x: Circuits and Electronics 10 10 1.0 Compilers 8 8 1.0 Archaeology's Dirty Little Secrets 7 7 1.0 How to Build a Startup 7 7 1.0 Introduction to Sociology 7 7 1.0 Stat2.1X: Introduction to Statistics: Descriptive Statistics 7 7 1.0
Let's now go easy. Let's count the number of ratings for each course, and order with the most number of ratings.
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title.order(ascending=False)[:10]
title An Introduction to Interactive Programming in Python 575 Design: Creation of Artifacts in Society 191 A Beginner's Guide to Irrational Behavior 147 Modern & Contemporary American Poetry 132 An Introduction to Operations Management 98 Greek and Roman Mythology 81 Critical Thinking in Global Challenges 65 Gamification 54 Machine Learning 50 Web Intelligence and Big Data 43 dtype: int64
Considering this information we can sort by the most rated ones with highest percentage of 4+ ratings.
for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[2], x[3], x[1]) , reverse=True)[:10]:
print title, total, gte_4, score
An Introduction to Interactive Programming in Python 572 575 0.994782608696 Design: Creation of Artifacts in Society 190 191 0.994764397906 A Beginner's Guide to Irrational Behavior 146 147 0.993197278912 Modern & Contemporary American Poetry 130 132 0.984848484848 An Introduction to Operations Management 96 98 0.979591836735 Greek and Roman Mythology 80 81 0.987654320988 Critical Thinking in Global Challenges 47 65 0.723076923077 Gamification 52 54 0.962962962963 Machine Learning 48 49 0.979591836735 Web Intelligence and Big Data 26 43 0.604651162791
Finally using the formula above that we learned, let's find out what the courses that most often occur wit the popular MOOC An introduction to Interactive Programming with Python by using the method "x + y/ x" . For each course, calculate the percentage of Programming with python raters who also rated that course. Order with the highest percentage first, and voilá we have the top 5 moocs.
course_users = coursetalk.pivot_table('rating', index='title', columns='user_id')
course_users.ix[course_users.index[:15], course_users.columns[:15]]
user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | |||||||||||||||
14.73x: The Challenges of Global Poverty | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2.01x: Elements of Structures | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3.091x: Introduction to Solid State Chemistry | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6.002x: Circuits and Electronics | NaN | NaN | NaN | NaN | NaN | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6.00x: Introduction to Computer Science and Programming | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7.00x: Introduction to Biology - The Secret of Life | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8.02x: Electricity and Magnetism | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8.MReVx: Mechanics ReView | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Beginner's Guide to Irrational Behavior | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Crash Course on Creativity | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A History of the World since 1300 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Look at Nuclear Science and Technology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A New History for a New China, 1700-2000: New Data and New Methods, Part 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AIDS | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Aboriginal Worldviews and Education | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
First, let's get only the users that rated the course An Introduction to Interactive Programming in Python
ratings_by_course = coursetalk[coursetalk.title == 'An Introduction to Interactive Programming in Python']
ratings_by_course.set_index('user_id', inplace=True)
Now, for all other courses let's filter out only the ratings from users that rated the Python course.
their_ids = ratings_by_course.index
their_ratings = course_users[their_ids]
course_users[their_ids].ix[course_users[their_ids].index[:15], course_users[their_ids].columns[:15]]
user_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | |||||||||||||||
14.73x: The Challenges of Global Poverty | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2.01x: Elements of Structures | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3.091x: Introduction to Solid State Chemistry | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6.002x: Circuits and Electronics | NaN | NaN | NaN | NaN | NaN | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6.00x: Introduction to Computer Science and Programming | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7.00x: Introduction to Biology - The Secret of Life | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8.02x: Electricity and Magnetism | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8.MReVx: Mechanics ReView | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Beginner's Guide to Irrational Behavior | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Crash Course on Creativity | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A History of the World since 1300 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Look at Nuclear Science and Technology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A New History for a New China, 1700-2000: New Data and New Methods, Part 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AIDS | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Aboriginal Worldviews and Education | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
By applying the division: number of ratings who rated Python Course and the given course / total of ratings who rated the Python Course we have our percentage.
course_count = their_ratings.ix['An Introduction to Interactive Programming in Python'].count()
sims = their_ratings.apply(lambda profile: profile.count() / float(course_count) , axis=1)
Ordering by the score, highest first excepts the first one which contains the course itself.
sims.order(ascending=False)[1:][:10]
title Cryptography I 0.006957 Machine Learning 0.006957 CS-169.1x: Software as a Service 0.005217 Python 0.005217 Introduction to Computer Science 0.005217 Human-Computer Interaction 0.005217 Computational Investing, Part I 0.005217 Learn to Program: Crafting Quality Code 0.005217 Web Development 0.005217 Gamification 0.005217 dtype: float64