DataPhilly is a local data meetup group I started back in 2012. I had attended a few data science conferences and I was really disappointed about the lack of a local meetup group for people interested in data science. And so DataPhilly was born!
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
Through Jupyter's kernel and messaging architecture, the Notebook allows code to be run in a range of different programming languages. For each notebook document that a user opens, the web application starts a kernel that runs the code for that notebook. Each kernel is capable of running code in a single programming language and there are kernels available in the following languages
The default kernel runs Python code. The notebook provides a simple way for users to pick which of these kernels is used for a given notebook.
Jupyter examples and tutorials can be found in the Jupyter github repo here.
The task I'll be walking you through today will demonstrate how to use Python for exploratory data analysis. The dataset I'll use is one I created by querying the Meetup API for the DataPhilly meetup. I'll walk you through using Jupyter notebook (The webapp we're using now), Pandas (an excel like tool for data exploration) and scikit-learn (a Python machine learning library) to explore the DataPhilly dataset. I won't go in depth into these tools but my hope is that you'll leave my talk wanting to learn more about using Python for exploratory data analysis and that you'll learn some interesting things about DataPhilly in the process.
First let's start off by initializing our environment
%matplotlib inline
initializes matplotlib so that we can display graphs and charts in our notebook.import seaborn as sns
imports seaborn a graphing library built on top of matplotlib.import pandas as pd
imports pandas a tool I'll explain in the next section.Hint: If you've installed Jupyter Notebook and you're running this on your machine, you can use the run button in the toolbar at the top of the page to execute each cell
Click on the cell above and the cell below. You'll notice that the cell above is Markdown. You can edit it by double clicking on it. The cell below contains Python code which can be modified and executed. If the code has any output it will be printed out below the cell with Out [n]: in front of it.
%matplotlib inline
import seaborn as sns
import pandas as pd
from matplotlib import rcParams
# Modify aesthetics for visibility during presentation
sns.set_style('darkgrid', {'axes.facecolor': '#C2C2C8'})
sns.set_palette('colorblind')
# Make everything bigger for visibility during presentation
rcParams['figure.figsize'] = 20, 10
rcParams['axes.titlesize'] = 'xx-large'
rcParams['axes.labelsize'] = 'x-large'
rcParams['xtick.labelsize'] = 'x-large'
rcParams['ytick.labelsize'] = 'x-large'
rcParams['legend.fontsize'] = 'xx-large'
rcParams['lines.linewidth'] = 4.0
rcParams['grid.linewidth'] = 2.0
# Hide warnings in the notebook
import warnings
warnings.filterwarnings('ignore')
Pandas is a library that provides data analysis tools for the Python programming language. You can think of it as Excel on steroids, but in Python.
To start off, I've used the meetup API to gather a bunch of data on members of the DataPhilly meetup group. First let's start off by looking at the events we've had over the past few years. I've loaded the data into a pandas DataFrame
and stored it in the file events.pkl
. A DataFrame
is a table similar to an Excel spreadsheet. Let's load it and see what it looks like:
events_df = pd.read_pickle('events.pkl')
events_df = events_df.sort_values(by='time')
events_df
created | name | rating | time | waitlist_count | yes_rsvp_count | id | |
---|---|---|---|---|---|---|---|
0 | 1351948193000 | Meet and greet | {u'count': 3, u'average': 5} | 1352934000000 | 0 | 17 | 89769502 |
1 | 1357781071000 | DataPhilly January 2013 Meetup - An Introducti... | {u'count': 6, u'average': 4.17000007629} | 1359588600000 | 0 | 61 | 98833672 |
2 | 1359732939000 | DataPhilly February 2013 Meetup - Data Science... | {u'count': 5, u'average': 5} | 1361316600000 | 0 | 47 | 102502622 |
3 | 1361647778000 | DataPhilly March 2013 Meetup - Data Analysis u... | {u'count': 8, u'average': 5} | 1364423400000 | 0 | 62 | 106043892 |
4 | 1362506708000 | DataPhilly April 2013 Meetup - Machine Learnin... | {u'count': 7, u'average': 4.57000017166} | 1366151400000 | 2 | 54 | 107740582 |
5 | 1369104714000 | DataPhilly June 2013 - Hadoop: BigSheets & Pig | {u'count': 4, u'average': 3} | 1370471400000 | 5 | 41 | 120425212 |
6 | 1375999505000 | DataPhilly August 2013 - Data Science with R | {u'count': 11, u'average': 4.55000019073} | 1377037800000 | 0 | 77 | 133803672 |
7 | 1378332108000 | DataPhilly September 2013 - Data Storytime | {u'count': 9, u'average': 5} | 1380234600000 | 0 | 64 | 138415912 |
8 | 1381360216000 | DataPhilly October 2013 - Data Science Tools a... | {u'count': 11, u'average': 4.73000001907} | 1382565600000 | 0 | 50 | 144769822 |
9 | 1383762778000 | DataPhilly November 2013 - Data in Practice | {u'count': 3, u'average': 4.67000007629} | 1384815600000 | 0 | 67 | 149515412 |
10 | 1389631621000 | DataPhilly January 2014 - Two Hours of Lightni... | {u'count': 6, u'average': 4.82999992371} | 1391036400000 | 0 | 69 | 160323532 |
11 | 1393608501000 | DataPhilly March 2014 - Interactive Data Visua... | {u'count': 9, u'average': 4.67000007629} | 1394661600000 | 0 | 69 | 168747852 |
12 | 1396956902000 | DataPhilly April 2014: Art and Data | {u'count': 4, u'average': 4.75} | 1397685600000 | 0 | 39 | 175993712 |
13 | 1400001749000 | DataPhilly May 2014: Data Discovery | {u'count': 7, u'average': 5} | 1400709600000 | 0 | 60 | 182860422 |
14 | 1410488369000 | Explore All the Data! | {u'count': 2, u'average': 5} | 1412719200000 | 0 | 44 | 206754182 |
15 | 1414103507000 | Explore All the Data! | {u'count': 3, u'average': 4} | 1415314800000 | 0 | 41 | 215265722 |
16 | 1417659431000 | DataPhilly - December 2014 | {u'count': 5, u'average': 5} | 1418770800000 | 2 | 68 | 219055217 |
17 | 1421280214000 | DataPhilly & GeoPhilly: Open Data Day Meetup | {u'count': 4, u'average': 4.5} | 1424386800000 | 83 | 57 | 219840555 |
18 | 1423955223000 | DataPhilly: March Meetup | {u'count': 3, u'average': 4.67000007629} | 1426802400000 | 0 | 114 | 220526799 |
19 | 1426720048000 | DataPhilly: April; Philly Tech Week Edition | {u'count': 9, u'average': 5} | 1429221600000 | 19 | 115 | 221245827 |
20 | 1442763491000 | DataPhilly October | {u'count': 6, u'average': 4.82999992371} | 1445551200000 | 7 | 139 | 225488147 |
You can access values in a DataFrame
column like this:
events_df['yes_rsvp_count']
0 17 1 61 2 47 3 62 4 54 5 41 6 77 7 64 8 50 9 67 10 69 11 69 12 39 13 60 14 44 15 41 16 68 17 57 18 114 19 115 20 139 Name: yes_rsvp_count, dtype: int64
You can access a row of a DataFrame
using iloc
:
events_df.iloc[4]
created 1362506708000 name DataPhilly April 2013 Meetup - Machine Learnin... rating {u'count': 7, u'average': 4.57000017166} time 1366151400000 waitlist_count 2 yes_rsvp_count 54 id 107740582 Name: 4, dtype: object
We can view the first few rows using the head
method:
events_df.head()
created | name | rating | time | waitlist_count | yes_rsvp_count | id | |
---|---|---|---|---|---|---|---|
0 | 1351948193000 | Meet and greet | {u'count': 3, u'average': 5} | 1352934000000 | 0 | 17 | 89769502 |
1 | 1357781071000 | DataPhilly January 2013 Meetup - An Introducti... | {u'count': 6, u'average': 4.17000007629} | 1359588600000 | 0 | 61 | 98833672 |
2 | 1359732939000 | DataPhilly February 2013 Meetup - Data Science... | {u'count': 5, u'average': 5} | 1361316600000 | 0 | 47 | 102502622 |
3 | 1361647778000 | DataPhilly March 2013 Meetup - Data Analysis u... | {u'count': 8, u'average': 5} | 1364423400000 | 0 | 62 | 106043892 |
4 | 1362506708000 | DataPhilly April 2013 Meetup - Machine Learnin... | {u'count': 7, u'average': 4.57000017166} | 1366151400000 | 2 | 54 | 107740582 |
And similarly the last few using tail
:
events_df.tail(3)
created | name | rating | time | waitlist_count | yes_rsvp_count | id | |
---|---|---|---|---|---|---|---|
18 | 1423955223000 | DataPhilly: March Meetup | {u'count': 3, u'average': 4.67000007629} | 1426802400000 | 0 | 114 | 220526799 |
19 | 1426720048000 | DataPhilly: April; Philly Tech Week Edition | {u'count': 9, u'average': 5} | 1429221600000 | 19 | 115 | 221245827 |
20 | 1442763491000 | DataPhilly October | {u'count': 6, u'average': 4.82999992371} | 1445551200000 | 7 | 139 | 225488147 |
We can see that the yes_rsvp_count
contains the number of people who RSVPed yes for each event. First let's look at some basic statistics:
yes_rsvp_count = events_df['yes_rsvp_count']
yes_rsvp_count.sum(), yes_rsvp_count.mean(), yes_rsvp_count.min(), yes_rsvp_count.max()
(1355, 64.523809523809518, 17, 139)
When we access a single column of the DataFrame
like this we get a Series
object which is just a 1-dimensional version of a DataFrame
.
type(yes_rsvp_count)
pandas.core.series.Series
We can use the built-in describe
method to print out a lot of useful stats in a nice tabular format:
yes_rsvp_count.describe()
count 21.000000 mean 64.523810 std 28.212797 min 17.000000 25% 47.000000 50% 61.000000 75% 69.000000 max 139.000000 Name: yes_rsvp_count, dtype: float64
Next I'd like to graph the number of RSVPs over time to see if there are any interesting trends. To do this let's first sum
the waitlist_count
and yes_rsvp_count
columns and make a new column called total_RSVP_count
.
events_df['total_RSVP_count'] = events_df['waitlist_count'] + events_df['yes_rsvp_count']
events_df['total_RSVP_count']
0 17 1 61 2 47 3 62 4 56 5 46 6 77 7 64 8 50 9 67 10 69 11 69 12 39 13 60 14 44 15 41 16 70 17 140 18 114 19 134 20 146 Name: total_RSVP_count, dtype: int64
We can plot these values using the plot
method
events_df['total_RSVP_count'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1045c68d0>
The plot method utilizes the matplotlib
library behind the scenes to draw the plot. This is interesting, but it would be nice to have the dates of the meetups on the X-axis of the plot.
To accomplish this, let's convert the time
field from a unix epoch timestamp to a python datetime
utilizing the apply
method and a function.
events_df.head(2)
created | name | rating | time | waitlist_count | yes_rsvp_count | id | total_RSVP_count | |
---|---|---|---|---|---|---|---|---|
0 | 1351948193000 | Meet and greet | {u'count': 3, u'average': 5} | 1352934000000 | 0 | 17 | 89769502 | 17 |
1 | 1357781071000 | DataPhilly January 2013 Meetup - An Introducti... | {u'count': 6, u'average': 4.17000007629} | 1359588600000 | 0 | 61 | 98833672 | 61 |
import datetime
def get_datetime_from_epoch(epoch):
return datetime.datetime.fromtimestamp(epoch/1000.0)
events_df['time'] = events_df['time'].apply(get_datetime_from_epoch)
events_df['time']
0 2012-11-14 18:00:00 1 2013-01-30 18:30:00 2 2013-02-19 18:30:00 3 2013-03-27 18:30:00 4 2013-04-16 18:30:00 5 2013-06-05 18:30:00 6 2013-08-20 18:30:00 7 2013-09-26 18:30:00 8 2013-10-23 18:00:00 9 2013-11-18 18:00:00 10 2014-01-29 18:00:00 11 2014-03-12 18:00:00 12 2014-04-16 18:00:00 13 2014-05-21 18:00:00 14 2014-10-07 18:00:00 15 2014-11-06 18:00:00 16 2014-12-16 18:00:00 17 2015-02-19 18:00:00 18 2015-03-19 18:00:00 19 2015-04-16 18:00:00 20 2015-10-22 18:00:00 Name: time, dtype: datetime64[ns]
Next let's make the time
column the index of the DataFrame
using the set_index
method and then re-plot
our data.
events_df.set_index('time', inplace=True)
events_df[['total_RSVP_count']].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x109664550>
We can also easily plot multiple columns on the same plot.
all_rsvps = events_df[['yes_rsvp_count', 'waitlist_count', 'total_RSVP_count']]
all_rsvps.plot(title='Attendance over time')
<matplotlib.axes._subplots.AxesSubplot at 0x10993a890>
Alright so I'm seeing some interesting trends here. Let's take a look at something different.
The Meetup API also provides us access to member info. Let's have a look at the data we have available:
members_df = pd.read_pickle('members.pkl')
for column in ['joined', 'visited']:
members_df[column] = members_df[column].apply(get_datetime_from_epoch)
members_df.head(3)
anon_id | anon_name | bio | city | country | gender | hometown | joined | lat | lon | membership_count | state | topics | visited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | James | NaN | Philadelphia | us | male | Philadelphia | 2015-10-10 21:43:33 | 39.94 | -75.23 | 0 | PA | [{u'name': u'Poker', u'urlkey': u'poker', u'id... | 2015-10-10 21:43:33 |
1 | 1 | Vijay | java software developer in center city philly | Philadelphia | us | male | Philadelphia | 2013-11-22 22:32:04 | 39.96 | -75.20 | 0 | PA | [] | 2015-04-13 20:20:24 |
2 | 2 | Justin | NaN | Philadelphia | us | male | NaN | 2015-06-10 16:18:43 | 40.00 | -75.14 | 63 | PA | [{u'name': u'Extreme Programming', u'urlkey': ... | 2015-10-22 15:42:23 |
You'll notice that I've anonymized the meetup member_id and the member's name. I've also used the python module SexMachine
to infer members gender based on their first name. I ran SexMachine on the original names before I anonymized them. Let's have a closer look at the gender breakdown of our members:
gender_counts = members_df['gender'].value_counts()
gender_counts
male 716 andy 257 female 175 mostly_male 91 mostly_female 35 Name: gender, dtype: int64
Next let's use the hist
method to plot a histogram of membership_count
. This is the number of groups each member is in.
members_df['membership_count'].hist(bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x10b34a990>
Something looks odd here let's check out the value_counts
:
members_df['membership_count'].value_counts().head()
0 124 2 105 1 96 3 86 5 77 Name: membership_count, dtype: int64
Okay so most members are members of 0 meetup groups?! This seems odd! I did a little digging and came up with the answer; members can set their membership details to be private, and then this value will be zero. Let's filter out these members and recreate the histogram.
members_df_non_zero = members_df[members_df['membership_count'] != 0]
members_df_non_zero['membership_count'].hist(bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0x10b534bd0>
Okay so most members are only members of a few meetup groups. There's some outliers that are pretty hard to read, let's try plotting this on a logarithmic scale to see if that helps:
ax = members_df_non_zero['membership_count'].hist(bins=50)
ax.set_yscale('log')
ax.set_xlim(0, 500)
(0, 500)
Let's use a mask
to filter out the outliers so we can dig into them a little further:
all_the_meetups = members_df[members_df['membership_count'] > 100]
filtered = all_the_meetups[['membership_count', 'city', 'country', 'state']]
filtered.sort_values(by='membership_count', ascending=False)
membership_count | city | country | state | |
---|---|---|---|---|
301 | 1838 | Berlin | de | NaN |
25 | 816 | San Francisco | us | CA |
141 | 651 | Jerusalem | il | NaN |
67 | 303 | Philadelphia | us | PA |
420 | 295 | Baltimore | us | MD |
1178 | 278 | Princeton | us | NJ |
257 | 241 | New York | us | NY |
223 | 207 | Scarsdale | us | NY |
150 | 197 | Philadelphia | us | PA |
174 | 166 | Philadelphia | us | PA |
86 | 166 | West Chester | us | PA |
449 | 146 | Exton | us | PA |
154 | 119 | San Francisco | us | CA |
1158 | 119 | Philadelphia | us | PA |
1022 | 113 | Levittown | us | PA |
868 | 106 | Seattle | us | WA |
987 | 102 | San Francisco | us | CA |
The people from Philly might actually be legitimate members, let's use a compound mask to filter them out as well:
all_the_meetups = members_df[
(members_df['membership_count'] > 100) & (members_df['city'] != 'Philadelphia')
]
filtered = all_the_meetups[['membership_count', 'city', 'country', 'state']]
filtered.sort_values(by='membership_count', ascending=False)
membership_count | city | country | state | |
---|---|---|---|---|
301 | 1838 | Berlin | de | NaN |
25 | 816 | San Francisco | us | CA |
141 | 651 | Jerusalem | il | NaN |
420 | 295 | Baltimore | us | MD |
1178 | 278 | Princeton | us | NJ |
257 | 241 | New York | us | NY |
223 | 207 | Scarsdale | us | NY |
86 | 166 | West Chester | us | PA |
449 | 146 | Exton | us | PA |
154 | 119 | San Francisco | us | CA |
1022 | 113 | Levittown | us | PA |
868 | 106 | Seattle | us | WA |
987 | 102 | San Francisco | us | CA |
That's strange, I don't think we've ever had any members from Berlin, San Francisco, or Jerusalem in attendance :-).
Moving on, we also have all the events that each member RSVPed to:
rsvps_df = pd.read_pickle('rsvps.pkl')
rsvps_df.head(3)
102502622 | 106043892 | 107740582 | 120425212 | 133803672 | 138415912 | 144769822 | 149515412 | 160323532 | 168747852 | ... | 206754182 | 215265722 | 219055217 | 219840555 | 220526799 | 221245827 | 225488147 | 89769502 | 98833672 | member_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 151 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 370 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 157 |
3 rows × 22 columns
We can utilize the pandas merge
method to join our members DataFrame
and our rsvps DataFrame
:
joined_with_rsvps_df = pd.merge(members_df, rsvps_df, left_on='anon_id', right_on='member_id')
joined_with_rsvps_df.head(3)
anon_id | anon_name | bio | city | country | gender | hometown | joined | lat | lon | ... | 206754182 | 215265722 | 219055217 | 219840555 | 220526799 | 221245827 | 225488147 | 89769502 | 98833672 | member_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Edward | NaN | Downingtown | us | male | Philadelphia | 2015-05-20 05:24:59 | 40.02 | -75.71 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 |
1 | 8 | John | CTO as SnipSnap, the coupon app | Woodbury | us | male | NaN | 2013-01-29 22:49:12 | 39.83 | -75.13 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 |
2 | 11 | John | Founder and CEO of Azavea, a firm that builds ... | Philadelphia | us | male | NaN | 2012-11-06 12:18:12 | 39.95 | -75.16 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 11 |
3 rows × 36 columns
joined_with_rsvps_df.columns
Index([ u'anon_id', u'anon_name', u'bio', u'city', u'country', u'gender', u'hometown', u'joined', u'lat', u'lon', u'membership_count', u'state', u'topics', u'visited', u'102502622', u'106043892', u'107740582', u'120425212', u'133803672', u'138415912', u'144769822', u'149515412', u'160323532', u'168747852', u'175993712', u'182860422', u'206754182', u'215265722', u'219055217', u'219840555', u'220526799', u'221245827', u'225488147', u'89769502', u'98833672', u'member_id'], dtype='object')
Now we have a ton of data, let's see what kind of interesting things we can discover. Let's look at the some stats on male attendees vs. female attendees:
First we can use the isin
method to make DataFrame
s for male and female members.
male_attendees = joined_with_rsvps_df[joined_with_rsvps_df['gender'].isin(['male', 'mostly_male'])]
male_attendees.tail(3)
anon_id | anon_name | bio | city | country | gender | hometown | joined | lat | lon | ... | 206754182 | 215265722 | 219055217 | 219840555 | 220526799 | 221245827 | 225488147 | 89769502 | 98833672 | member_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
578 | 1261 | Tom | NaN | Philadelphia | us | male | NaN | 2015-10-06 17:50:59 | 39.96 | -75.20 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1261 |
579 | 1262 | Daniel | NaN | Philadelphia | us | male | NaN | 2015-10-07 15:48:58 | 39.97 | -75.17 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1262 |
583 | 1271 | Chris | NaN | Philadelphia | us | male | NaN | 2015-10-20 15:00:55 | 39.96 | -75.20 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1271 |
3 rows × 36 columns
female_attendees = joined_with_rsvps_df[joined_with_rsvps_df['gender'].isin(['female', 'mostly_female'])]
female_attendees.tail(3)
anon_id | anon_name | bio | city | country | gender | hometown | joined | lat | lon | ... | 206754182 | 215265722 | 219055217 | 219840555 | 220526799 | 221245827 | 225488147 | 89769502 | 98833672 | member_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
580 | 1265 | Erin | NaN | Philadelphia | us | female | NaN | 2015-10-13 18:13:37 | 39.95 | -75.16 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1265 |
581 | 1268 | Anne | NaN | Philadelphia | us | female | Philadelphia | 2015-10-18 15:29:42 | 39.96 | -75.20 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1268 |
582 | 1269 | Stacey | NaN | Philadelphia | us | female | NaN | 2015-10-20 09:55:35 | 39.96 | -75.20 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1269 |
3 rows × 36 columns
Next we can use the sum
method to count the number of male and female attendees per event and create a Series
for each.
event_ids = [
'102502622', '106043892', '107740582', '120425212', '133803672', '138415912', '144769822', '149515412',
'160323532', '168747852', '175993712', '182860422', '206754182', '215265722', '219055217', '219840555',
'220526799', '221245827', '225488147', '89769502', '98833672'
]
male_attendees[event_ids].sum().head(3)
102502622 30 106043892 35 107740582 33 dtype: float64
We can then recombine the male and female Series
' into a new DataFrame
.
gender_attendance = pd.DataFrame({'male': male_attendees[event_ids].sum(), 'female': female_attendees[event_ids].sum()})
gender_attendance.head(3)
female | male | |
---|---|---|
102502622 | 2 | 30 |
106043892 | 6 | 35 |
107740582 | 3 | 33 |
And then we can use merge
again to combine this with our events DataFrame
.
events_with_gender_df = pd.merge(events_df, gender_attendance, left_on='id', right_index=True)
events_with_gender_df.head(3)
created | name | rating | waitlist_count | yes_rsvp_count | id | total_RSVP_count | female | male | |
---|---|---|---|---|---|---|---|---|---|
time | |||||||||
2012-11-14 18:00:00 | 1351948193000 | Meet and greet | {u'count': 3, u'average': 5} | 0 | 17 | 89769502 | 17 | 1 | 8 |
2013-01-30 18:30:00 | 1357781071000 | DataPhilly January 2013 Meetup - An Introducti... | {u'count': 6, u'average': 4.17000007629} | 0 | 61 | 98833672 | 61 | 1 | 36 |
2013-02-19 18:30:00 | 1359732939000 | DataPhilly February 2013 Meetup - Data Science... | {u'count': 5, u'average': 5} | 0 | 47 | 102502622 | 47 | 2 | 30 |
The we can plot
the attendance by gender over time
gender_df = events_with_gender_df[['female', 'male']]
gender_df.plot(title='Attendance by gender over time')
<matplotlib.axes._subplots.AxesSubplot at 0x10c19f290>
This might be easier to interpret by looking at the percentage of females in attendance. We can use the div
(divide) method to calculate this.
female_ratio = gender_df['female'].div(gender_df['male'] + gender_df['female'])
female_ratio.plot(title='Percentage female attendance over time', ylim=(0.0, 1.0))
<matplotlib.axes._subplots.AxesSubplot at 0x10c8e5110>