PyData NYC, 2014¶

Author: Greg Reda (@gjreda)¶

Topic: Translating SQL to pandas¶

Date: 2014-11-22¶

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)

Reading data¶

CSV¶

In [2]:

# this would be a huge pain to load into a database
mo = pd.read_csv('data/mariano-rivera.csv')
mo.tail()

Out[2]:

	Year	Age	Tm	Lg	W	L	W-L%	ERA	G	GF	SV	IP	H	R	ER	HR	BB	IBB	SO	HBP	WP	BF	ERA+	WHIP	H/9	HR/9	BB/9	SO/9	SO/BB	Awards
14	2009	39	NYY	AL	3	3	0.500	1.76	66	55	44	66.1	48	14	13	7	12	1	72	1	1	257	262	0.905	6.5	0.9	1.6	9.8	6.00	ASMVP-14
15	2010	40	NYY	AL	3	3	0.500	1.80	61	55	33	60.0	39	14	12	2	11	3	45	5	0	230	241	0.833	5.9	0.3	1.7	6.8	4.09	AS
16	2011	41	NYY	AL	1	2	0.333	1.91	64	54	44	61.1	47	13	13	3	8	2	60	2	1	233	226	0.897	6.9	0.4	1.2	8.8	7.50	ASCYA-8
17	2012	42	NYY	AL	1	1	0.500	2.16	9	9	5	8.1	6	2	2	0	2	2	8	0	0	32	203	0.960	6.5	0.0	2.2	8.6	4.00	NaN
18	2013	43	NYY	AL	6	2	0.750	2.11	64	60	44	64.0	58	16	15	6	9	3	54	1	0	256	192	1.047	8.2	0.8	1.3	7.6	6.00	AS

URL¶

pandas can fetch data from a URL ...

In [3]:

clean = lambda s: s.replace('$', '')[:-1] if '.' in s else s.replace('$', '') # a lot going on here
url = 'https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv'
sandwiches = pd.read_table(url, sep='\t', converters={'price': lambda s: float(clean(s))})
sandwiches.head(3)

Out[3]:

	rank	sandwich	restaurant	description	price	address	city	phone	website	full_address	formatted_address	lat	lng
0	1	BLT	Old Oak Tap	The B is applewood smoked—nice and snapp...	10.0	2109 W. Chicago Ave.	Chicago	773-772-0406	theoldoaktap.com	2109 W. Chicago Ave., Chicago	2109 West Chicago Avenue, Chicago, IL 60622, USA	41.895734	-87.679960
1	2	Fried Bologna	Au Cheval	Thought your bologna-eating days had retired w...	9.0	800 W. Randolph St.	Chicago	312-929-4580	aucheval.tumblr.com	800 W. Randolph St., Chicago	800 West Randolph Street, Chicago, IL 60607, USA	41.884672	-87.647754
2	3	Woodland Mushroom	Xoco	Leave it to Rick Bayless and crew to come up w...	9.5	445 N. Clark St.	Chicago	312-334-3688	rickbayless.com	445 N. Clark St., Chicago	445 North Clark Street, Chicago, IL 60654, USA	41.890602	-87.630925

JSON¶

In [4]:

gh = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=3')
gh[['body', 'created_at', 'title', 'url']].head(3)

Out[4]:

	body	created_at	title	url
0		2014-11-22 02:54:48	DOC: Add where and mask to API doc	https://api.github.com/repos/pydata/pandas/iss...
1	Related to #7979\r\n\r\nBecause `Index` is no ...	2014-11-22 02:16:30	API: Index.duplicated() should return `np.arra...	https://api.github.com/repos/pydata/pandas/iss...
2	I'm not sure if this is supported or not -- it...	2014-11-21 19:49:36	HDF corrupts data when using complib='blosc:zlib'	https://api.github.com/repos/pydata/pandas/iss...

You'll likely need to do some parsing though - pandas read_json doesn't do well with nested JSON yet

Clipboard¶

Possibly my favorite way to read data into pandas ...

In [5]:

clip = pd.read_clipboard()
clip.head()

Out[5]:

	figsize=(16,8)

SQL Database¶

In [6]:

# All of this is basically the same as it would be with Postgres, MySQL, or any other database
# Just pass pandas a connection object and it'll take care of the rest.
from pandas.io import sql
import sqlite3
conn = sqlite3.connect('data/towed.db')
query = "SELECT * FROM towed"
towed = sql.read_sql(query, con=conn, parse_dates={'date':'%m/%d/%Y'})
towed.head()

Out[6]:

	date	make	style	model	color	plate	state	towed_to	facility_phone	inventory_num
0	2014-11-18	FORD	LL		BLK	S105053	IL	10300 S. Doty	(773) 568-8495	2750424
1	2014-11-18	HOND	4D	ACC	BLK	S415270	IL	400 E. Lower Wacker	(312) 744-7550	917129
2	2014-11-18	CHRY	VN		SIL	V847641	IL	701 N. Sacramento	(773) 265-7605	6798366
3	2014-11-18	HYUN	4D		SIL	N756530	IL	400 E. Lower Wacker	(312) 744-7550	917127
4	2014-11-18	TOYT	4D		WHI	K702211	IL	400 E. Lower Wacker	(312) 744-7550	917128

We're going to work with the towed dataset for a bit.

Inspecting¶

We've already been using .head(), but there's also .tail(). You can also use standard Python slicing.

In [7]:

towed[100:105]

Out[7]:

	date	make	style	model	color	plate	state	towed_to	facility_phone	inventory_num
100	2014-11-18	FORD	4D		SIL	B9455	NB	10300 S. Doty	(773) 568-8495	2750400
101	2014-11-18	BMW	4D		WHI	V960806	IL	400 E. Lower Wacker	(312) 744-7550	917087
102	2014-11-18	DODG	PK	TK	RED	1382871B	IL	10300 S. Doty	(773) 568-8495	2750398
103	2014-11-18	CHEV	4D		TAN	V356714	IL	10300 S. Doty	(773) 568-8495	2750397
104	2014-11-18	BUIC	4D		WHI	S941660	IL	10300 S. Doty	(773) 568-8495	2750396

In [8]:

towed.info() # empty string showing up as non-null value

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5065 entries, 0 to 5064
Data columns (total 10 columns):
date              5065 non-null datetime64[ns]
make              5065 non-null object
style             5065 non-null object
model             5065 non-null object
color             5065 non-null object
plate             5065 non-null object
state             5065 non-null object
towed_to          5065 non-null object
facility_phone    5065 non-null object
inventory_num     5065 non-null int64
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 435.3+ KB

In [9]:

mo.info() # note the nulls in the Awards column

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19 entries, 0 to 18
Data columns (total 34 columns):
Year      19 non-null int64
Age       19 non-null int64
Tm        19 non-null object
Lg        19 non-null object
W         19 non-null int64
L         19 non-null int64
W-L%      19 non-null float64
ERA       19 non-null float64
G         19 non-null int64
GS        19 non-null int64
GF        19 non-null int64
CG        19 non-null int64
SHO       19 non-null int64
SV        19 non-null int64
IP        19 non-null float64
H         19 non-null int64
R         19 non-null int64
ER        19 non-null int64
HR        19 non-null int64
BB        19 non-null int64
IBB       19 non-null int64
SO        19 non-null int64
HBP       19 non-null int64
BK        19 non-null int64
WP        19 non-null int64
BF        19 non-null int64
ERA+      19 non-null int64
WHIP      19 non-null float64
H/9       19 non-null float64
HR/9      19 non-null float64
BB/9      19 non-null float64
SO/9      19 non-null float64
SO/BB     19 non-null float64
Awards    15 non-null object
dtypes: float64(9), int64(22), object(3)
memory usage: 5.2+ KB

In [10]:

mo.describe() # basic stats for any numeric column

Out[10]:

	Year	Age	W	L	W-L%	ERA	G	GS	GF	CG	SHO	SV	IP	H	R	ER	HR	BB	IBB	SO	HBP	BK	WP	BF	ERA+	WHIP	H/9	HR/9	BB/9	SO/9	SO/BB
count	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19	19	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000	19.000000
mean	2004.000000	34.000000	4.315789	3.157895	0.570158	2.222105	58.684211	0.526316	50.105263	0	0	34.315789	67.315789	52.526316	17.894737	16.578947	3.736842	15.052632	2.157895	61.736842	2.421053	0.157895	0.684211	268.578947	222.157895	0.998684	6.994737	0.489474	1.994737	8.142105	4.870000
std	5.627314	5.627314	2.083070	1.462994	0.174293	0.918861	17.023032	2.294157	19.944074	0	0	15.198877	18.657506	15.886292	8.305920	8.064434	2.423086	8.140937	1.500487	24.250641	1.894899	0.374634	0.749269	75.045405	56.108291	0.171747	1.107787	0.328117	0.779939	1.403296	2.553853
min	1995.000000	25.000000	1.000000	0.000000	0.200000	1.380000	9.000000	0.000000	2.000000	0	0	0.000000	8.100000	6.000000	2.000000	2.000000	0.000000	2.000000	0.000000	8.000000	0.000000	0.000000	0.000000	32.000000	84.000000	0.665000	5.200000	0.000000	0.800000	5.300000	1.700000
25%	1999.500000	29.500000	3.000000	2.000000	0.500000	1.800000	61.000000	0.000000	51.500000	0	0	31.500000	62.550000	45.000000	14.000000	13.000000	2.500000	10.500000	1.000000	51.500000	1.000000	0.000000	0.000000	251.000000	192.000000	0.901000	6.300000	0.300000	1.300000	6.900000	3.350000
50%	2004.000000	34.000000	4.000000	3.000000	0.571000	1.910000	64.000000	0.000000	57.000000	0	0	39.000000	70.200000	58.000000	16.000000	14.000000	3.000000	12.000000	2.000000	60.000000	2.000000	0.000000	1.000000	277.000000	233.000000	0.994000	6.900000	0.400000	2.100000	8.000000	4.090000
75%	2008.500000	38.500000	6.000000	4.000000	0.651500	2.250000	66.000000	0.000000	60.500000	0	0	44.000000	75.100000	63.000000	21.000000	19.000000	4.500000	19.000000	3.000000	73.000000	4.000000	0.000000	1.000000	303.500000	254.500000	1.070500	7.600000	0.600000	2.400000	9.250000	6.085000
max	2013.000000	43.000000	8.000000	6.000000	1.000000	5.510000	74.000000	10.000000	69.000000	0	0	53.000000	107.200000	73.000000	43.000000	41.000000	11.000000	34.000000	6.000000	130.000000	6.000000	1.000000	2.000000	425.000000	316.000000	1.507000	9.500000	1.500000	4.000000	10.900000	12.830000

Indexes¶

In [11]:

towed.set_index('date', inplace=True)

In [12]:

# SELECT *
# FROM towed
# WHERE date = '2014-11-04'
# LIMIT 5;
towed.ix['2014-11-04']

Out[12]:

	make	style	model	color	plate	state	towed_to	facility_phone	inventory_num
date
2014-11-04	LINC	LL		GRN	V322831	IL	10300 S. Doty	(773) 568-8495	2749375
2014-11-04	CHRY	VN		BLU	7101535	IL	701 N. Sacramento	(773) 265-7605	6797250
2014-11-04	PLYM	VN	TK	GRN	V144454	IL	701 N. Sacramento	(773) 265-7605	6797248
2014-11-04	CHEV	VN	TK	BLK	K719308	IL	701 N. Sacramento	(773) 265-7605	6797246
2014-11-04	CHEV	4D	IMP	SIL	UG5J2P	MO	701 N. Sacramento	(773) 265-7605	6797244
2014-11-04	FORD	SW		SIL	S860992	IL	10300 S. Doty	(773) 568-8495	2749362
2014-11-04	DODG	VN		RED	571R990	IL	701 N. Sacramento	(773) 265-7605	6797239
2014-11-04	DODG	PK	TK	BRO	1381812	IL	10300 S. Doty	(773) 568-8495	2749364
2014-11-04	ZCZY	2D		BRO		IL	10300 S. Doty	(773) 785-9752	1714849
2014-11-04	CHEV	4D		SIL			10300 S. Doty	(773) 568-8495	2749363
2014-11-04	CHEV	4D		BLK	K527576	IL	10300 S. Doty	(773) 568-8495	2749365
2014-11-04	OLDS	VN	TK	MAR	695R324	IL	701 N. Sacramento	(773) 265-7605	6797237
2014-11-04	OLDS	4D		TAN	N646613	IL	10300 S. Doty	(773) 568-8495	2749359
2014-11-04	LINC	4D		SIL			10300 S. Doty	(773) 568-8495	2749358
2014-11-04	LINC	4D		BLK	E144539	IN	10300 S. Doty	(773) 785-9752	1714848
2014-11-04	ISU	LL	TK	WHI	AY82383	AL	701 N. Sacramento	(773) 265-7605	6797232
2014-11-04	HYUN	2T		GRN	S991561	IL	701 N. Sacramento	(773) 265-7605	6797224
2014-11-04	TOYT	VN		GRN	R620262	IL	701 N. Sacramento	(773) 265-7605	6797225
2014-11-04	CHEV	LL		GRY	IT9801M	IL	10300 S. Doty	(773) 568-8495	2749335
2014-11-04	CHEV	4D		BLK	R689805	IL	10300 S. Doty	(773) 568-8495	6797229
2014-11-04	SAA	4D		BLK	V901738	IL	10300 S. Doty	(773) 785-9752	1714847
2014-11-04	PONT	4D		MAR	X149084	IL	10300 S. Doty	(773) 785-9752	1714846
2014-11-04	DODG	4D		GRY		IL	10300 S. Doty	(773) 568-8495	2749357
2014-11-04	NISS	4D		BLK	S806552	IL	10300 S. Doty	(773) 568-8495	2749355
2014-11-04	DODG	4D		GRY	P432676	IL	701 N. Sacramento	(773) 265-7605	6797220
2014-11-04	OLDS	4D		BLK		IL	701 N. Sacramento	(773) 265-7605	6797216
2014-11-04	PONT	4D		MAR	837PDG	WI	701 N. Sacramento	(773) 265-7605	6797217
2014-11-04	DODG	VN		RED		IL	10300 S. Doty	(773) 568-8495	2749360
2014-11-04	FORD	LL		BLK	V521737	IL	10300 S. Doty	(773) 568-8495	464577
2014-11-04	FORD	4D		GRY	L702784	IL	10300 S. Doty	(773) 785-9752	1714845
...	...	...	...	...	...	...	...	...	...
2014-11-04	LINC	4D		GRN	L901713	IL	10300 S. Doty	(773) 568-8495	2749370
2014-11-04	TOYT	2D		SIL	190R367	IL	10300 S. Doty	(773) 568-8495	2749345
2014-11-04	TOYT	2D		GRN	V241962	IL	10300 S. Doty	(773) 568-8495	2749352
2014-11-04	MASE	4D		GRY	B345780	IL	701 N. Sacramento	(773) 265-7605	6797207
2014-11-04	DODG	VN		WHI	L293657	IL	10300 S. Doty	(773) 568-8495	2749354
2014-11-04	NISS	LL		GRY	V583736	IL	701 N. Sacramento	(773) 265-7605	6797201
2014-11-04	FORD	LL		BLK	796ERG	MN	701 N. Sacramento	(773) 265-1846	1532877
2014-11-04	FORD	LL		BLK	S485239	IL	10300 S. Doty	(773) 568-8495	2749349
2014-11-04	DODG	4D		BLK	E274365	IL	10300 S. Doty	(773) 568-8495	2749348
2014-11-04	PONT	4D		MAR	S728401	IL	10300 S. Doty	(773) 568-8495	2749347
2014-11-04	JEEP	LL		RED	R491482	IL	701 N. Sacramento	(773) 265-7605	6797196
2014-11-04	LINC	4D		BLK	R470097	IL	701 N. Sacramento	(773) 265-7605	6797194
2014-11-04	FORD	2D		WHI	V145161	IL	701 N. Sacramento	(773) 265-7605	6797188
2014-11-04	FORD	VN		WHI	00527142	IL	701 N. Sacramento	(773) 265-7605	6797186
2014-11-04	DODG	4D		SIL	P579868	IL	10300 S. Doty	(773) 568-8495	2749332
2014-11-04	FORD	4D		DBL	CAPITAL	WI	701 N. Sacramento	(773) 265-7605	6797189
2014-11-04	SATR	4D		GRN	V514780	IL	701 N. Sacramento	(773) 265-7605	6797190
2014-11-04	OLDS	4D		GRY	N935508	IL	10300 S. Doty	(773) 568-8495	2749343
2014-11-04	NISS	4D		SIL	R952618	IL	10300 S. Doty	(773) 568-8495	2749329
2014-11-04	FORD	4D		MAR	V612513	IL	10300 S. Doty	(773) 568-8495	2749337
2014-11-04	DODG	VN	TK	BLU	731R563	IL	10300 S. Doty	(773) 568-8495	2749386
2014-11-04	AUDI	4D		WHI	E150919	IL	10300 S. Doty	(773) 568-8495	2749334
2014-11-04	PLYM	LL		GRY		IL	10300 S. Doty	(773) 568-8495	2749326
2014-11-04	FORD	4D		GRN	VUO601	IN	701 N. Sacramento	(773) 265-7605	6797183
2014-11-04	CHEV	4D		SIL	A014416	IL	701 N. Sacramento	(773) 265-7605	6797182
2014-11-04	INFI	4D		BLK	V682152	IL	701 N. Sacramento	(773) 265-7605	6797178
2014-11-04	SUZI	LL		TAN	E259378	IL	701 N. Sacramento	(773) 265-7605	6797177
2014-11-04	MASE	4D		GRY		IL	10300 S. Doty	(773) 568-8495	2749377
2014-11-04	CHEV	2D		BLU	V609354	IL	10300 S. Doty	(773) 568-8495	2749378
2014-11-04	CHEV	2D		SIL	552P849	IL	10300 S. Doty	(773) 568-8495	2749381

70 rows × 9 columns

In [13]:

towed.ix['2014-11-04', 'make'] # get a Series back (or individual values, if unique)

Out[13]:

date
2014-11-04    LINC
2014-11-04    CHRY
2014-11-04    PLYM
2014-11-04    CHEV
2014-11-04    CHEV
2014-11-04    FORD
2014-11-04    DODG
2014-11-04    DODG
2014-11-04    ZCZY
2014-11-04    CHEV
2014-11-04    CHEV
2014-11-04    OLDS
2014-11-04    OLDS
2014-11-04    LINC
2014-11-04    LINC
...
2014-11-04    FORD
2014-11-04    SATR
2014-11-04    OLDS
2014-11-04    NISS
2014-11-04    FORD
2014-11-04    DODG
2014-11-04    AUDI
2014-11-04    PLYM
2014-11-04    FORD
2014-11-04    CHEV
2014-11-04    INFI
2014-11-04    SUZI
2014-11-04    MASE
2014-11-04    CHEV
2014-11-04    CHEV
Name: make, Length: 70

In [14]:

# SELECT *
# FROM towed
# WHERE date = '2014-11-04';
towed.reset_index(inplace=True)
towed[towed['date'] == '2014-11-04']

Out[14]:

	date	make	style	model	color	plate	state	towed_to	facility_phone	inventory_num
1350	2014-11-04	LINC	LL		GRN	V322831	IL	10300 S. Doty	(773) 568-8495	2749375
1351	2014-11-04	CHRY	VN		BLU	7101535	IL	701 N. Sacramento	(773) 265-7605	6797250
1352	2014-11-04	PLYM	VN	TK	GRN	V144454	IL	701 N. Sacramento	(773) 265-7605	6797248
1353	2014-11-04	CHEV	VN	TK	BLK	K719308	IL	701 N. Sacramento	(773) 265-7605	6797246
1354	2014-11-04	CHEV	4D	IMP	SIL	UG5J2P	MO	701 N. Sacramento	(773) 265-7605	6797244
1355	2014-11-04	FORD	SW		SIL	S860992	IL	10300 S. Doty	(773) 568-8495	2749362
1356	2014-11-04	DODG	VN		RED	571R990	IL	701 N. Sacramento	(773) 265-7605	6797239
1357	2014-11-04	DODG	PK	TK	BRO	1381812	IL	10300 S. Doty	(773) 568-8495	2749364
1358	2014-11-04	ZCZY	2D		BRO		IL	10300 S. Doty	(773) 785-9752	1714849
1359	2014-11-04	CHEV	4D		SIL			10300 S. Doty	(773) 568-8495	2749363
1360	2014-11-04	CHEV	4D		BLK	K527576	IL	10300 S. Doty	(773) 568-8495	2749365
1361	2014-11-04	OLDS	VN	TK	MAR	695R324	IL	701 N. Sacramento	(773) 265-7605	6797237
1362	2014-11-04	OLDS	4D		TAN	N646613	IL	10300 S. Doty	(773) 568-8495	2749359
1363	2014-11-04	LINC	4D		SIL			10300 S. Doty	(773) 568-8495	2749358
1364	2014-11-04	LINC	4D		BLK	E144539	IN	10300 S. Doty	(773) 785-9752	1714848
1365	2014-11-04	ISU	LL	TK	WHI	AY82383	AL	701 N. Sacramento	(773) 265-7605	6797232
1366	2014-11-04	HYUN	2T		GRN	S991561	IL	701 N. Sacramento	(773) 265-7605	6797224
1367	2014-11-04	TOYT	VN		GRN	R620262	IL	701 N. Sacramento	(773) 265-7605	6797225
1368	2014-11-04	CHEV	LL		GRY	IT9801M	IL	10300 S. Doty	(773) 568-8495	2749335
1369	2014-11-04	CHEV	4D		BLK	R689805	IL	10300 S. Doty	(773) 568-8495	6797229
1370	2014-11-04	SAA	4D		BLK	V901738	IL	10300 S. Doty	(773) 785-9752	1714847
1371	2014-11-04	PONT	4D		MAR	X149084	IL	10300 S. Doty	(773) 785-9752	1714846
1372	2014-11-04	DODG	4D		GRY		IL	10300 S. Doty	(773) 568-8495	2749357
1373	2014-11-04	NISS	4D		BLK	S806552	IL	10300 S. Doty	(773) 568-8495	2749355
1374	2014-11-04	DODG	4D		GRY	P432676	IL	701 N. Sacramento	(773) 265-7605	6797220
1375	2014-11-04	OLDS	4D		BLK		IL	701 N. Sacramento	(773) 265-7605	6797216
1376	2014-11-04	PONT	4D		MAR	837PDG	WI	701 N. Sacramento	(773) 265-7605	6797217
1377	2014-11-04	DODG	VN		RED		IL	10300 S. Doty	(773) 568-8495	2749360
1378	2014-11-04	FORD	LL		BLK	V521737	IL	10300 S. Doty	(773) 568-8495	464577
1379	2014-11-04	FORD	4D		GRY	L702784	IL	10300 S. Doty	(773) 785-9752	1714845
...	...	...	...	...	...	...	...	...	...	...
1390	2014-11-04	LINC	4D		GRN	L901713	IL	10300 S. Doty	(773) 568-8495	2749370
1391	2014-11-04	TOYT	2D		SIL	190R367	IL	10300 S. Doty	(773) 568-8495	2749345
1392	2014-11-04	TOYT	2D		GRN	V241962	IL	10300 S. Doty	(773) 568-8495	2749352
1393	2014-11-04	MASE	4D		GRY	B345780	IL	701 N. Sacramento	(773) 265-7605	6797207
1394	2014-11-04	DODG	VN		WHI	L293657	IL	10300 S. Doty	(773) 568-8495	2749354
1395	2014-11-04	NISS	LL		GRY	V583736	IL	701 N. Sacramento	(773) 265-7605	6797201
1396	2014-11-04	FORD	LL		BLK	796ERG	MN	701 N. Sacramento	(773) 265-1846	1532877
1397	2014-11-04	FORD	LL		BLK	S485239	IL	10300 S. Doty	(773) 568-8495	2749349
1398	2014-11-04	DODG	4D		BLK	E274365	IL	10300 S. Doty	(773) 568-8495	2749348
1399	2014-11-04	PONT	4D		MAR	S728401	IL	10300 S. Doty	(773) 568-8495	2749347
1400	2014-11-04	JEEP	LL		RED	R491482	IL	701 N. Sacramento	(773) 265-7605	6797196
1401	2014-11-04	LINC	4D		BLK	R470097	IL	701 N. Sacramento	(773) 265-7605	6797194
1402	2014-11-04	FORD	2D		WHI	V145161	IL	701 N. Sacramento	(773) 265-7605	6797188
1403	2014-11-04	FORD	VN		WHI	00527142	IL	701 N. Sacramento	(773) 265-7605	6797186
1404	2014-11-04	DODG	4D		SIL	P579868	IL	10300 S. Doty	(773) 568-8495	2749332
1405	2014-11-04	FORD	4D		DBL	CAPITAL	WI	701 N. Sacramento	(773) 265-7605	6797189
1406	2014-11-04	SATR	4D		GRN	V514780	IL	701 N. Sacramento	(773) 265-7605	6797190
1407	2014-11-04	OLDS	4D		GRY	N935508	IL	10300 S. Doty	(773) 568-8495	2749343
1408	2014-11-04	NISS	4D		SIL	R952618	IL	10300 S. Doty	(773) 568-8495	2749329
1409	2014-11-04	FORD	4D		MAR	V612513	IL	10300 S. Doty	(773) 568-8495	2749337
1410	2014-11-04	DODG	VN	TK	BLU	731R563	IL	10300 S. Doty	(773) 568-8495	2749386
1411	2014-11-04	AUDI	4D		WHI	E150919	IL	10300 S. Doty	(773) 568-8495	2749334
1412	2014-11-04	PLYM	LL		GRY		IL	10300 S. Doty	(773) 568-8495	2749326
1413	2014-11-04	FORD	4D		GRN	VUO601	IN	701 N. Sacramento	(773) 265-7605	6797183
1414	2014-11-04	CHEV	4D		SIL	A014416	IL	701 N. Sacramento	(773) 265-7605	6797182
1415	2014-11-04	INFI	4D		BLK	V682152	IL	701 N. Sacramento	(773) 265-7605	6797178
1416	2014-11-04	SUZI	LL		TAN	E259378	IL	701 N. Sacramento	(773) 265-7605	6797177
1417	2014-11-04	MASE	4D		GRY		IL	10300 S. Doty	(773) 568-8495	2749377
1418	2014-11-04	CHEV	2D		BLU	V609354	IL	10300 S. Doty	(773) 568-8495	2749378
1419	2014-11-04	CHEV	2D		SIL	552P849	IL	10300 S. Doty	(773) 568-8495	2749381

70 rows × 10 columns

In [15]:

towed.set_index('date', inplace=True)

Selection¶

In [16]:

cols = ['make', 'color', 'state'] # just passing a list
towed[cols][:5]

Out[16]:

	make	color	state
date
2014-11-18	FORD	BLK	IL
2014-11-18	HOND	BLK	IL
2014-11-18	CHRY	SIL	IL
2014-11-18	HYUN	SIL	IL
2014-11-18	TOYT	WHI	IL

Filtering¶

In [17]:

# SELECT *
# FROM towed
# WHERE make = 'FORD'
# ORDER BY state DESC
# LIMIT 5;
towed[towed['make'] == 'FORD'].sort('state', ascending=False).head()

Out[17]:

	make	style	model	color	plate	state	towed_to	facility_phone	inventory_num
date
2014-11-04	FORD	4D		DBL	CAPITAL	WI	701 N. Sacramento	(773) 265-7605	6797189
2014-10-22	FORD	4D	TAU	RED	113WDP	WI	701 N. Sacramento	(773) 265-7605	6796198
2014-11-08	FORD	LL		BLK	635FJE	WI	10300 S. Doty	(773) 568-8495	2749537
2014-10-06	FORD	2D		ONG	198WJX	WI	701 N. Sacramento	(773) 265-7605	6794908
2014-11-10	FORD	4D		BLU	250WLU	WI	701 N. Sacramento	(773) 265-7605	6797787

In [18]:

# SELECT *
# FROM towed
# WHERE color = 'BLK'
#   AND state = 'MI';
towed[(towed.color == 'BLK') & (towed.state == 'MI')]

Out[18]:

	make	style	color	plate	state	towed_to	facility_phone	inventory_num
date
2014-11-12	DODG	LL	BLK	CLV6160	MI	701 N. Sacramento	(773) 265-7605	6797946
2014-09-08	CHEV	2D	BLK	3KMT29	MI	10300 S. Doty	(773) 568-8495	2745448
2014-08-27	KIA	4D	BLK	DAN1057	MI	701 N. Sacramento	(773) 265-7605	6791554

In [19]:

# SELECT *
# FROM towed
# WHERE make = 'PORS'
#   OR state = 'NY'
# ORDER BY make, color DESC;
towed[(towed.make == 'PORS') | (towed.state == 'NY')].sort(['make', 'color'], ascending=[True, False]) # not red Porshe

Out[19]:

	make	style	color	plate	state	towed_to	facility_phone	inventory_num
date
2014-09-27	AUDI	4D	SIL	EWA7707	NY	701 N. Sacramento	(773) 265-7605	6794421
2014-09-27	AUDI	4D	SIL	EWA7707	NY	701 N. Sacramento	(773) 265-7605	6794421
2014-11-11	BUIC	LL	WHI	GNW2480	NY	10300 S. Doty	(773) 568-8495	2749681
2014-10-08	PORS	LL	RED	V515513	IL	701 N. Sacramento	(773) 265-7605	6795116
2014-11-01	PORS	LL	BLK	V671447	IL	10300 S. Doty	(773) 568-8495	2749122
2014-09-25	PORS	LL	BLK		IL	701 N. Sacramento	(773) 265-7605	6794214
2014-09-19	PORS	LL	BLK	V733398	IL	701 N. Sacramento	(773) 265-7605	6793740
2014-09-01	SUZI	4D	GRY	EHL9836	NY	10300 S. Doty	(773) 568-8495	2744842

In [20]:

m = (towed.style == '4D') & (towed.state == 'IL') # create a filter mask

In [21]:

m.value_counts() # it's just a boolean series

Out[21]:

True     2570
False    2495
dtype: int64

In [22]:

# SELECT make, model, color
# FROM towed
# WHERE style = '4D' AND state = 'IL
# ORDER BY date
# LIMIT 5;
towed[m][['make', 'model', 'color']].sort().head()

Out[22]:

	make	color
date
2014-08-20	BUIC	GRY
2014-08-20	LEXS	BLK
2014-08-21	NISS	GRY
2014-08-21	HOND	GRY
2014-08-21	PONT	BLK

In [23]:

towed[towed.duplicated()] # this would suck in SQL ... but that's also what primary keys are for

Out[23]:

	make	style	model	color	plate	state	towed_to	facility_phone	inventory_num
date
2014-11-18	NISS	4D		GRY	E203840	IL	400 E. Lower Wacker	(312) 744-7550	6798305
2014-11-06	BUIC	4D		PLE	R4548	IL	701 N. Sacramento	(773) 265-7605	6797394
2014-11-03	FORD	4D		BLK	S227831	IL	10300 S. Doty	(773) 568-8495	2749299
2014-10-19	MERZ	LL	TK	BLK	MINE-IP	IL	701 N. Sacramento	(773) 265-7605	6795998
2014-10-11	FORD	LL		RED		IL	10300 S. Doty	(773) 568-8495	2747534
2014-09-27	AUDI	4D		SIL	EWA7707	NY	701 N. Sacramento	(773) 265-7605	6794421
2014-09-04	CHEV	4D	LUM	WHI	S880655	IL	10300 S. Doty	(773) 568-8495	2745054
2014-09-03	CHEV	LL		GRY	K585960	IL	10300 S. Doty	(773) 568-8495	2744980

Joining¶

Going to use Divvy data¶

In [24]:

# thousands comma in tripduration field ... that's be really annoying to load into a DB
! egrep \"[0-9],[0-9]+\" data/divvy/Divvy_Trips_2013.csv | head -n 5

In [25]:

! head -n 5 data/divvy/Divvy_Stations_2013.csv

In [26]:

! head -n 5 data/weather.csv

CDT,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity, Mean Humidity, Min Humidity, Max Sea Level PressureIn, Mean Sea Level PressureIn, Min Sea Level PressureIn, Max VisibilityMiles, Mean VisibilityMiles, Min VisibilityMiles, Max Wind SpeedMPH, Mean Wind SpeedMPH, Max Gust SpeedMPH,PrecipitationIn, CloudCover, Events, WindDirDegrees
2013-6-1,76,70,63,70,63,61,93,81,68,29.73,29.69,29.66,10,9,2,25,13,33,0.56,7,Rain-Thunderstorm,213
2013-6-2,63,53,43,59,50,42,97,84,71,30.04,29.92,29.71,10,9,3,16,6,23,0.01,6,Rain,63
2013-6-3,68,56,43,46,42,38,100,69,37,30.12,30.07,30.04,10,10,10,14,5,20,0.00,2,,142
2013-6-4,72,61,50,49,45,41,93,64,35,30.16,30.12,30.09,10,10,10,14,6,21,0.00,5,,73

In [27]:

from datetime import datetime
to_datetime = lambda d: datetime.strptime(d, '%m/%d/%Y %H:%M')

In [28]:

# if you know the datetime format, it's much, much faster to explicitly parse it
# see here: https://gist.github.com/gjreda/7433f5f70299610d9b6b
trips = pd.read_csv('data/divvy/Divvy_Trips_2013.csv',
                    converters={'starttime': to_datetime,'stoptime': to_datetime},
                    thousands=',')
stations = pd.read_csv('data/divvy/Divvy_Stations_2013.csv')

/Users/gjreda/.virtualenvs/pydata2014/lib/python2.7/site-packages/pandas/io/parsers.py:1154: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [29]:

trips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 759788 entries, 0 to 759787
Data columns (total 12 columns):
trip_id              759788 non-null int64
starttime            759788 non-null datetime64[ns]
stoptime             759788 non-null datetime64[ns]
bikeid               759788 non-null int64
tripduration         759788 non-null int64
from_station_id      758826 non-null float64
from_station_name    759788 non-null object
to_station_id        758829 non-null float64
to_station_name      759788 non-null object
usertype             759788 non-null object
gender               403046 non-null object
birthyear            402909 non-null float64
dtypes: datetime64[ns](2), float64(3), int64(3), object(4)
memory usage: 75.4+ MB

In [30]:

stations.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300 entries, 0 to 299
Data columns (total 4 columns):
name          300 non-null object
latitude      300 non-null float64
longitude     300 non-null float64
dpcapacity    300 non-null int64
dtypes: float64(2), int64(1), object(1)
memory usage: 11.7+ KB

joins

LEFT JOIN¶

In [31]:

# CREATE TABLE divvy (
#   SELECT *
#   FROM trips
#   LEFT JOIN stations
#     ON trips.from_station_name = stations.name
# );
divvy = pd.merge(trips, stations, how='left', left_on='from_station_name', right_on='name')
divvy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 759788 entries, 0 to 759787
Data columns (total 16 columns):
trip_id              759788 non-null int64
starttime            759788 non-null datetime64[ns]
stoptime             759788 non-null datetime64[ns]
bikeid               759788 non-null int64
tripduration         759788 non-null int64
from_station_id      758826 non-null float64
from_station_name    759788 non-null object
to_station_id        758829 non-null float64
to_station_name      759788 non-null object
usertype             759788 non-null object
gender               403046 non-null object
birthyear            402909 non-null float64
name                 759788 non-null object
latitude             759788 non-null float64
longitude            759788 non-null float64
dpcapacity           759788 non-null int64
dtypes: datetime64[ns](2), float64(5), int64(4), object(5)
memory usage: 98.5+ MB

In [32]:

# SELECT trip_id, ... divvy.name AS name_origin ... stations.name AS name_dest
# FROM divvy
# INNER JOIN stations
#   ON divvy.to_station_name = stations.name
divvy = pd.merge(divvy, stations, how='inner', left_on='to_station_name', right_on='name', suffixes=['_origin', '_dest'])
divvy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 759788 entries, 0 to 759787
Data columns (total 20 columns):
trip_id              759788 non-null int64
starttime            759788 non-null datetime64[ns]
stoptime             759788 non-null datetime64[ns]
bikeid               759788 non-null int64
tripduration         759788 non-null int64
from_station_id      758826 non-null float64
from_station_name    759788 non-null object
to_station_id        758829 non-null float64
to_station_name      759788 non-null object
usertype             759788 non-null object
gender               403046 non-null object
birthyear            402909 non-null float64
name_origin          759788 non-null object
latitude_origin      759788 non-null float64
longitude_origin     759788 non-null float64
dpcapacity_origin    759788 non-null int64
name_dest            759788 non-null object
latitude_dest        759788 non-null float64
longitude_dest       759788 non-null float64
dpcapacity_dest      759788 non-null int64
dtypes: datetime64[ns](2), float64(7), int64(5), object(6)
memory usage: 121.7+ MB

UNION & UNION ALL (concatenation)¶

In [33]:

df1 = pd.DataFrame({'a': np.random.randn(3), 'b': np.square(range(100,103))})
df2 = pd.DataFrame({'a': [np.NaN, 11, 99], 'b': np.random.randn(3)})

In [34]:

# SELECT a, b
# FROM df1
# UNION
# SELECT a, b
# FROM df2;

# takes a list, so you can pass 2+ dataframes
# note that the indexes are the same
pd.concat([df1, df2])

Out[34]:

	a	b
0	0.200588	10000.000000
1	1.533702	10201.000000
2	0.977121	10404.000000
0	NaN	-0.323561
1	11.000000	-1.318334
2	99.000000	-0.544382

No Looping!¶

Vectorizing functions with `.apply`¶

vectorize-all-the-things

In [35]:

# it's precise to the minute - want to see counts by nearest hour
divvy.starttime.value_counts().order()[:3]

Out[35]:

2013-11-21 19:05:00    1
2013-09-17 05:40:00    1
2013-12-21 20:17:00    1
dtype: int64

In [36]:

# I don't want to just extract the hour from the datetime ... NEAREST hour
from datetime import timedelta

def round_time(dt, mins=60):
    """
    Rounds a datetime object to its nearest hour (by default).
    (e.g. 2014-02-01 10:45:13 becomes 2014-02-01 11:00:00)
    """
    dt += timedelta(minutes=mins/2.)
    dt -= timedelta(minutes=dt.minute % mins, seconds=dt.second)
    return dt

In [37]:

divvy['starthour'] = divvy.starttime.apply(round_time)
divvy['stophour'] = divvy.stoptime.apply(round_time)

In [38]:

divvy[['starttime', 'starthour']].head(3)

Out[38]:

	starttime	starthour
0	2013-06-27 12:11:00	2013-06-27 12:00:00
1	2013-06-28 12:55:00	2013-06-28 13:00:00
2	2013-07-09 19:06:00	2013-07-09 19:00:00

Grouping¶

pandas groupby draws largely from Hadley Wickham's Split Apply Combine Methodology for Data Analysis (it's a good paper; you should read it).

split-apply-combine-methodology

Trips by starthour¶

Count vs Size

In [39]:

# SELECT starthour, count(1)
# FROM divvy
# GROUP BY starthour
divvy.groupby('starthour').size()

Out[39]:

starthour
2013-06-27 01:00:00     1
2013-06-27 11:00:00     3
2013-06-27 12:00:00     4
2013-06-27 13:00:00     2
2013-06-27 15:00:00    10
2013-06-27 16:00:00     2
2013-06-27 17:00:00     3
2013-06-27 18:00:00     1
2013-06-27 19:00:00    14
2013-06-27 20:00:00    21
2013-06-27 21:00:00     5
2013-06-27 22:00:00    17
2013-06-27 23:00:00     9
2013-06-28 00:00:00     8
2013-06-28 01:00:00     5
...
2013-12-31 10:00:00    20
2013-12-31 11:00:00    33
2013-12-31 12:00:00    47
2013-12-31 13:00:00    59
2013-12-31 14:00:00    52
2013-12-31 15:00:00    41
2013-12-31 16:00:00    43
2013-12-31 17:00:00    24
2013-12-31 18:00:00    12
2013-12-31 19:00:00     7
2013-12-31 20:00:00     7
2013-12-31 21:00:00     7
2013-12-31 22:00:00    12
2013-12-31 23:00:00     2
2014-01-01 00:00:00     2
Length: 4452

starthour with longest mean trip duration¶

In [40]:

# SELECT starthour, avg(tripduration)
# FROM divvy
# GROUP BY starthour
# ORDER BY avg(tripduration) DESC
# LIMIT 5
divvy.groupby('starthour')['tripduration'].mean().order(ascending=False)[:5]

Out[40]:

starthour
2013-06-27 01:00:00    31177.0
2013-06-28 02:00:00    28469.0
2013-07-24 03:00:00    24593.0
2013-12-18 02:00:00    22762.5
2013-07-22 04:00:00    21118.0
Name: tripduration, dtype: float64

Count of unique birthyears with mean and median tripduration by usertype¶

In [41]:

divvy.groupby('usertype').agg({'birthyear': pd.Series.nunique, 'tripduration': [np.mean, np.median]})

Out[41]:

	birthyear	tripduration
	nunique	mean	median
usertype
Customer	4	1824.054727	1257
Subscriber	68	722.018892	566

Plotting¶

pandas has some nice integration with matplotlib¶

But if you want nice plots with frustration-free adjustments, I hear seaborn is fantastic. Matplotlib has a steep (and infinite) learning curve.

If you're interested in Python dataviz libraries, check out Rob Story's talk here [slides].

In [42]:

# number of trips started by starthour
divvy.groupby('starthour').size().plot(figsize=(16,8))

Out[42]:

<matplotlib.axes._subplots.AxesSubplot at 0x10bc9f4d0>

Distribution of trip duration¶

In [43]:

# distribution of tripduration
divvy.tripduration.hist(figsize=(16,8), bins=1000)
plt.xlim(0, 10000);

What percentage of Divvy rides are less than 10 minutes? 20 minutes?¶

Sounds like we need to plot the cumulative distribution of trip duration.

(If someone knows of a better way of doing this, I'd love to hear it).

In [44]:

duration_counts = divvy.tripduration.value_counts()
duration_counts.index.name = 'seconds'
duration_counts.name = 'trips'
duration_counts.head()

Out[44]:

seconds
408        712
399        711
346        706
379        701
415        700
Name: trips, dtype: int64

In [45]:

df = duration_counts.reset_index()
df['minutes'] = df.seconds/60.
df.set_index('minutes', inplace=True)
df.sort(inplace=True)

In [46]:

(df.trips.cumsum() / df.trips.sum()).plot(figsize=(16,8))
plt.xlim(0, 60)
plt.yticks(np.arange(0, 1.1, 0.1));

Number of trips per birth year¶

In [47]:

plt.figure(figsize=(9, 18))
divvy.groupby('birthyear').size().order().plot(kind='barh')

Out[47]:

<matplotlib.axes._subplots.AxesSubplot at 0x11072a5d0>

In [48]:

divvy.groupby(divvy['starttime'].apply(lambda d: d.dayofweek))['trip_id'].count().plot(kind='bar')
plt.title('Divvy trips by weekday') # 0 = Monday ...
plt.xlabel('Weekday')
plt.ylabel('# of trips started');

Men vs. Women¶

Plotting multiple lines and subplotting.

In [49]:

divvy['startdate'] = divvy.starthour.apply(lambda d: d.date())
by_gender = divvy.groupby(['startdate', 'gender']).size()

In [50]:

by_gender.head()

Out[50]:

startdate   gender
2013-06-27  Female     14
            Male       45
2013-06-28  Female     75
            Male      321
2013-06-29  Female     42
dtype: int64

In [51]:

by_gender.unstack(1).head()

Out[51]:

gender	Female	Male
startdate
2013-06-27	14	45
2013-06-28	75	321
2013-06-29	42	163
2013-06-30	47	180
2013-07-01	87	449

pandas wins here.

In [52]:

# SELECT startdate
#    , COUNT(IF(gender = 'Female', 1, NULL))
#    , COUNT(IF(gender = 'Male', 1, NULL))
# FROM divvy
# GROUP BY startdate
# LIMIT 5;
divvy.groupby(['startdate', 'gender']).size().unstack(1).head()

Out[52]:

gender	Female	Male
startdate
2013-06-27	14	45
2013-06-28	75	321
2013-06-29	42	163
2013-06-30	47	180
2013-07-01	87	449

In [53]:

by_gender.unstack(1).plot(figsize=(16,8))

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x1102ecf10>

Consumer vs Subscriber¶

In [54]:

divvy.groupby(['startdate', 'usertype']).size().unstack(1).plot(figsize=(16,8), subplots=True);

Trips by weekday hour¶

In [55]:

weekdays = divvy['starttime'].apply(lambda d: d.dayofweek)
hours = divvy['starttime'].apply(lambda d: d.hour)
by_weekday_hour = divvy.groupby([weekdays, hours])['trip_id'].count()
by_weekday_hour.index.names = ['weekday', 'hour'] # rename MultiIndex

In [56]:

by_weekday_hour.unstack(0).plot(figsize=(16,8))
plt.title('Trips by weekday hour')
plt.ylabel('# of trips started')
plt.xlabel('Hour of day')
plt.xticks(range(24))
plt.xlim(0, 23);

# Thanks! ## Questions?