Trevor Muñoz

10 January 2014

For more background, please see the accompanying blog post

In [1]:

# By convention
import pandas as pd

Loading Data¶

For this I want to setup two DataFrames—one with the contents of Dish.csv (~25 MB in the latest download) and one with the contents of MenuItem.csv (~114 MB).

As mentioned, Open Refine hangs before it can load MenuItem.csv (without increasing the Java heap size to provide more memory). So, though the timings will vary across systems, I've included time profiling information to show how speedily Pandas handles this job.

In [2]:

%%time

dish_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Dish.csv'
dish_df = pd.read_csv(dish_data, index_col='id')

CPU times: user 701 ms, sys: 111 ms, total: 812 ms
Wall time: 888 ms

In [3]:

%%time

menu_item_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuItem.csv'
item_df = pd.read_csv(menu_item_data, index_col='dish_id')

CPU times: user 3.43 s, sys: 519 ms, total: 3.95 s
Wall time: 4.42 s

We can make it easier to join up data by selecting the dish id (equal to the "primary key" of the first DataFrame) as the column to index by when loading up the second DataFrame.

With both DataFrames set up, we can print the first few rows to check that everything looks right.

In [4]:

dish_df[:5]

Out[4]:

	name	description	menus_appeared	times_appeared	first_appeared	last_appeared	lowest_price	highest_price
id
1	Consomme printaniere royal	NaN	10	10	1897	1927	0.20	0.4
2	Chicken gumbo	NaN	110	116	1895	1960	0.10	0.8
3	Tomato aux croutons	NaN	15	15	1893	1917	0.25	0.4
4	Onion au gratin	NaN	41	41	1900	1971	0.35	1.0
5	St. Emilion	NaN	62	64	1881	1981	0.00	18.0

In [5]:

item_df[:5]

Out[5]:

	id	menu_page_id	price	high_price	created_at	updated_at	xpos	ypos
dish_id
1	1	1389	0.4	NaN	2011-03-28 15:00:44 UTC	2011-04-19 04:33:15 UTC	0.111429	0.254735
2	2	1389	0.6	NaN	2011-03-28 15:01:13 UTC	2011-04-19 15:00:54 UTC	0.438571	0.254735
3	3	1389	0.4	NaN	2011-03-28 15:01:40 UTC	2011-04-19 19:10:05 UTC	0.140000	0.261922
4	4	1389	0.5	NaN	2011-03-28 15:01:51 UTC	2011-04-19 19:07:01 UTC	0.377143	0.262720
5	5	3079	0.5	1	2011-03-28 15:21:26 UTC	2011-04-13 15:25:27 UTC	0.105714	0.313178

Pandas integrates with the Notebook to display DataFrames as nice HTML tables. It's also possible to print a summary description of a DataFrame.

In [6]:

dish_df

Out[6]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 399455 entries, 1 to 477346
Data columns (total 8 columns):
name              399455  non-null values
description       0  non-null values
menus_appeared    399455  non-null values
times_appeared    399455  non-null values
first_appeared    399455  non-null values
last_appeared     399455  non-null values
lowest_price      377962  non-null values
highest_price     377962  non-null values
dtypes: float64(3), int64(4), object(1)

Clustering to Find Duplicate Values¶

For the moment, I'll focus on the Dish data.

With a single function call, we can see how many different values we have in the 'name' column before we start de-duping

In [7]:

unique_values_t0 = len(dish_df.name.unique())
print unique_values_t0

To start, almost all the values of dish name are supposedly unique

In [8]:

# Drop the 'description' column because it's empty and taking up screen real estate
del dish_df['description']

In order to find very similar values that we can identify as duplicates and normalize, I'll pursue the same series of steps I would take using Open Refine.

In [9]:

def strip_and_lower(x):
    """
    Basic string normalization:
    1) strip leading and trailing whitespace
    2) convert to lowercase
    3) normalize internal whitespace (remove extra spaces between tokens)
    """
    
    tokens = x.strip().lower().split()
    result = ' '.join(filter(None, tokens))
    
    return result

# Apply the function to every value of name using map() — rather than, for instance, using a loop
dish_df['name_modified'] = dish_df['name'].map(strip_and_lower)

A new column has been created:

In [10]:

dish_df[:5]

Out[10]:

	name	menus_appeared	times_appeared	first_appeared	last_appeared	lowest_price	highest_price	name_modified
id
1	Consomme printaniere royal	10	10	1897	1927	0.20	0.4	consomme printaniere royal
2	Chicken gumbo	110	116	1895	1960	0.10	0.8	chicken gumbo
3	Tomato aux croutons	15	15	1893	1917	0.25	0.4	tomato aux croutons
4	Onion au gratin	41	41	1900	1971	0.35	1.0	onion au gratin
5	St. Emilion	62	64	1881	1981	0.00	18.0	st. emilion

This already yields some duplicates which had been showing up as distinct values because of small (inconsequential) differences in whitespace or capitalization.

In [11]:

len(dish_df.name_modified.unique())

Out[11]:

Now, if we implement Open Refine-style clustering, we should find more candidates for de-duping.

In [12]:

import re

def fingerprint(x):
    """
    A modified version of the fingerprint clustering algorithm implemented by Open Refine.
    See https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
    This does not normalize to ASCII characters since diacritics may be significant in this dataset
    """
    alphanumeric_tokens = filter(None, re.split('\W', x))
    seen = set()
    seen_add = seen.add
    deduped = sorted([i for i in alphanumeric_tokens if i not in seen and not seen_add(i)])
    fingerprint = ''.join(deduped)
    
    return fingerprint

dish_df['fingerprint'] = dish_df['name_modified'].map(fingerprint)

We can see the results as another new column

In [13]:

dish_df[:5]

Out[13]:

	name	menus_appeared	times_appeared	first_appeared	last_appeared	lowest_price	highest_price	name_modified	fingerprint
id
1	Consomme printaniere royal	10	10	1897	1927	0.20	0.4	consomme printaniere royal	consommeprintaniereroyal
2	Chicken gumbo	110	116	1895	1960	0.10	0.8	chicken gumbo	chickengumbo
3	Tomato aux croutons	15	15	1893	1917	0.25	0.4	tomato aux croutons	auxcroutonstomato
4	Onion au gratin	41	41	1900	1971	0.35	1.0	onion au gratin	augratinonion
5	St. Emilion	62	64	1881	1981	0.00	18.0	st. emilion	emilionst

To see the "clusters" of potential duplicate name values that the fingerprinting method identifies, we can group the data by the contents of the fingerprint column

In [14]:

clusters = dish_df.groupby('fingerprint')

# Order the clusters by size
s = clusters.size()
s.order()

Out[14]:

fingerprint
000000dedouzainehuitreslazelande                                1
0001001938duringhavemenmoreorplacedwho                          1
000100populationportslopesunnyto                                1
000112bottledrinkginhollandperrum                               1
000112bottledrinkginjamaicaperrum                               1
000112bottledrinkginoldperrumtom                                1
000112bottledrinkginperplymouthrum                              1
000113030aacresafterallandareasavebeenbeingbodegasbuttsbybyasscocompanyconsidereddelicatedondryextremelyfamilyfavoritefinestfirmforfoundedgenerationsgonzalesgonzalezhavinginisitjerezjoejudgedkeptknownlargestliterallymanuelmariamaturingmembersminimumnamedofoldolderoneoroverownerspepeproducingsschieffelinselectedsherriessherrysolerasspainstandardthethroughouttiotodaytranslatedultimateunclevineyardvineyardswarehouseswaswhichwhosewithworldyears    1
00012belosdzlalesno                                             1
000150additioncupsinofpintpricequarttowine                      1
000150boxcigarettesfranclargepersizesmall                       1
000150boxcigarettesfrancslargepersizesmall                      1
0001550andchampagnejouetmilsperrier                             1
000160cgermainegorgelmacons                                     1
000165amontilladoptsqtssherry                                   1
...
broiledchickenhalf       36
brownedhashedpotatoes    36
codryextraghmumm         36
eggsfriedtwo             36
alecgingerimported       38
mashedpotatoes           38
dryextraghmumms          39
creamhashedinpotatoes    40
creamicevanilla          40
augratinpotatoes         41
2eggsonpoachedtoast      44
friedpotatoessweet       45
boiledpotatoes           46
frenchfriedpotatoes      47
                         81
Length: 330038, dtype: int64

We might want to know what scope of variation there is in the size of the clusters. This can be accomplished with 1 line:,

In [15]:

s.value_counts()

Out[15]:

1     294118
2      23204
3       6230
4       2564
5       1370
6        788
7        491
8        323
9        249
10       178
11       110
12       102
15        50
14        46
13        39
18        24
20        20
16        20
17        20
19        14
22        12
21        10
25         7
36         6
31         5
26         5
30         5
24         4
23         3
33         3
38         2
29         2
40         2
27         2
45         1
44         1
46         1
41         1
32         1
39         1
35         1
47         1
28         1
81         1
dtype: int64

So, by this measure, there are still 294,118 singleton values (down from 399,421), but there are 23,204 two-item clusters that could be normalized, 110 11-item clusters, just 1 47-item cluster, etc.

The print display of the various clusters above shows only the fingerprint value and the count of rows that share that common value but this is not the most immediately-apprehensible way of seeing what's going on. For clarity, let's print the 5 largest clusters:

In [16]:

# Actually we'll be excluding the last item of the series since the value there is blank
for f in s.order()[-6:-1].index:
    print dish_df[dish_df.fingerprint == f].name

id
2759        Potatoes, au gratin
7176         Potatoes au Gratin
8373        Potatoes--Au gratin
35728       Potatoes: au gratin
44271        Au Gratin Potatoes
84510      Au Gratin (Potatoes)
94968     Potatoes, au Gratin, 
97166      POTATOES:- Au gratin
185040     Au Gratin [potatoes]
313168      Au Gratin  Potatoes
315697     (Potatoes) Au Gratin
325940       Au Gratin Potatoes
330420       au-Gratin Potatoes
353435     Potatoes:  Au gratin
373639       Potatoes Au Gratin
376396     Potatoes - au gratin
376414      Potatoes, Au Gratin
378597       Au gratin potatoes
378732       Potatoes au gratin
379157      Potatoes, au Gratin
379534        Potatoes AuGratin
379829     [Potatoes] au gratin
380247     [Potatoes] Au Gratin
380907      Potatoes au Gratin 
383920       Au gratin Potatoes
385327      Potatoes Au Gratin 
385777       Au Gratin potatoes
387869       Potatoes Au gratin
388838       potatoes au gratin
389716     Potatoes - Au gratin
391056      POTATOES: Au Gratin
398307      Potatoes: Au gratin
398348       Potatoes-Au gratin
414322       au gratin potatoes
414927      Potatoes -Au Gratin
420093       POTATOES Au Gratin
435770      Potatoes, Au gratin
441471      POTATOES: AU GRATIN
448532     POTATOES - Au gratin
450243       POTATOES AU GRATIN
453840      potatoes, au gratin
Name: name, dtype: object
id
22361       Poached Eggs (2) on toast
22562        Eggs, 2 Poached on Toast
22763         2 Poached Eggs on Toast
31324     Eggs, Poached on toast (2) 
48043       Eggs poached on toast (2)
51940       Poached eggs on toast (2)
79275      2 Poached  (Eggs) on Toast
81059      Eggs Poached (2), on Toast
81421        2 Eggs, poached on toast
90960       2 Poached (eggs) on Toast
91432       2 (Eggs) Poached on Toast
108224       Poached Eggs, 2 on toast
111031     Poached Eggs on Toast, [2]
127128     Eggs, Poached (2) on Toast
128696      Eggs (2) poached on toast
183077      EGGS Poached (2) on Toast
206923       2 Poached on Toast, Eggs
224031       Eggs: 2 Poached on Toast
296286    Eggs - Poached on Toast (2)
334815      (2) Poached Eggs on Toast
369966     Eggs (2), Poached on Toast
375226      Eggs, poached on toast(2)
376027     Eggs, poached on toast (2)
376449    EGGS, poached on toast (2) 
376635    [EGGS] poached on toast (2)
376746     EGGS, poached on toast (2)
377100     Poached eggs (2), on toast
380354     EGGS  poached on toast (2)
381711        2 Poached Eggs on toast
382721     EGGS, poached on toast [2]
383950      Poached Eggs on Toast (2)
389804      Poached eggs on Toast (2)
399299       2 Eggs, Poached on Toast
408143      Poached eggs (2) on toast
417205     Eggs, poached on toast [2]
433441        2 eggs poached on toast
440712     Eggs, Poached on Toast (2)
442689      Eggs poached on toast [2]
445035       2 eggs, poached on toast
452711     POACHED EGGS ON TOAST  (2)
458993     Eggs, Poached on Toast [2]
465510      EGGS poached on toast (2)
466056      EGGS poached on toast [2]
476441     Eggs poached, on toast (2)
Name: name, dtype: object
id
2089         Sweet Potatoes Fried
3780        Sweet Potatoes, Fried
4075         fried Sweet Potatoes
9476        Potatoes, sweet fried
17042      Fried sweet [potatoes]
17116      Potatoes, Sweet, Fried
38431      Sweet Potatoes - Fried
42055       Potatoes sweet, fried
56061       Potatoes, fried sweet
76902        Potatoes Sweet Fried
78525     Sweet (potatoes), fried
86202       POTATOES: Fried sweet
92455       Sweet Potatoes: fried
101532       Potatoes Fried Sweet
136255       Sweet fried potatoes
198754     Sweet  Potatoes, fried
349957      Sweet Potatoes--Fried
354008      Potatoes  SWEET Fried
356717     Sweet Potatoes---Fried
373866       Fried sweet potatoes
373917       FRIED SWEET POTATOES
373973     Sweet potatoes - Fried
374052       Fried Sweet Potatoes
376416      Potatoes, Fried Sweet
376488      Sweet Potatoes: Fried
379831     Fried [Sweet Potatoes]
380251     Fried [sweet potatoes]
387427     Sweet potatoes - fried
390331      Sweet potatoes, fried
392859       Fried Sweet potatoes
396587     SWEET POTATOES---Fried
397513    Sweet potatoes -- Fried
397930     SWEET POTATOES - Fried
401304       Sweet potatoes-fried
402285       Sweet potatoes-Fried
406090       fried sweet potatoes
409582       Sweet potatoes fried
411471      Sweet potatoes--Fried
431756     Sweet Potatoes - fried
452385      Sweet Potatoes, fried
454978      fried sweet  potatoes
456340       Sweet Fried Potatoes
461269     SWEET POTATOES - fried
465671     Sweet Potatoes, Fried 
471570      Fried Sweet Potatoes.
Name: name, dtype: object
id
219           Boiled potatoes
2752         Potatoes, boiled
4290         Potatoes--Boiled
17902        Boiled Potatoes,
20308         Potatoes Boiled
31198        Potatoes: boiled
34056        Boiled Potatoes.
37644       Potatoes - Boiled
40611         Boiled Potatoes
44255       Boiled (potatoes)
59475       Boiled [Potatoes]
63979         POTATOES-boiled
74342         Potatoes Boiled
76369        POTATOES  Boiled
85397       Potatoes , boiled
94958       Potatoes, Boiled,
97154       Potatoes:- Boiled
125408      POTATOES: *Boiled
194143       Boiled  Potatoes
199538        Boiled-Potatoes
225209      Potatoes:--boiled
362463      Potatoes---Boiled
374956        Boiled Potatoes
376397       Potatoes, Boiled
379856    Potatoes -- Boiled 
380179     Potatoes -- Boiled
380244      Boiled [potatoes]
385714       potatoes, boiled
389060       *Boiled Potatoes
394137        boiled potatoes
394579        Boiled POTATOES
395878        BOILED POTATOES
398306       Potatoes: Boiled
398346        Potatoes-Boiled
401302        Potatoes boiled
407244      Potatoes - boiled
412613      Potatoes, boiled 
426362       Potatoes boiled 
427434         BoiledPotatoes
427492       Potatoes Boiled 
432003       Boiled Potatoes 
436408      Potatoes, Boiled 
451614       POTATOES--Boiled
452667      POTATOES - Boiled
465416       POTATOES: Boiled
471029       Boiled potatoes.
Name: name, dtype: object
id
1259         French fried potatoes
2761        Potatoes, French fried
11893      Potatoes---French Fried
14987       Potatoes: French fried
17047      French fried [potatoes]
17949      Potatoes - French Fried
18441        Potatoes French fried
21120        Potatoes-French Fried
21264       Potatoes--French Fried
29043     POTATOES:  French Fried.
32765       Potatoes French, fried
35844       French Fried Potatoes.
37852      French Fried (POTATOES)
39455      Potatoes, French, fried
57562       Potatoes- French Fried
74346        Potatoes French Fried
76376       POTATOES  French Fried
97164      POTATOES:- French fried
100589    Potatoes - French Fried,
112767     Potatoes, Fried, French
147020     Potatoes:  French Fried
181396    Potatoes -- French Fried
248905       French Fried Potatoes
328845       French-Fried Potatoes
373638       Potatoes French Fried
375011       French Fried Potatoes
376400      Potatoes, French Fried
376487      Potatoes: French Fried
376508     Potatoes - French fried
379156       French Fried potatoes
379828     French Fried [Potatoes]
386120      Potatoes, french fried
387212      Potatoes--French fried
387671     French Fried (potatoes)
393034       French fried Potatoes
395877       FRENCH FRIED POTATOES
398352       Potatoes-French fried
401306       Potatoes french fried
401838      French  fried potatoes
404612     Potatoes:  French fried
407806      French fried potatoes 
410281       french fried potatoes
410591     Potatoes: French fried 
436571     POTATOES - French Fried
441467      POTATOES: FRENCH FRIED
455232      potatoes, french fried
473731      POTATOES--FRENCH FRIED
Name: name, dtype: object

At a glance these do indeed look like values that should be normalized to a single "name". Since the default way that IPython prints a Pandas Series object includes the 'id', it is easy to construct URLs to allow for some sanity checking.

(I'm not automating the sanity checking here because I don't want to get in bad odor with NYPL's anti-spambot mechanisms.)

In [17]:

testing_cluster = dish_df[dish_df.fingerprint == 'augratinpotatoes']

for item in testing_cluster.index.tolist():
    url = "http://menus.nypl.org/dishes/{0}".format(item)
    print url

http://menus.nypl.org/dishes/2759
http://menus.nypl.org/dishes/7176
http://menus.nypl.org/dishes/8373
http://menus.nypl.org/dishes/35728
http://menus.nypl.org/dishes/44271
http://menus.nypl.org/dishes/84510
http://menus.nypl.org/dishes/94968
http://menus.nypl.org/dishes/97166
http://menus.nypl.org/dishes/185040
http://menus.nypl.org/dishes/313168
http://menus.nypl.org/dishes/315697
http://menus.nypl.org/dishes/325940
http://menus.nypl.org/dishes/330420
http://menus.nypl.org/dishes/353435
http://menus.nypl.org/dishes/373639
http://menus.nypl.org/dishes/376396
http://menus.nypl.org/dishes/376414
http://menus.nypl.org/dishes/378597
http://menus.nypl.org/dishes/378732
http://menus.nypl.org/dishes/379157
http://menus.nypl.org/dishes/379534
http://menus.nypl.org/dishes/379829
http://menus.nypl.org/dishes/380247
http://menus.nypl.org/dishes/380907
http://menus.nypl.org/dishes/383920
http://menus.nypl.org/dishes/385327
http://menus.nypl.org/dishes/385777
http://menus.nypl.org/dishes/387869
http://menus.nypl.org/dishes/388838
http://menus.nypl.org/dishes/389716
http://menus.nypl.org/dishes/391056
http://menus.nypl.org/dishes/398307
http://menus.nypl.org/dishes/398348
http://menus.nypl.org/dishes/414322
http://menus.nypl.org/dishes/414927
http://menus.nypl.org/dishes/420093
http://menus.nypl.org/dishes/435770
http://menus.nypl.org/dishes/441471
http://menus.nypl.org/dishes/448532
http://menus.nypl.org/dishes/450243
http://menus.nypl.org/dishes/453840

Better Dish Statistics¶

Identifying the clusters is great and we can use this information not only to improve the quality of the curated data set but also to clarify and improve analysis of the data about dish frequencies and "timespans." For the purposes of example, I'll use the "french fried potatoes" cluster.

With 47 values to choose from, what should we select for a normalized value? How about the most common string already represented in the cluster?

In [18]:

# Get a Dataframe consisting of all the rows that share a common fingerprint
result_df = dish_df[dish_df.fingerprint == 'frenchfriedpotatoes']

# Get the number of times appeared for each result by selecting that column (Series) and sort in descending order
by_appearance = result_df.times_appeared.order(ascending=False)

#Use the index of the max value to look up the most common value for the name
print "Most common value: '{0}'".format(dish_df.name[by_appearance.idxmax()])

Most common value: 'French fried potatoes'

Does clustering the values change the number of times a dish appeared or the earliest and latest dates it appears?

In [19]:

result_df.times_appeared[:5]

Out[19]:

id
1259     1392
2761      243
11893       1
14987      23
17047       3
Name: times_appeared, dtype: int64

According to this data (it varies slightly from what appears on the web site), the dish with id 1259 and name 'French fried potatoes' appears 1,392 times. Dish 2761 (name "Potatoes, French Fried") appears 243 times, etc.

If we assert that all of these instances are in fact the same type of dish, we can quickly calculate a new total for how often this dish appears on the menus in the data set.

In [20]:

total_appearances = result_df.times_appeared.sum()
print "French fried potatoes appear {0} times in the menus digitized by NYPL.".format(total_appearances)

French fried potatoes appear 2140 times in the menus digitized by NYPL.

We can do something similar for the date ranges. (It isn't even necessary to do anything with the data type of the values for first_appeared and last_appeared.)

In [21]:

# aggregate_last_appeared = result_df.last_appeared.max()

# It turns out we can't completely straightforwardly use min() as well as max() 
# because some dishes apparently have '0' for first appeared.
aggregate_first_appeared = result_df.first_appeared.values[0]
aggregate_last_appeared = result_df.last_appeared.max()

print "French fried potatoes first appeared on a menu in {0} and last appeared on a menu in {1}"\
.format(aggregate_first_appeared, aggregate_last_appeared)

French fried potatoes first appeared on a menu in 1884 and last appeared on a menu in 1989

For an apparently ever-popular dish like 'French fried potatoes' this 100-or-so year range is not that different from what appears on the page for the most-popular variant (Dish 1259). The aggregate values might shift more for other dishes.

Of course, there are also still 'pommes frites' to account for …

Plotting Results¶

Reproducing NYPL's lovely appearance by year frequency plots takes a little more doing but it provides a good opportunity to demonstrate the facilities that Pandas provides for pulling together different sets of data using SQL-like mechanisms.

In [22]:

# Load data from the other CSV files included in the NYPL's regular data dumps because we'll need it all 
page_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuPage.csv'
page_df = pd.read_csv(page_data, index_col='id')

menu_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Menu.csv'
menu_df = pd.read_csv(menu_data, index_col='id', parse_dates=True)

(For this final bit, I'm not entirely confident that I'm doing this in the best way. It seems to "work" but it feels a little verbose. Caveat lector.)

In [23]:

# This should create a DataFrame from the Menu Items data set that contains 
# only "rows" related to our example cluster of 'French fried potato' dishes
lookup_items_df = item_df[item_df.index.isin(result_df.index)]

In [24]:

# The length of this DataFrame should be same as the sum of occurrences we calculated above 
try:
    assert len(lookup_items_df) == result_df.times_appeared.sum()
    print "Length of the DataFrame is consistent"
except:
    print "Something's wrong"

Length of the DataFrame is consistent

In [25]:

lookup_items_df.head(10)

Out[25]:

	id	menu_page_id	price	high_price	created_at	updated_at	xpos	ypos
dish_id
1259	454	5217	NaN	NaN	2011-04-19 03:29:41 UTC	2011-10-05 21:07:14 UTC	0.342857	0.761442
1259	530	130	NaN	NaN	2011-04-19 13:17:01 UTC	2011-05-13 05:14:16 UTC	0.667143	0.735197
1259	1194	6471	NaN	NaN	2011-04-19 19:22:19 UTC	2011-05-06 18:52:50 UTC	0.315714	0.747705
1259	1792	134	NaN	NaN	2011-04-19 20:42:20 UTC	2011-04-19 20:42:20 UTC	0.550000	0.685084
1259	2093	146	NaN	NaN	2011-04-19 22:22:22 UTC	2011-04-19 22:22:22 UTC	0.568571	0.683963
1259	2953	1627	NaN	NaN	2011-04-20 02:18:30 UTC	2011-04-20 02:18:30 UTC	0.508571	0.623720
1259	3715	4783	0.10	NaN	2011-04-20 05:06:28 UTC	2011-10-02 16:31:41 UTC	0.301429	0.338671
2761	3966	3531	0.10	NaN	2011-04-20 05:25:58 UTC	2011-04-20 05:25:58 UTC	0.342857	0.477359
1259	4565	1432	0.10	NaN	2011-04-20 16:04:24 UTC	2011-05-11 20:53:58 UTC	0.131429	0.824837
2761	5788	5006	0.15	NaN	2011-04-20 19:12:01 UTC	2011-05-31 20:07:02 UTC	0.535714	0.652158

From this intermediate DataFrame we want to grab the values for menu_page_id so that we can look up the relevant menus to which they belong:

In [26]:

target_pages = lookup_items_df.menu_page_id.values

lookup_pages_df = page_df[page_df.index.isin(target_pages)]

In [27]:

lookup_pages_df.head(10)

Out[27]:

	menu_id	page_number	image_id	full_height	full_width	uuid
id
130	12463	1	466928	3049	2004	510D47DB-491F-A3D9-E040-E00A18064A99
134	12465	1	466931	3411	2291	510d47db-4923-a3d9-e040-e00a18064a99
146	12470	1	466940	3427	2282	510d47db-4933-a3d9-e040-e00a18064a99
263	12505	1	475815	4949	3261	510d47db-4309-a3d9-e040-e00a18064a99
265	12506	3	4000008161	5559	3229	510d47db-430b-a3d9-e040-e00a18064a99
309	12518	3	4000013961	4659	2823	510d47db-662f-a3d9-e040-e00a18064a99
332	12523	5	4000013986	3703	2427	510d47db-6652-a3d9-e040-e00a18064a99
460	12562	1	466438	2803	1979	510d47db-44a4-a3d9-e040-e00a18064a99
482	12569	1	466597	2600	1811	510d47db-461a-a3d9-e040-e00a18064a99
501	12577	1	466783	2884	2083	510d47db-47c1-a3d9-e040-e00a18064a99

Now we can grab all the relevant values of 'menu_id' for menus and do the last lookup:

In [28]:

target_menus = lookup_pages_df.menu_id.values

In [29]:

lookup_menus_df = menu_df[menu_df.index.isin(target_menus)]

Now we should have a DataFrame containing only menus that have pages on which one of the dishes from our 'French fried potato' cluster appears:

In [30]:

lookup_menus_df.head(10)

Out[30]:

	name	sponsor	event	venue	place	physical_description	occasion	notes	call_number	keywords	language	date	location	location_type	currency	currency_symbol	status	page_count	dish_count
id
12463	NaN	HOTEL EASTMAN	BREAKFAST	COMMERCIAL	HOT SPRINGS, AR	CARD; 4.75X7.5;	EASTER;	NaN	1900-2822	NaN	NaN	1900-04-15	Hotel Eastman	NaN	NaN	NaN	under review	2	67
12465	NaN	NORDDEUTSCHER LLOYD BREMEN	FRUHSTUCK/BREAKFAST;	COMMERCIAL	DAMPFER KAISER WILHELM DER GROSSE;	CARD; ILLU; COL; 5.5X8.0;	NaN	MENU IN GERMAN AND ENGLISH; ILLUS, STEAMSHIP A...	1900-2827	NaN	NaN	1900-04-16	Norddeutscher Lloyd Bremen	NaN	NaN	NaN	under review	2	84
12470	NaN	NORDDEUTSCHER LLOYD BREMEN	FRUHSTUCK/BREAKFAST	COMMERCIAL	SCHNELLDAMPFER KAISER WILHELM DER GROSSE;	BROADSIDE; ILLUS; COL; 5.5X8.50;	NaN	MENU IN GERMAN AND ENGLISH; ILLUS, LIGHTHOUSE;...	1900-2839	NaN	NaN	1900-04-17	Norddeutscher Lloyd Bremen	NaN	NaN	NaN	under review	2	80
12505	NaN	BARTHOLDI HOTEL	BREAKFAST	COMMERCIAL	NY	BROADSIDE; COL; 7 X 11.75;	DAILY	VERY POOR CONDITION;	1900-1137	NaN	NaN	1900-02-18	Bartholdi Hotel	NaN	Dollars	$	under review	2	203
12506	NaN	BARTHOLDI HOTEL	LUNCH & DINNER	COMMERCIAL	23RD ST & BWAY NY	FOLDER; ILLUS;8.25 X 14;	DAILY	MANY ITEMS HANDWRITTEN; HOURS INCLUDED;	1900-1138	NaN	NaN	1900-02-18	Bartholdi Hotel	NaN	Dollars	$	under review	4	336
12518	NaN	COLUMBIA RESTAURANT	DAILY MENU	COMMERCIAL	48 EAST 14TH STREET,[NEW YORK,NY?]	FOLDER;ILLUS;7.25X11.5;	DAILY;	2 COPIES;WINE LIST;	1901-1527	NaN	NaN	1901-06-03	Columbia Restaurant	NaN	Dollars	$	under review	4	475
12523	NaN	CITIZENS'STEAMBOAT COMPANY	LUNCH	COMMERCIAL	STEAMER SARATOGA,TROY LINE	BOOKLET;ILLUS;COL;6.25X9;	DAILY;	AU SABLE CHASM ON FRONT COVER;BLUE MOUNTAIN LA...	1901-1537	NaN	NaN	1901-01-01	Citizens'steamboat Company	NaN	Dollars	$	under review	8	246
12562	NaN	CUNARD LINE	BREAKFAST	COMMERCIAL	EN ROUTE ON BOARD R.M.S. CAMPANIA	CARD; 4.5 X 6.5;	NaN	PRICED WINE LIST ON BACK OF MENU; PRICES IN BR...	1900-2170	NaN	NaN	1900-02-28	Cunard Line	NaN	NaN	NaN	under review	2	110
12569	NaN	MAXWELL HOUSE	SUPPER	COMMERCIAL	NASHVILLE, [TN?]	CARD; ILLUS; 4.25X6.25	NaN	ELABORATE PRINTING OF HOTEL NAME FORMS ILLUSTR...	1900-2397	NaN	NaN	1900-03-12	Maxwell House	NaN	NaN	NaN	under review	2	38
12577	NaN	HADDON HALL	SUPPER	COMMERCIAL	[PHILADELPHIA,PA.]	BROADSIDE; ILLUS; COL; 5 X 7;	DAILY	NaN	1900-2625	NaN	NaN	1900-03-31	Haddon Hall	NaN	NaN	NaN	under review	2	60

This next bit is a workaround since parsing the dates in the date column doesn't seem to be working for me as expected. I need to extract the year for each menu so I can group the number of appearances of dish by year

In [31]:

import dateutil

def extract_year(x):
    try:
        datestring = dateutil.parser.parse(x)
        return datestring.year
    except:
        return None

lookup_menus_df['date_year'] = lookup_menus_df['date'].map(extract_year)

With a little IPython magic, it's possible to plot frequency of appearances:

In [32]:

%matplotlib inline

by_year = lookup_menus_df.groupby(lookup_menus_df['date_year'])
year_series = by_year.size()
year_series.plot(figsize=(20,10))

Out[32]:

<matplotlib.axes.AxesSubplot at 0x10e5ff4d0>

Of course, this graph does not look right. What's going on with that huge spike at the beginning of the 20th century? And why do the number of appearances decrease toward mid-century?

This skew is an artifact of the coverage of the underlying menu collection that NYPL holds. From the Library's own collection guide:

Thousands of menus beginning with the oldest items in the collection from 1851. The collection is strongest for the period between 1890 and 1910.

This is why NYPL, on their own site, has normalized the plot of dish appearances by total number of menus held for a given year.

In [33]:

# Extract years in the complete DataFrame for menus
menu_df['date_year'] = menu_df['date'].map(extract_year)

# Group by year and calculate the number of menus held for each year
total_menu_by_year = menu_df.groupby(menu_df['date_year'])
menu_total_series = total_menu_by_year.size()

With this information, we can create a normalized graph:

In [34]:

normed = [year_series[date_year]/menu_total_series[date_year].astype(float) for date_year in year_series.index.tolist()]
pd.Series(normed, index=year_series.index).plot(figsize=(20,10))

Out[34]:

<matplotlib.axes.AxesSubplot at 0x10e60a0d0>

I hope this notebook demonstrates some ways to use the data analysis tools provided by Pandas in the service of curating data from What's On the Menu?

Using Pandas to Curate Data from the New York Public Library's What's On the Menu? Project¶

Loading Data¶

Clustering to Find Duplicate Values¶

Better Dish Statistics¶

Plotting Results¶