Trevor Muñoz
10 January 2014
For more background, please see the accompanying blog post
# By convention
import pandas as pd
For this I want to setup two DataFrames—one with the contents of Dish.csv (~25 MB in the latest download) and one with the contents of MenuItem.csv (~114 MB).
As mentioned, Open Refine hangs before it can load MenuItem.csv (without increasing the Java heap size to provide more memory). So, though the timings will vary across systems, I've included time profiling information to show how speedily Pandas handles this job.
%%time
dish_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Dish.csv'
dish_df = pd.read_csv(dish_data, index_col='id')
CPU times: user 701 ms, sys: 111 ms, total: 812 ms Wall time: 888 ms
%%time
menu_item_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuItem.csv'
item_df = pd.read_csv(menu_item_data, index_col='dish_id')
CPU times: user 3.43 s, sys: 519 ms, total: 3.95 s Wall time: 4.42 s
We can make it easier to join up data by selecting the dish id (equal to the "primary key" of the first DataFrame) as the column to index by when loading up the second DataFrame.
With both DataFrames set up, we can print the first few rows to check that everything looks right.
dish_df[:5]
name | description | menus_appeared | times_appeared | first_appeared | last_appeared | lowest_price | highest_price | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1 | Consomme printaniere royal | NaN | 10 | 10 | 1897 | 1927 | 0.20 | 0.4 |
2 | Chicken gumbo | NaN | 110 | 116 | 1895 | 1960 | 0.10 | 0.8 |
3 | Tomato aux croutons | NaN | 15 | 15 | 1893 | 1917 | 0.25 | 0.4 |
4 | Onion au gratin | NaN | 41 | 41 | 1900 | 1971 | 0.35 | 1.0 |
5 | St. Emilion | NaN | 62 | 64 | 1881 | 1981 | 0.00 | 18.0 |
item_df[:5]
id | menu_page_id | price | high_price | created_at | updated_at | xpos | ypos | |
---|---|---|---|---|---|---|---|---|
dish_id | ||||||||
1 | 1 | 1389 | 0.4 | NaN | 2011-03-28 15:00:44 UTC | 2011-04-19 04:33:15 UTC | 0.111429 | 0.254735 |
2 | 2 | 1389 | 0.6 | NaN | 2011-03-28 15:01:13 UTC | 2011-04-19 15:00:54 UTC | 0.438571 | 0.254735 |
3 | 3 | 1389 | 0.4 | NaN | 2011-03-28 15:01:40 UTC | 2011-04-19 19:10:05 UTC | 0.140000 | 0.261922 |
4 | 4 | 1389 | 0.5 | NaN | 2011-03-28 15:01:51 UTC | 2011-04-19 19:07:01 UTC | 0.377143 | 0.262720 |
5 | 5 | 3079 | 0.5 | 1 | 2011-03-28 15:21:26 UTC | 2011-04-13 15:25:27 UTC | 0.105714 | 0.313178 |
Pandas integrates with the Notebook to display DataFrames as nice HTML tables. It's also possible to print a summary description of a DataFrame.
dish_df
<class 'pandas.core.frame.DataFrame'> Int64Index: 399455 entries, 1 to 477346 Data columns (total 8 columns): name 399455 non-null values description 0 non-null values menus_appeared 399455 non-null values times_appeared 399455 non-null values first_appeared 399455 non-null values last_appeared 399455 non-null values lowest_price 377962 non-null values highest_price 377962 non-null values dtypes: float64(3), int64(4), object(1)
For the moment, I'll focus on the Dish data.
With a single function call, we can see how many different values we have in the 'name' column before we start de-duping
unique_values_t0 = len(dish_df.name.unique())
print unique_values_t0
399421
To start, almost all the values of dish name are supposedly unique
# Drop the 'description' column because it's empty and taking up screen real estate
del dish_df['description']
In order to find very similar values that we can identify as duplicates and normalize, I'll pursue the same series of steps I would take using Open Refine.
def strip_and_lower(x):
"""
Basic string normalization:
1) strip leading and trailing whitespace
2) convert to lowercase
3) normalize internal whitespace (remove extra spaces between tokens)
"""
tokens = x.strip().lower().split()
result = ' '.join(filter(None, tokens))
return result
# Apply the function to every value of name using map() — rather than, for instance, using a loop
dish_df['name_modified'] = dish_df['name'].map(strip_and_lower)
A new column has been created:
dish_df[:5]
name | menus_appeared | times_appeared | first_appeared | last_appeared | lowest_price | highest_price | name_modified | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1 | Consomme printaniere royal | 10 | 10 | 1897 | 1927 | 0.20 | 0.4 | consomme printaniere royal |
2 | Chicken gumbo | 110 | 116 | 1895 | 1960 | 0.10 | 0.8 | chicken gumbo |
3 | Tomato aux croutons | 15 | 15 | 1893 | 1917 | 0.25 | 0.4 | tomato aux croutons |
4 | Onion au gratin | 41 | 41 | 1900 | 1971 | 0.35 | 1.0 | onion au gratin |
5 | St. Emilion | 62 | 64 | 1881 | 1981 | 0.00 | 18.0 | st. emilion |
This already yields some duplicates which had been showing up as distinct values because of small (inconsequential) differences in whitespace or capitalization.
len(dish_df.name_modified.unique())
373336
Now, if we implement Open Refine-style clustering, we should find more candidates for de-duping.
import re
def fingerprint(x):
"""
A modified version of the fingerprint clustering algorithm implemented by Open Refine.
See https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
This does not normalize to ASCII characters since diacritics may be significant in this dataset
"""
alphanumeric_tokens = filter(None, re.split('\W', x))
seen = set()
seen_add = seen.add
deduped = sorted([i for i in alphanumeric_tokens if i not in seen and not seen_add(i)])
fingerprint = ''.join(deduped)
return fingerprint
dish_df['fingerprint'] = dish_df['name_modified'].map(fingerprint)
We can see the results as another new column
dish_df[:5]
name | menus_appeared | times_appeared | first_appeared | last_appeared | lowest_price | highest_price | name_modified | fingerprint | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
1 | Consomme printaniere royal | 10 | 10 | 1897 | 1927 | 0.20 | 0.4 | consomme printaniere royal | consommeprintaniereroyal |
2 | Chicken gumbo | 110 | 116 | 1895 | 1960 | 0.10 | 0.8 | chicken gumbo | chickengumbo |
3 | Tomato aux croutons | 15 | 15 | 1893 | 1917 | 0.25 | 0.4 | tomato aux croutons | auxcroutonstomato |
4 | Onion au gratin | 41 | 41 | 1900 | 1971 | 0.35 | 1.0 | onion au gratin | augratinonion |
5 | St. Emilion | 62 | 64 | 1881 | 1981 | 0.00 | 18.0 | st. emilion | emilionst |
To see the "clusters" of potential duplicate name values that the fingerprinting method identifies, we can group the data by the contents of the fingerprint column
clusters = dish_df.groupby('fingerprint')
# Order the clusters by size
s = clusters.size()
s.order()
fingerprint 000000dedouzainehuitreslazelande 1 0001001938duringhavemenmoreorplacedwho 1 000100populationportslopesunnyto 1 000112bottledrinkginhollandperrum 1 000112bottledrinkginjamaicaperrum 1 000112bottledrinkginoldperrumtom 1 000112bottledrinkginperplymouthrum 1 000113030aacresafterallandareasavebeenbeingbodegasbuttsbybyasscocompanyconsidereddelicatedondryextremelyfamilyfavoritefinestfirmforfoundedgenerationsgonzalesgonzalezhavinginisitjerezjoejudgedkeptknownlargestliterallymanuelmariamaturingmembersminimumnamedofoldolderoneoroverownerspepeproducingsschieffelinselectedsherriessherrysolerasspainstandardthethroughouttiotodaytranslatedultimateunclevineyardvineyardswarehouseswaswhichwhosewithworldyears 1 00012belosdzlalesno 1 000150additioncupsinofpintpricequarttowine 1 000150boxcigarettesfranclargepersizesmall 1 000150boxcigarettesfrancslargepersizesmall 1 0001550andchampagnejouetmilsperrier 1 000160cgermainegorgelmacons 1 000165amontilladoptsqtssherry 1 ... broiledchickenhalf 36 brownedhashedpotatoes 36 codryextraghmumm 36 eggsfriedtwo 36 alecgingerimported 38 mashedpotatoes 38 dryextraghmumms 39 creamhashedinpotatoes 40 creamicevanilla 40 augratinpotatoes 41 2eggsonpoachedtoast 44 friedpotatoessweet 45 boiledpotatoes 46 frenchfriedpotatoes 47 81 Length: 330038, dtype: int64
We might want to know what scope of variation there is in the size of the clusters. This can be accomplished with 1 line:,
s.value_counts()
1 294118 2 23204 3 6230 4 2564 5 1370 6 788 7 491 8 323 9 249 10 178 11 110 12 102 15 50 14 46 13 39 18 24 20 20 16 20 17 20 19 14 22 12 21 10 25 7 36 6 31 5 26 5 30 5 24 4 23 3 33 3 38 2 29 2 40 2 27 2 45 1 44 1 46 1 41 1 32 1 39 1 35 1 47 1 28 1 81 1 dtype: int64
So, by this measure, there are still 294,118 singleton values (down from 399,421), but there are 23,204 two-item clusters that could be normalized, 110 11-item clusters, just 1 47-item cluster, etc.
The print display of the various clusters above shows only the fingerprint value and the count of rows that share that common value but this is not the most immediately-apprehensible way of seeing what's going on. For clarity, let's print the 5 largest clusters:
# Actually we'll be excluding the last item of the series since the value there is blank
for f in s.order()[-6:-1].index:
print dish_df[dish_df.fingerprint == f].name
id 2759 Potatoes, au gratin 7176 Potatoes au Gratin 8373 Potatoes--Au gratin 35728 Potatoes: au gratin 44271 Au Gratin Potatoes 84510 Au Gratin (Potatoes) 94968 Potatoes, au Gratin, 97166 POTATOES:- Au gratin 185040 Au Gratin [potatoes] 313168 Au Gratin Potatoes 315697 (Potatoes) Au Gratin 325940 Au Gratin Potatoes 330420 au-Gratin Potatoes 353435 Potatoes: Au gratin 373639 Potatoes Au Gratin 376396 Potatoes - au gratin 376414 Potatoes, Au Gratin 378597 Au gratin potatoes 378732 Potatoes au gratin 379157 Potatoes, au Gratin 379534 Potatoes AuGratin 379829 [Potatoes] au gratin 380247 [Potatoes] Au Gratin 380907 Potatoes au Gratin 383920 Au gratin Potatoes 385327 Potatoes Au Gratin 385777 Au Gratin potatoes 387869 Potatoes Au gratin 388838 potatoes au gratin 389716 Potatoes - Au gratin 391056 POTATOES: Au Gratin 398307 Potatoes: Au gratin 398348 Potatoes-Au gratin 414322 au gratin potatoes 414927 Potatoes -Au Gratin 420093 POTATOES Au Gratin 435770 Potatoes, Au gratin 441471 POTATOES: AU GRATIN 448532 POTATOES - Au gratin 450243 POTATOES AU GRATIN 453840 potatoes, au gratin Name: name, dtype: object id 22361 Poached Eggs (2) on toast 22562 Eggs, 2 Poached on Toast 22763 2 Poached Eggs on Toast 31324 Eggs, Poached on toast (2) 48043 Eggs poached on toast (2) 51940 Poached eggs on toast (2) 79275 2 Poached (Eggs) on Toast 81059 Eggs Poached (2), on Toast 81421 2 Eggs, poached on toast 90960 2 Poached (eggs) on Toast 91432 2 (Eggs) Poached on Toast 108224 Poached Eggs, 2 on toast 111031 Poached Eggs on Toast, [2] 127128 Eggs, Poached (2) on Toast 128696 Eggs (2) poached on toast 183077 EGGS Poached (2) on Toast 206923 2 Poached on Toast, Eggs 224031 Eggs: 2 Poached on Toast 296286 Eggs - Poached on Toast (2) 334815 (2) Poached Eggs on Toast 369966 Eggs (2), Poached on Toast 375226 Eggs, poached on toast(2) 376027 Eggs, poached on toast (2) 376449 EGGS, poached on toast (2) 376635 [EGGS] poached on toast (2) 376746 EGGS, poached on toast (2) 377100 Poached eggs (2), on toast 380354 EGGS poached on toast (2) 381711 2 Poached Eggs on toast 382721 EGGS, poached on toast [2] 383950 Poached Eggs on Toast (2) 389804 Poached eggs on Toast (2) 399299 2 Eggs, Poached on Toast 408143 Poached eggs (2) on toast 417205 Eggs, poached on toast [2] 433441 2 eggs poached on toast 440712 Eggs, Poached on Toast (2) 442689 Eggs poached on toast [2] 445035 2 eggs, poached on toast 452711 POACHED EGGS ON TOAST (2) 458993 Eggs, Poached on Toast [2] 465510 EGGS poached on toast (2) 466056 EGGS poached on toast [2] 476441 Eggs poached, on toast (2) Name: name, dtype: object id 2089 Sweet Potatoes Fried 3780 Sweet Potatoes, Fried 4075 fried Sweet Potatoes 9476 Potatoes, sweet fried 17042 Fried sweet [potatoes] 17116 Potatoes, Sweet, Fried 38431 Sweet Potatoes - Fried 42055 Potatoes sweet, fried 56061 Potatoes, fried sweet 76902 Potatoes Sweet Fried 78525 Sweet (potatoes), fried 86202 POTATOES: Fried sweet 92455 Sweet Potatoes: fried 101532 Potatoes Fried Sweet 136255 Sweet fried potatoes 198754 Sweet Potatoes, fried 349957 Sweet Potatoes--Fried 354008 Potatoes SWEET Fried 356717 Sweet Potatoes---Fried 373866 Fried sweet potatoes 373917 FRIED SWEET POTATOES 373973 Sweet potatoes - Fried 374052 Fried Sweet Potatoes 376416 Potatoes, Fried Sweet 376488 Sweet Potatoes: Fried 379831 Fried [Sweet Potatoes] 380251 Fried [sweet potatoes] 387427 Sweet potatoes - fried 390331 Sweet potatoes, fried 392859 Fried Sweet potatoes 396587 SWEET POTATOES---Fried 397513 Sweet potatoes -- Fried 397930 SWEET POTATOES - Fried 401304 Sweet potatoes-fried 402285 Sweet potatoes-Fried 406090 fried sweet potatoes 409582 Sweet potatoes fried 411471 Sweet potatoes--Fried 431756 Sweet Potatoes - fried 452385 Sweet Potatoes, fried 454978 fried sweet potatoes 456340 Sweet Fried Potatoes 461269 SWEET POTATOES - fried 465671 Sweet Potatoes, Fried 471570 Fried Sweet Potatoes. Name: name, dtype: object id 219 Boiled potatoes 2752 Potatoes, boiled 4290 Potatoes--Boiled 17902 Boiled Potatoes, 20308 Potatoes Boiled 31198 Potatoes: boiled 34056 Boiled Potatoes. 37644 Potatoes - Boiled 40611 Boiled Potatoes 44255 Boiled (potatoes) 59475 Boiled [Potatoes] 63979 POTATOES-boiled 74342 Potatoes Boiled 76369 POTATOES Boiled 85397 Potatoes , boiled 94958 Potatoes, Boiled, 97154 Potatoes:- Boiled 125408 POTATOES: *Boiled 194143 Boiled Potatoes 199538 Boiled-Potatoes 225209 Potatoes:--boiled 362463 Potatoes---Boiled 374956 Boiled Potatoes 376397 Potatoes, Boiled 379856 Potatoes -- Boiled 380179 Potatoes -- Boiled 380244 Boiled [potatoes] 385714 potatoes, boiled 389060 *Boiled Potatoes 394137 boiled potatoes 394579 Boiled POTATOES 395878 BOILED POTATOES 398306 Potatoes: Boiled 398346 Potatoes-Boiled 401302 Potatoes boiled 407244 Potatoes - boiled 412613 Potatoes, boiled 426362 Potatoes boiled 427434 BoiledPotatoes 427492 Potatoes Boiled 432003 Boiled Potatoes 436408 Potatoes, Boiled 451614 POTATOES--Boiled 452667 POTATOES - Boiled 465416 POTATOES: Boiled 471029 Boiled potatoes. Name: name, dtype: object id 1259 French fried potatoes 2761 Potatoes, French fried 11893 Potatoes---French Fried 14987 Potatoes: French fried 17047 French fried [potatoes] 17949 Potatoes - French Fried 18441 Potatoes French fried 21120 Potatoes-French Fried 21264 Potatoes--French Fried 29043 POTATOES: French Fried. 32765 Potatoes French, fried 35844 French Fried Potatoes. 37852 French Fried (POTATOES) 39455 Potatoes, French, fried 57562 Potatoes- French Fried 74346 Potatoes French Fried 76376 POTATOES French Fried 97164 POTATOES:- French fried 100589 Potatoes - French Fried, 112767 Potatoes, Fried, French 147020 Potatoes: French Fried 181396 Potatoes -- French Fried 248905 French Fried Potatoes 328845 French-Fried Potatoes 373638 Potatoes French Fried 375011 French Fried Potatoes 376400 Potatoes, French Fried 376487 Potatoes: French Fried 376508 Potatoes - French fried 379156 French Fried potatoes 379828 French Fried [Potatoes] 386120 Potatoes, french fried 387212 Potatoes--French fried 387671 French Fried (potatoes) 393034 French fried Potatoes 395877 FRENCH FRIED POTATOES 398352 Potatoes-French fried 401306 Potatoes french fried 401838 French fried potatoes 404612 Potatoes: French fried 407806 French fried potatoes 410281 french fried potatoes 410591 Potatoes: French fried 436571 POTATOES - French Fried 441467 POTATOES: FRENCH FRIED 455232 potatoes, french fried 473731 POTATOES--FRENCH FRIED Name: name, dtype: object
At a glance these do indeed look like values that should be normalized to a single "name". Since the default way that IPython prints a Pandas Series object includes the 'id', it is easy to construct URLs to allow for some sanity checking.
(I'm not automating the sanity checking here because I don't want to get in bad odor with NYPL's anti-spambot mechanisms.)
testing_cluster = dish_df[dish_df.fingerprint == 'augratinpotatoes']
for item in testing_cluster.index.tolist():
url = "http://menus.nypl.org/dishes/{0}".format(item)
print url
http://menus.nypl.org/dishes/2759 http://menus.nypl.org/dishes/7176 http://menus.nypl.org/dishes/8373 http://menus.nypl.org/dishes/35728 http://menus.nypl.org/dishes/44271 http://menus.nypl.org/dishes/84510 http://menus.nypl.org/dishes/94968 http://menus.nypl.org/dishes/97166 http://menus.nypl.org/dishes/185040 http://menus.nypl.org/dishes/313168 http://menus.nypl.org/dishes/315697 http://menus.nypl.org/dishes/325940 http://menus.nypl.org/dishes/330420 http://menus.nypl.org/dishes/353435 http://menus.nypl.org/dishes/373639 http://menus.nypl.org/dishes/376396 http://menus.nypl.org/dishes/376414 http://menus.nypl.org/dishes/378597 http://menus.nypl.org/dishes/378732 http://menus.nypl.org/dishes/379157 http://menus.nypl.org/dishes/379534 http://menus.nypl.org/dishes/379829 http://menus.nypl.org/dishes/380247 http://menus.nypl.org/dishes/380907 http://menus.nypl.org/dishes/383920 http://menus.nypl.org/dishes/385327 http://menus.nypl.org/dishes/385777 http://menus.nypl.org/dishes/387869 http://menus.nypl.org/dishes/388838 http://menus.nypl.org/dishes/389716 http://menus.nypl.org/dishes/391056 http://menus.nypl.org/dishes/398307 http://menus.nypl.org/dishes/398348 http://menus.nypl.org/dishes/414322 http://menus.nypl.org/dishes/414927 http://menus.nypl.org/dishes/420093 http://menus.nypl.org/dishes/435770 http://menus.nypl.org/dishes/441471 http://menus.nypl.org/dishes/448532 http://menus.nypl.org/dishes/450243 http://menus.nypl.org/dishes/453840
Identifying the clusters is great and we can use this information not only to improve the quality of the curated data set but also to clarify and improve analysis of the data about dish frequencies and "timespans." For the purposes of example, I'll use the "french fried potatoes" cluster.
With 47 values to choose from, what should we select for a normalized value? How about the most common string already represented in the cluster?
# Get a Dataframe consisting of all the rows that share a common fingerprint
result_df = dish_df[dish_df.fingerprint == 'frenchfriedpotatoes']
# Get the number of times appeared for each result by selecting that column (Series) and sort in descending order
by_appearance = result_df.times_appeared.order(ascending=False)
#Use the index of the max value to look up the most common value for the name
print "Most common value: '{0}'".format(dish_df.name[by_appearance.idxmax()])
Most common value: 'French fried potatoes'
Does clustering the values change the number of times a dish appeared or the earliest and latest dates it appears?
result_df.times_appeared[:5]
id 1259 1392 2761 243 11893 1 14987 23 17047 3 Name: times_appeared, dtype: int64
According to this data (it varies slightly from what appears on the web site), the dish with id 1259 and name 'French fried potatoes' appears 1,392 times. Dish 2761 (name "Potatoes, French Fried") appears 243 times, etc.
If we assert that all of these instances are in fact the same type of dish, we can quickly calculate a new total for how often this dish appears on the menus in the data set.
total_appearances = result_df.times_appeared.sum()
print "French fried potatoes appear {0} times in the menus digitized by NYPL.".format(total_appearances)
French fried potatoes appear 2140 times in the menus digitized by NYPL.
We can do something similar for the date ranges. (It isn't even necessary to do anything with the data type of the values for first_appeared and last_appeared.)
# aggregate_last_appeared = result_df.last_appeared.max()
# It turns out we can't completely straightforwardly use min() as well as max()
# because some dishes apparently have '0' for first appeared.
aggregate_first_appeared = result_df.first_appeared.values[0]
aggregate_last_appeared = result_df.last_appeared.max()
print "French fried potatoes first appeared on a menu in {0} and last appeared on a menu in {1}"\
.format(aggregate_first_appeared, aggregate_last_appeared)
French fried potatoes first appeared on a menu in 1884 and last appeared on a menu in 1989
For an apparently ever-popular dish like 'French fried potatoes' this 100-or-so year range is not that different from what appears on the page for the most-popular variant (Dish 1259). The aggregate values might shift more for other dishes.
Of course, there are also still 'pommes frites' to account for …
Reproducing NYPL's lovely appearance by year frequency plots takes a little more doing but it provides a good opportunity to demonstrate the facilities that Pandas provides for pulling together different sets of data using SQL-like mechanisms.
# Load data from the other CSV files included in the NYPL's regular data dumps because we'll need it all
page_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuPage.csv'
page_df = pd.read_csv(page_data, index_col='id')
menu_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Menu.csv'
menu_df = pd.read_csv(menu_data, index_col='id', parse_dates=True)
(For this final bit, I'm not entirely confident that I'm doing this in the best way. It seems to "work" but it feels a little verbose. Caveat lector.)
# This should create a DataFrame from the Menu Items data set that contains
# only "rows" related to our example cluster of 'French fried potato' dishes
lookup_items_df = item_df[item_df.index.isin(result_df.index)]
# The length of this DataFrame should be same as the sum of occurrences we calculated above
try:
assert len(lookup_items_df) == result_df.times_appeared.sum()
print "Length of the DataFrame is consistent"
except:
print "Something's wrong"
Length of the DataFrame is consistent
lookup_items_df.head(10)
id | menu_page_id | price | high_price | created_at | updated_at | xpos | ypos | |
---|---|---|---|---|---|---|---|---|
dish_id | ||||||||
1259 | 454 | 5217 | NaN | NaN | 2011-04-19 03:29:41 UTC | 2011-10-05 21:07:14 UTC | 0.342857 | 0.761442 |
1259 | 530 | 130 | NaN | NaN | 2011-04-19 13:17:01 UTC | 2011-05-13 05:14:16 UTC | 0.667143 | 0.735197 |
1259 | 1194 | 6471 | NaN | NaN | 2011-04-19 19:22:19 UTC | 2011-05-06 18:52:50 UTC | 0.315714 | 0.747705 |
1259 | 1792 | 134 | NaN | NaN | 2011-04-19 20:42:20 UTC | 2011-04-19 20:42:20 UTC | 0.550000 | 0.685084 |
1259 | 2093 | 146 | NaN | NaN | 2011-04-19 22:22:22 UTC | 2011-04-19 22:22:22 UTC | 0.568571 | 0.683963 |
1259 | 2953 | 1627 | NaN | NaN | 2011-04-20 02:18:30 UTC | 2011-04-20 02:18:30 UTC | 0.508571 | 0.623720 |
1259 | 3715 | 4783 | 0.10 | NaN | 2011-04-20 05:06:28 UTC | 2011-10-02 16:31:41 UTC | 0.301429 | 0.338671 |
2761 | 3966 | 3531 | 0.10 | NaN | 2011-04-20 05:25:58 UTC | 2011-04-20 05:25:58 UTC | 0.342857 | 0.477359 |
1259 | 4565 | 1432 | 0.10 | NaN | 2011-04-20 16:04:24 UTC | 2011-05-11 20:53:58 UTC | 0.131429 | 0.824837 |
2761 | 5788 | 5006 | 0.15 | NaN | 2011-04-20 19:12:01 UTC | 2011-05-31 20:07:02 UTC | 0.535714 | 0.652158 |
From this intermediate DataFrame we want to grab the values for menu_page_id so that we can look up the relevant menus to which they belong:
target_pages = lookup_items_df.menu_page_id.values
lookup_pages_df = page_df[page_df.index.isin(target_pages)]
lookup_pages_df.head(10)
menu_id | page_number | image_id | full_height | full_width | uuid | |
---|---|---|---|---|---|---|
id | ||||||
130 | 12463 | 1 | 466928 | 3049 | 2004 | 510D47DB-491F-A3D9-E040-E00A18064A99 |
134 | 12465 | 1 | 466931 | 3411 | 2291 | 510d47db-4923-a3d9-e040-e00a18064a99 |
146 | 12470 | 1 | 466940 | 3427 | 2282 | 510d47db-4933-a3d9-e040-e00a18064a99 |
263 | 12505 | 1 | 475815 | 4949 | 3261 | 510d47db-4309-a3d9-e040-e00a18064a99 |
265 | 12506 | 3 | 4000008161 | 5559 | 3229 | 510d47db-430b-a3d9-e040-e00a18064a99 |
309 | 12518 | 3 | 4000013961 | 4659 | 2823 | 510d47db-662f-a3d9-e040-e00a18064a99 |
332 | 12523 | 5 | 4000013986 | 3703 | 2427 | 510d47db-6652-a3d9-e040-e00a18064a99 |
460 | 12562 | 1 | 466438 | 2803 | 1979 | 510d47db-44a4-a3d9-e040-e00a18064a99 |
482 | 12569 | 1 | 466597 | 2600 | 1811 | 510d47db-461a-a3d9-e040-e00a18064a99 |
501 | 12577 | 1 | 466783 | 2884 | 2083 | 510d47db-47c1-a3d9-e040-e00a18064a99 |
Now we can grab all the relevant values of 'menu_id' for menus and do the last lookup:
target_menus = lookup_pages_df.menu_id.values
lookup_menus_df = menu_df[menu_df.index.isin(target_menus)]
Now we should have a DataFrame containing only menus that have pages on which one of the dishes from our 'French fried potato' cluster appears:
lookup_menus_df.head(10)
name | sponsor | event | venue | place | physical_description | occasion | notes | call_number | keywords | language | date | location | location_type | currency | currency_symbol | status | page_count | dish_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
12463 | NaN | HOTEL EASTMAN | BREAKFAST | COMMERCIAL | HOT SPRINGS, AR | CARD; 4.75X7.5; | EASTER; | NaN | 1900-2822 | NaN | NaN | 1900-04-15 | Hotel Eastman | NaN | NaN | NaN | under review | 2 | 67 |
12465 | NaN | NORDDEUTSCHER LLOYD BREMEN | FRUHSTUCK/BREAKFAST; | COMMERCIAL | DAMPFER KAISER WILHELM DER GROSSE; | CARD; ILLU; COL; 5.5X8.0; | NaN | MENU IN GERMAN AND ENGLISH; ILLUS, STEAMSHIP A... | 1900-2827 | NaN | NaN | 1900-04-16 | Norddeutscher Lloyd Bremen | NaN | NaN | NaN | under review | 2 | 84 |
12470 | NaN | NORDDEUTSCHER LLOYD BREMEN | FRUHSTUCK/BREAKFAST | COMMERCIAL | SCHNELLDAMPFER KAISER WILHELM DER GROSSE; | BROADSIDE; ILLUS; COL; 5.5X8.50; | NaN | MENU IN GERMAN AND ENGLISH; ILLUS, LIGHTHOUSE;... | 1900-2839 | NaN | NaN | 1900-04-17 | Norddeutscher Lloyd Bremen | NaN | NaN | NaN | under review | 2 | 80 |
12505 | NaN | BARTHOLDI HOTEL | BREAKFAST | COMMERCIAL | NY | BROADSIDE; COL; 7 X 11.75; | DAILY | VERY POOR CONDITION; | 1900-1137 | NaN | NaN | 1900-02-18 | Bartholdi Hotel | NaN | Dollars | $ | under review | 2 | 203 |
12506 | NaN | BARTHOLDI HOTEL | LUNCH & DINNER | COMMERCIAL | 23RD ST & BWAY NY | FOLDER; ILLUS;8.25 X 14; | DAILY | MANY ITEMS HANDWRITTEN; HOURS INCLUDED; | 1900-1138 | NaN | NaN | 1900-02-18 | Bartholdi Hotel | NaN | Dollars | $ | under review | 4 | 336 |
12518 | NaN | COLUMBIA RESTAURANT | DAILY MENU | COMMERCIAL | 48 EAST 14TH STREET,[NEW YORK,NY?] | FOLDER;ILLUS;7.25X11.5; | DAILY; | 2 COPIES;WINE LIST; | 1901-1527 | NaN | NaN | 1901-06-03 | Columbia Restaurant | NaN | Dollars | $ | under review | 4 | 475 |
12523 | NaN | CITIZENS'STEAMBOAT COMPANY | LUNCH | COMMERCIAL | STEAMER SARATOGA,TROY LINE | BOOKLET;ILLUS;COL;6.25X9; | DAILY; | AU SABLE CHASM ON FRONT COVER;BLUE MOUNTAIN LA... | 1901-1537 | NaN | NaN | 1901-01-01 | Citizens'steamboat Company | NaN | Dollars | $ | under review | 8 | 246 |
12562 | NaN | CUNARD LINE | BREAKFAST | COMMERCIAL | EN ROUTE ON BOARD R.M.S. CAMPANIA | CARD; 4.5 X 6.5; | NaN | PRICED WINE LIST ON BACK OF MENU; PRICES IN BR... | 1900-2170 | NaN | NaN | 1900-02-28 | Cunard Line | NaN | NaN | NaN | under review | 2 | 110 |
12569 | NaN | MAXWELL HOUSE | SUPPER | COMMERCIAL | NASHVILLE, [TN?] | CARD; ILLUS; 4.25X6.25 | NaN | ELABORATE PRINTING OF HOTEL NAME FORMS ILLUSTR... | 1900-2397 | NaN | NaN | 1900-03-12 | Maxwell House | NaN | NaN | NaN | under review | 2 | 38 |
12577 | NaN | HADDON HALL | SUPPER | COMMERCIAL | [PHILADELPHIA,PA.] | BROADSIDE; ILLUS; COL; 5 X 7; | DAILY | NaN | 1900-2625 | NaN | NaN | 1900-03-31 | Haddon Hall | NaN | NaN | NaN | under review | 2 | 60 |
This next bit is a workaround since parsing the dates in the date column doesn't seem to be working for me as expected. I need to extract the year for each menu so I can group the number of appearances of dish by year
import dateutil
def extract_year(x):
try:
datestring = dateutil.parser.parse(x)
return datestring.year
except:
return None
lookup_menus_df['date_year'] = lookup_menus_df['date'].map(extract_year)
With a little IPython magic, it's possible to plot frequency of appearances:
%matplotlib inline
by_year = lookup_menus_df.groupby(lookup_menus_df['date_year'])
year_series = by_year.size()
year_series.plot(figsize=(20,10))
<matplotlib.axes.AxesSubplot at 0x10e5ff4d0>
Of course, this graph does not look right. What's going on with that huge spike at the beginning of the 20th century? And why do the number of appearances decrease toward mid-century?
This skew is an artifact of the coverage of the underlying menu collection that NYPL holds. From the Library's own collection guide:
Thousands of menus beginning with the oldest items in the collection from 1851. The collection is strongest for the period between 1890 and 1910.
This is why NYPL, on their own site, has normalized the plot of dish appearances by total number of menus held for a given year.
# Extract years in the complete DataFrame for menus
menu_df['date_year'] = menu_df['date'].map(extract_year)
# Group by year and calculate the number of menus held for each year
total_menu_by_year = menu_df.groupby(menu_df['date_year'])
menu_total_series = total_menu_by_year.size()
With this information, we can create a normalized graph:
normed = [year_series[date_year]/menu_total_series[date_year].astype(float) for date_year in year_series.index.tolist()]
pd.Series(normed, index=year_series.index).plot(figsize=(20,10))
<matplotlib.axes.AxesSubplot at 0x10e60a0d0>
I hope this notebook demonstrates some ways to use the data analysis tools provided by Pandas in the service of curating data from What's On the Menu?