In this notebook we use TextBlob to extract nouns, verbs, and sentences from the OCRd text of a 19th century cookery book. We try to clean things up a bit, using regular expressions to discard likely OCR errors. Then we recombine the various parts in random combinations to create delicious recipes for all occasions. Enjoy!
Inspired by Australian Plain Cookery by a Practical Cook, 1882.
import requests
from textblob import TextBlob
import re
import random
import pandas as pd
from IPython.display import display, HTML
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# The Cloudstor URL links to the repository of OCRd text from Trove digitised books
CLOUDSTOR_URL = 'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL'
# File name of the cookery book
text_file = 'australian-plain-cookery-by-a-practical-cook-nla.obj-579917051.txt'
First we procure a recipe book.
# Download the text of the book
response = requests.get(f'{CLOUDSTOR_URL}/download?files={text_file}')
Then we slice and dice the words to create a new TextBlob.
# Create a TextBlob using the text
blob = TextBlob(response.text)
Carefully we remove the nouns and the verbs, discarding any that are spoiled.
# Get the verbs filtering out short words and those including non-alpha characters.
# 'VBD' is the part of speech tag for a past tense verb
verbs = [w.title() for w, t in blob.tags if t == 'VBD' and len(w) > 3 and w.isalpha()]
# Get the nouns filtering out short words and those including non-alpha characters.
# NNP is the POS tag for proper nouns
nouns = [w.title() for w, t in blob.tags if t.startswith('NNP') and len(w) > 3 and w.isalpha()]
Now it is necessary to prepare the sentences. First extract them from the blob. Discard any that seem ill-formed.
# Get the sentences from the blob
# Uses a regexp to exclude those that include anything other than standard letters, numbers, and punctuation.
sentences = [str(s).replace('\n', ' ') for s in blob.sentences if re.match(r'^[a-zA-Z\s\-,\.;0-9\'&\(\):]*$', str(s))]
The sentences now need to be divided, to separate out the titles, which are recognised by their case.
# Titles in this cookbook are in uppercase, so we can separate them out from the rest of the sentences.
titles = [s for s in sentences if s.strip('.').isupper()]
sentences = [s for s in sentences if not s.strip('.').isupper()]
Now we are ready to start cooking!
def recipe_maker(num=5):
html = ''
# Get a random title
title = random.choice(titles)
html = f'<h4>{title}</h4>'
html += '<h5>Ingredients:</h5>'
html += '<ol>'
for n in range(1, num + 1):
# Make a random selection from the nouns & verbs
html += f'<li>{random.choice(verbs)} {random.choice(nouns)}</li>'
html += '</ol>'
html += '<h5>Method:</h5>'
# Get random sentences and combine
html += f'<p>{" ".join(random.sample(sentences, num))}</p>'
display(HTML(html))
recipe_maker(6)
There's a full list of the POS (Part of Speech) tags here if you'd like to play with different combinations.
Perhaps we could add some more cookbooks? Let's load details of all the digitised books in Trove that include the word 'cookery' in the title.
df = pd.read_csv('https://raw.githubusercontent.com/GLAM-Workbench/trove-books/master/trove_digitised_books_with_ocr.csv')
df.loc[(df['title'].str.contains('cookery')) & (df['text_downloaded'] == True)]
To use a different one of these as the source for our recipe generator, just copy the index value, and then get the name of the text_file
. Like this:
df.loc[8173]['text_file']
Copy and paste the file name into the text_file
value at the top of this notebook, and then re-run the cells.
How might we combine ingredients from all of these cook books?
Created by Tim Sherratt for the GLAM Workbench.