The National Museum of Australia provides access to its collection data through an API. But if you're going to do any large-scale analysis of the data, you probably want to harvest and save it locally. This notebook helps you do just that.
According to the API documentation, the possible endpoints are:
/object
- the museum catalogue plus images/media/narrative
- narratives by Museum staff about featured topics/party
- people and organisations associated with collection items/place
- locations associated with collection items/collection
- sub-collections within the museum catalogue/media
- images and other media associated with collection itemsThis notebook should harvest records from any of these endpoints, though I've only tested object
, party
, and place
.
It harvests records in the simple JSON format and saves them as they are to a file-based database using TinyDB. See the other notebooks in this repository for examples of loading the JSON data into a DataFrame for manipulation and analysis.
If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!
Some tips:
Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.
import requests
from tinydb import TinyDB, Query
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.mount('https://', HTTPAdapter(max_retries=retries))
API_BASE_URL = 'https://data.nma.gov.au/{}'
To make full use of the NMA API and avoid rate limits, you should go and get yourself and API key. Once you have your key, paste it in below.
# Paste your key in between the quotes
API_KEY = 'YOUR API KEY'
def get_total(endpoint, params, headers):
'''
Get the total number of results.
'''
response = s.get(endpoint, headers=headers, params=params)
data = response.json()
return data['meta']['results']
def harvest_records(record_type):
# Put api key in request headers
headers = {
'apikey': API_KEY
}
# Set basic params
params = {
'text': '*',
'limit': 100, # Number of records per request
'offset': 0 # We'll change this as we loop through
}
# Create a db to hold the results
db = TinyDB('nma_{}_db.json'.format(record_type))
# Get the endpoint for this type of record
endpoint = API_BASE_URL.format(record_type)
# Are there more records? We'll check this on each request.
more = True
# Get the total number of records
total_records = get_total(endpoint, params, headers)
# Make a progress bar
with tqdm(total=total_records) as pbar:
# Continue while 'more' is True
while more:
# Get the data
response = s.get(endpoint, headers=headers, params=params)
data = response.json()
# Insert the records (in the 'data' field) into the db
db.insert_multiple(data['data'])
# If there's not a 'next' link, set more to False
more = data.get('links', {}).get('next', False)
# Update the offset value
params['offset'] += 100
# Update the progress bar
pbar.update(len(data['data']))
harvest_records('place')
harvest_records('party')
harvest_records('object')
Created by Tim Sherratt for the GLAM Workbench.
Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.