Harvest summary data from Trove lists

Using the Trove API we'll harvest some information about Trove lists and create a dataset containing the following fields:

  • id — the list identifier, you can use this to get more information about a list from either the web interface or the API
  • title
  • number_items — the number of items in the list
  • created — the date the list was created
  • updated — the date the list was last updated

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Setting up...

In [ ]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import pandas as pd
from tqdm.auto import tqdm
import altair as alt
import datetime
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from textblob import TextBlob
from operator import itemgetter
import nltk
from IPython.display import display, HTML
import time
from json import JSONDecodeError
nltk.download('stopwords')
nltk.download('punkt')

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.mount('https://', HTTPAdapter(max_retries=retries))

Add your Trove API key

In [ ]:
# This creates a variable called 'api_key', paste your key between the quotes
# <-- Then click the run icon 
api_key = 'YOUR API KEY'

# This displays a message with your key
print('Your API key is: {}'.format(api_key))

Set some parameters

You could change the value of q if you only want to harvest a subset of lists.

In [3]:
api_url = 'http://api.trove.nla.gov.au/v2/result'
params = {
    'q': ' ',
    'zone': 'list',
    'encoding': 'json',
    'n': 100,
    's': '*',
    'key': api_key,
    'reclevel': 'full',
    'bulkHarvest': 'true'
}

Harvest the data

In [4]:
def get_total():
    '''
    This will enable us to make a nice progress bar...
    '''
    response = requests.get(api_url, params=params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])
In [ ]:
lists = []
total = get_total()
with tqdm(total=total) as pbar:
    while params['s']:
        response = requests.get(api_url, params=params)
        try:
            data = response.json()
        except JSONDecodeError:
            print(response.text)
        else:
            records = data['response']['zone'][0]['records']
            try:
                params['s'] = records['nextStart']
            except KeyError:
                params['s'] = None
            for record in records['list']:
                lists.append({
                    'id': record['id'], 
                    'title': record.get('title', ''), 
                    'number_items': record['listItemCount'], 
                    'created': record['created'],
                    'updated': record['lastupdated']
                })
            pbar.update(100)
            time.sleep(0.2)
In [18]:
df = pd.read_csv('data/trove-lists-2020-09-22.csv', parse_dates=['created'])
df.shape
Out[18]:
(95358, 5)

Inspect the results

In [9]:
df = pd.DataFrame(lists)
df.head()
Out[9]:
id title number_items created updated
0 100638 DOHSE 7 2017-03-20T11:45:58Z 2017-03-20T11:45:58Z
1 10064 Goldsmith Group 17 2011-05-21T10:37:40Z 2012-07-05T12:29:09Z
2 100640 Berrima District Highwaymen 29 2017-03-20T12:20:58Z 2020-10-19T11:37:11Z
3 100643 Preddy, Harold William (WW1 Soldier) 9 2017-03-20T21:43:27Z 2017-03-20T21:43:27Z
4 100645 Women's Peace Army, Victoria 13 2017-03-20T22:59:31Z 2017-03-20T22:59:31Z
In [12]:
df.describe()
Out[12]:
id number_items
count 95358.000000 95358.000000
mean 76292.690839 18.663783
std 41917.889859 82.783191
min 51.000000 0.000000
25% 40438.500000 1.000000
50% 77276.500000 4.000000
75% 112079.750000 12.000000
max 148422.000000 10351.000000

Save the harvested data as a CSV file

In [17]:
csv_file = 'data/trove-lists-{}.csv'.format(datetime.datetime.now().isoformat()[:10])
df.to_csv(csv_file, index=False)
display(HTML('<a target="_blank" href="{}">Download CSV</a>'.format(csv_file)))

How many items are in lists?

In [20]:
total_items = df['number_items'].sum()
print('There are {:,} items in {:,} lists.'.format(total_items, df.shape[0]))
There are 1,779,741 items in 95,358 lists.

What is the biggest list?

In [9]:
biggest = df.iloc[df['number_items'].idxmax()]
biggest
Out[9]:
id                                  71461
title           Victoria and elsewhere...
number_items                        10351
created              2015-04-03T11:50:51Z
updated              2016-02-22T04:27:12Z
Name: 74391, dtype: object
In [10]:
display(HTML('The biggest list is <a target="_blank" href="https://trove.nla.gov.au/list?id={}">{}</a> with {:,} items.'.format(biggest['id'], biggest['title'], biggest['number_items'])))
The biggest list is Victoria and elsewhere... with 10,351 items.

When were they created?

In [15]:
# This makes it possible to include more than 5000 records
# alt.data_transformers.enable('json', urlpath='files')
alt.data_transformers.disable_max_rows()
alt.Chart(df[['created']]).mark_line().encode(
    x='yearmonth(created):T',
    y='count()',
    tooltip=[alt.Tooltip('yearmonth(created):T', title='Month'), alt.Tooltip('count()', title='Lists')]
).properties(width=600)
Out[15]: