Convert a Trove list into a CSV file

This notebook converts Trove lists into CSV files (spreadsheets). Separate CSV files are created for newspaper articles and works from Trove's other zones. You can also save the OCRd text, a PDF, and an image of each newspaper article.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

  • Code cells have boxes around them.
  • To run a code cell either click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Add your values to these two cells

This is the only section that you'll need to edit. Paste your API key and list id in the cells below as indicated.

If necessary, follow the instructions in the Trove Help to obtain your own Trove API Key.

The list id is the number in the url of your Trove list. So the list with this url https://trove.nla.gov.au/list/83774 has an id of 83774.

In [ ]:
# Paste you API key between the quotes, and then run the cell
API_KEY = 'YOUR API KEY GOES HERE'
print('Your API key is: {}'.format(API_KEY))

Paste your list id below, and set your preferences for saving newspaper articles.

In [ ]:
# Paste your list id between the quotes, and then run the cell
list_id = '83777'

# If you don't want to save all the OCRd text, change True to False below
save_texts = True

# Change this to True if you want to save PDFs of newspaper articles
save_pdfs = False

# Change this to False if you don't want to save images of newspaper articles
save_images = True

Set things up

Run the cell below to load the necessary libraries and set up some directories to store the results.

In [ ]:
import requests
from requests.exceptions import HTTPError, Timeout
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import pandas as pd
import os
import re
import shutil
from tqdm.auto import tqdm
from IPython.core.display import display, HTML
from pathlib import Path
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO
import re
import time

s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 500, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.mount('https://', HTTPAdapter(max_retries=retries))

Define some functions

Run the cell below to set up all the functions we'll need for the conversion.

In [ ]:
def listify(value):
    '''
    Sometimes values can be lists and sometimes not.
    Turn them all into lists to make life easier.
    '''
    if isinstance(value, (str, int)):
        try:
            value = str(value)
        except ValueError:
            pass
        value = [value]
    return value

def get_url(identifiers, linktype):
    '''
    Loop through the identifiers to find the request url.
    '''
    url = ''
    for identifier in identifiers:
        if identifier['linktype'] == linktype:
            url = identifier['value']
            break
    return url

def save_as_csv(list_dir, data, data_type):
    df = pd.DataFrame(data)
    df.to_csv('{}/{}-{}.csv'.format(list_dir, list_id, data_type), index=False)
    
def make_filename(article):
    '''
    Create a filename for a text file or PDF.
    For easy sorting/aggregation the filename has the format:
        PUBLICATIONDATE-NEWSPAPERID-ARTICLEID
    '''
    date = article['date']
    date = date.replace('-', '')
    newspaper_id = article['newspaper_id']
    article_id = article['id']
    return '{}-{}-{}'.format(date, newspaper_id, article_id)

def get_list(list_id):
    list_url = f'https://api.trove.nla.gov.au/v2/list/{list_id}?encoding=json&reclevel=full&include=listItems&key={API_KEY}'
    response = s.get(list_url)
    return response.json()

def get_article(id):
    article_api_url = f'https://api.trove.nla.gov.au/v2/newspaper/{id}/?encoding=json&reclevel=full&include=articletext&key={API_KEY}'
    response = s.get(article_api_url)
    return response.json()

def make_dirs(list_id):
    list_dir = Path('data', 'converted-lists', list_id)
    list_dir.mkdir(parents=True, exist_ok=True)
    Path(list_dir, 'text').mkdir(exist_ok=True)
    Path(list_dir, 'image').mkdir(exist_ok=True)
    Path(list_dir, 'pdf').mkdir(exist_ok=True)
    return list_dir

def ping_pdf(ping_url):
    '''
    Check to see if a PDF is ready for download.
    If a 200 status code is received, return True.
    '''
    ready = False
    # req = Request(ping_url)
    try:
        # urlopen(req)
        response = s.get(ping_url, timeout=30)
        response.raise_for_status()
    except HTTPError:
        if response.status_code == 423:
            ready = False
        else:
            raise
    else:
        ready = True
    return ready
    
def get_pdf_url(article_id, zoom=3):
    '''
    Download the PDF version of an article.
    These can take a while to generate, so we need to ping the server to see if it's ready before we download.
    '''
    pdf_url = None
    # Ask for the PDF to be created
    prep_url = f'https://trove.nla.gov.au/newspaper/rendition/nla.news-article{article_id}/level/{zoom}/prep'
    response = s.get(prep_url)
    # Get the hash
    prep_id = response.text
    # Url to check if the PDF is ready
    ping_url = f'https://trove.nla.gov.au/newspaper/rendition/nla.news-article{article_id}.{zoom}.ping?followup={prep_id}'
    tries = 0
    ready = False
    time.sleep(2)  # Give some time to generate pdf
    # Are you ready yet?
    while ready is False and tries < 5:
        ready = ping_pdf(ping_url)
        if not ready:
            tries += 1
            time.sleep(2)
    # Download if ready
    if ready:
        pdf_url = f'https://trove.nla.gov.au/newspaper/rendition/nla.news-article{article_id}.{zoom}.pdf?followup={prep_id}'
    return pdf_url

def get_box(zones):
    '''
    Loop through all the zones to find the outer limits of each boundary.
    Return a bounding box around the article.
    '''
    left = 10000
    right = 0
    top = 10000
    bottom = 0
    page_id = zones[0]['data-page-id']
    for zone in zones:
        if int(zone['data-y']) < top:
            top = int(zone['data-y'])
        if int(zone['data-x']) < left:
            left = int(zone['data-x'])
        if (int(zone['data-x']) + int(zone['data-w'])) > right:
            right = int(zone['data-x']) + int(zone['data-w'])
        if (int(zone['data-y']) + int(zone['data-h'])) > bottom:
            bottom = int(zone['data-y']) + int(zone['data-h'])
    return {'page_id': page_id, 'left': left, 'top': top, 'right': right, 'bottom': bottom}

def get_article_boxes(article_url):
    '''
    Positional information about the article is attached to each line of the OCR output in data attributes.
    This function loads the HTML version of the article and scrapes the x, y, and width values for each line of text
    to determine the coordinates of a box around the article.
    '''
    boxes = []
    response = s.get(article_url)
    soup = BeautifulSoup(response.text, 'lxml')
    # Lines of OCR are in divs with the class 'zone'
    # 'onPage' limits to those on the current page
    zones = soup.select('div.zone.onPage')
    boxes.append(get_box(zones))
    off_page_zones = soup.select('div.zone.offPage')
    if off_page_zones:
        current_page = off_page_zones[0]['data-page-id']
        zones = []
        for zone in off_page_zones:
            if zone['data-page-id'] == current_page:
                zones.append(zone)
            else:
                boxes.append(get_box(zones))
                zones = [zone]
                current_page = zone['data-page-id']
        boxes.append(get_box(zones))
    return boxes

def get_page_images(list_dir, article, size=3000):
    '''
    Extract an image of the article from the page image(s), save it, and return the filename(s).
    '''
    # Get position of article on the page(s)
    boxes = get_article_boxes(f'http://nla.gov.au/nla.news-article{article["id"]}')
    image_filename = make_filename(article)
    for box in boxes:
        # print(box)
        # Construct the url we need to download the page image
        page_url = f'https://trove.nla.gov.au/ndp/imageservice/nla.news-page{box["page_id"]}/level7'
        # Download the page image
        response = s.get(page_url)
        # Open download as an image for editing
        img = Image.open(BytesIO(response.content))
        # Use coordinates of top line to create a square box to crop thumbnail
        points = (box['left'], box['top'], box['right'], box['bottom'])
        # Crop image to article box
        cropped = img.crop(points)
        # Resize if necessary
        if size:
            cropped.thumbnail((size, size), Image.ANTIALIAS)
        # Save and display thumbnail
        cropped_file = Path(list_dir, 'image', f'{image_filename}-{box["page_id"]}.jpg')
        cropped.save(cropped_file)

def harvest_list(list_id, save_text=True, save_pdfs=False, save_images=False):
    list_dir = make_dirs(list_id)
    data = get_list(list_id)
    works = []
    articles = []
    for item in tqdm(data['list'][0]['listItem']):
        for zone, record in item.items():
            if zone == 'work':
                work = {
                    'id': record.get('id', ''),
                    'title': record.get('title', ''),
                    'type': '|'.join(listify(record.get('type', ''))),
                    'issued': '|'.join(listify(record.get('issued', ''))),
                    'contributor': '|'.join(listify(record.get('contributor', ''))),
                    'trove_url': record.get('troveUrl', ''),
                    'fulltext_url': get_url(record.get('identifier', ''), 'fulltext'),
                    'thumbnail_url': get_url(record.get('identifier', ''), 'thumbnail')
                }
                works.append(work)
            elif zone == 'article':
                article = {
                    'id': record.get('id'),
                    'title': record.get('heading', ''),
                    'category': record.get('category', ''),
                    'date': record.get('date', ''),
                    'newspaper_id': record.get('title', {}).get('id'),
                    'newspaper_title': record.get('title', {}).get('value'),
                    'page': record.get('page', ''),
                    'page_sequence': record.get('pageSequence', ''),
                    'trove_url': f'http://nla.gov.au/nla.news-article{record.get("id")}'
                }
                full_details = get_article(record.get('id'))
                article['words'] = full_details['article'].get('wordCount', '')
                article['illustrated'] = full_details['article'].get('illustrated', '')
                article['corrections'] = full_details['article'].get('correctionCount', '')
                if 'trovePageUrl' in full_details['article']:
                    page_id = re.search(r'page\/(\d+)', full_details['article']['trovePageUrl']).group(1)
                    article['page_url'] = f'http://trove.nla.gov.au/newspaper/page/{page_id}'
                else:
                    article['page_url'] = ''
                filename = make_filename(article)
                if save_texts:
                    text = full_details['article'].get('articleText')
                    text_file = Path(list_dir, 'text', f'{filename}.txt')
                    if text:
                        text = re.sub('<[^<]+?>', '', text)
                        text = re.sub("\s\s+", " ", text)
                        text_file = Path(list_dir, 'text', f'{filename}.txt')
                        with open(text_file, 'wb') as text_output:
                            text_output.write(text.encode('utf-8'))
                if save_pdfs:
                    pdf_url = get_pdf_url(record['id'])
                    if pdf_url:
                        pdf_file = Path(list_dir, 'pdf', f'{filename}.pdf')
                        response = s.get(pdf_url, stream=True)
                        with open(pdf_file, 'wb') as pf:
                            for chunk in response.iter_content(chunk_size=128):
                                pf.write(chunk)
                if save_images:
                    get_page_images(list_dir, article)
                
                articles.append(article)
    if articles:
        save_as_csv(list_dir, articles, 'articles')
    if works:
        save_as_csv(list_dir, works, 'works')
    return works, articles

Let's do it!

Run the cell below to start the conversion.

In [ ]:
works, articles = harvest_list(list_id, save_texts, save_pdfs, save_images)

View the results

You can browse the harvested files in the data/converted-lists/[your list id] directory.

Run the cells below for a preview of the CSV files.

In [ ]:
# Preview newspaper articles CSV
df_articles = pd.DataFrame(articles)
df_articles
In [ ]:
# Preview works CSV
df_works = pd.DataFrame(works)
df_works

Download the results

Run the cell below to zip up all the harvested files and create a download link.

In [ ]:
list_dir = Path('data', 'converted-lists', list_id)
shutil.make_archive(list_dir, 'zip', list_dir)
display(HTML(f'<a download="{list_id}.zip" href="{list_dir}.zip">Download your harvest</a>'))

Created by Tim Sherratt for the GLAM Workbench.