Harvesting the text of digitised books (and ephemera)

This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:

  • Harvest metadata of digitised books using the Trove API
  • Extract the number of pages for each book from the Trove web interface (the number of pages is necessary to download the OCRd text)
  • Download the OCRd text for each book

It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a search in the book zone for books that include the phrase "nla.obj" and are available online. This currently returns 65,050 results in the web interface. In amongst these are a range of government papers and reports, as well as publications where access to the digital copy is restricted. I think these are mostly recent books submitted in digital form under legal deposit. Things are made even more confusing by the fact that the contents of the 'Books & libraries' category in the web interface is not the same as the API's books zone. Anyway, I've used the new fullTextInd index to try and filter out works without any OCRd text. This reduces the total results to 40,751 results.

But some of those 40,751 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 42,174 works. However, not all of these 42,174 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 31,402 works that might have some OCRd text to download.

After downloading all the OCRd text files (ignoring any that were empty) I ended up with a grand total of 26,762 files.

If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 26,762 compared to 29,652. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both this record and this record point to this digitised work. As they're not exact duplicates, I've left them in the results.

Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages.

Here's the metadata I've harvested in CSV format:

This file includes the following columns:

  • title – title of the work
  • url – link to the metadata record in Trove
  • contributors – pipe-separated names of contributors
  • date – publication date
  • format – the type of work, eg 'Book' or 'Government publication', can have multiple values (pipe-separated)
  • fulltext_url – link to the digital version
  • trove_id – unique identifier of the digital version
  • language – main language of the work
  • rights – copyright status
  • pages – number of pages
  • form – work format, generally one of 'Book', 'Multi volume book', or 'Digital publication'
  • volume – volume/part number
  • children – pipe-separated ids of any child works
  • parent – id of parent work (if any)
  • text_downloaded – file name of the downloaded OCR text
  • text_file – True/False is there any OCRd text

Browse and download text files from Cloudstor:

  • 26,762 text files (about 3.6gb in total) downloaded from the books zone in August 2021.

The full list of books in digital format is also available as a searchable database running on Glitch. It includes links to download OCRd text from CloudStor. You can use this database to filter the titles and create your own list of books. Search results can be downloaded as in CSV or JSON format.

Setting things up

In [45]:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
from IPython.display import display, FileLink
import pandas as pd
import json
import re
import time
import os
import arrow
from copy import deepcopy
from bs4 import BeautifulSoup
from slugify import slugify
import requests_cache
In [15]:
s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [92]:
# Add your Trove API key below
api_key = 'YOUR API KEY'
In [17]:
params = {
    'key': api_key,
    'zone': 'book',
    'q': '"nla.obj" fullTextInd:y', # API v 2.1 added the full text indicator
    'bulkHarvest': 'true',
    'n': 100,
    'encoding': 'json',
    'l-availability': 'y',
    'l-format': 'Book',
    'include': 'links,workversions'
}

Harvest metadata using the API

In [50]:
def get_total_results():
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])


def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the full text version of the book.
    '''
    url = None
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            break
    return url

def get_version_record(record):
    for version in record.get('version'):
        for record in version['record']:
            try:
                if record['metadataSource'].get('value') == 'ANL:DL':
                    return record
            except (AttributeError, TypeError, KeyError):
                pass
                
def join_list(record, key):
    # A field may have a single value or an array.
    # If it's an array, join the values into a string.
    string_list = ''
    if record:
        value = record.get(key, [])
        if not isinstance(value, list):
            value = [value]
        string_list = '|'.join(value)
    return string_list


def harvest_books():
    '''
    Harvest metadata relating to digitised books.
    '''
    books = []
    total = get_total_results()
    start = '*'
    these_params = params.copy()
    with tqdm(total=total) as pbar:
        while start:
            these_params['s'] = start
            response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
            data = response.json()
            # The nextStart parameter is used to get the next page of results.
            # If there's no nextStart then it means we're on the last page of results.
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for record in data['response']['zone'][0]['records']['work']:
                # See if there's a link to the full text version.
                if 'identifier' in record:
                    fulltext_url = get_fulltext_url(record['identifier'])
                    # I'm making the assumption that if this is a booky book (not a map or music etc),
                    # then 'Book' will appear first in the list of types.
                    # This might not be a valid assumption.
                    # try:
                    #    format_type = record.get('type')[0]
                    # except (IndexError, TypeError):
                    #    format_type = None
                    # Save the record if there's a full text link and it's a booky book.
                    if fulltext_url:
                        trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url).group(1)
                        # Get the basic metadata.
                        book = {
                            'title': record.get('title'),
                            'url': record.get('troveUrl'),
                            'contributors': join_list(record, 'contributor'),
                            'date': record.get('issued'),
                            'format': join_list(record, 'type'),
                            'fulltext_url': fulltext_url,
                            'trove_id': trove_id
                        }
                        # Add some extra info if avaliable
                        version = get_version_record(record)
                        book['language'] = join_list(version, 'language')
                        book['rights'] = join_list(version, 'rights')
                        books.append(book)
                        # print(book)
            if not response.from_cache:
                time.sleep(0.2)
            pbar.update(100)
    return books
In [ ]:
# Do the harvest!
books = harvest_books()
In [52]:
len(books)
Out[52]:
40751

Get the number of pages in each book

In order to download the OCRd text we need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page.

In [55]:
def get_work_data(url):
    '''
    Extract work data in a JSON string from the work's HTML page.
    '''
    response = s.get(url)
    try:
        work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
    except AttributeError:
        work_data = '{}'
    if not response.from_cache:
        time.sleep(0.2)
    return json.loads(work_data)


def get_pages(work):
    '''
    Get the number of pages from the work data.
    '''
    try:
        pages = len(work['children']['page'])
    except KeyError:
        pages = 0
    return pages


def get_volumes(parent_id):
    '''
    Get the ids of volumes that are children of the current record.
    '''
    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    parts = []
    # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
    while n == 20:
        # Get the browse page
        response = s.get(start_url.format(parent_id, start))
        # Beautifulsoup turns the HTML into an easily navigable structure
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all the divs containing issue details and loop through them
        details = soup.find_all(class_='l-item-info')
        for detail in details:
            title = detail.find('h3')
            if title:
                issue_id = title.parent['href'].strip('/')
            else:
                issue_id = detail.find('a')['href'].strip('/')
            # Get the issue id
            parts.append(issue_id)
        if not response.from_cache:
            time.sleep(0.2)
        # Increment the startIdx
        start += n
        # Set n to the number of results on the current page
        n = len(details)
    return parts


def add_pages(books):
    '''
    Add the number of pages to the metadata for each book.
    Add volumes from multi volume books.
    '''
    books_with_pages = []
    for book in tqdm(books):
        # print(book['fulltext_url'])
        work = get_work_data(book['fulltext_url'])
        form = work.get('form')
        pages = get_pages(work)
        book['pages'] = pages
        book['form'] = form
        book['volume'] = ''
        book['parent'] = ''
        book['children'] = ''
        # Multi volume books are containers with child volumes
        # so we have to get the ids of each individual volume and process them
        if pages == 0 and form == 'Multi Volume Book':
            # Get child volumes
            volumes = get_volumes(book['trove_id'])
            # For each volume get details and add as a new book entry
            for index, volume_id in enumerate(volumes):
                volume = book.copy()
                # Add link up to the container
                volume['parent'] = book['trove_id']
                volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)
                volume['trove_id'] = volume_id
                work = get_work_data(volume['fulltext_url'])
                form = work.get('form')
                pages = get_pages(work)
                volume['form'] = form
                volume['pages'] = pages
                volume['volume'] = str(index + 1)
                # print(volume)
                books_with_pages.append(volume)
            # Add links from container to volumes
            book['children'] = '|'.join(volumes)
        # print(book)
        books_with_pages.append(book)
    return books_with_pages
In [ ]:
# Add number of pages to the book metadata
books_with_pages = add_pages(deepcopy(books))

Convert and save results

Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory.

In [57]:
df = pd.DataFrame(books_with_pages)
In [58]:
df.head()
Out[58]:
title url contributors date format fulltext_url trove_id language rights pages form volume parent children
0 Goliath Joe, fisherman / by Charles Thackeray ... https://trove.nla.gov.au/work/10013347 Thackeray, Charles 1900-1919 Book|Book/Illustrated https://nla.gov.au/nla.obj-2831231419 nla.obj-2831231419 English Out of Copyright|http://rightsstatements.org/v... 130 Book
1 Grammar of the Narrinyeri tribe of Australian ... https://trove.nla.gov.au/work/10029401 Taplin, George 1878-1880 Book|Government publication http://nla.gov.au/nla.obj-688657424 nla.obj-688657424 English Out of Copyright|http://rightsstatements.org/v... 24 Book
2 The works of the Rev. Sydney Smith https://trove.nla.gov.au/work/1004403 Smith, Sydney, 1771-1845 1839-1900 Book|Book/Illustrated|Microform https://nla.gov.au/nla.obj-630176596 nla.obj-630176596 English No known copyright restrictions|http://rightss... 65 Book
3 Nellie Doran : a story of Australian home and ... https://trove.nla.gov.au/work/10049667 Miriam Agatha 1914-1923 Book http://nla.gov.au/nla.obj-24357566 nla.obj-24357566 English Out of Copyright|http://rightsstatements.org/v... 246 Book
4 Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... https://trove.nla.gov.au/work/10053234 Germany. Heer. Heereswaffenamt 1942 Book|Book/Illustrated|Government publication https://nla.gov.au/nla.obj-51530748 nla.obj-51530748 German Out of Copyright|http://rightsstatements.org/v... 80 Book
In [59]:
# How many records?
df.shape
Out[59]:
(42174, 14)
In [60]:
# How many have pages?
df.loc[df['pages'] != 0].shape
Out[60]:
(31402, 14)
In [61]:
# How many of each format?
df['form'].value_counts()
Out[61]:
Book                   29069
Digital Publication     9808
Multi Volume Book       2348
Picture                  523
Journal                  357
Manuscript                36
Other - General           14
Map                        2
Other - Australian         1
Name: form, dtype: int64
In [62]:
# Breakdown by language
df['language'].value_counts()
Out[62]:
English                                       25674
                                              14284
Chinese                                        1219
French                                          210
Undetermined                                    193
German                                           92
Japanese                                         63
Dutch                                            56
Australian languages                             55
Austronesian (Other)                             55
Italian                                          31
Latin                                            31
Spanish                                          22
Maori                                            20
Swedish                                          19
Portuguese                                       16
Korean                                           15
Tahitian                                         13
Indonesian                                       12
Danish                                           11
Multiple languages                                8
Tongan                                            7
Greek, Modern (1453- )                            7
Finnish                                           7
Russian                                           6
Norwegian                                         5
Czech                                             4
Samoan                                            4
Thai                                              4
Polish                                            3
Fijian                                            2
Miscellaneous languages                           2
Papiamento                                        2
Malay                                             2
Welsh                                             2
Papuan (Other)                                    2
No linguistic content                             2
Tagalog                                           1
Niger-Kordofanian (Other)                         1
Sanskrit                                          1
Javanese                                          1
pol                                               1
Philippine (Other)                                1
Scottish Gaelic                                   1
Vietnamese                                        1
Yiddish                                           1
Hawaiian                                          1
Creoles and Pidgins, English-based (Other)        1
Irish                                             1
Gã                                                1
Nauru                                             1
Name: language, dtype: int64
In [63]:
# Save as CSV
df.to_csv('trove_digitised_books.csv', index=False)
display(FileLink('trove_digitised_books.csv'))

Download the OCRd texts

In [35]:
# Run this cell if you need to reload the books data from the CSV
df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)
books_with_pages = df.to_dict('records')
In [64]:
def save_ocr(books, output_dir='text'):
    '''
    Download the OCRd text for each book.
    '''
    os.makedirs(output_dir, exist_ok=True)
    for book in tqdm(books):
        # Default values
        book['text_downloaded'] = False
        book['text_file'] = ''
        if book['pages'] != 0:       
            # print(book['title'])
            # The index value for the last page of an issue will be the total pages - 1
            last_page = book['pages'] - 1
            file_name = '{}-{}.txt'.format(slugify(str(book['title'])[:50]), book['trove_id'])
            file_path = os.path.join(output_dir, file_name)
            # Check to see if the file has already been harvested
            if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
                # print('Already saved')
                book['text_file'] = file_name
                book['text_downloaded'] = True
            else:
                url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)
                # print(url)
                # Get the file
                r = s.get(url)
                # Check there was no error
                if r.status_code == requests.codes.ok:
                    # Check that the file's not empty
                    r.encoding = 'utf-8'
                    if len(r.text) > 0 and not r.text.isspace():
                        # Check that the file isn't HTML (some not found pages don't return 404s)
                        if BeautifulSoup(r.text, 'html.parser').find('html') is None:
                            # If everything's ok, save the file
                            with open(file_path, 'w', encoding='utf-8') as text_file:
                                text_file.write(r.text)
                            # print('Saved')
                            book['text_file'] = file_name
                            book['text_downloaded'] = True
                if not r.from_cache:
                    time.sleep(1)
In [ ]:
save_ocr(books_with_pages, '/Volumes/bigdata/mydata/Trove/books')

Convert and save updated results

The new books list includes the file name of the downloaded text file (if there is one), and a boolean field indicating if the text has been downloaded.

In [75]:
# Convert this to df
df_downloaded = pd.DataFrame(books_with_pages)
In [76]:
df_downloaded.head()
Out[76]:
title url contributors date format fulltext_url trove_id language rights pages form volume parent children text_downloaded text_file
0 Goliath Joe, fisherman / by Charles Thackeray ... https://trove.nla.gov.au/work/10013347 Thackeray, Charles 1900-1919 Book|Book/Illustrated https://nla.gov.au/nla.obj-2831231419 nla.obj-2831231419 English Out of Copyright|http://rightsstatements.org/v... 130 Book True goliath-joe-fisherman-by-charles-thackeray-wob...
1 Grammar of the Narrinyeri tribe of Australian ... https://trove.nla.gov.au/work/10029401 Taplin, George 1878-1880 Book|Government publication http://nla.gov.au/nla.obj-688657424 nla.obj-688657424 English Out of Copyright|http://rightsstatements.org/v... 24 Book True grammar-of-the-narrinyeri-tribe-of-australian-...
2 The works of the Rev. Sydney Smith https://trove.nla.gov.au/work/1004403 Smith, Sydney, 1771-1845 1839-1900 Book|Book/Illustrated|Microform https://nla.gov.au/nla.obj-630176596 nla.obj-630176596 English No known copyright restrictions|http://rightss... 65 Book True the-works-of-the-rev-sydney-smith-nla.obj-6301...
3 Nellie Doran : a story of Australian home and ... https://trove.nla.gov.au/work/10049667 Miriam Agatha 1914-1923 Book http://nla.gov.au/nla.obj-24357566 nla.obj-24357566 English Out of Copyright|http://rightsstatements.org/v... 246 Book True nellie-doran-a-story-of-australian-home-and-sc...
4 Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... https://trove.nla.gov.au/work/10053234 Germany. Heer. Heereswaffenamt 1942 Book|Book/Illustrated|Government publication https://nla.gov.au/nla.obj-51530748 nla.obj-51530748 German Out of Copyright|http://rightsstatements.org/v... 80 Book True lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger...
In [77]:
# How many have been downloaded?
df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape
Out[77]:
(29652, 16)

Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.

As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates.

In [78]:
df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')
Out[78]:
title url contributors date format fulltext_url trove_id language rights pages form volume parent children text_downloaded text_file
25788 Three weeks in Southland : being the account o... https://trove.nla.gov.au/work/237350529 Reid, Stuart, active 1884-1885 1885 Book https://nla.gov.au/nla.obj-101207695 nla.obj-101207695 English Out of Copyright|http://rightsstatements.org/v... 66 Book True three-weeks-in-southland-being-the-account-of-...
7469 Three weeks in Southland : being the account o... https://trove.nla.gov.au/work/19178390 Reid, Stuart, active 1884-1885 1885 Book http://nla.gov.au/nla.obj-101207695 nla.obj-101207695 66 Book 2 nla.obj-477008239 True three-weeks-in-southland-being-the-account-of-...
25790 A recent visit to several of the Polynesian is... https://trove.nla.gov.au/work/237350531 Bennett, George, active 1830-1831 1831 Book https://nla.gov.au/nla.obj-101212925 nla.obj-101212925 English No known copyright restrictions|http://rightss... 8 Book True a-recent-visit-to-several-of-the-polynesian-is...
7771 A recent visit to several of the Polynesian is... https://trove.nla.gov.au/work/19241288 Bennett, George, active 1830-1831 1831-1832 Book/Illustrated|Book http://nla.gov.au/nla.obj-101212925 nla.obj-101212925 8 Book True a-recent-visit-to-several-of-the-polynesian-is...
25807 How Capt. Cook died : new light from an old book https://trove.nla.gov.au/work/237350548 1908 Book https://nla.gov.au/nla.obj-101227721 nla.obj-101227721 English No known copyright restrictions|http://rightss... 10 Book True how-capt-cook-died-new-light-from-an-old-book-...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
37508 A Wonderful Illawarra waterfall : a rare beaut... https://trove.nla.gov.au/work/24063846 1895 Book http://nla.gov.au/nla.obj-99671695 nla.obj-99671695 1 Book True a-wonderful-illawarra-waterfall-a-rare-beauty-...
25811 The Results of the census of 1871 : supplement... https://trove.nla.gov.au/work/237350552 1873 Book https://nla.gov.au/nla.obj-99716940 nla.obj-99716940 English No known copyright restrictions|http://rightss... 2 Book True the-results-of-the-census-of-1871-supplement-t...
4099 The Results of the census of 1871 : supplement... https://trove.nla.gov.au/work/17856108 1873 Book http://nla.gov.au/nla.obj-99716940 nla.obj-99716940 2 Book True the-results-of-the-census-of-1871-supplement-t...
25795 Regular packets for Australia : emigration to ... https://trove.nla.gov.au/work/237350536 1850 Book https://nla.gov.au/nla.obj-99727992 nla.obj-99727992 English No known copyright restrictions|http://rightss... 1 Book True regular-packets-for-australia-emigration-to-po...
909 Regular packets for Australia : emigration to ... https://trove.nla.gov.au/work/12328620 1850 Book http://nla.gov.au/nla.obj-99727992 nla.obj-99727992 1 Book True regular-packets-for-australia-emigration-to-po...

6234 rows × 16 columns

In [79]:
# Save as CSV
df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)
display(FileLink('trove_digitised_books_with_ocr.csv'))

Create a searchable database using Datasette

To make it easy to explore the list of books, let's load the CSV file into Datasette. First we'll drop some columns, do some reordering, and add links to the downloaded text files stored on CloudStor.

In [87]:
df_datasette = df_downloaded.copy()

# Add link to Cloudstor
df_datasette['cloudstor_url'] = df_datasette.loc[df_datasette['text_downloaded'] == True]['text_file'].apply(lambda x: f'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL/download?path={x}')

Remove some columns that aren't going to be useful.

In [88]:
df_datasette = df_datasette[['title', 'contributors', 'date', 'format', 'language', 'rights', 'pages', 'url', 'fulltext_url', 'cloudstor_url', 'form', 'volume', 'parent', 'children']]

Rename columns for clarity.

In [90]:
df_datasette.columns = ['title', 'contributors', 'date', 'format', 'language', 'copyright', 'pages', 'view_details_url', 'view_book_url', 'download_text_url', 'form', 'volume', 'parent', 'children']
df_datasette.head()
Out[90]:
title contributors date format language copyright pages view_details_url view_book_url download_text_url form volume parent children
0 Goliath Joe, fisherman / by Charles Thackeray ... Thackeray, Charles 1900-1919 Book|Book/Illustrated English Out of Copyright|http://rightsstatements.org/v... 130 https://trove.nla.gov.au/work/10013347 https://nla.gov.au/nla.obj-2831231419 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book
1 Grammar of the Narrinyeri tribe of Australian ... Taplin, George 1878-1880 Book|Government publication English Out of Copyright|http://rightsstatements.org/v... 24 https://trove.nla.gov.au/work/10029401 http://nla.gov.au/nla.obj-688657424 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book
2 The works of the Rev. Sydney Smith Smith, Sydney, 1771-1845 1839-1900 Book|Book/Illustrated|Microform English No known copyright restrictions|http://rightss... 65 https://trove.nla.gov.au/work/1004403 https://nla.gov.au/nla.obj-630176596 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book
3 Nellie Doran : a story of Australian home and ... Miriam Agatha 1914-1923 Book English Out of Copyright|http://rightsstatements.org/v... 246 https://trove.nla.gov.au/work/10049667 http://nla.gov.au/nla.obj-24357566 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book
4 Lastkraftwagen 3 t Ford : Baumuster V 3000 S :... Germany. Heer. Heereswaffenamt 1942 Book|Book/Illustrated|Government publication German Out of Copyright|http://rightsstatements.org/v... 80 https://trove.nla.gov.au/work/10053234 https://nla.gov.au/nla.obj-51530748 https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd... Book
In [91]:
df_datasette.to_csv('trove-digital-books-datasette.csv', index=False)

This post describes how you can load your CSV files into Datasette using Glitch. Here's the result – a searchable database of Trove books available in digital form.

Some leftover bits used for renaming the text files

In [ ]:
# Rename files to include truncated title of book
for row in df.itertuples():
    try:
        os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))
    except FileNotFoundError:
        pass
In [ ]:
# Convert all filenames back to just nla.obj- form
for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:
    try:
        objname = re.search(r'.*(nla\.obj.*)', filename).group(1)
    except AttributeError:
        print(filename)
    os.rename(os.path.join('text', filename), os.path.join('text', objname))

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.