Harvesting the text of digitised books (and ephemera)¶

This notebook harvests metadata and OCRd text from digitised books in Trove. There's three main steps:

Harvest metadata of digitised books using the Trove API
Extract the number of pages for each book from the Trove web interface (the number of pages is necessary to download the OCRd text)
Download the OCRd text for each book

It's not easy to identify all the digitised books with OCRd text in Trove. I'm starting with a search in the book zone for books that include the phrase "nla.obj" and are available online. This currently returns 65,050 results in the web interface. In amongst these are a range of government papers and reports, as well as publications where access to the digital copy is restricted. I think these are mostly recent books submitted in digital form under legal deposit. Things are made even more confusing by the fact that the contents of the 'Books & libraries' category in the web interface is not the same as the API's books zone. Anyway, I've used the new fullTextInd index to try and filter out works without any OCRd text. This reduces the total results to 40,751 results.

But some of those 40,751 results are actually parent records that contain multiple volumes or parts. When I find the number of pages in each book, I'm also checking to see if the record is a 'Multi volume book' and has child works. If it does, I add the child works to the list of books. After this stage there are 42,174 works. However, not all of these 42,174 records have OCRd text. Parent records of multi volume works, and ebook formats like PDFs or MOBI, don't have individual pages, and therefore don't have any text to download. If we exclude works without pages, there are 31,402 works that might have some OCRd text to download.

After downloading all the OCRd text files (ignoring any that were empty) I ended up with a grand total of 26,762 files.

If you compare the number of downloaded files to the number in the CSV file that are identified as having OCRd text you'll notice a difference – 26,762 compared to 29,652. After a bit more poking around I realised that there are some duplicates in the list of works. This seems to be because more than one Trove metadata record can point to the same digitised work. For example, both this record and this record point to this digitised work. As they're not exact duplicates, I've left them in the results.

Looking through the downloaded text files, it's clear that we're getting ephemera (particularly pamphlets and posters) as well as books. There doesn't seem to be an obvious way to filter these out up front, but of course you could filter later by the number of pages.

Here's the metadata I've harvested in CSV format:

CSV formatted file with details of digitised books

This file includes the following columns:

title – title of the work
url – link to the metadata record in Trove
contributors – pipe-separated names of contributors
date – publication date
format – the type of work, eg 'Book' or 'Government publication', can have multiple values (pipe-separated)
fulltext_url – link to the digital version
trove_id – unique identifier of the digital version
language – main language of the work
rights – copyright status
pages – number of pages
form – work format, generally one of 'Book', 'Multi volume book', or 'Digital publication'
volume – volume/part number
children – pipe-separated ids of any child works
parent – id of parent work (if any)
text_downloaded – file name of the downloaded OCR text
text_file – True/False is there any OCRd text

Browse and download text files from Cloudstor:

26,762 text files (about 3.6gb in total) downloaded from the books zone in August 2021.

The full list of books in digital format is also available as a searchable database running on Glitch. It includes links to download OCRd text from CloudStor. You can use this database to filter the titles and create your own list of books. Search results can be downloaded as in CSV or JSON format.

Setting things up¶

In [45]:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
from IPython.display import display, FileLink
import pandas as pd
import json
import re
import time
import os
import arrow
from copy import deepcopy
from bs4 import BeautifulSoup
from slugify import slugify
import requests_cache

In [15]:

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

In [92]:

# Add your Trove API key below
api_key = 'YOUR API KEY'

In [17]:

params = {
    'key': api_key,
    'zone': 'book',
    'q': '"nla.obj" fullTextInd:y', # API v 2.1 added the full text indicator
    'bulkHarvest': 'true',
    'n': 100,
    'encoding': 'json',
    'l-availability': 'y',
    'l-format': 'Book',
    'include': 'links,workversions'
}

Harvest metadata using the API¶

In [50]:

def get_total_results():
    '''
    Get the total number of results for a search.
    '''
    these_params = params.copy()
    these_params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])


def get_fulltext_url(links):
    '''
    Loop through the identifiers to find a link to the full text version of the book.
    '''
    url = None
    for link in links:
        if link['linktype'] == 'fulltext' and 'nla.obj' in link['value']:
            url = link['value']
            break
    return url

def get_version_record(record):
    for version in record.get('version'):
        for record in version['record']:
            try:
                if record['metadataSource'].get('value') == 'ANL:DL':
                    return record
            except (AttributeError, TypeError, KeyError):
                pass
                
def join_list(record, key):
    # A field may have a single value or an array.
    # If it's an array, join the values into a string.
    string_list = ''
    if record:
        value = record.get(key, [])
        if not isinstance(value, list):
            value = [value]
        string_list = '|'.join(value)
    return string_list


def harvest_books():
    '''
    Harvest metadata relating to digitised books.
    '''
    books = []
    total = get_total_results()
    start = '*'
    these_params = params.copy()
    with tqdm(total=total) as pbar:
        while start:
            these_params['s'] = start
            response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
            data = response.json()
            # The nextStart parameter is used to get the next page of results.
            # If there's no nextStart then it means we're on the last page of results.
            try:
                start = data['response']['zone'][0]['records']['nextStart']
            except KeyError:
                start = None
            for record in data['response']['zone'][0]['records']['work']:
                # See if there's a link to the full text version.
                if 'identifier' in record:
                    fulltext_url = get_fulltext_url(record['identifier'])
                    # I'm making the assumption that if this is a booky book (not a map or music etc),
                    # then 'Book' will appear first in the list of types.
                    # This might not be a valid assumption.
                    # try:
                    #    format_type = record.get('type')[0]
                    # except (IndexError, TypeError):
                    #    format_type = None
                    # Save the record if there's a full text link and it's a booky book.
                    if fulltext_url:
                        trove_id = re.search(r'(nla\.obj\-\d+)', fulltext_url).group(1)
                        # Get the basic metadata.
                        book = {
                            'title': record.get('title'),
                            'url': record.get('troveUrl'),
                            'contributors': join_list(record, 'contributor'),
                            'date': record.get('issued'),
                            'format': join_list(record, 'type'),
                            'fulltext_url': fulltext_url,
                            'trove_id': trove_id
                        }
                        # Add some extra info if avaliable
                        version = get_version_record(record)
                        book['language'] = join_list(version, 'language')
                        book['rights'] = join_list(version, 'rights')
                        books.append(book)
                        # print(book)
            if not response.from_cache:
                time.sleep(0.2)
            pbar.update(100)
    return books

In [ ]:

# Do the harvest!
books = harvest_books()

In [52]:

len(books)

Out[52]:

Get the number of pages in each book¶

In order to download the OCRd text we need to know the number of pages in a work. This information is not available via the API, so we have to scrape it from the work's HTML page.

In [55]:

def get_work_data(url):
    '''
    Extract work data in a JSON string from the work's HTML page.
    '''
    response = s.get(url)
    try:
        work_data = re.search(r'var work = JSON\.parse\(JSON\.stringify\((\{.*\})', response.text).group(1)
    except AttributeError:
        work_data = '{}'
    if not response.from_cache:
        time.sleep(0.2)
    return json.loads(work_data)


def get_pages(work):
    '''
    Get the number of pages from the work data.
    '''
    try:
        pages = len(work['children']['page'])
    except KeyError:
        pages = 0
    return pages


def get_volumes(parent_id):
    '''
    Get the ids of volumes that are children of the current record.
    '''
    start_url = 'https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c'
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    parts = []
    # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
    while n == 20:
        # Get the browse page
        response = s.get(start_url.format(parent_id, start))
        # Beautifulsoup turns the HTML into an easily navigable structure
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all the divs containing issue details and loop through them
        details = soup.find_all(class_='l-item-info')
        for detail in details:
            title = detail.find('h3')
            if title:
                issue_id = title.parent['href'].strip('/')
            else:
                issue_id = detail.find('a')['href'].strip('/')
            # Get the issue id
            parts.append(issue_id)
        if not response.from_cache:
            time.sleep(0.2)
        # Increment the startIdx
        start += n
        # Set n to the number of results on the current page
        n = len(details)
    return parts


def add_pages(books):
    '''
    Add the number of pages to the metadata for each book.
    Add volumes from multi volume books.
    '''
    books_with_pages = []
    for book in tqdm(books):
        # print(book['fulltext_url'])
        work = get_work_data(book['fulltext_url'])
        form = work.get('form')
        pages = get_pages(work)
        book['pages'] = pages
        book['form'] = form
        book['volume'] = ''
        book['parent'] = ''
        book['children'] = ''
        # Multi volume books are containers with child volumes
        # so we have to get the ids of each individual volume and process them
        if pages == 0 and form == 'Multi Volume Book':
            # Get child volumes
            volumes = get_volumes(book['trove_id'])
            # For each volume get details and add as a new book entry
            for index, volume_id in enumerate(volumes):
                volume = book.copy()
                # Add link up to the container
                volume['parent'] = book['trove_id']
                volume['fulltext_url'] = 'http://nla.gov.au/{}'.format(volume_id)
                volume['trove_id'] = volume_id
                work = get_work_data(volume['fulltext_url'])
                form = work.get('form')
                pages = get_pages(work)
                volume['form'] = form
                volume['pages'] = pages
                volume['volume'] = str(index + 1)
                # print(volume)
                books_with_pages.append(volume)
            # Add links from container to volumes
            book['children'] = '|'.join(volumes)
        # print(book)
        books_with_pages.append(book)
    return books_with_pages

In [ ]:

# Add number of pages to the book metadata
books_with_pages = add_pages(deepcopy(books))

Convert and save results¶

Getting the page numbers takes quite a while, so it's a good idea to save the results to a CSV file before proceeding. That way, you won't have to repeat the process if something goes wrong and you lose the data that's sitting in memory.

In [57]:

df = pd.DataFrame(books_with_pages)

In [58]:

df.head()

Out[58]:

	title	url	contributors	date	format	fulltext_url	trove_id	language	rights	pages	form
0	Goliath Joe, fisherman / by Charles Thackeray ...	https://trove.nla.gov.au/work/10013347	Thackeray, Charles	1900-1919	Book\|Book/Illustrated	https://nla.gov.au/nla.obj-2831231419	nla.obj-2831231419	English	Out of Copyright\|http://rightsstatements.org/v...	130	Book
1	Grammar of the Narrinyeri tribe of Australian ...	https://trove.nla.gov.au/work/10029401	Taplin, George	1878-1880	Book\|Government publication	http://nla.gov.au/nla.obj-688657424	nla.obj-688657424	English	Out of Copyright\|http://rightsstatements.org/v...	24	Book
2	The works of the Rev. Sydney Smith	https://trove.nla.gov.au/work/1004403	Smith, Sydney, 1771-1845	1839-1900	Book\|Book/Illustrated\|Microform	https://nla.gov.au/nla.obj-630176596	nla.obj-630176596	English	No known copyright restrictions\|http://rightss...	65	Book
3	Nellie Doran : a story of Australian home and ...	https://trove.nla.gov.au/work/10049667	Miriam Agatha	1914-1923	Book	http://nla.gov.au/nla.obj-24357566	nla.obj-24357566	English	Out of Copyright\|http://rightsstatements.org/v...	246	Book
4	Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...	https://trove.nla.gov.au/work/10053234	Germany. Heer. Heereswaffenamt	1942	Book\|Book/Illustrated\|Government publication	https://nla.gov.au/nla.obj-51530748	nla.obj-51530748	German	Out of Copyright\|http://rightsstatements.org/v...	80	Book

In [59]:

# How many records?
df.shape

Out[59]:

(42174, 14)

In [60]:

# How many have pages?
df.loc[df['pages'] != 0].shape

Out[60]:

(31402, 14)

In [61]:

# How many of each format?
df['form'].value_counts()

Out[61]:

Book                   29069
Digital Publication     9808
Multi Volume Book       2348
Picture                  523
Journal                  357
Manuscript                36
Other - General           14
Map                        2
Other - Australian         1
Name: form, dtype: int64

In [62]:

# Breakdown by language
df['language'].value_counts()

Out[62]:

English                                       25674
                                              14284
Chinese                                        1219
French                                          210
Undetermined                                    193
German                                           92
Japanese                                         63
Dutch                                            56
Australian languages                             55
Austronesian (Other)                             55
Italian                                          31
Latin                                            31
Spanish                                          22
Maori                                            20
Swedish                                          19
Portuguese                                       16
Korean                                           15
Tahitian                                         13
Indonesian                                       12
Danish                                           11
Multiple languages                                8
Tongan                                            7
Greek, Modern (1453- )                            7
Finnish                                           7
Russian                                           6
Norwegian                                         5
Czech                                             4
Samoan                                            4
Thai                                              4
Polish                                            3
Fijian                                            2
Miscellaneous languages                           2
Papiamento                                        2
Malay                                             2
Welsh                                             2
Papuan (Other)                                    2
No linguistic content                             2
Tagalog                                           1
Niger-Kordofanian (Other)                         1
Sanskrit                                          1
Javanese                                          1
pol                                               1
Philippine (Other)                                1
Scottish Gaelic                                   1
Vietnamese                                        1
Yiddish                                           1
Hawaiian                                          1
Creoles and Pidgins, English-based (Other)        1
Irish                                             1
Gã                                                1
Nauru                                             1
Name: language, dtype: int64

In [63]:

# Save as CSV
df.to_csv('trove_digitised_books.csv', index=False)
display(FileLink('trove_digitised_books.csv'))

trove_digitised_books.csv

Download the OCRd texts¶

In [35]:

# Run this cell if you need to reload the books data from the CSV
df = pd.read_csv('trove_digitised_books.csv', keep_default_na=False)
books_with_pages = df.to_dict('records')

In [64]:

def save_ocr(books, output_dir='text'):
    '''
    Download the OCRd text for each book.
    '''
    os.makedirs(output_dir, exist_ok=True)
    for book in tqdm(books):
        # Default values
        book['text_downloaded'] = False
        book['text_file'] = ''
        if book['pages'] != 0:       
            # print(book['title'])
            # The index value for the last page of an issue will be the total pages - 1
            last_page = book['pages'] - 1
            file_name = '{}-{}.txt'.format(slugify(str(book['title'])[:50]), book['trove_id'])
            file_path = os.path.join(output_dir, file_name)
            # Check to see if the file has already been harvested
            if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
                # print('Already saved')
                book['text_file'] = file_name
                book['text_downloaded'] = True
            else:
                url = 'https://trove.nla.gov.au/{}/download?downloadOption=ocr&firstPage=0&lastPage={}'.format(book['trove_id'], last_page)
                # print(url)
                # Get the file
                r = s.get(url)
                # Check there was no error
                if r.status_code == requests.codes.ok:
                    # Check that the file's not empty
                    r.encoding = 'utf-8'
                    if len(r.text) > 0 and not r.text.isspace():
                        # Check that the file isn't HTML (some not found pages don't return 404s)
                        if BeautifulSoup(r.text, 'html.parser').find('html') is None:
                            # If everything's ok, save the file
                            with open(file_path, 'w', encoding='utf-8') as text_file:
                                text_file.write(r.text)
                            # print('Saved')
                            book['text_file'] = file_name
                            book['text_downloaded'] = True
                if not r.from_cache:
                    time.sleep(1)

In [ ]:

save_ocr(books_with_pages, '/Volumes/bigdata/mydata/Trove/books')

Convert and save updated results¶

The new books list includes the file name of the downloaded text file (if there is one), and a boolean field indicating if the text has been downloaded.

In [75]:

# Convert this to df
df_downloaded = pd.DataFrame(books_with_pages)

In [76]:

df_downloaded.head()

Out[76]:

	title	url	contributors	date	format	fulltext_url	trove_id	language	rights	pages	form	text_downloaded	text_file
0	Goliath Joe, fisherman / by Charles Thackeray ...	https://trove.nla.gov.au/work/10013347	Thackeray, Charles	1900-1919	Book\|Book/Illustrated	https://nla.gov.au/nla.obj-2831231419	nla.obj-2831231419	English	Out of Copyright\|http://rightsstatements.org/v...	130	Book	True	goliath-joe-fisherman-by-charles-thackeray-wob...
1	Grammar of the Narrinyeri tribe of Australian ...	https://trove.nla.gov.au/work/10029401	Taplin, George	1878-1880	Book\|Government publication	http://nla.gov.au/nla.obj-688657424	nla.obj-688657424	English	Out of Copyright\|http://rightsstatements.org/v...	24	Book	True	grammar-of-the-narrinyeri-tribe-of-australian-...
2	The works of the Rev. Sydney Smith	https://trove.nla.gov.au/work/1004403	Smith, Sydney, 1771-1845	1839-1900	Book\|Book/Illustrated\|Microform	https://nla.gov.au/nla.obj-630176596	nla.obj-630176596	English	No known copyright restrictions\|http://rightss...	65	Book	True	the-works-of-the-rev-sydney-smith-nla.obj-6301...
3	Nellie Doran : a story of Australian home and ...	https://trove.nla.gov.au/work/10049667	Miriam Agatha	1914-1923	Book	http://nla.gov.au/nla.obj-24357566	nla.obj-24357566	English	Out of Copyright\|http://rightsstatements.org/v...	246	Book	True	nellie-doran-a-story-of-australian-home-and-sc...
4	Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...	https://trove.nla.gov.au/work/10053234	Germany. Heer. Heereswaffenamt	1942	Book\|Book/Illustrated\|Government publication	https://nla.gov.au/nla.obj-51530748	nla.obj-51530748	German	Out of Copyright\|http://rightsstatements.org/v...	80	Book	True	lastkraftwagen-3-t-ford-baumuster-v-3000-s-ger...

In [77]:

# How many have been downloaded?
df_downloaded.loc[df_downloaded['text_downloaded'] == True].shape

Out[77]:

(29652, 16)

Why is the number above different to the number of files actually downloaded? Let's have a look for duplicates.

As you can see below, some digitised works are linked to from multiple metadata records. Hence there are duplicates.

In [78]:

df_downloaded.loc[df_downloaded.duplicated('trove_id', keep=False) == True].sort_values('trove_id')

Out[78]:

	title	url	contributors	date	format	fulltext_url	trove_id	language	rights	pages	form	volume	parent	children	text_downloaded	text_file
25788	Three weeks in Southland : being the account o...	https://trove.nla.gov.au/work/237350529	Reid, Stuart, active 1884-1885	1885	Book	https://nla.gov.au/nla.obj-101207695	nla.obj-101207695	English	Out of Copyright\|http://rightsstatements.org/v...	66	Book				True	three-weeks-in-southland-being-the-account-of-...
7469	Three weeks in Southland : being the account o...	https://trove.nla.gov.au/work/19178390	Reid, Stuart, active 1884-1885	1885	Book	http://nla.gov.au/nla.obj-101207695	nla.obj-101207695			66	Book	2	nla.obj-477008239		True	three-weeks-in-southland-being-the-account-of-...
25790	A recent visit to several of the Polynesian is...	https://trove.nla.gov.au/work/237350531	Bennett, George, active 1830-1831	1831	Book	https://nla.gov.au/nla.obj-101212925	nla.obj-101212925	English	No known copyright restrictions\|http://rightss...	8	Book				True	a-recent-visit-to-several-of-the-polynesian-is...
7771	A recent visit to several of the Polynesian is...	https://trove.nla.gov.au/work/19241288	Bennett, George, active 1830-1831	1831-1832	Book/Illustrated\|Book	http://nla.gov.au/nla.obj-101212925	nla.obj-101212925			8	Book				True	a-recent-visit-to-several-of-the-polynesian-is...
25807	How Capt. Cook died : new light from an old book	https://trove.nla.gov.au/work/237350548		1908	Book	https://nla.gov.au/nla.obj-101227721	nla.obj-101227721	English	No known copyright restrictions\|http://rightss...	10	Book				True	how-capt-cook-died-new-light-from-an-old-book-...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
37508	A Wonderful Illawarra waterfall : a rare beaut...	https://trove.nla.gov.au/work/24063846		1895	Book	http://nla.gov.au/nla.obj-99671695	nla.obj-99671695			1	Book				True	a-wonderful-illawarra-waterfall-a-rare-beauty-...
25811	The Results of the census of 1871 : supplement...	https://trove.nla.gov.au/work/237350552		1873	Book	https://nla.gov.au/nla.obj-99716940	nla.obj-99716940	English	No known copyright restrictions\|http://rightss...	2	Book				True	the-results-of-the-census-of-1871-supplement-t...
4099	The Results of the census of 1871 : supplement...	https://trove.nla.gov.au/work/17856108		1873	Book	http://nla.gov.au/nla.obj-99716940	nla.obj-99716940			2	Book				True	the-results-of-the-census-of-1871-supplement-t...
25795	Regular packets for Australia : emigration to ...	https://trove.nla.gov.au/work/237350536		1850	Book	https://nla.gov.au/nla.obj-99727992	nla.obj-99727992	English	No known copyright restrictions\|http://rightss...	1	Book				True	regular-packets-for-australia-emigration-to-po...
909	Regular packets for Australia : emigration to ...	https://trove.nla.gov.au/work/12328620		1850	Book	http://nla.gov.au/nla.obj-99727992	nla.obj-99727992			1	Book				True	regular-packets-for-australia-emigration-to-po...

6234 rows × 16 columns

In [79]:

# Save as CSV
df_downloaded.to_csv('trove_digitised_books_with_ocr.csv', index=False)
display(FileLink('trove_digitised_books_with_ocr.csv'))

trove_digitised_books_with_ocr.csv

Create a searchable database using Datasette¶

To make it easy to explore the list of books, let's load the CSV file into Datasette. First we'll drop some columns, do some reordering, and add links to the downloaded text files stored on CloudStor.

In [87]:

df_datasette = df_downloaded.copy()

# Add link to Cloudstor
df_datasette['cloudstor_url'] = df_datasette.loc[df_datasette['text_downloaded'] == True]['text_file'].apply(lambda x: f'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL/download?path={x}')

Remove some columns that aren't going to be useful.

In [88]:

df_datasette = df_datasette[['title', 'contributors', 'date', 'format', 'language', 'rights', 'pages', 'url', 'fulltext_url', 'cloudstor_url', 'form', 'volume', 'parent', 'children']]

Rename columns for clarity.

In [90]:

df_datasette.columns = ['title', 'contributors', 'date', 'format', 'language', 'copyright', 'pages', 'view_details_url', 'view_book_url', 'download_text_url', 'form', 'volume', 'parent', 'children']
df_datasette.head()

Out[90]:

	title	contributors	date	format	language	copyright	pages	view_details_url	view_book_url	download_text_url	form
0	Goliath Joe, fisherman / by Charles Thackeray ...	Thackeray, Charles	1900-1919	Book\|Book/Illustrated	English	Out of Copyright\|http://rightsstatements.org/v...	130	https://trove.nla.gov.au/work/10013347	https://nla.gov.au/nla.obj-2831231419	https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...	Book
1	Grammar of the Narrinyeri tribe of Australian ...	Taplin, George	1878-1880	Book\|Government publication	English	Out of Copyright\|http://rightsstatements.org/v...	24	https://trove.nla.gov.au/work/10029401	http://nla.gov.au/nla.obj-688657424	https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...	Book
2	The works of the Rev. Sydney Smith	Smith, Sydney, 1771-1845	1839-1900	Book\|Book/Illustrated\|Microform	English	No known copyright restrictions\|http://rightss...	65	https://trove.nla.gov.au/work/1004403	https://nla.gov.au/nla.obj-630176596	https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...	Book
3	Nellie Doran : a story of Australian home and ...	Miriam Agatha	1914-1923	Book	English	Out of Copyright\|http://rightsstatements.org/v...	246	https://trove.nla.gov.au/work/10049667	http://nla.gov.au/nla.obj-24357566	https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...	Book
4	Lastkraftwagen 3 t Ford : Baumuster V 3000 S :...	Germany. Heer. Heereswaffenamt	1942	Book\|Book/Illustrated\|Government publication	German	Out of Copyright\|http://rightsstatements.org/v...	80	https://trove.nla.gov.au/work/10053234	https://nla.gov.au/nla.obj-51530748	https://cloudstor.aarnet.edu.au/plus/s/ugiw3gd...	Book

In [91]:

df_datasette.to_csv('trove-digital-books-datasette.csv', index=False)

This post describes how you can load your CSV files into Datasette using Glitch. Here's the result – a searchable database of Trove books available in digital form.

Some leftover bits used for renaming the text files¶

In [ ]:

# Rename files to include truncated title of book
for row in df.itertuples():
    try:
        os.rename(os.path.join('text', '{}.txt'.format(row.book_id)), os.path.join('text', '{}-{}.txt'.format(slugify(row.title[:50]), row.book_id)))
    except FileNotFoundError:
        pass

In [ ]:

# Convert all filenames back to just nla.obj- form
for filename in [f for f in os.listdir('text') if f[-4:] == '.txt']:
    try:
        objname = re.search(r'.*(nla\.obj.*)', filename).group(1)
    except AttributeError:
        print(filename)
    os.rename(os.path.join('text', filename), os.path.join('text', objname))

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.