If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.
Some tips:
The Trove Newspaper & Gazette Harvester is a command line tool that helps you download large quantities of digitised articles from Trove's newspapers and gazettes.
Instead of working your way through page after page of search results using Trove’s web interface, the newspaper & gazette harvester will save the results of your search to a CSV (spreadsheet) file which you can then filter, sort, or analyse.
Even better, the harvester can save the full OCRd (and possibly corrected) text of each article to an individual file. You could, for example, collect the text of thousands of articles on a particular topic and then feed them to a text analysis engine like Voyant to look for patterns in the language.
If you'd like to install and run the TroveHarvester on your local system see the GitHub repository.
If you'd like to try before you buy, you can run a fully-functional version of the TroveHarvester from this very notebook!
Run the two cells below to set some things up and check for an existing API key in a call called .env
. Don't worry if you don't have an .env
file, you can just paste your API key where indicated below.
import os
import shutil
from IPython.display import HTML, display
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv
If you were running TroveHarvester on your local system, you could access the basic help information by entering this on the command line:
troveharvester -h
In this notebook environment you need start with a !
to run the command-line TroveHarvester script. Click on the cell below and hit Shift+Enter
to view the TroveHarvester's basic options.
!troveharvester -h
Before we go any further you should make sure you have a Trove API key. For non-commercial projects, you just fill out a simple form and your API key is generated instantly. Follow the instructions in the Trove Help to obtain your own Trove API Key.
Once you've created a key, you can access it at any time on the 'For developers' tab of your Trove user profile.
Copy your API key now, and paste it in the cell below, between the quotes. Then hit Shift+Enter
to save your key as a variable called api_key
.
# Insert your Trove API key
API_KEY = "YOUR API KEY"
# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
API_KEY = os.getenv("TROVE_API_KEY")
The TroveHarvester translates queries from the Trove web interface into something that the API can understand. So all you need to do is construct your query using the web interface. Once you're happy with the results you're getting just copy the url.
Once you've constructed your query and copied the url, paste it between the quotes in the cell below and hit Shift+Enter
to save it as a variable.
query = "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge%201902&l-artType=newspapers&l-state=Queensland&l-title=840"
By default the harvester will save all the article metadata to a CSV formatted file called results.csv
. If you'd like to save the full OCRd text of all the articles, just add the --text
parameter. If you'd like copies of the articles as JPG images, add the --image
option. You can also save PDFs of all the articles by adding the --pdf
parameter, but be warned that this will slow down your harvest considerably and can consume large amounts of disk space. So use with care!
Now we're ready to start the harvest! Just run the code in the cell below. You can delete the --text
parameter if you're not interested in saving the full text of every article. You could also try adding --image
to save articles as images (this will slow down the harvest).
!troveharvester start "$query" $API_KEY --text
You'll know the harvest is finished when the asterisk in the square brackets of the cell above turns into a number.
If the harvest stops before it's finished, you can restart it by running the cell below.
!troveharvester restart
If you want to check the details of a finished harvest, just run the cell below.
!troveharvester report
There will be at least three files created for each harvest:
harvester_config.json
a file that captures the parameters used to launch the harvestro-crate-metadata.json
a metadata file documenting the harvest in RO-Crate formatresults.csv
contains details of all the harvested articles in plain text CSV (Comma Separated Values) formatThe details recorded for each article are:
article_id
– a unique identifier for the articletitle
– the title of the articledate
– in ISO format, YYYY-MM-DDpage
– page number (of course), but might also indicate the page is part of a supplement or special sectionnewspaper_id
– a unique identifier for the newspaper or gazette title (this can be used to retrieve more information or build a link to the web interface)newspaper_title
– the name of the newspaper (or gazette)category
– one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’words
– number of words in the articleillustrated
– is it illustrated (values are y or n)edition
– edition of newspaper (rarely used)supplement
– section of newspaper (rarely used)section
– section of newspaper (rarely used)url
– the persistent url for the articlepage_url
– the persistent url of the page on which the article is publishedsnippet
– short text samplerelevance
– search relevance score of this resultstatus
– some articles that are still being processed will have the status "coming soon" and might be missing other fieldscorrections
– number of text correctionslast_correction
– date of last correctiontags
– number of attached tagscomments
– number of attached commentslists
– number of lists this article is included intext
– path to text filepdf
– path to PDF fileimages
– path to image file(s)If you’ve asked for text files PDFs or images, there will be additional directories containing those files. Files containing the OCRd text of the articles will be saved in a directory named text
. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt
tells you:
19460104
– the article was published on 4 January 1946 (YYYYMMDD)1002
– the article was published in The Tribune206680758
– the article's unique identifierAs you can see, you can use the newspaper and article ids to create direct links into Trove:
https://trove.nla.gov.au/newspaper/title/[newspaper id]
http://nla.gov.au/nla.news-article[article id]
Similarly, if you've asked for copies of the articles as images, they'll be in a directory named image
. The image file names are similar to the text files, but with an extra id number for the page from which the image was extracted. So, for example, the image filename 19250411-460-140772994-11900413.jpg
tells you:
19250411
– the article was published on 11 April 1925 (YYYYMMDD)460
– the article was published in The Australasian140772994
– the article's unique identifier11900413
– the page's unique identifier (some articles can be split over multiple pages)Once your harvest is complete, you probably want to download the results. The easiest way to do this is to zip up the results folder. Run the following cell to zip up the folder containing all the data from your most recent harvest.
# List all the harvest folders and sort by date
harvests = sorted(
[d for d in os.listdir("data") if os.path.isdir(os.path.join("data", d))]
)
# Get the most recent
timestamp = harvests[-1]
# Zip up the folder
shutil.make_archive(
os.path.join("data", timestamp), "zip", os.path.join("data", timestamp)
)
Once your zip file has been created you can find it in the data directory. Or just run the cell below to create a handy download link.
display(
HTML(
f'<a href="data/{timestamp}.zip" download="{timestamp}.zip">data/{timestamp}.zip</a>'
)
)
Have a look at the Exploring your TroveHarvest data for some ideas.
Created by Tim Sherratt (@wragge) for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.