If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.
Some tips:
The Trove Newspaper Harvester is a command line tool that helps you download large quantities of digitised newspaper articles from Trove.
Instead of working your way through page after page of search results using Trove’s web interface, the newspaper harvester will save the results of your search to a CSV (spreadsheet) file which you can then filter, sort, or analyse.
Even better, the harvester can save the full OCRd (and possibly corrected) text of each article to an individual file. You could, for example, collect the text of thousands of articles on a particular topic and then feed them to a text analysis engine like Voyant to look for patterns in the language.
If you'd like to install and run the TroveHarvester on your local system see the installation instructions.
If you'd like to try before you buy, you can run a fully-functional version of the TroveHarvester from this very notebook!
If you were running TroveHarvester on your local system, you could access the basic help information by entering this on the command line:
troveharvester -h
In this notebook you need to use the magic %run
command to call the TroveHarvester script. Click on the cell below and hit Shift+Enter
to view the TroveHarvester's basic options.
%run -m troveharvester -h
Before we go any further you should make sure you have a Trove API key. For non-commercial projects, you just fill out a simple form and your API key is generated instantly. Follow the instructions in the Trove Help to obtain your own Trove API Key.
Once you've created a key, you can access it at any time on the 'For developers' tab of your Trove user profile.
Copy your API key now, and paste it in the cell below, between the quotes. Then hit Shift+Enter
to save your key as a variable called api_key
.
api_key = 'ju3rgk0jp354ikmh'
print('Your API key is: {}'.format(api_key))
Your API key is: ju3rgk0jp354ikmh
The TroveHarvester translates queries from the Trove web interface into something that the API can understand. So all you need to do is construct your query using the web interface. Once you're happy with the results you're getting just copy the url.
It's important to note that there are currently a few differences between the indexes used by the web interface and the API, so some queries won't translate directly. For example, the state
facet doesn't exist in the API index. If you use the state
facet the TroveHarvester will try to replace it with a list of newspapers from that state, but there are now so many newspaper titles that this could fail. Similarly, the API index won't recognise has:corrections
. However, most queries should translate without any problems.
Once you've constructed your query and copied the url, paste it between the quotes in the cell below and hit Shift+Enter
to save it as a variable.
query = 'https://trove.nla.gov.au/newspaper/result?q=cyclone+wragge&l-category=Article&l-decade=191'
By default the harvester will save all the article metadata to a CSV formatted file called results.csv
. If you'd like to save the full OCRd text of all the articles, just add the --text
parameter. You can also save PDFs of all the articles by adding the --pdf
parameter, but be warned that this will slow down your harvest considerably and can consume large amounts of disk space. So use with care!
Now we're ready to start the harvest! Just run the code in the cell below. You can delete the --text
parameter if you're not interested in saving the full text of every article.
%run -m troveharvester start $query $api_key --text
You'll know the harvest is finished when the asterix in the square brackets of the cell above turns into a number.
If the harvest stops before it's finished, you can restart it by running the cell below.
%run -m troveharvester restart
If you want to check the details of a finished harvest, just run the cell below.
%run -m troveharvester report
When you start a new harvest, the harvester looks for a directory called data. Within this directory it creates another directory for your harvest. The name of this directory will be in the form of a unix timestamp – a very large number that represents the number of seconds since 1 January 1970. So this means the directory with the largest number will contain the most recent harvest.
The harvester saves your results inside this directory. There will be at least two files created for each harvest:
results.csv
– a text file containing the details of all harvested articlesmetadata.json
– a configuration file which stores all the details of the harvestIf you’ve asked for PDFs or text files, there will be additional directories containing those files.
The results.csv
file is a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program. The details recorded for each article are:
article_id
– a unique identifier for the articletitle
– the title of the articlenewspaper_id
– a unique identifier for the newspaper (this can be used to retrieve more information or build a link to the web interface)newspaper_title
– the name of the newspaperpage
– page number (of course), but might also indicate the page is part of a supplement or special sectiondate
– in ISO format, YYYY-MM-DDcategory
– one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’words
– number of words in the articleillustrated
– is it illustrated (values are y or n)corrections
– number of text correctionsurl
– the persistent url for the articlepage_url
– the persistent url of the page on which the article is publishedFiles containing the OCRd text of the articles will be saved in a directory named text
. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt
tells you:
19460104
– the article was published on 4 January 1946 (YYYYMMDD)1002
– the article was published in The Tribune206680758
– the article's unique identifierAs you can see, you can use the newspaper and article ids to create direct links into Trove:
https://trove.nla.gov.au/newspaper/title/[newspaper id]
http://nla.gov.au/nla.news-article[article id]
Browse the contents of the data directory to find the results of your harvest.
Once your harvest is complete, you probably want to download the results. The easiest way to do this is to zip up the results folder. Run the following cell to zip up the folder containing all the data from your most recent harvest.
import shutil
import os
# List all the harvest folders and sort by date
harvests = sorted([d for d in os.listdir('data') if os.path.isdir(os.path.join('data', d))])
# Get the most recent
timestamp = harvests[-1]
# Zip up the folder
shutil.make_archive(os.path.join('data', timestamp), 'zip', os.path.join('data', timestamp))
'/Users/tim/Dropbox/working_code/glam-workbench-presentations/notebooks/trove/data/1536281447.zip'
Once your zip file has been created you can find it in the data directory. Or just run the cell below to create a handy download link.
from IPython.core.display import display, HTML
display(HTML('<a target="_blank" href="data/{}.zip">Download your harvest</a>'.format(timestamp)))
Have a look at the Exploring your TroveHarvest data for some ideas.
Created by Tim Sherrratt (@wragge) as part of the OzGLAM workbench.
If you think this project is worthwhile you can support it on Patreon.