Using TroveHarvester to get newspaper and gazette articles in bulk

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

The Trove Newspaper & Gazette Harvester is a command line tool that helps you download large quantities of digitised articles from Trove's newspapers and gazettes.

Instead of working your way through page after page of search results using Trove’s web interface, the newspaper & gazette harvester will save the results of your search to a CSV (spreadsheet) file which you can then filter, sort, or analyse.

Even better, the harvester can save the full OCRd (and possibly corrected) text of each article to an individual file. You could, for example, collect the text of thousands of articles on a particular topic and then feed them to a text analysis engine like Voyant to look for patterns in the language.

If you'd like to install and run the TroveHarvester on your local system see the GitHub repository.

If you'd like to try before you buy, you can run a fully-functional version of the TroveHarvester from this very notebook!

Getting started

If you were running TroveHarvester on your local system, you could access the basic help information by entering this on the command line:

troveharvester -h

In this notebook you need to use the magic %run command to call the TroveHarvester script. Click on the cell below and hit Shift+Enter to view the TroveHarvester's basic options.

In [ ]:
%run -m troveharvester -- -h

Before we go any further you should make sure you have a Trove API key. For non-commercial projects, you just fill out a simple form and your API key is generated instantly. Follow the instructions in the Trove Help to obtain your own Trove API Key.

Once you've created a key, you can access it at any time on the 'For developers' tab of your Trove user profile.

Copy your API key now, and paste it in the cell below, between the quotes. Then hit Shift+Enter to save your key as a variable called api_key.

In [ ]:
api_key = ''
print('Your API key is: {}'.format(api_key))

What do you want to harvest?

The TroveHarvester translates queries from the Trove web interface into something that the API can understand. So all you need to do is construct your query using the web interface. Once you're happy with the results you're getting just copy the url.

Once you've constructed your query and copied the url, paste it between the quotes in the cell below and hit Shift+Enter to save it as a variable.

In [ ]:
query = ''

Running the harvest

By default the harvester will save all the article metadata to a CSV formatted file called results.csv. If you'd like to save the full OCRd text of all the articles, just add the --text parameter. If you'd like copies of the articles as JPG images, add the --image option. You can also save PDFs of all the articles by adding the --pdf parameter, but be warned that this will slow down your harvest considerably and can consume large amounts of disk space. So use with care!

Now we're ready to start the harvest! Just run the code in the cell below. You can delete the --text parameter if you're not interested in saving the full text of every article. You could also try adding --image to save articles as images (this will slow down the harvest).

In [ ]:
%run -m troveharvester -- start $query $api_key --text

You'll know the harvest is finished when the asterisk in the square brackets of the cell above turns into a number.

If the harvest stops before it's finished, you can restart it by running the cell below.

In [ ]:
%run -m troveharvester -- restart

If you want to check the details of a finished harvest, just run the cell below.

In [ ]:
%run -m troveharvester -- report

Harvest results

When you start a new harvest, the harvester looks for a directory called data. Within this directory it creates another directory for your harvest. The name of this directory will be in the form of a unix timestamp – a very large number that represents the number of seconds since 1 January 1970. So this means the directory with the largest number will contain the most recent harvest.

The harvester saves your results inside this directory. There will be at least two files created for each harvest:

  • results.csv – a text file containing the details of all harvested articles
  • metadata.json – a configuration file which stores all the details of the harvest

If you’ve asked for PDFs or text files, there will be additional directories containing those files.

The results.csv file is a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program. The details recorded for each article are:

  • article_id – a unique identifier for the article
  • title – the title of the article
  • newspaper_id – a unique identifier for the newspaper or gazette (this can be used to retrieve more information or build a link to the web interface)
  • newspaper_title – the name of the newspaper or gazette
  • page – page number (of course), but might also indicate the page is part of a supplement or special section
  • date – in ISO format, YYYY-MM-DD
  • category – one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’
  • words – number of words in the article
  • illustrated – is it illustrated (values are y or n)
  • corrections – number of text corrections
  • snippet – short text sample
  • url – the persistent url for the article
  • page_url – the persistent url of the page on which the article is published

Files containing the OCRd text of the articles will be saved in a directory named text. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt tells you:

Similarly, if you've asked for copies of the articles as images, they'll be in a directory named image. The image file names are similar to the text files, but with an extra id number for the page from which the image was extracted. So, for example, the image filename 19250411-460-140772994-11900413.jpg tells you:

As you can see, you can use the newspaper and article ids to create direct links into Trove:

  • to a newspaper or gazette[newspaper id]
  • to an article[article id]

Download your data

If you're using this notebook through the MyBinder service (it'll say `mybinder` in the url) make sure you download your data once the harvest is finished as it will not be preserved!

Once your harvest is complete, you probably want to download the results. The easiest way to do this is to zip up the results folder. Run the following cell to zip up the folder containing all the data from your most recent harvest.

In [ ]:
import shutil
import os

# List all the harvest folders and sort by date
harvests = sorted([d for d in os.listdir('data') if os.path.isdir(os.path.join('data', d))])
# Get the most recent
timestamp = harvests[-1]
# Zip up the folder
shutil.make_archive(os.path.join('data', timestamp), 'zip', os.path.join('data', timestamp))

Once your zip file has been created you can find it in the data directory. Or just run the cell below to create a handy download link.

In [ ]:
from IPython.display import display, HTML

display(HTML(f'<a href="data/{timestamp}.zip" download="{timestamp}.zip">data/{timestamp}.zip</a>'))

Explore your data

Have a look at the Exploring your TroveHarvest data for some ideas.

Created by Tim Sherratt (@wragge) for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.