Trove includes more than 370,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. You can view them all in Trove by searching for nuc:"APAR:PR" in the journals zone.
This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases. The metadata is available from Trove, but to get the full text we have to go back to the Parliamentary Library's database, ParlInfo. The code in this notebook updates my original GitHub repository.
There are two main steps:
Sometimes multiple press releases can be grouped together as 'works' in Trove. This is because Trove thinks that they're versions of the same thing. Indeed, there are multiple versions of some press releases. For example, sometimes the office of a Minister and the Minister's department both issue a copy of the same press release or transcript. But these versions are not always identical, and sometimes Trove has grouped press releases together incorrectly. To make sure that we harvest as many individual press releases as possible, the code below unpacks any versions contained within a 'work' and turns them into individual records. This means there will be more duplicates, but it also means you can explore how the versions might differ.
It looks like the earlier documents have been OCRd and the results are quite variable. If you follow the fulltext_url
link you should be able to view a PDF version for comparison.
It also seems that some documents only have a PDF version and not any OCRd text. These documents will be ignored by the save_texts()
function, so you might end up with fewer texts than records.
The copyright statement attached to each record in Trove reads:
Copyright remains with the copyright holder. Contact the Australian Copyright Council for further information on your rights and responsibilities.
So depending on what you want to do with them, you might need to contact individual copyright holders for permission.
I've used this notebook to update an example dataset relating to refugees that I first generated in December 2017. It's been created by searching for the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', and 'boat arrivals' amongst the press releases. The exact query used is:
nuc:"APAR:PR" AND ("illegal arrival" OR text:"immigrant" OR text:"immigrants" OR "asylum seeker" OR "boat people" OR refugee OR "boat arrivals")
You can view the results of this query on Trove.
After unpacking the versions and harvesting available texts I ended up with 12,619 text files. You can browse the files on CloudStor, or download the complete dataset as a zip file (43mb).
In the cell below you need to insert your search query and your Trove API key.
The search query can be anything you would enter in the Trove search box. As you can see from the examples below it can include phrases, exact phrases, and boolean operators (AND
, OR
, and NOT
).
You can get a Trove API key by following these instructions.
You can change output_dir
to save the results to a specific directory on your machine.
# Insert your query between the single quotes.
# query = '"illegal arrival" OR text:"immigrant" OR text:"immigrants" OR "asylum seeker" OR "boat people" OR refugee OR "boat arrivals"'
query = 'atomic'
# Insert your Trove API key between the single quotes
api_key = 'YOUR API KEY GOES HERE'
# You don't have to change this
output_dir = 'press-releases'
import requests
import time
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
from slugify import slugify
import pandas as pd
from datetime import datetime
import os
import shutil
from IPython.display import display, HTML, FileLink
from tqdm import tqdm_notebook
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
def get_total_results(params):
'''
Get the total number of results for a search.
'''
these_params = params.copy()
these_params['n'] = 0
response = s.get('https://api.trove.nla.gov.au/v2/result', params=these_params)
data = response.json()
return int(data['response']['zone'][0]['records']['total'])
def get_fulltext_url(links):
'''
Loop through the identifiers to find a link to the digital version of the journal.
'''
url = None
for link in links:
if link['linktype'] == 'fulltext':
url = link['value']
break
return url
def get_source(version):
'''
Get the metadata source of a version.
'''
if 'metadataSource' in version:
try:
source = version['metadataSource']['value']
except TypeError:
try:
source = version['metadataSource']
except TypeError:
print(version)
except KeyError:
source = None
else:
source = None
return source
def harvest_prs(query, api_key):
'''
Harvest details of parliamentary press releases using the Trove API.
This function saves the 'version' level records individually (these are grouped under 'works').
'''
# Define parameters for the search -- you could change this of course
# The nuc:"APAR:PR" limits the results to the Parliamentary Press Releases
params = {
'q': 'nuc:"APAR:PR" AND ({})'.format(query),
'zone': 'article',
'n': 100,
'key': api_key,
'bulkHarvest': 'true',
'encoding': 'json',
'include': 'workVersions',
'l-availability': 'y'
}
start = '*'
total = get_total_results(params)
records = []
url = 'http://api.trove.nla.gov.au/v2/result'
with tqdm_notebook(total=total) as pbar:
while start:
params['s'] = start
response = s.get(url, params=params)
data = response.json()
# If there's a startNext value then we get it to request the next page of results
try:
start = data['response']['zone'][0]['records']['nextStart']
except KeyError:
start = None
for work in data['response']['zone'][0]['records']['work']:
# Different records can be grouped within works as versions.
# So we're going to extract each version as a separate record.
for version in work['version']:
# Sometimes there are even versions grouped together in a version... ¯\_(ツ)_/¯
# We need to extract their ids from a single string
ids = version['id'].split()
# This may or may not be a list...
if isinstance(version['record'], list):
version_records = version['record']
else:
version_records = [version['record']]
# Loop through versions in versions.
for index, record in enumerate(version_records):
source = get_source(record)
if source == 'APAR:PR':
# Add the id to the version record
record['version_id'] = ids[index]
record = clean_metadata(record)
records.append(record)
# Try to avoid hitting the API request limit
pbar.update(100)
time.sleep(0.2)
return records
def stringify_values(version, field):
'''
If a value is a list, join it into a pipe separate string.
Otherwise just return the string value.
'''
try:
if isinstance(version[field], list):
values = [str(v) for v in version.get(field)]
value = '|'.join(values)
else:
value = version.get(field, '')
except KeyError:
value = ''
return value
def clean_metadata(version):
'''
Standardises, cleans, and stringifies record metadata.
'''
record = {}
record['version_id'] = version['version_id']
record['title'] = version.get('title')
record['date'] = version.get('date')
# Make sure creators is a list
record['creators'] = stringify_values(version, 'creator')
record['subjects'] = stringify_values(version, 'subject')
record['source'] = stringify_values(version, 'source')
# Get the fulltext url from the list of identifiers
try:
record['fulltext_url'] = get_fulltext_url(version['identifier'])
except KeyError:
record['fulltext_url'] = ''
record['trove_url'] = 'https://trove.nla.gov.au/version/{}'.format(version['version_id'])
return record
def save_texts(records, output_dir, query):
'''
Get the text of press releases in the ParlInfo db.
This function uses urls harvested from Trove to request press releases from Parlinfo.
Text is extracted from the HTML files and saved as individual text files.
'''
# Loop through all the previously harvested records
for record in tqdm_notebook(records):
output_path = os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'texts')
os.makedirs(output_path, exist_ok=True)
filename = '{}-{}-{}.txt'.format(record['date'], slugify(record['creators']), record['version_id'])
file_path = os.path.join(output_path, filename)
# Only save files we haven't saved before
if not os.path.exists(file_path):
# Get the Parlinfo web page
response = s.get(record['fulltext_url'])
# Parse web page in Beautiful Soup
soup = BeautifulSoup(response.text, 'lxml')
content = soup.find('div', class_='box')
# If we find some text on the web page then save it.
if content:
# Open file
# print 'Saving file...'
with open(file_path, 'w', encoding='utf-8') as text_file:
# Get the contents of each paragraph and write it to the file
for para in content.find_all('p'):
text_file.write('{}\n\n'.format(para.get_text().strip()))
time.sleep(0.5)
Running the cell below will harvest details of all the press releases matching our query using the Trove API. The results will be saved in the records
variable for further use.
records = harvest_prs(query, api_key)
HBox(children=(IntProgress(value=0, max=1419), HTML(value='')))
The cells below convert the records
variable into a Pandas DataFrame, have a little peek inside, and then save all the harvested metadata as a CSV formatted text file. This file provides an index to the harvested press releases.
df = pd.DataFrame(records)
df.head()
creators | date | fulltext_url | source | subjects | title | trove_url | version_id | |
---|---|---|---|---|---|---|---|---|
0 | Evans, Gareth | 1991-09-16 | http://parlinfo.aph.gov.au/parlInfo/search/dis... | Minister for Foreign Affairs and Trade | The International Atomic Energy Agency and the... | https://trove.nla.gov.au/version/214098272 | 214098272 | |
1 | ALP | 1902-01-01 | http://parlinfo.aph.gov.au/parlInfo/search/dis... | AUSTRALIAN LABOR PARTY | History of the Federal Capital and Parliament ... | Australian Labor Party: 2nd Commonwealth Confe... | https://trove.nla.gov.au/version/211168619 | 211168619 |
2 | ALP | 1964-04-27 | http://parlinfo.aph.gov.au/parlInfo/search/dis... | LEADER OF THE OPPOSITION | Outside control of the liberal party | https://trove.nla.gov.au/version/211168681 | 211168681 | |
3 | ALP | 1965-06-10 | http://parlinfo.aph.gov.au/parlInfo/search/dis... | LEADER OF THE OPPOSITION | Decisions of the federal executive | https://trove.nla.gov.au/version/211168736 | 211168736 | |
4 | ALP | 1964-11-09 | http://parlinfo.aph.gov.au/parlInfo/search/dis... | LEADER OF THE OPPOSITION | Decisions of the federal executive | https://trove.nla.gov.au/version/211168720 | 211168720 |
Note that the number of records in the harvested data might be different to the number of search results. This is because we've unpacked versions that had been combined into a single work.
# How many records
df.shape
(1771, 8)
# Save the data as a CSV file
os.makedirs(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query))), exist_ok=True)
df.to_csv(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'press-releases-{}.csv'.format(slugify(query))), index=False)
The details we've harvested from the Trove API include a url that points to the full text of the press release in the ParlInfo database. Now we can loop through all those urls, saving the text of the press releases.
# Only run this cell if you need to reload the harvested metadata from the CSV
df = pd.read_csv(os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)), 'press-releases-{}.csv'.format(slugify(query))), keep_default_na=False)
records = df.to_dict('records')
save_texts(records, output_dir, query)
HBox(children=(IntProgress(value=0, max=1771), HTML(value='')))
The metadata and text files we've harvested are all sitting in a directory named using the query
value. If you're running this notebook on a cloud service, like Binder, you probably want to download it all. Running the cell below will zip up the whole directory and provide a convenient download link.
output_path = os.path.join(output_dir, 'press-releases-{}'.format(slugify(query)))
shutil.make_archive(output_path, 'zip', output_path)
display(HTML('<b>Download results</b>'))
display(FileLink('{}.zip'.format(output_path)))
Created by Tim Sherratt.
Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.