Australian GLAM organisations have made a large number of openly-licensed datasets available through government data portals. But they're not always easy to find. Some are in state-based portals, other are in the national portal. And who would go looking for library data in a government data portal anyway?
To encourage people to explore these datasets, I've harvested them all from the different portals and combined them into one big CSV file.
I've harvested data from the following portals:
In actual fact data.gov.au provides two portals – an old one that includes datasets not in the state portals, and a new one that brings all the state and national datasets together. So why didn't I just harvest everything from the new data.gov.au portal? I did, but it soon became apparent that the new portal had a problem with managing duplicate organisations and datasets that made the results difficult to use. So now I've gone back to aggregating everything myself.
For each portal, I've used the web interface to manually search for terms like 'library', 'archives', 'records', and 'museum' to find GLAM organisations. This isn't always straightforward. Sometimes the GLAM organisation will be identified as an 'organisation' by the data portal. But other times, the GLAM organisation is hidden beneath a parent organisation, and relevant datasets are identified by tags that include the GLAM organisation's name. In some cases there are neither organisations, or tags, and you just have to search for datasets that include the organisation name somewhere in their notes. Because of these inconsistencies, it's entirely possible that I've missed some organisations.
I've saved all of the organisation names, tags, and queries, into the portals
dictionary you'll see below, along with the API endpoint. Fortunately all of the portals use CKAN behind the scenes, so the API is consistent. Yay! This makes things so much easier. Unfortunately Victoria makes you register and get an API key before you can access their CKAN API, so if you want to run this harvest yourself, you'll have to insert your own API key where indicated.
The datasets themselves are arranged in a hierarchy of packages and resources. A package can contain multiple resources, or files. These might be the same data in different formats, data files and documentation, or versions of the data that change over time. I flatten out this hierarchy as I harvest the packages to create a CSV file where each row is a single file. The fields I'm capturing are:
dataset_title
– name of the packagepublisher
– organisation that created/published the packageauthor
– usually an email of the person who uploaded the packagedataset_issued
– date the package was createddataset_modified
– date the package was last changeddataset_description
– a description of the packagesource
– the portal it was harvested frominfo_url
– a link to the portal page for more informationstart_date
– earliest date in the dataend_date
– latest date in the datafile_title
– name of the file (resource)download_url
– url to directly download the data fileformat
– format of the file, eg. 'CSV' or 'JSON'file_description
– description of the filefile_created
– date the file was createdfile_modified
– date the file was last changedfile_size
– size of the filelicence
– licence string, eg. 'CC-BY'You can browse a list of datasets, download a CSV containing all the harvested data, or just the CSVs. You can also search the harvested data using Datasette on Glitch.
To start exploring the contents of the datasets, give the GLAM CSV Explorer a spin.
import requests
import json
from json import JSONDecodeError
import pandas as pd
from pandas.errors import ParserError
from urllib.parse import urlparse
import os
import re
import time
import slugify as slugify
from tqdm.notebook import tqdm
from requests_cache import CachedSession
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
s = CachedSession()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
portals = [
{
'name': 'data.qld.gov.au',
'api_url': 'https://data.qld.gov.au/api/action/',
'orgs': ['state-library-queensland'],
'tags': ['Queensland State Archives', 'queensland state archives'],
'queries': ['Queensland Museum'],
'groups': [],
'base_url': 'https://data.qld.gov.au/dataset/',
'package_ids': []
},
{
'name': 'data.gov.au',
'api_url': 'https://data.gov.au/api/3/action/',
'orgs': [
'aiatsis',
'nationallibraryofaustralia',
'libraries-tasmania',
'nationalarchivesofaustralia',
'national-portrait-gallery'
],
'tags':[],
'queries':[],
'groups': [],
'base_url': 'https://data.gov.au/dataset/',
'package_ids': []
},
{
'name': 'data.sa.gov.au',
'api_url': 'https://data.sa.gov.au/data/api/3/action/',
'orgs': [
'state-library-of-south-australia',
'mount-gambier-library',
'state-records',
'history-sa',
'south-australian-museum'
],
'tags':[],
'queries':[],
'groups': [],
'base_url': 'https://data.sa.gov.au/data/dataset/',
'package_ids': []
},
{
'name': 'data.nsw.gov.au',
'api_url': 'https://data.nsw.gov.au/data/api/3/action/',
'orgs': [
'state-library-of-nsw',
'nsw-state-archives',
'maas',
'australian-museum'
],
'tags':[],
'queries':[],
'groups': [],
'base_url': 'https://data.nsw.gov.au/data/dataset/',
'package_ids': []
},
{
'name': 'data.wa.gov.au',
'api_url': 'https://catalogue.data.wa.gov.au/api/3/action/',
'orgs': [
'state-library-of-western-australia',
'state-records-office-of-western-australia',
'western-australian-museum'
],
'tags':[],
'queries':[],
'groups': [],
'base_url': 'https://catalogue.data.wa.gov.au/dataset/',
'package_ids': []
},
{
'name': 'data.vic.gov.au',
'api_url': 'https://discover.data.vic.gov.au/api/3/action/',
# 'apikey': 'YOUR API KEY',
'orgs': [
'state-library-of-victoria'
],
'tags':[],
'queries':['PROV', 'Public Records Office', 'Museums Victoria'],
'groups': [],
'base_url': 'https://www.data.vic.gov.au/data/dataset/',
'package_ids': []
},
]
def get_value(field):
'''
Sometimes values are strings and sometimes objects in strings.
Get string values.
'''
try:
s = field.replace("u'", "'").replace("'", '"')
j = json.loads(s)
value = j['name']
except JSONDecodeError:
value = field
except AttributeError:
value = None
return value
def fix_github_links(url):
'''
Make sure github links point to downloadable files.
'''
return url.replace('//github.com', '//raw.githubusercontent.com').replace('/blob', '')
def check_http_status(url):
'''
Do a HEAD request of downloadable datasets to check if they're still there.
'''
response = s.head(url, allow_redirects=True)
return response.status_code
def get_format(resource):
# First try getting file extension
try:
url = fix_github_links(resource['url'])
file_format = re.search('\.([a-zA-Z]+)$', url).group(1).upper()
# If that fails just use the supplied value (which may be dodgy)
except AttributeError:
file_format = resource['format']
return file_format
def add_key(portal):
'''Add an API KEY into headers.'''
if 'apikey' in portal:
headers = {
'apikey': portal['apikey'],
"Content-Type": "application/json",
"Accept": "application/json"
}
else:
headers = {}
return headers
def get_package_resources(package_id, portal, org=None):
'''
Given a package id and a portal, download details of all associated datasets/
'''
resources = []
api_url = portal['api_url']
url = '{}package_show?id={}'.format(api_url, package_id)
# print(url)
response = s.get(url, headers=add_key(portal))
package_data = response.json()
try:
title = package_data['result']['title']
except KeyError:
# Not found
pass
else:
if org:
organisation = org
else:
organisation = package_data['result']['organization']['title']
author = get_value(package_data['result']['author'])
try:
date_from = package_data['result']['temporal_coverage_from']
except KeyError:
date_from = ''
try:
date_to = package_data['result']['temporal_coverage_to']
except KeyError:
date_to = ''
for resource in package_data['result']['resources']:
dataset = {}
resource_url = fix_github_links(resource['url'])
dataset['dataset_title'] = title.strip()
dataset['publisher'] = organisation
dataset['author'] = author
dataset['dataset_issued'] = package_data['result']['metadata_created']
dataset['dataset_modified'] = package_data['result']['metadata_modified']
dataset['dataset_description'] = package_data['result']['notes']
dataset['source'] = portal['name']
dataset['info_url'] = portal['base_url'] + package_id
dataset['start_date'] = date_from
dataset['end_date'] = date_to
dataset['file_title'] = resource['name'].strip()
dataset['download_url'] = resource_url
dataset['format'] = get_format(resource)
dataset['file_description'] = resource['description']
dataset['file_created'] = resource['created']
dataset['file_modified'] = resource['last_modified']
dataset['file_size'] = resource['size']
# dataset['status'] = check_http_status(resource_url)
dataset['licence'] = package_data['result']['license_title']
resources.append(dataset)
return resources
def process_packages(url, portal, results_label, org=None):
'''
Get list of packages associated with an organisation, or returned by a search,
then get details of all the files (resources) inside that package.
'''
tqdm.write(url)
resources = []
response = s.get(url, headers=add_key(portal))
data = response.json()
for package in data['result'][results_label]:
resources.extend(get_package_resources(package['id'], portal, org=org))
time.sleep(0.2)
return resources
def process_portals():
'''
Get all of the resources from the defined portals.
'''
resources = []
for portal in tqdm(portals):
api_url = portal['api_url']
for org in portal['orgs']:
#url = f'{api_url}organization_show?id={org}&include_datasets=true'
url = f'{api_url}package_search?fq=organization:{org}&rows=1000'
#resources.extend(process_packages(url, portal, 'packages'))
resources.extend(process_packages(url, portal, 'results'))
for tag in portal['tags']:
url = f'{api_url}package_search?q=tags:"{tag}"&rows=1000'
resources.extend(process_packages(url, portal, 'results', org=tag))
for query in portal['queries']:
url = f'{api_url}package_search?q="{query}"&rows=1000'
resources.extend(process_packages(url, portal, 'results', org=query))
for group in portal['groups']:
url = f'{api_url}group_show?id={group}&include_datasets=True'
resources.extend(process_packages(url, portal, 'packages'))
return resources
resources = process_portals()
df = pd.DataFrame(resources)
# Standardise some names
df['publisher'] = df['publisher'].str.replace('queensland state archives', 'Queensland State Archives')
df['publisher'] = df['publisher'].str.replace('Public Records Office', 'Public Records Office Victoria')
df['publisher'] = df['publisher'].str.replace('Public Record Office', 'Public Records Office Victoria')
df['publisher'] = df['publisher'].str.replace('PROV', 'Public Records Office Victoria')
df['publisher'] = df['publisher'].str.replace(r'^State Records$', 'State Records South Australia', regex=True)
# Just in case there are any duplicates, we can use the download_url (which incorporates a unique id) to get rid of them
df.drop_duplicates(subset=['download_url'], inplace=True)
df.shape[0]
df['source'].value_counts()
Remember a single dataset might contain multiple files.
df_datasets = df.drop_duplicates(subset=['info_url'])
df_datasets.shape[0]
df_datasets['source'].value_counts()
First let's look at datasets.
df_datasets['publisher'].value_counts()
Now let's look at files.
df['publisher'].value_counts()
df['format'].value_counts()
df['licence'].value_counts()
df.to_csv('glam-datasets-from-gov-portals.csv', index=False)
csvs = df.loc[df['format'] == 'CSV']
csvs.shape
csvs.to_csv('glam-datasets-from-gov-portals-csvs.csv', index=False)
# Write results to a markdown file
from slugify import slugify
orgs = df.sort_values(by=['publisher', 'dataset_title', 'dataset_modified']).groupby('publisher')
with open('glam_datasets_from_gov_portals.md', 'w') as md_file:
md_file.write('# GLAM datasets from Australian government data portals\n')
for org, group in orgs:
md_file.write('* [{}](#{})\n'.format(org, slugify(org)))
for org, group in orgs:
md_file.write('\n## {}\n'.format(org))
for dataset, files in group.groupby(['dataset_title', 'info_url']):
md_file.write('\n### [{}]({})\n'.format(dataset[0], dataset[1]))
for row in files.itertuples():
md_file.write('* [{}]({}) ({}, {})\n'.format(row.file_title, row.download_url.replace(' ', '+'), row.format, row.file_created))
Created by Tim Sherratt (@wragge) for the GLAM workbench. Support me by becoming a GitHub sponsor!