Australian GLAM organisations have made a large number of openly-licensed datasets available through government data portals. But they're not always easy to find. Some are in state-based portals, other are in the national portal. And who would go looking for library data in a government data portal anyway?
To encourage people to explore these datasets, I've harvested them all from the different portals and combined them into one big CSV file.
I've harvested data from the following portals:
In actual fact data.gov.au provides two portals – an old one that includes datasets not in the state portals, and a new one that brings all the state and national datasets together. So why didn't I just harvest everything from the new data.gov.au portal? I did, but it soon became apparent that the new portal had a problem with managing duplicate organisations and datasets that made the results difficult to use. So now I've gone back to aggregating everything myself.
For each portal, I've used the web interface to manually search for terms like 'library', 'archives', 'records', and 'museum' to find GLAM organisations. This isn't always straightforward. Sometimes the GLAM organisation will be identified as an 'organisation' by the data portal. But other times, the GLAM organisation is hidden beneath a parent organisation, and relevant datasets are identified by tags that include the GLAM organisation's name. In some cases there are neither organisations, or tags, and you just have to search for datasets that include the organisation name somewhere in their notes. Because of these inconsistencies, it's entirely possible that I've missed some organisations.
I've saved all of the organisation names, tags, and queries, into the portals
dictionary you'll see below, along with the API endpoint. Fortunately all of the portals use CKAN behind the scenes, so the API is consistent. Yay! This makes things so much easier. Unfortunately Victoria makes you register and get an API key before you can access their CKAN API, so if you want to run this harvest yourself, you'll have to insert your own API key where indicated.
The datasets themselves are arranged in a hierarchy of packages and resources. A package can contain multiple resources, or files. These might be the same data in different formats, data files and documentation, or versions of the data that change over time. I flatten out this hierarchy as I harvest the packages to create a CSV file where each row is a single file. The fields I'm capturing are:
dataset_title
– name of the packagepublisher
– organisation that created/published the packageauthor
– usually an email of the person who uploaded the packagedataset_issued
– date the package was createddataset_modified
– date the package was last changeddataset_description
– a description of the packagesource
– the portal it was harvested frominfo_url
– a link to the portal page for more informationstart_date
– earliest date in the dataend_date
– latest date in the datafile_title
– name of the file (resource)download_url
– url to directly download the data fileformat
– format of the file, eg. 'CSV' or 'JSON'file_description
– description of the filefile_created
– date the file was createdfile_modified
– date the file was last changedfile_size
– size of the filelicence
– licence string, eg. 'CC-BY'You can browse a list of datasets, download a CSV containing all the harvested data, or just the CSVs. You can also search the harvested data using Datasette on Glitch.
To start exploring the contents of the datasets, give the GLAM CSV Explorer a spin.
import json
import re
import time
from json import JSONDecodeError
import pandas as pd
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from requests_cache import CachedSession
from slugify import slugify
from tqdm.notebook import tqdm
s = CachedSession()
s.headers.update(
{
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0"
}
)
retries = Retry(total=10, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))
portals = [
{
"name": "data.qld.gov.au",
"api_url": "https://data.qld.gov.au/api/action/",
"orgs": ["state-library-queensland"],
"tags": ["Queensland State Archives", "queensland state archives"],
"queries": ["Queensland Museum"],
"groups": [],
"base_url": "https://data.qld.gov.au/dataset/",
"package_ids": [],
},
{
"name": "data.gov.au",
"api_url": "https://data.gov.au/api/3/action/",
"orgs": [
"aiatsis",
"nationallibraryofaustralia",
"libraries-tasmania",
"nationalarchivesofaustralia",
"national-portrait-gallery",
],
"tags": [],
"queries": [],
"groups": [],
"base_url": "https://data.gov.au/dataset/",
"package_ids": [],
},
{
"name": "data.sa.gov.au",
"api_url": "https://data.sa.gov.au/data/api/3/action/",
"orgs": [
"state-library-of-south-australia",
"mount-gambier-library",
"state-records",
"history-sa",
"south-australian-museum",
],
"tags": [],
"queries": [],
"groups": [],
"base_url": "https://data.sa.gov.au/data/dataset/",
"package_ids": [],
},
{
"name": "data.nsw.gov.au",
"api_url": "https://data.nsw.gov.au/data/api/3/action/",
"orgs": [
"state-library-of-nsw",
"nsw-state-archives",
"maas",
"australian-museum",
],
"tags": [],
"queries": [],
"groups": [],
"base_url": "https://data.nsw.gov.au/data/dataset/",
"package_ids": [],
},
{
"name": "data.wa.gov.au",
"api_url": "https://catalogue.data.wa.gov.au/api/3/action/",
"orgs": [
"state-library-of-western-australia",
"state-records-office-of-western-australia",
"western-australian-museum",
],
"tags": [],
"queries": [],
"groups": [],
"base_url": "https://catalogue.data.wa.gov.au/dataset/",
"package_ids": [],
},
{
"name": "data.vic.gov.au",
"api_url": "https://discover.data.vic.gov.au/api/3/action/",
# 'apikey': 'YOUR API KEY',
"orgs": ["state-library-of-victoria"],
"tags": [],
"queries": ["PROV", "Public Records Office", "Museums Victoria"],
"groups": [],
"base_url": "https://www.data.vic.gov.au/data/dataset/",
"package_ids": [],
},
]
def get_value(field):
"""
Sometimes values are strings and sometimes objects in strings.
Get string values.
"""
try:
s = field.replace("u'", "'").replace("'", '"')
j = json.loads(s)
value = j["name"]
except JSONDecodeError:
value = field
except AttributeError:
value = None
return value
def fix_github_links(url):
"""
Make sure github links point to downloadable files.
"""
return url.replace("//github.com", "//raw.githubusercontent.com").replace(
"/blob", ""
)
def check_http_status(url):
"""
Do a HEAD request of downloadable datasets to check if they're still there.
"""
response = s.head(url, allow_redirects=True)
return response.status_code
def get_format(resource):
# First try getting file extension
try:
url = fix_github_links(resource["url"])
file_format = re.search(r"\.([a-zA-Z]+)$", url).group(1).upper()
# If that fails just use the supplied value (which may be dodgy)
except AttributeError:
file_format = resource["format"]
return file_format
def add_key(portal):
"""Add an API KEY into headers."""
if "apikey" in portal:
headers = {
"apikey": portal["apikey"],
"Content-Type": "application/json",
"Accept": "application/json",
}
else:
headers = {}
return headers
def get_package_resources(package_id, portal, org=None):
"""
Given a package id and a portal, download details of all associated datasets/
"""
resources = []
api_url = portal["api_url"]
url = "{}package_show?id={}".format(api_url, package_id)
# print(url)
response = s.get(url, headers=add_key(portal))
package_data = response.json()
try:
title = package_data["result"]["title"]
except KeyError:
# Not found
pass
else:
if org:
organisation = org
else:
organisation = package_data["result"]["organization"]["title"]
try:
author = get_value(package_data["result"]["author"])
except KeyError:
author = None
try:
date_from = package_data["result"]["temporal_coverage_from"]
except KeyError:
date_from = ""
try:
date_to = package_data["result"]["temporal_coverage_to"]
except KeyError:
date_to = ""
for resource in package_data["result"]["resources"]:
dataset = {}
resource_url = fix_github_links(resource["url"])
dataset["dataset_title"] = title.strip()
dataset["publisher"] = organisation
if author:
dataset["author"] = author
dataset["dataset_issued"] = package_data["result"]["metadata_created"]
dataset["dataset_modified"] = package_data["result"]["metadata_modified"]
dataset["dataset_description"] = package_data["result"]["notes"]
dataset["source"] = portal["name"]
dataset["info_url"] = portal["base_url"] + package_id
dataset["start_date"] = date_from
dataset["end_date"] = date_to
dataset["file_title"] = resource["name"].strip()
dataset["download_url"] = resource_url
dataset["format"] = get_format(resource)
dataset["file_description"] = resource.get("description")
dataset["file_created"] = resource["created"]
try:
dataset["file_modified"] = resource["last_modified"]
except KeyError:
pass
dataset["file_size"] = resource["size"]
# dataset['status'] = check_http_status(resource_url)
dataset["licence"] = package_data["result"]["license_title"]
resources.append(dataset)
return resources
def process_packages(url, portal, results_label, org=None):
"""
Get list of packages associated with an organisation, or returned by a search,
then get details of all the files (resources) inside that package.
"""
tqdm.write(url)
resources = []
response = s.get(url, headers=add_key(portal))
data = response.json()
for package in data["result"][results_label]:
resources.extend(get_package_resources(package["id"], portal, org=org))
time.sleep(0.2)
return resources
def process_portals():
"""
Get all of the resources from the defined portals.
"""
resources = []
for portal in tqdm(portals):
api_url = portal["api_url"]
for org in portal["orgs"]:
# url = f'{api_url}organization_show?id={org}&include_datasets=true'
url = f"{api_url}package_search?fq=organization:{org}&rows=1000"
# resources.extend(process_packages(url, portal, 'packages'))
resources.extend(process_packages(url, portal, "results"))
for tag in portal["tags"]:
url = f'{api_url}package_search?q=tags:"{tag}"&rows=1000'
resources.extend(process_packages(url, portal, "results", org=tag))
for query in portal["queries"]:
url = f'{api_url}package_search?q="{query}"&rows=1000'
resources.extend(process_packages(url, portal, "results", org=query))
for group in portal["groups"]:
url = f"{api_url}group_show?id={group}&include_datasets=True"
resources.extend(process_packages(url, portal, "packages"))
return resources
resources = process_portals()
0%| | 0/6 [00:00<?, ?it/s]
https://data.qld.gov.au/api/action/package_search?fq=organization:state-library-queensland&rows=1000 https://data.qld.gov.au/api/action/package_search?q=tags:"Queensland State Archives"&rows=1000 https://data.qld.gov.au/api/action/package_search?q=tags:"queensland state archives"&rows=1000 https://data.qld.gov.au/api/action/package_search?q="Queensland Museum"&rows=1000 https://data.gov.au/api/3/action/package_search?fq=organization:aiatsis&rows=1000 https://data.gov.au/api/3/action/package_search?fq=organization:nationallibraryofaustralia&rows=1000 https://data.gov.au/api/3/action/package_search?fq=organization:libraries-tasmania&rows=1000 https://data.gov.au/api/3/action/package_search?fq=organization:nationalarchivesofaustralia&rows=1000 https://data.gov.au/api/3/action/package_search?fq=organization:national-portrait-gallery&rows=1000 https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:state-library-of-south-australia&rows=1000 https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:mount-gambier-library&rows=1000 https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:state-records&rows=1000 https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:history-sa&rows=1000 https://data.sa.gov.au/data/api/3/action/package_search?fq=organization:south-australian-museum&rows=1000 https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:state-library-of-nsw&rows=1000 https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:nsw-state-archives&rows=1000 https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:maas&rows=1000 https://data.nsw.gov.au/data/api/3/action/package_search?fq=organization:australian-museum&rows=1000 https://catalogue.data.wa.gov.au/api/3/action/package_search?fq=organization:state-library-of-western-australia&rows=1000 https://catalogue.data.wa.gov.au/api/3/action/package_search?fq=organization:state-records-office-of-western-australia&rows=1000 https://catalogue.data.wa.gov.au/api/3/action/package_search?fq=organization:western-australian-museum&rows=1000 https://discover.data.vic.gov.au/api/3/action/package_search?fq=organization:state-library-of-victoria&rows=1000 https://discover.data.vic.gov.au/api/3/action/package_search?q="PROV"&rows=1000 https://discover.data.vic.gov.au/api/3/action/package_search?q="Public Records Office"&rows=1000 https://discover.data.vic.gov.au/api/3/action/package_search?q="Museums Victoria"&rows=1000
df = pd.DataFrame(resources)
# Standardise some names
df["publisher"] = df["publisher"].str.replace(
"queensland state archives", "Queensland State Archives"
)
df["publisher"] = df["publisher"].str.replace(
"Public Records Office", "Public Records Office Victoria"
)
df["publisher"] = df["publisher"].str.replace(
"Public Record Office", "Public Records Office Victoria"
)
df["publisher"] = df["publisher"].str.replace("PROV", "Public Records Office Victoria")
df["publisher"] = df["publisher"].str.replace(
r"^State Records$", "State Records South Australia", regex=True
)
# Just in case there are any duplicates, we can use the download_url (which incorporates a unique id) to get rid of them
df.drop_duplicates(subset=["download_url"], inplace=True)
df.shape[0]
1192
df["source"].value_counts()
data.qld.gov.au 559 data.sa.gov.au 262 data.wa.gov.au 177 data.gov.au 98 data.vic.gov.au 68 data.nsw.gov.au 28 Name: source, dtype: int64
Remember a single dataset might contain multiple files.
df_datasets = df.drop_duplicates(subset=["info_url"])
df_datasets.shape[0]
463
df_datasets["source"].value_counts()
data.qld.gov.au 175 data.sa.gov.au 102 data.vic.gov.au 67 data.wa.gov.au 65 data.gov.au 38 data.nsw.gov.au 16 Name: source, dtype: int64
First let's look at datasets.
df_datasets["publisher"].value_counts()
Queensland State Archives 108 Public Records Office Victoria 61 State Records Office of Western Australia 44 State Library of South Australia 37 State Library of Queensland 36 Queensland Museum 31 State Records South Australia 30 Libraries Tasmania 29 State Library of Western Australia 18 South Australian Museum 17 History Trust of South Australia 16 NSW State Archives 8 State Library of NSW 5 State Library of Victoria 5 National Library of Australia 4 Museum of Applied Arts and Sciences 3 Western Australian Museum 3 Mount Gambier Library 2 National Archives of Australia 2 Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) 2 National Portrait Gallery 1 Museums Victoria 1 Name: publisher, dtype: int64
Now let's look at files.
df["publisher"].value_counts()
State Library of Queensland 304 Queensland State Archives 204 State Library of South Australia 136 State Records Office of Western Australia 86 Libraries Tasmania 85 South Australian Museum 74 State Library of Western Australia 73 Public Records Office Victoria 62 Queensland Museum 51 State Records South Australia 30 History Trust of South Australia 20 NSW State Archives 19 Western Australian Museum 18 State Library of NSW 5 National Library of Australia 5 State Library of Victoria 5 Museum of Applied Arts and Sciences 4 National Archives of Australia 3 Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) 3 Mount Gambier Library 2 National Portrait Gallery 2 Museums Victoria 1 Name: publisher, dtype: int64
df["format"].value_counts()
CSV 752 JSON 78 XML 61 JPG 53 XLSX 51 IMG 37 RTF 16 JPEG 15 HTML 12 website link 10 DOC 10 API 10 ZIP 10 DOCX 9 TXT 9 GeoJSON 7 OBJ 7 MTL 5 API ArcGIS Server Map Service 4 Mixed Formats 4 GeoPackage 3 SHP 3 FGDB 3 KML 2 WMS 2 PDF 2 2 GEOJSON 2 Website 1 TIF 1 app 1 JS 1 MUSEUM 1 api 1 RSS 1 CSS, Java, PHP, JavaScript 1 .txt 1 XSD 1 plain 1 page 1 WFS 1 Name: format, dtype: int64
df["licence"].value_counts()
Creative Commons Attribution 4.0 646 Creative Commons Attribution 277 Creative Commons Attribution 4.0 International 127 77 Creative Commons Attribution 2.5 Australia 27 Creative Commons Attribution-NonCommercial 9 Creative Commons Attribution 3.0 Australia 8 Creative Commons Attribution 3.0 7 Other (Open) 4 notspecified 3 Creative Commons Attribution Share-Alike 4.0 3 Creative Commons Attribution Non-Commercial 4.0 2 Custom (Other) 1 Creative Commons Attribution No Derivative Works 4.0 International 1 Name: licence, dtype: int64
df.to_csv("glam-datasets-from-gov-portals.csv", index=False)
csvs = df.loc[df["format"] == "CSV"]
csvs.shape
(752, 18)
csvs.to_csv("glam-datasets-from-gov-portals-csvs.csv", index=False)
# Write results to a markdown file
orgs = df.sort_values(by=["publisher", "dataset_title", "dataset_modified"]).groupby(
"publisher"
)
with open("glam_datasets_from_gov_portals.md", "w") as md_file:
md_file.write("# GLAM datasets from Australian government data portals\n")
for org, group in orgs:
md_file.write("* [{}](#{})\n".format(org, slugify(org)))
for org, group in orgs:
md_file.write("\n## {}\n".format(org))
for dataset, files in group.groupby(["dataset_title", "info_url"]):
md_file.write("\n### [{}]({})\n".format(dataset[0], dataset[1]))
for row in files.itertuples():
if row.file_modified:
file_date = row.file_modified
else:
file_date = row.file_created
md_file.write(
"* [{}]({}) ({}, {})\n".format(
row.file_title,
row.download_url.replace(" ", "+"),
row.format,
file_date,
)
)
Created by Tim Sherratt (@wragge) for the GLAM workbench. Support me by becoming a GitHub sponsor!