Harvest GLAM datasets from government data portals

Australian GLAM organisations have made a large number of openly-licensed datasets available through government data portals. But they're not always easy to find. Some are in state-based portals, other are in the national portal. And who would go looking for library data in a government data portal anyway?

To encourage people to explore these datasets, I've harvested them all from the different portals and combined them into one big CSV file.

Method

I've harvested data from the following portals:

In actual fact data.gov.au provides two portals – an old one that includes datasets not in the state portals, and a new one that brings all the state and national datasets together. So why didn't I just harvest everything from the new data.gov.au portal? I did, but it soon became apparent that the new portal had a problem with managing duplicate organisations and datasets that made the results difficult to use. So now I've gone back to aggregating everything myself.

For each portal, I've used the web interface to manually search for terms like 'library', 'archives', 'records', and 'museum' to find GLAM organisations. This isn't always straightforward. Sometimes the GLAM organisation will be identified as an 'organisation' by the data portal. But other times, the GLAM organisation is hidden beneath a parent organisation, and relevant datasets are identified by tags that include the GLAM organisation's name. In some cases there are neither organisations, or tags, and you just have to search for datasets that include the organisation name somewhere in their notes. Because of these inconsistencies, it's entirely possible that I've missed some organisations.

I've saved all of the organisation names, tags, and queries, into the portals dictionary you'll see below, along with the API endpoint. Fortunately all of the portals use CKAN behind the scenes, so the API is consistent. Yay! This makes things so much easier. Unfortunately Victoria makes you register and get an API key before you can access their CKAN API, so if you want to run this harvest yourself, you'll have to insert your own API key where indicated.

The datasets themselves are arranged in a hierarchy of packages and resources. A package can contain multiple resources, or files. These might be the same data in different formats, data files and documentation, or versions of the data that change over time. I flatten out this hierarchy as I harvest the packages to create a CSV file where each row is a single file. The fields I'm capturing are:

  • dataset_title – name of the package
  • publisher – organisation that created/published the package
  • author – usually an email of the person who uploaded the package
  • dataset_issued – date the package was created
  • dataset_modified – date the package was last changed
  • dataset_description – a description of the package
  • source – the portal it was harvested from
  • info_url – a link to the portal page for more information
  • start_date – earliest date in the data
  • end_date – latest date in the data
  • file_title – name of the file (resource)
  • download_url – url to directly download the data file
  • format – format of the file, eg. 'CSV' or 'JSON'
  • file_description – description of the file
  • file_created – date the file was created
  • file_modified – date the file was last changed
  • file_size – size of the file
  • licence – licence string, eg. 'CC-BY'

You can browse a list of datasets, download a CSV containing all the harvested data, or just the CSVs. You can also search the harvested data using Datasette on Glitch.

To start exploring the contents of the datasets, give the GLAM CSV Explorer a spin.

In [1]:
import requests
import json
from json import JSONDecodeError
import pandas as pd
from pandas.errors import ParserError
from urllib.parse import urlparse
import os
import re
import time
import slugify as slugify
from tqdm.notebook import tqdm
from requests_cache import CachedSession

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

s = CachedSession()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
In [2]:
portals = [
    {
        'name': 'data.qld.gov.au',
        'api_url': 'https://data.qld.gov.au/api/action/',
        'orgs': ['state-library-queensland'],
        'tags': ['Queensland State Archives', 'queensland state archives'],
        'queries': ['Queensland Museum'],
        'groups': [],
        'base_url': 'https://data.qld.gov.au/dataset/',
        'package_ids': []
    },
    {
        'name': 'data.gov.au',
        'api_url': 'https://data.gov.au/api/3/action/',
        'orgs': [
            'aiatsis',
            'nationallibraryofaustralia',
            'libraries-tasmania',
            'nationalarchivesofaustralia',
            'national-portrait-gallery'
        ],
        'tags':[],
        'queries':[],
        'groups': [],
        'base_url': 'https://data.gov.au/dataset/',
        'package_ids': []
    },
    {
        'name': 'data.sa.gov.au',
        'api_url': 'https://data.sa.gov.au/data/api/3/action/',
        'orgs': [    
            'state-library-of-south-australia',
            'mount-gambier-library',
            'state-records',
            'history-sa',
            'south-australian-museum'
        ],
        'tags':[],
        'queries':[],
        'groups': [],
        'base_url': 'https://data.sa.gov.au/data/dataset/',
        'package_ids': []
    },
    {
        'name': 'data.nsw.gov.au',
        'api_url': 'https://data.nsw.gov.au/data/api/3/action/',
        'orgs': [
            'state-library-of-nsw',
            'nsw-state-archives',
            'maas',
            'australian-museum'
        ],
        'tags':[],
        'queries':[],
        'groups': [],
        'base_url': 'https://data.nsw.gov.au/data/dataset/',
        'package_ids': []
    },
    {
        'name': 'data.wa.gov.au',
        'api_url': 'https://catalogue.data.wa.gov.au/api/3/action/',
        'orgs': [
            'state-library-of-western-australia',
            'state-records-office-of-western-australia',
            'western-australian-museum'
        ],
        'tags':[],
        'queries':[],
        'groups': [],
        'base_url': 'https://catalogue.data.wa.gov.au/dataset/',
        'package_ids': []
    },
    {
        'name': 'data.vic.gov.au',
        'api_url': 'https://discover.data.vic.gov.au/api/3/action/',
        # 'apikey': 'YOUR API KEY',
        'orgs': [
            'state-library-of-victoria'
        ],
        'tags':[],
        'queries':['PROV', 'Public Records Office', 'Museums Victoria'],
        'groups': [],
        'base_url': 'https://www.data.vic.gov.au/data/dataset/',
        'package_ids': []
    },
]
In [21]:
def get_value(field):
    '''
    Sometimes values are strings and sometimes objects in strings.
    Get string values.
    '''
    try:
        s = field.replace("u'", "'").replace("'", '"')
        j = json.loads(s)
        value = j['name']
    except JSONDecodeError:
        value = field
    except AttributeError:
        value = None
    return value

def fix_github_links(url):
    '''
    Make sure github links point to downloadable files.
    '''
    return url.replace('//github.com', '//raw.githubusercontent.com').replace('/blob', '')

def check_http_status(url):
    '''
    Do a HEAD request of downloadable datasets to check if they're still there.
    '''
    response = s.head(url, allow_redirects=True)
    return response.status_code

def get_format(resource):
    # First try getting file extension
    try:
        url = fix_github_links(resource['url'])
        file_format = re.search('\.([a-zA-Z]+)$', url).group(1).upper()
    # If that fails just use the supplied value (which may be dodgy)
    except AttributeError:
        file_format = resource['format']
    return file_format

def add_key(portal):
    '''Add an API KEY into headers.'''
    if 'apikey' in portal:
        headers = {
            'apikey': portal['apikey'],
            "Content-Type": "application/json",
            "Accept": "application/json"
        }
    else:
        headers = {}
    return headers

def get_package_resources(package_id, portal, org=None):
    '''
    Given a package id and a portal, download details of all associated datasets/
    '''
    resources = []
    api_url = portal['api_url']
    url = '{}package_show?id={}'.format(api_url, package_id)
    # print(url)
    response = s.get(url, headers=add_key(portal))
    package_data = response.json()
    try:
        title = package_data['result']['title']
    except KeyError:
        # Not found
        pass
    else:
        if org:
            organisation = org
        else:
            organisation = package_data['result']['organization']['title']
        author = get_value(package_data['result']['author'])
        try:
            date_from = package_data['result']['temporal_coverage_from']
        except KeyError:
            date_from = ''
        try:
            date_to = package_data['result']['temporal_coverage_to']
        except KeyError:
            date_to = ''
        for resource in package_data['result']['resources']:
            dataset = {}
            resource_url = fix_github_links(resource['url'])
            dataset['dataset_title'] = title.strip()
            dataset['publisher'] = organisation
            dataset['author'] = author
            dataset['dataset_issued'] = package_data['result']['metadata_created']
            dataset['dataset_modified'] = package_data['result']['metadata_modified']
            dataset['dataset_description'] = package_data['result']['notes']
            dataset['source'] = portal['name']
            dataset['info_url'] = portal['base_url'] + package_id
            dataset['start_date'] = date_from
            dataset['end_date'] = date_to
            dataset['file_title'] = resource['name'].strip()
            dataset['download_url'] = resource_url
            dataset['format'] = get_format(resource)
            dataset['file_description'] = resource['description']
            dataset['file_created'] = resource['created']
            dataset['file_modified'] = resource['last_modified']
            dataset['file_size'] = resource['size']
            # dataset['status'] = check_http_status(resource_url)
            dataset['licence'] = package_data['result']['license_title']
            resources.append(dataset)
    return resources

def process_packages(url, portal, results_label, org=None):
    '''
    Get list of packages associated with an organisation, or returned by a search,
    then get details of all the files (resources) inside that package.
    '''
    tqdm.write(url)
    resources = []
    response = s.get(url, headers=add_key(portal))
    data = response.json()
    for package in data['result'][results_label]:
        resources.extend(get_package_resources(package['id'], portal, org=org))
        time.sleep(0.2)
    return resources

def process_portals():
    '''
    Get all of the resources from the defined portals.
    '''
    resources = []
    for portal in tqdm(portals):
        api_url = portal['api_url']
        for org in portal['orgs']:
            #url = f'{api_url}organization_show?id={org}&include_datasets=true'
            url = f'{api_url}package_search?fq=organization:{org}&rows=1000'
            #resources.extend(process_packages(url, portal, 'packages'))
            resources.extend(process_packages(url, portal, 'results'))
        for tag in portal['tags']:
            url = f'{api_url}package_search?q=tags:"{tag}"&rows=1000'
            resources.extend(process_packages(url, portal, 'results', org=tag))
        for query in portal['queries']:
            url = f'{api_url}package_search?q="{query}"&rows=1000'
            resources.extend(process_packages(url, portal, 'results', org=query))
        for group in portal['groups']:
            url = f'{api_url}group_show?id={group}&include_datasets=True'
            resources.extend(process_packages(url, portal, 'packages'))
    return resources
In [ ]:
resources = process_portals()

Explore the data

In [23]:
df = pd.DataFrame(resources)

# Standardise some names
df['publisher'] = df['publisher'].str.replace('queensland state archives', 'Queensland State Archives')
df['publisher'] = df['publisher'].str.replace('Public Records Office', 'Public Records Office Victoria')
df['publisher'] = df['publisher'].str.replace('Public Record Office', 'Public Records Office Victoria')
df['publisher'] = df['publisher'].str.replace('PROV', 'Public Records Office Victoria')
df['publisher'] = df['publisher'].str.replace(r'^State Records$', 'State Records South Australia', regex=True)

# Just in case there are any duplicates, we can use the download_url (which incorporates a unique id) to get rid of them
df.drop_duplicates(subset=['download_url'], inplace=True)

How many files are there?

In [24]:
df.shape[0]
Out[24]:
1076

How many files come from each portal?

In [25]:
df['source'].value_counts()
Out[25]:
data.qld.gov.au    523
data.sa.gov.au     261
data.wa.gov.au     105
data.gov.au         98
data.vic.gov.au     62
data.nsw.gov.au     27
Name: source, dtype: int64

How many unique datasets are there?

Remember a single dataset might contain multiple files.

In [26]:
df_datasets = df.drop_duplicates(subset=['info_url'])
df_datasets.shape[0]
Out[26]:
413

How many datasets come from each portal?

In [27]:
df_datasets['source'].value_counts()
Out[27]:
data.qld.gov.au    169
data.sa.gov.au     101
data.vic.gov.au     61
data.gov.au         38
data.wa.gov.au      28
data.nsw.gov.au     16
Name: source, dtype: int64

How many datasets and files come from each organisation?

First let's look at datasets.

In [28]:
df_datasets['publisher'].value_counts()
Out[28]:
Queensland State Archives                                                          105
Public Records Office Victoria                                                      55
State Library of South Australia                                                    36
State Library of Queensland                                                         35
State Records South Australia                                                       30
Queensland Museum                                                                   29
Libraries Tasmania                                                                  29
State Library of Western Australia                                                  18
South Australian Museum                                                             17
History Trust of South Australia                                                    16
NSW State Archives                                                                   8
State Records Office of Western Australia                                            7
State Library of NSW                                                                 5
State Library of Victoria                                                            5
National Library of Australia                                                        4
Museum of Applied Arts and Sciences                                                  3
Western Australian Museum                                                            3
Mount Gambier Library                                                                2
National Archives of Australia                                                       2
Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS)      2
National Portrait Gallery                                                            1
Museums Victoria                                                                     1
Name: publisher, dtype: int64

Now let's look at files.

In [29]:
df['publisher'].value_counts()
Out[29]:
State Library of Queensland                                                        277
Queensland State Archives                                                          197
State Library of South Australia                                                   135
Libraries Tasmania                                                                  85
South Australian Museum                                                             74
State Library of Western Australia                                                  73
Public Records Office Victoria                                                      56
Queensland Museum                                                                   49
State Records South Australia                                                       30
History Trust of South Australia                                                    20
NSW State Archives                                                                  19
Western Australian Museum                                                           19
State Records Office of Western Australia                                           13
National Library of Australia                                                        5
State Library of Victoria                                                            5
State Library of NSW                                                                 5
National Archives of Australia                                                       3
Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS)      3
Museum of Applied Arts and Sciences                                                  3
Mount Gambier Library                                                                2
National Portrait Gallery                                                            2
Museums Victoria                                                                     1
Name: publisher, dtype: int64

What formats are the files in?

In [30]:
df['format'].value_counts()
Out[30]:
CSV                              711
JSON                              78
XML                               61
XLSX                              50
ZIP                               22
JPG                               17
JPEG                              15
RTF                               15
HTML                              12
DOC                               10
API                               10
TXT                                9
DOCX                               8
website link                       7
OBJ                                7
MTL                                5
Mixed Formats                      4
GeoJSON                            4
API ArcGIS Server Map Service      4
                                   2
PDF                                2
RSS                                2
KML                                2
WMS                                2
GEOJSON                            2
CSS, Java, PHP, JavaScript         1
JS                                 1
XSD                                1
MPK                                1
.txt                               1
app                                1
Website                            1
WFS                                1
RDF                                1
api                                1
PHP                                1
plain                              1
MUSEUM                             1
page                               1
TIF                                1
Name: format, dtype: int64

What licences have been applied to the files?

In [31]:
df['licence'].value_counts()
Out[31]:
Creative Commons Attribution 4.0                                      611
Creative Commons Attribution                                          275
Creative Commons Attribution 4.0 International                        121
Creative Commons Attribution 2.5 Australia                             27
Creative Commons Attribution-NonCommercial                              9
Creative Commons Attribution 3.0 Australia                              8
Creative Commons Attribution 3.0                                        7
                                                                        4
Other (Open)                                                            4
notspecified                                                            3
Creative Commons Attribution Share-Alike 4.0                            3
Creative Commons Attribution Non-Commercial 4.0                         2
Creative Commons Attribution No Derivative Works 4.0 International      1
Custom (Other)                                                          1
Name: licence, dtype: int64

Save as a CSV file

In [32]:
df.to_csv('glam-datasets-from-gov-portals.csv', index=False)

Save a CSV of CSV files only!

In [33]:
csvs = df.loc[df['format'] == 'CSV']
In [34]:
csvs.shape
Out[34]:
(711, 18)
In [35]:
csvs.to_csv('glam-datasets-from-gov-portals-csvs.csv', index=False)

Create a human-readable list

In [36]:
# Write results to a markdown file
from slugify import slugify
orgs = df.sort_values(by=['publisher', 'dataset_title', 'dataset_modified']).groupby('publisher')
with open('glam_datasets_from_gov_portals.md', 'w') as md_file:
    md_file.write('# GLAM datasets from Australian government data portals\n')
    for org, group in orgs:
        md_file.write('* [{}](#{})\n'.format(org, slugify(org)))
    for org, group in orgs:
        md_file.write('\n## {}\n'.format(org))
        for dataset, files in group.groupby(['dataset_title', 'info_url']):
            md_file.write('\n### [{}]({})\n'.format(dataset[0], dataset[1]))
            for row in files.itertuples():
                md_file.write('* [{}]({}) ({}, {})\n'.format(row.file_title, row.download_url.replace(' ', '+'), row.format, row.file_created))

Created by Tim Sherratt (@wragge) for the GLAM workbench. Support me by becoming a GitHub sponsor!