If you've ever poked around in Trove's 'map' zone, you might have noticed the beautiful deep-zoomable images available for many of the NLA's digitised maps. Even better, in many cases the high-resolution TIFF versions of the digitised maps are available for download.
I knew there were lots of great maps you could download from Trove, but how many? And how big were the files? I thought I'd try to quantify this a bit by harvesting and analysing the metadata.
The size of the downloadable files (both in bytes and pixels) are embedded within the landing pages for the digitised maps. So harvesting the metadata involves a number of steps:
2023 update! It turns out that embedded within the embedded data are MARC descriptions that include some other metadata that's not available through the API. This includes the map scale and coordinates. The coordinates can either be a point, or a bounding box. I've saved these values as well, and explored some ways of parsing and visualising the coordinates in this notebook.
The fields in the harvested dataset are:
title
– title of the mapurl
– url to the map in the digitised file viewerwork_url
– url to the work in the Trove map categoryidentifier
– NLA identifierdate
– date published or createdcreators
– creators of the mappublication
– publication place, publisher, and publication date (if available)extent
– physical description of mapcopyright_status
– copyright status based on available metadata (scraped from web page)scale
– map scalecoordinates
– map coordinates, either a point or a bounding box (format is 'W--E/N--S', eg: 'E 130⁰50'--E 131⁰00'/S 12⁰30'--S 12⁰40')filesize_string
– filesize string in MBfilesize
– size of TIFF file in byteswidth
– width of TIFF in pixelsheight
– height of TIFF in pixelscopy_role
– I'm not sure what the values in this field signify, but as described below, you can use them to download high-res TIFF imagesThere are a couple of undocumented tricks that make it easy to programatically download images of the maps.
/image
to the map url. For example: http://nla.gov.au/nla.obj-232162256/imagewid
parameter to specify a pixel width. For example: http://nla.gov.au/nla.obj-232162256/image?wid=400copy_role
value to the url. For example, if the copy_role
is 'm' this url will download the TIFF: http://nla.gov.au/nla.obj-232162256/m (note that some of these files are very, very large – you might want to check the filesize
before downloading)import datetime
import json
import os
import re
import time
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
import altair as alt
import pandas as pd
import requests_cache
from bs4 import BeautifulSoup
from IPython.display import FileLink, display
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv
# This creates a variable called 'api_key', paste your key between the quotes
api_key = ""
# Use an api key value from environment variables if it is available (useful for testing)
if os.getenv("TROVE_API_KEY"):
api_key = os.getenv("TROVE_API_KEY")
# This displays a message with your key
print("Your API key is: {}".format(api_key))
Your API key is: gq29l1g1h75pimh4
def get_total_results(params):
"""
Get the total number of results for a search.
"""
these_params = params.copy()
these_params["n"] = 0
response = s.get("https://api.trove.nla.gov.au/v2/result", params=these_params)
data = response.json()
return int(data["response"]["zone"][0]["records"]["total"])
def get_fulltext_url(links):
"""
Loop through the identifiers to find a link to the digital version of the journal.
"""
url = None
for link in links:
if link["linktype"] == "fulltext" and "nla.obj" in link["value"]:
url = link["value"]
break
return url
def get_copyright_status(response=None, url=None):
"""
Scrape copyright information from a digital work page.
"""
if url and not response:
response = s.get(url)
if response:
soup = BeautifulSoup(response.text, "lxml")
try:
copyright_status = str(
soup.find("div", id="tab-access").find("p", class_="decorative").string
)
return copyright_status
# No access tab
except AttributeError:
pass
return None
def get_work_data(url):
"""
Extract work data in a JSON string from the work's HTML page.
"""
response = s.get(url)
try:
work_data = json.loads(
re.search(
r"var work = JSON\.parse\(JSON\.stringify\((\{.*\})", response.text
).group(1)
)
except (AttributeError, TypeError):
work_data = {}
# else:
# If there's no copyright info in the work data, then scrape it
# if "copyrightPolicy" not in work_data:
# work_data["copyrightPolicy"] = get_copyright_status(response)
if not response.from_cache:
time.sleep(0.2)
return work_data
def find_field_content(record, tag, subfield):
"""
Loop through a MARC record looking for tag/subfield.
If found, return the subfield value.
"""
try:
for field in record["datafield"]:
if field["tag"] == tag:
if isinstance(field["subfield"], list):
for sfield in field["subfield"]:
if sfield["code"] == subfield:
return sfield["content"]
else:
if field["subfield"]["code"] == subfield:
return field["subfield"]["content"]
except (KeyError, TypeError):
pass
return None
def get_marc_field(work_data, tag, subfield):
"""
Loop through all the MARC records in work metadata looking for a tag/subfield.
If found, return the subfield value.
"""
if "marcData" in work_data and work_data["marcData"]:
for record in work_data["marcData"]["record"]:
content = find_field_content(record, tag, subfield)
if content:
return content
return None
def format_bytes(size):
"""
Format bytes as a human-readable string
"""
# 2**10 = 1024
power = 2**10
n = 0
power_labels = {0: "", 1: "K", 2: "M", 3: "G", 4: "T"}
while size > power:
size /= power
n += 1
return size, power_labels[n] + "B"
def get_publication_details(work_data):
"""
Get MARC values for publication details and combine into a single string.
"""
parts = []
for code in ["a", "b", "c"]:
value = get_marc_field(work_data, 260, code)
if value:
parts.append(str(value))
return " ".join(parts)
def get_map_data(work_data):
"""
Look for file size information in the embedded data
"""
map_data = {}
width = None
height = None
num_bytes = None
try:
# Make sure there's a downloadable version
if (
work_data.get("accessConditions") == "Unrestricted"
and "copies" in work_data
):
for copy in work_data["copies"]:
# Get the pixel dimensions
if "technicalmetadata" in copy:
width = copy["technicalmetadata"].get("width")
height = copy["technicalmetadata"].get("height")
# Get filesize in bytes
elif (
copy["copyrole"] in ["m", "o", "i", "fd"]
and copy["access"] == "true"
):
num_bytes = copy.get("filesize")
copy_role = copy["copyrole"]
if width and height and num_bytes:
size, unit = format_bytes(num_bytes)
# Convert bytes to something human friendly
map_data["filesize_string"] = "{:.2f}{}".format(size, unit)
map_data["filesize"] = num_bytes
map_data["width"] = width
map_data["height"] = height
map_data["copy_role"] = copy_role
except AttributeError:
pass
return map_data
def get_maps():
"""
Harvest metadata about maps.
"""
url = "http://api.trove.nla.gov.au/v2/result"
maps = []
params = {
"q": '"nla.obj-"',
"zone": "map",
"l-availability": "y",
"l-format": "Map/Single map",
"bulkHarvest": "true", # Needed to maintain a consistent order across requests
"key": api_key,
"n": 100,
"encoding": "json",
}
start = "*"
total = get_total_results(params)
with tqdm(total=total) as pbar:
while start:
params["s"] = start
response = s.get(url, params=params)
data = response.json()
# If there's a startNext value then we get it to request the next page of results
try:
start = data["response"]["zone"][0]["records"]["nextStart"]
except KeyError:
start = None
for work in tqdm(
data["response"]["zone"][0]["records"]["work"], leave=False
):
# Check to see if there's a link to a digital version
try:
fulltext_url = get_fulltext_url(work["identifier"])
except KeyError:
pass
else:
if fulltext_url:
work_data = get_work_data(fulltext_url)
map_data = get_map_data(work_data)
obj_id = re.search(r"(nla\.obj\-\d+)", fulltext_url).group(1)
try:
contributors = "|".join(work.get("contributor"))
except TypeError:
contributors = work.get("contributor")
# Get basic metadata
# You could add more work data here
# Check the Trove API docs for work record structure
map_data["title"] = work["title"]
map_data["url"] = fulltext_url
map_data["work_url"] = work.get("troveUrl")
map_data["identifier"] = obj_id
map_data["date"] = work.get("issued")
map_data["creators"] = contributors
map_data["publication"] = get_publication_details(work_data)
map_data["extent"] = work_data.get("extent")
# I think the copyright status scraped from the page (below) is more likely to be accurate
# map_data["copyright_policy"] = work_data.get("copyrightPolicy")
map_data["copyright_status"] = get_copyright_status(
url=fulltext_url
)
map_data["scale"] = get_marc_field(work_data, 255, "a")
map_data["coordinates"] = get_marc_field(work_data, 255, "c")
maps.append(map_data)
# print(map_data)
if not response.from_cache:
time.sleep(0.2)
pbar.update(100)
return maps
maps = get_maps()
# Convert to dataframe
# Convert dtypes converts numbers to integers rather than floats
df = pd.DataFrame(maps).convert_dtypes()
# Reorder columns
df = df[
[
"identifier",
"title",
"url",
"work_url",
"date",
"creators",
"publication",
"extent",
"copyright_status",
"scale",
"coordinates",
"filesize_string",
"filesize",
"width",
"height",
"copy_role",
]
]
df.head()
# Save to CSV
csv_file = f"single_maps_{datetime.datetime.now().strftime('%Y%m%d')}.csv"
df.to_csv(csv_file, index=False)
display(FileLink(csv_file))
# Reload data from CSV if necessary
df = pd.read_csv(
"https://raw.githubusercontent.com/GLAM-Workbench/trove-maps-data/main/single_maps_20230131.csv"
)
How many digitised maps are available?
print("{:,} maps".format(df.shape[0]))
33,161 maps
How many of the maps have high-resolution downloads?
df.loc[df["filesize"].notnull()].shape
(29190, 16)
What are the copy_role
values?
df["copy_role"].value_counts()
m 28809 i 355 o 26 Name: copy_role, dtype: int64
How much map data is available for download?
size, unit = format_bytes(df["filesize"].sum())
print("{:.2f}{}".format(size, unit))
13.29TB
What's the copyright status of the maps?
df["copyright_status"].value_counts()
Out of Copyright 24490 In Copyright 7573 Edition Out of Copyright 625 Copyright Undetermined 305 Copyright Uncertain 111 Unknown 17 Edition In Copyright 4 Name: copyright_status, dtype: int64
Let's show the copyright status as a chart...
counts = df["copyright_status"].value_counts().to_frame().reset_index()
counts.columns = ["status", "count"]
alt.Chart(counts).mark_bar().encode(
y="status:N", x="count", tooltip="count"
).properties(height=200)
Let's look at the sizes of the download files. To make this easier we'll divide the filesizes into ranges (bins) and count the number of files in each range.
# Convert bytes to mb
df["mb"] = df["filesize"] / 2**10 / 2**10
# Create 500mb-sized bins and count the number of files in each bin
sizes = (
pd.cut(df["mb"], bins=[0, 500, 1000, 1500, 2000, 3000, 3500])
.value_counts()
.to_frame()
.reset_index()
)
sizes.columns = ["mb", "count"]
# Convert intervals to strings for display in chart
sizes["mb"] = sizes["mb"].astype(str)
sizes
mb | count | |
---|---|---|
0 | (0, 500] | 16007 |
1 | (500, 1000] | 10208 |
2 | (1000, 1500] | 2574 |
3 | (1500, 2000] | 312 |
4 | (2000, 3000] | 78 |
5 | (3000, 3500] | 11 |
alt.Chart(sizes).mark_bar().encode(
x=alt.X("mb:N", sort=None), y="count:Q", tooltip="count:Q"
).properties(width=400)
So while most are less than 500MB, more than 10,000 are between 0.5 and 1GB!
What's the biggest file available for download?
df.iloc[df["filesize"].idxmax()]
identifier nla.obj-591001246 title Map of the City of Rangoon and suburbs 1928-29... url http://nla.gov.au/nla.obj-591001246 work_url https://trove.nla.gov.au/work/182743876 date 1932 creators Geological Survey of India publication NaN extent 1 map on 4 sheets : colour ; 154 x 126 cm, sheets copyright_status Out of Copyright scale Scale 1:12,000. 1 in. = 1000 ft. coordinates (E 96°06ʹ--E 96°13ʹ/N 16°53ʹ--N 16°44ʹ). filesize_string 3.38GB filesize 3623879488.0 width 31769.0 height 38023.0 copy_role m mb 3456.000793 Name: 5046, dtype: object
All downloads greater than 3GB.
df.loc[(df["filesize"] / 2**10 / 2**10 / 2**10) > 3]
identifier | title | url | work_url | date | creators | publication | extent | copyright_status | scale | coordinates | filesize_string | filesize | width | height | copy_role | mb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2424 | nla.obj-2567709383 | Map of the coastal plain of British Guiana | https://nla.gov.au/nla.obj-2567709383 | https://trove.nla.gov.au/work/152215030 | 1955 | Bleackley, D. (David) | [S.l.] : Geological Survey of British Guiana, ... | 1 map : col. ; 88 x 205 cm. | In Copyright | Scale [ca. 1:143,000]. | (W 60°00ʹ--W 57°00ʹ/N 9°00ʹ--N 6°00ʹ). | 3.08GB | 3.305391e+09 | 49731.0 | 22155.0 | m | 3152.266552 |
5046 | nla.obj-591001246 | Map of the City of Rangoon and suburbs 1928-29... | http://nla.gov.au/nla.obj-591001246 | https://trove.nla.gov.au/work/182743876 | 1932 | Geological Survey of India | NaN | 1 map on 4 sheets : colour ; 154 x 126 cm, sheets | Out of Copyright | Scale 1:12,000. 1 in. = 1000 ft. | (E 96°06ʹ--E 96°13ʹ/N 16°53ʹ--N 16°44ʹ). | 3.38GB | 3.623879e+09 | 31769.0 | 38023.0 | m | 3456.000793 |
6038 | nla.obj-3009772762 | Shqipëria, hartë fiziko-politike : shkalla 1... | https://nla.gov.au/nla.obj-3009772762 | https://trove.nla.gov.au/work/191812727 | 1965 | Samimi, Ergjin | NaN | 1 map on 3 sheets : color ; 173 x 91 cm, sheet... | In Copyright | Scale 1:200,000. 1 cm to 2 km ; | (E 18°58ʹ--E 21°12ʹ/N 42°40ʹ--N 39°35ʹ). | 3.04GB | 3.266078e+09 | 23106.0 | 47117.0 | m | 3114.774906 |
7442 | nla.obj-568387103 | Peta geologi teknik daerah Jakarta - Bogor : E... | http://nla.gov.au/nla.obj-568387103 | https://trove.nla.gov.au/work/20208553 | 1970 | Indonesia. Direktorat Geologi | NaN | 1 map : colour ; 157 x 107 cm | In Copyright | Scale 1:50,000 | (E 106°33'00"--E 106°59'00"/S 5°59'00"--S 6°38... | 3.05GB | 3.279211e+09 | 26384.0 | 41429.0 | m | 3127.298904 |
7916 | nla.obj-400826638 | Nyūginia-tō zenzu / Taiwan Sōtokufu Gaijibu... | http://nla.gov.au/nla.obj-400826638 | https://trove.nla.gov.au/work/205481810 | 1942 | Taiwan | NaN | 1 map on 4 sheets : colour ; 172 x 99 cm | Out of Copyright | Scale 1:5,000,000 ; | (E 126°00ʹ--E 156°00ʹ/N 4°00ʹ--S 12°00ʹ). | 3.04GB | 3.264456e+09 | 42659.0 | 25508.0 | m | 3113.228321 |
11317 | nla.obj-568387099 | Geological map of Djawa and Madura / compiled ... | http://nla.gov.au/nla.obj-568387099 | https://trove.nla.gov.au/work/218208895 | 1963 | Indonesia. Direktorat Geologi | NaN | 1 map : colour ; 78 x 216 cm. | In Copyright | Scale 1:500,000 | (E 104°58ʹ28ʺ--E 113°98ʹ28ʺ/S 5°30ʹ00ʺ--S 9°00... | 3.08GB | 3.311802e+09 | 52593.0 | 20990.0 | m | 3158.380127 |
14297 | nla.obj-1954049619 | A new chart of the South Pacific Ocean, includ... | https://nla.gov.au/nla.obj-1954049619 | https://trove.nla.gov.au/work/237421392 | 1849-1857 | James Imray and Son | NaN | 1 map ; 96.4 x 183.0 cm | Edition Out of Copyright | Scale approximately 1:11,000,000 at the equator | (E 111°--W 60°/N 20°--S 60°). | 3.00GB | 3.223027e+09 | 44606.0 | 24085.0 | m | 3073.717865 |
14717 | nla.obj-2618718155 | Proposed plan for the site for the federal cap... | https://nla.gov.au/nla.obj-2618718155 | https://trove.nla.gov.au/work/239126400 | 1911 | Wilson, George, died 1923 | NaN | 1 map : colour ; 141 x 141 cm | Out of Copyright | Scale 1:4,800 ; | (E 149°08'/S 35°18'). | 3.12GB | 3.344969e+09 | 33600.0 | 33184.0 | m | 3190.011211 |
14887 | nla.obj-2824965115 | Map of the mandated territory of New Guinea / ... | https://nla.gov.au/nla.obj-2824965115 | https://trove.nla.gov.au/work/239997009 | 1925 | Krahe, R. E. | NaN | 1 map : transparent architectural linen ; 210 ... | In Copyright | Scale 1:1,000,000 | (E 140°50'00"--E 159°41'00"/S 0°33'00"--S 11°5... | 3.37GB | 3.622362e+09 | 53028.0 | 22770.0 | m | 3454.553661 |
The widest image?
df.iloc[df["width"].idxmax()]
identifier nla.obj-636346192 title Land status petroleum mining agreement in resp... url http://nla.gov.au/nla.obj-636346192 work_url https://trove.nla.gov.au/work/230363372 date 1968 creators Brunei Shell Petroleum Company publication NaN extent 1 map ; 286 x 58 cm copyright_status In Copyright scale Scale 1:10,000 coordinates (E 114°09ʹ53ʺ--E 114°23ʹ34ʺ/N 4°38ʹ42ʺ--N 4°32... filesize_string 2.80GB filesize 3008938460.0 width 68453.0 height 14652.0 copy_role m mb 2869.547329 Name: 13113, dtype: object
The tallest image?
df.iloc[df["height"].idxmax()]
identifier nla.obj-2824964225 title Traverse of the Ramu River, navigated by the "... url https://nla.gov.au/nla.obj-2824964225 work_url https://trove.nla.gov.au/work/240049759 date 1940-1945 creators Stanley, Evan R. (Evan Richard), 1885-1924 publication NaN extent 1 map : on architectural linen ; 410 x 76 cm copyright_status In Copyright scale Scale 1:31,760 coordinates (E 144°35'--E 144°50'/S 4°01'--S 5°11'). filesize_string 2.85GB filesize 3057135688.0 width 13840.0 height 73630.0 copy_role m mb 2915.511787 Name: 14904, dtype: object
Created by Tim Sherratt for the GLAM Workbench.
Work on this notebook was originally supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.