New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available. In this notebook we'll look at the way the Memento protocol is supported across five web archive repositories – the UK Web Archive, the National Library of Australia, the National Library of New Zealand, the Internet Archive, and the UK Government Web Archive. In particular we'll examine:
Notebooks using Timegates or Timemaps to access capture data include:
import json
import re
import arrow
import requests
# Alternatively use the python Memento client
# These are the repositories we'll be using
TIMEGATES = {
"awa": "https://web.archive.org.au/awa/",
"nzwa": "https://ndhadeliver.natlib.govt.nz/webarchive/",
"ukwa": "https://www.webarchive.org.uk/wayback/archive/",
"ia": "https://web.archive.org/web/",
"ukgwa": "https://webarchive.nationalarchives.gov.uk/ukgwa/"
}
Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the Accept-Datetime
value in the headers of your request.
For example, if you wanted to query the Australian Web Archive to find the version of http://nla.gov.au/
that was captured as close as possible to 1 January 2001, you'd set the Accept-Datetime
header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:
https://web.archive.org.au/awa/http://nla.gov.au/
A get
request will return the captured page, but if all you want is the url of the archived page you can use a head
request and extract the information you need from the response headers. Try this:
response = requests.head(
"https://web.archive.org.au/awa/http://nla.gov.au/",
headers={"Accept-Datetime": "Fri, 01 Jan 2010 01:00:00 GMT"},
)
response.headers
{'Server': 'nginx', 'Date': 'Thu, 23 Mar 2023 15:03:12 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144751/http://www.nla.gov.au/', 'Link': '<http://www.nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://www.nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://www.nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144751mp_/http://www.nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:47:51 GMT"', 'Vary': 'accept-datetime'}
The request above returns the following headers:
{
'Server': 'nginx',
'Date': 'Wed, 06 May 2020 04:34:50 GMT',
'Content-Length': '0', 'Connection': 'keep-alive',
'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/',
'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"',
'Vary': 'accept-datetime'
}
The Link
parameter contains the Memento information. You can see that it's actually providing information on four types of link:
original
url (ie the url that was archived) – <http://nla.gov.au/>
timegate
for the harvested url (which us what we just used) – <https://web.archive.org.au/awa/http://nla.gov.au/>
timemap
for the harvested url (we'll look at this below) – <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>
memento
– <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>
The memento
link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive. As we'll see below, some systems provide additional links such as first memento
, last memento
, prev memento
, and next memento
.
Here's some functions to query a timegate in one of the five systems we're exploring. We'll use them to compare the results we get from each.
def format_date_for_headers(iso_date, tz):
"""
Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.
Convert the datetime to UTC and format as required by Accet-Datetime headers:
eg Fri, 23 Mar 2007 01:00:00 GMT
"""
local = arrow.get(f"{iso_date} 12:00:00 {tz}", "YYYY-MM-DD HH:mm:ss ZZZ")
gmt = local.to("utc")
return f'{gmt.format("ddd, DD MMM YYYY HH:mm:ss")} GMT'
def parse_links_from_headers(response):
"""
Extract original, timegate, timemap, and memento links from 'Link' header.
"""
links = response.links
return {k: v["url"] for k, v in links.items()}
def format_timestamp(timestamp, date_format="YYYY-MM-DD HH:mm:ss"):
return arrow.get(timestamp, "YYYYMMDDHHmmss").format(date_format)
def test_timegate(
timegate,
url,
date=None,
tz="Australia/Canberra",
request_type="head",
allow_redirects=True,
):
headers = {}
if date:
formatted_date = format_date_for_headers(date, tz)
headers["Accept-Datetime"] = formatted_date
# Note that you don't get a timegate response if you leave off the trailing slash
tg_url = (
f"{TIMEGATES[timegate]}{url}/"
if not url.endswith("/")
else f"{TIMEGATES[timegate]}{url}"
)
print(tg_url)
if request_type == "head":
response = requests.head(
tg_url, headers=headers, allow_redirects=allow_redirects
)
else:
response = requests.get(
tg_url, headers=headers, allow_redirects=allow_redirects
)
response.raise_for_status()
# print(response.headers)
return parse_links_from_headers(response)
A HEAD
request that follows redirects returns no results
result = test_timegate("awa", "http://www.nla.gov.au")
# Test for expected result
assert result == {}
result
https://web.archive.org.au/awa/http://www.nla.gov.au/
{}
A HEAD
request that doesn't follow redirects returns results as expected
result = test_timegate("awa", "http://www.nla.gov.au", allow_redirects=False)
# Test for expected result
assert "memento" in result
result
https://web.archive.org.au/awa/http://www.nla.gov.au/
{'original': 'https://www.nla.gov.au/', 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/', 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/', 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}
A query without an Accept-Datetime
value returns a recent capture.
result = test_timegate("awa", "http://www.nla.gov.au", allow_redirects=False)
# Test for expected result
assert "memento" in result
result
https://web.archive.org.au/awa/http://www.nla.gov.au/
{'original': 'https://www.nla.gov.au/', 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/', 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/', 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}
A query with an Accept-Datetime
value of 1 January 2002 returns a capture from 20 January 2002.
result = test_timegate(
"awa", "http://www.education.gov.au/", date="2002-01-01", allow_redirects=False
)
# Test for expected result
assert "memento" in result
assert "20020120" in result["memento"]
result
https://web.archive.org.au/awa/http://www.education.gov.au/
{'original': 'http://www.education.gov.au:80/', 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/', 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/', 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}
Using a GET
rather than a HEAD
request returns no Memento information when redirects are followed.
result = test_timegate(
"awa", "http://www.education.gov.au/", date="2002-01-01", request_type="get"
)
# Test for expected result
assert result == {}
result
https://web.archive.org.au/awa/http://www.education.gov.au/
{}
Using a GET
rather than a HEAD
request returns Memento information when redirects are not followed.
result = test_timegate(
"awa",
"http://www.education.gov.au/",
date="2002-01-01",
request_type="get",
allow_redirects=False,
)
# Test for expected result
assert "memento" in result
result
https://web.archive.org.au/awa/http://www.education.gov.au/
{'original': 'http://www.education.gov.au:80/', 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/', 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/', 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}
Changing whether or not redirects are followed has no effect on any of these responses.
A query without an Accept-Datetime
returns a recent capture.
result = test_timegate("nzwa", "http://natlib.govt.nz")
# Test for expected result
assert "memento" in result
result
A query with an Accept-Datetime
value of 1 January 2005 returns a memento
from July 2004.
result = test_timegate("nzwa", "http://natlib.govt.nz", date="2005-01-01")
# Test for expected result
assert "memento" in result
assert "20040711" in result["memento"]
result
A GET
request returns the same results as a HEAD
request.
result_head = test_timegate("nzwa", "http://natlib.govt.nz", date="2005-01-01")
result_get = test_timegate(
"nzwa", "http://natlib.govt.nz", date="2005-01-01", request_type="get"
)
# Test for expected result
assert result_head == result_get
result_get
Using a HEAD
request that follows redirects returns results as expected.
result = test_timegate("ia", "http://discontents.com.au")
# Test for expected result
assert "memento" in result
# IA responses have additional fields
assert "first memento" in result
result
https://web.archive.org/web/http://discontents.com.au/
{'original': 'http://discontents.com.au/', 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/', 'timegate': 'https://web.archive.org/web/http://discontents.com.au/', 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/', 'prev memento': 'https://web.archive.org/web/20230313181957/https://discontents.com.au/', 'memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/', 'last memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/'}
Using a HEAD
request returns no Memento information if redirects are not followed.
result = test_timegate("ia", "http://discontents.com.au", allow_redirects=False)
# Test for expected result
assert result == {}
result
https://web.archive.org/web/http://discontents.com.au/
{}
A query without an Accept-Datetime
value returns a memento
and also includes a first memento
, last memento
, prev memento
, and last memento
.
result = test_timegate("ia", "http://discontents.com.au")
# Test for expected result
assert "memento" in result
# IA responses have additional fields
assert "first memento" in result
result
https://web.archive.org/web/http://discontents.com.au/
{'original': 'http://discontents.com.au/', 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/', 'timegate': 'https://web.archive.org/web/http://discontents.com.au/', 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/', 'prev memento': 'https://web.archive.org/web/20220323201952/http://www.discontents.com.au/', 'memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/', 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}
A query with an Accept-Datetime
value of 1 January 2010 returns a memento
from 9 February 2010.
result = test_timegate("ia", "http://discontents.com.au", date="2010-01-01")
# Test for expected result
assert "memento" in result
assert "20100209" in result["memento"]
result
https://web.archive.org/web/http://discontents.com.au/
{'original': 'http://discontents.com.au:80/', 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/', 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/', 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/', 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/', 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/', 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/', 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}
GET
requests return different results if redirects are not followed.
result = test_timegate(
"ia", "http://discontents.com.au", date="2010-01-01", request_type="get"
)
result_no_redirects = test_timegate(
"ia",
"http://discontents.com.au",
date="2010-01-01",
request_type="get",
allow_redirects=False,
)
# Test for expected result
assert result != result_no_redirects
result_no_redirects
https://web.archive.org/web/http://discontents.com.au/ https://web.archive.org/web/http://discontents.com.au/
{'original': 'http://discontents.com.au/', 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au/', 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/'}
Changing whether or not redirects are followed has no effect on any of these responses.
A query without an Accept-Datetime
value returns a recent capture.
result = test_timegate("ukwa", "http://bl.uk")
# Test for expected result
assert "memento" in result
result
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
{'original': 'https://www.bl.uk/', 'timegate': 'https://www.webarchive.org.uk/wayback/archive/https://www.bl.uk/', 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/https://www.bl.uk/', 'memento': 'https://www.webarchive.org.uk/wayback/archive/20230319105859mp_/https://www.bl.uk/'}
A query with an Accept-Datetime
value of 1 January 2006 returns a memento
from 4 May 2004.
result = test_timegate("ukwa", "http://bl.uk", date="2006-01-01")
# Test for expected result
assert "memento" in result
assert "20040504" in result["memento"]
result
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
{'original': 'http://www.bl.uk/', 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/', 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/', 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}
A GET
request returns the same results as a HEAD
request.
result_head = test_timegate("ukwa", "http://bl.uk", date="2006-01-01")
result_get = test_timegate(
"ukwa", "http://bl.uk", date="2006-01-01", request_type="get"
)
# Test for expected result
assert result_head == result_get
result_get
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/ https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
{'original': 'http://www.bl.uk/', 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/', 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/', 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}
Changing whether or not redirects are followed has no effect on any of these responses.
A query without an Accept-Datetime
value returns a recent capture.
result = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/")
# Test for expected result
assert "memento" in result
result
https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/
{'original': 'https://www.nationalarchives.gov.uk/', 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/', 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk/', 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20230311073241mp_/https://www.nationalarchives.gov.uk/'}
A query with an Accept-Datetime
value of 1 January 2006 returns a memento
from 13 February 2006.
result = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01")
# Test for expected result
assert "memento" in result
assert "20060213" in result["memento"]
result
https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/
{'original': 'http://www.nationalarchives.gov.uk/', 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/', 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/', 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}
A GET
request returns the same results as a HEAD
request.
result_head = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01")
result_get = test_timegate(
"ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01", request_type="get")
# Test for expected result
assert result_head == result_get
result_get
https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/ https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/
{'original': 'http://www.nationalarchives.gov.uk/', 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/', 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/', 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}
As you can see above, there are a couple of significant differences in the way that Timegates behave across the five repositories.
first memento
, last memento
, prev memento
, and last memento
)HEAD
or GET
with UKWA, NZWA, and UKGWA, but IA and AWA behave different depending on the type of request and whether redirects are followed. To get results from either a HEAD
or GET
request, AWA requests should not follow redirects. To get results from a HEAD
requests, IA requests should follow redirects. GET
requests to IA will return results whether or not redirects are allowed, however, those results differ.Here's some code to smooth out the differences between systems, and return Memento data as a Python dictionary. Specifically it:
memento
value in the response (as sometimes happens with NLNZ), it looks for a first
, last
, prev
or next
value instead.def query_timegate(timegate, url, date=None, tz="Australia/Canberra"):
"""
Query the specified repository for a Memento.
"""
headers = {}
if date:
formatted_date = format_date_for_headers(date, tz)
headers["Accept-Datetime"] = formatted_date
# Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!
tg_url = (
f"{TIMEGATES[timegate]}{url}/"
if not url.endswith("/")
else f"{TIMEGATES[timegate]}{url}"
)
# print(tg_url)
# IA only works if redirects are followed -- this defaults to False with HEAD requests...
if timegate == "ia":
allow_redirects = True
else:
allow_redirects = False
response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)
response.raise_for_status()
return parse_links_from_headers(response)
def get_memento(timegate, url, date=None, tz="Australia/Canberra"):
"""
If there's no memento in the results, look for an alternative.
"""
links = query_timegate(timegate, url, date, tz)
# NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness
if links:
if "memento" in links:
memento = links["memento"]
elif "prev memento" in links:
memento = links["prev memento"]
elif "next memento" in links:
memento = links["next memento"]
elif "last memento" in links:
memento = links["last memento"]
else:
memento = None
return memento
Now we can request a Memento from any of the five repositories and get back the results as a Python dictionary. You can see this code in action in the Get full page screenshots from archived web pages notebook.
result = query_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2015-01-01")
# Test for expected result
assert "memento" in result
result
{'original': 'http://nationalarchives.gov.uk/', 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://nationalarchives.gov.uk/', 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://nationalarchives.gov.uk/', 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20141223091614mp_/http://nationalarchives.gov.uk/'}
Or if we just want to get the url for a Memento (and fallback to alternative values if memento
is missing).
result = get_memento("nzwa", "http://natlib.govt.nz")
# Test for expected result
assert result.startswith("https://ndhadeliver.natlib.govt.nz/webarchive/")
result
'https://ndhadeliver.natlib.govt.nz/webarchive/20220801082654mp_/http://natlib.govt.nz/'
Memento Timemaps provide machine-processable lists of web page captures from a particular archive. They are available from both OpenWayback and Pywb systems, though there are some differences. The Pywb documentation notes that the following formats are available:
Timemaps are requested using a url with the following format:
http://[address.of.archive]/[collection]/timemap/[format]/[web page url]
So if you wanted to query the Australian Web Archive to get a list of captures in JSON format from http://nla.gov.au/ you'd use this url:
https://web.archive.org.au/awa/timemap/json/http://nla.gov.au/
The examples below show how the format and behaviour of Timemaps vary slightly across the five respoitories we're interested in.
def get_timemap(timegate, url, format="json"):
"""
Basic function to get a Timemap for the supplied url.
"""
tg_url = f"{TIMEGATES[timegate]}timemap/{format}/{url}/"
response = requests.get(tg_url)
response.raise_for_status()
# Show the content-type
# print(response.headers['content-type'])
return response.headers["content-type"], response.text
Request a Timemap in link
format. Note that response headers include content-type
of application/link-format
.
content_type, timemap = get_timemap("awa", "http://www.gov.au", "link")
print(content_type)
# Test content type
assert content_type == "application/link-format"
# Show the first 5 lines
print("\n".join(timemap.splitlines()[:5]))
application/link-format <https://web.archive.org.au/awa/timemap/link/http://www.gov.au/>; rel="self"; type="application/link-format"; from="Wed, 06 Dec 2000 21:15:00 GMT", <https://web.archive.org.au/awa/http://www.gov.au/>; rel="timegate", <http://www.gov.au/>; rel="original", <https://web.archive.org.au/awa/20001206211500mp_/http://www.gov.au/>; rel="memento"; datetime="Wed, 06 Dec 2000 21:15:00 GMT"; collection="awa", <https://web.archive.org.au/awa/20010118203600mp_/http://www.gov.au/>; rel="memento"; datetime="Thu, 18 Jan 2001 20:36:00 GMT"; collection="awa",
Request a Timemap in json
format. This returns ndjson
(Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type
of text/x-ndjson
.
content_type, timemap = get_timemap(
"awa",
"http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm",
"json",
)
print(content_type)
# Test content type
assert content_type == "text/x-ndjson"
# Show the first line
print("\n".join(timemap.splitlines()[:1]))
text/x-ndjson {"urlkey": "au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm", "timestamp": "20031122074837", "url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "length": "3446", "source": "awa", "source-coll": "awa"}
Request a Timemap in cdxj
format. Note that response headers include content-type
of text/x-cdxj
.
content_type, timemap = get_timemap(
"awa",
"http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm",
"cdxj",
)
print(content_type)
# Test content type
assert content_type == "text/x-cdxj"
# Show the first line
print("\n".join(timemap.splitlines()[:1]))
text/x-cdxj au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm 20031122074837 {"url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "length": "3446", "source": "awa", "source-coll": "awa"}
Request a Timemap in link
format. Note that response headers include content-type
of application/link-format
.
content_type, timemap = get_timemap("ukwa", "http://bl.uk", "link")
print(content_type)
# Test content type
assert content_type == "application/link-format"
print("\n".join(timemap.splitlines()[:5]))
application/link-format <https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/>; rel="self"; type="application/link-format"; from="Tue, 30 Oct 2001 00:00:19 GMT", <https://www.webarchive.org.uk/wayback/archive/http://bl.uk/>; rel="timegate", <http://bl.uk/>; rel="original", <https://www.webarchive.org.uk/wayback/archive/20011030000019mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 30 Oct 2001 00:00:19 GMT"; collection="archive", <https://www.webarchive.org.uk/wayback/archive/20011113000000mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 13 Nov 2001 00:00:00 GMT"; collection="archive",
Request a Timemap in json
format. This returns ndjson
(Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type
of text/x-ndjson
.
content_type, timemap = get_timemap("ukwa", "http://bl.uk", "json")
print(content_type)
# Test content type
assert content_type == "text/x-ndjson"
print("\n".join(timemap.splitlines()[:1]))
text/x-ndjson {"urlkey": "uk,bl)/", "timestamp": "20011030000019", "url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive", "access": "allow"}
Request a Timemap in cdxj
format. Note that response headers include content-type
of text/x-cdxj
.
content_type, timemap = get_timemap("ukwa", "http://bl.uk", "cdxj")
print(content_type)
# Test content type
assert content_type == "text/x-cdxj"
print("\n".join(timemap.splitlines()[:1]))
text/x-cdxj uk,bl)/ 20011030000019 {"url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive", "access": "allow"}
Request a Timemap in link
format. Note that response headers include content-type
of application/link-format
.
content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "link")
print(content_type)
# Test content type
assert content_type == "application/link-format"
print("\n".join(timemap.splitlines()[:5]))
application/link-format <https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk//>; rel="self"; type="application/link-format"; from="Mon, 20 Oct 2003 01:04:12 GMT", <https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk//>; rel="timegate", <https://www.nationalarchives.gov.uk//>; rel="original", <https://webarchive.nationalarchives.gov.uk/ukgwa/20031020010412mp_/http://www.nationalarchives.gov.uk:80/>; rel="memento"; datetime="Mon, 20 Oct 2003 01:04:12 GMT"; collection="full_zipnum", <https://webarchive.nationalarchives.gov.uk/ukgwa/20040104233258mp_/http://www.nationalarchives.gov.uk/>; rel="memento"; datetime="Sun, 04 Jan 2004 23:32:58 GMT"; collection="full_zipnum",
Request a Timemap in json
format. This returns ndjson
(Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type
of text/x-ndjson
.
content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "json")
print(content_type)
# Test content type
assert content_type == "text/x-ndjson"
print("\n".join(timemap.splitlines()[:1]))
text/x-ndjson {"urlkey": "uk,gov,nationalarchives)/", "timestamp": "20031020010412", "url": "http://www.nationalarchives.gov.uk:80/", "mime": "text/html", "status": "200", "digest": "U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J", "redirect": "-", "robotflags": "-", "length": "951", "offset": "898", "filename": "UKGOV-WEEKLY-010-031019180412-000.warc.gz", "source": "full_zipnum", "source-coll": "full_zipnum", "access": "allow"}
Request a Timemap in cdxj
format. Note that response headers include content-type
of text/x-cdxj
.
content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "cdxj")
print(content_type)
# Test content type
assert content_type == "text/x-cdxj"
print("\n".join(timemap.splitlines()[:1]))
text/x-cdxj uk,gov,nationalarchives)/ 20031020010412 {"url": "http://www.nationalarchives.gov.uk:80/", "mime": "text/html", "status": "200", "digest": "U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J", "redirect": "-", "robotflags": "-", "length": "951", "offset": "898", "filename": "UKGOV-WEEKLY-010-031019180412-000.warc.gz", "source": "full_zipnum", "source-coll": "full_zipnum", "access": "allow"}
Request a Timemap in link
format. Note that response headers include content-type
of application/link-format
.
content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "link")
print(content_type)
# Test content type
assert content_type == "application/link-format"
print("\n".join(timemap.splitlines()[:5]))
application/link-format <https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://natlib.govt.nz/>; rel="self"; type="application/link-format"; from="Sun, 11 Jul 2004 21:32:25 GMT", <https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/>; rel="timegate", <http://natlib.govt.nz/>; rel="original", <https://ndhadeliver.natlib.govt.nz/webarchive/20040711213225mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Sun, 11 Jul 2004 21:32:25 GMT"; collection="webarchive", <https://ndhadeliver.natlib.govt.nz/webarchive/20060704033135mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Tue, 04 Jul 2006 03:31:35 GMT"; collection="webarchive",
Request a Timemap in json
format. This returns ndjson (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type of text/x-ndjson.
content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "json")
print(content_type)
# Test content type
assert content_type == "text/x-ndjson"
print("\n".join(timemap.splitlines()[:1]))
text/x-ndjson {"urlkey": "nz,govt,natlib)/", "timestamp": "20040711213225", "url": "http://www.natlib.govt.nz/", "mime": "text/html", "status": "200", "digest": "JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK", "redirect": "-", "robotflags": "-", "length": "0", "offset": "976", "filename": "V1-FL1645590.arc", "load_url": "http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/", "source": "webarchive", "source-coll": "webarchive"}
Request a Timemap in cdxj
format. Note that response headers include content-type
of text/x-cdxj
.
content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "cdxj")
print(content_type)
# Test content type
assert content_type == "text/x-cdxj"
print("\n".join(timemap.splitlines()[:1]))
text/x-cdxj nz,govt,natlib)/ 20040711213225 {"url": "http://www.natlib.govt.nz/", "mime": "text/html", "status": "200", "digest": "JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK", "redirect": "-", "robotflags": "-", "length": "0", "offset": "976", "filename": "V1-FL1645590.arc", "load_url": "http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/", "source": "webarchive", "source-coll": "webarchive"}
Request a Timemap in link
format. Note that response headers include content-type
of application/link-format
.
content_type, timemap = get_timemap("ia", "http://discontents.com.au", "link")
print(content_type)
# Test content type
assert content_type == "application/link-format"
print("\n".join(timemap.splitlines()[:5]))
application/link-format <http://www.discontents.com.au:80/>; rel="original", <https://web.archive.org/web/timemap/link/http://discontents.com.au/>; rel="self"; type="application/link-format"; from="Sun, 06 Dec 1998 01:22:33 GMT", <https://web.archive.org/web/http://discontents.com.au/>; rel="timegate", <https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/>; rel="first memento"; datetime="Sun, 06 Dec 1998 01:22:33 GMT", <https://web.archive.org/web/19981212024410/http://www.discontents.com.au:80/>; rel="memento"; datetime="Sat, 12 Dec 1998 02:44:10 GMT",
Request for timemap in json
format returns results in JSON as an array of arrays, where the first row provides the column headings. Response headers include content-type
of application/json
.
content_type, timemap = get_timemap("ia", "http://discontents.com.au", "json")
print(content_type)
# Test content type
assert content_type == "application/json"
print("\n".join(timemap.splitlines()[:5]))
application/json [["urlkey","timestamp","original","mimetype","statuscode","digest","redirect","robotflags","length","offset","filename"], ["au,com,discontents)/","19981206012233","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1610","43993900","green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz"], ["au,com,discontents)/","19981212024410","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","17792789","slash-913417727-c/slash-913430608.arc.gz"], ["au,com,discontents)/","19990125094813","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","11419234","slash-913417727-c/slash_19990124232053-917257670.arc.gz"], ["au,com,discontents)/","19990208004052","http://discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1612","13269748","slash-913417727-c/slash-918434425.arc.gz"],
Request for timemap in cdxj
returns results in plain text, with fields separated by spaces, and captures separated by line breaks. Response headers include content-type
of text/plain
.
content_type, timemap = get_timemap("ia", "http://discontents.com.au", "cdxj")
print(content_type)
# Test content type
assert content_type == "text/plain"
print("\n".join(timemap.splitlines()[:1]))
text/plain au,com,discontents)/ 19981206012233 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1610 43993900 green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz
If we compare the Pywb JSON output with the IA Wayback output, we see there are also some differences in the field labels. In particular original
in IA Wayback is just url
in Pywb, while statuscode
and mimetype
are shortened to status
and mime
in Pywb.
_, timemap = get_timemap("ia", "http://bl.uk", "json")
data = json.loads(timemap)
# Test for `mimetype` label
assert "mimetype" in data[0]
data[0]
['urlkey', 'timestamp', 'original', 'mimetype', 'statuscode', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename']
_, timemap = get_timemap("ukwa", "http://bl.uk", "json")
data = [json.loads(line) for line in timemap.splitlines()]
# Test for `mime` label
assert "mime" in data[0]
list(data[0].keys())
['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename', 'load_url', 'source', 'source-coll', 'access']
The good news is that all repositories provide Timemaps in the standard link
format as required by the Memento specification. However, there's more varation when it comes to other formats.
json
format is different to the Pywb format from UKWA, UKGWA, NLNZ, and NLA.With the information above we can construct some functions to return normalised Timemap results as JSON. To do this we need to:
Because the link
format provides less information than the json
format, we could also try to enrich the NLNZ data by requesting more information about individual Mementos.
def convert_lists_to_dicts(results):
"""
Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
Renames keys to standardise IA with other Timemaps.
"""
if results:
keys = results[0]
results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
else:
results_as_dicts = results
for d in results_as_dicts:
d["status"] = d.pop("statuscode")
d["mime"] = d.pop("mimetype")
d["url"] = d.pop("original")
return results_as_dicts
def get_capture_data_from_memento(url, request_type="head"):
"""
For OpenWayback systems this can get some extra capture info to insert into Timemaps.
"""
if request_type == "head":
response = requests.head(url)
else:
response = requests.get(url)
headers = response.headers
length = headers.get("x-archive-orig-content-length")
status = headers.get("x-archive-orig-status")
status = status.split(" ")[0] if status else None
mime = headers.get("x-archive-orig-content-type")
mime = mime.split(";")[0] if mime else None
return {"length": length, "status": status, "mime": mime}
def convert_link_to_json(results, enrich_data=False):
"""
Converts link formatted Timemap to JSON.
This was originally needed for NLNZ, but now all five archives
return JSON data.
"""
data = []
for line in results.splitlines():
parts = line.split("; ")
if len(parts) > 1:
link_type = re.search(
r'rel="(original|self|timegate|first memento|last memento|memento)"',
parts[1],
).group(1)
if link_type == "memento":
link = parts[0].strip("<>")
timestamp, original = re.search(r"/(\d{12}|\d{14})/(.*)$", link).groups()
capture = {"timestamp": timestamp, "url": original}
if enrich_data:
capture.update(get_capture_data_from_memento(link))
# print(capture)
data.append(capture)
return data
def get_timemap_as_json(timegate, url):
"""
Get a Timemap then normalise results (if necessary) to return a list of dicts.
"""
tg_url = f"{TIMEGATES[timegate]}timemap/json/{url}/"
response = requests.get(tg_url)
response.raise_for_status()
response_type = response.headers["content-type"]
# print(response_type)
if response_type == "text/x-ndjson":
data = [json.loads(line) for line in response.text.splitlines()]
elif response_type == "application/json":
data = convert_lists_to_dicts(response.json())
elif response_type in ["application/link-format", "text/html;charset=utf-8"]:
data = convert_link_to_json(response.text)
return data
Now we can get information about captures in a standardised JSON format from all five repositories. You can see this in action in the Display changes in the text of an archived web page over time notebook
timemap = get_timemap_as_json("ukwa", "http://bl.uk")
# Test for `mime` label
assert "mime" in timemap[0]
timemap[0]
{'urlkey': 'uk,bl)/', 'timestamp': '20011030000019', 'url': 'http://www.bl.uk/', 'mime': 'text/html', 'status': '200', 'digest': 'JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW', 'redirect': '-', 'robotflags': '-', 'length': '0', 'offset': '10813988', 'filename': '/data/102148/31031347/WARCS/BL-31031347.warc.gz', 'load_url': '', 'source': 'archive', 'source-coll': 'archive', 'access': 'allow'}
timemap = get_timemap_as_json("ia", "http://bl.uk")
# Test for `mime` label
assert "mime" in timemap[0]
timemap[0]
{'urlkey': 'uk,bl)/', 'timestamp': '19970218190613', 'digest': 'Z42UMUL76GODKO3EMNSLXDTCST66VDAX', 'redirect': '-', 'robotflags': '-', 'length': '1208', 'offset': '19524651', 'filename': 'GR-001114-c/GR-002277.arc.gz', 'status': '200', 'mime': 'text/html', 'url': 'http://www.bl.uk:80/'}
You can also modify the url of a Memento to change the way it's presented. In particular, adding id_
after the timestamp will tell the server that you want the original harvested version of the webpage, without any rewriting of links, or web archive navigation features. For example:
https://web.archive.org.au/awa/20200302223537id_/http://discontents.com.au/
This works with all five repositories, however, note that for the Australian Web Archive you need to use the web.archive.org.au
domain, not webarchive.nla.gov.au
.
In addition, IA supports the if_
option, which provides a view of the archived page without web archive headers navigation inserted, but with links to CSS, JS, and images rewritten to point to archived versions. This is as close as you can get to looking at the original page, and I've used it in the Get full page screenshots from archived web pages notebook. Note that if you add if_
to requests from the UKWA, NLNZ, or the NLA you'll be redirected to the standard view with the original page framed by the web archive navigation.
Pywb's page on url rewriting has some useful information about this.
Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!
Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020.
The Web Archives section of the GLAM Workbench is sponsored by the British Library.