Notebook

Getting data from web archives using Memento¶

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available. In this notebook we'll look at the way the Memento protocol is supported across five web archive repositories – the UK Web Archive, the National Library of Australia, the National Library of New Zealand, the Internet Archive, and the UK Government Web Archive. In particular we'll examine:

Timegates – request web page captures from (around) a particular date
Timemaps – request a list of web archive captures from a particular url
Mementos – use url modifiers to change the way an archived web page is presented

Notebooks using Timegates or Timemaps to access capture data include:

Useful tools and documentation¶

In [1]:

import json
import re

import arrow
import requests

# Alternatively use the python Memento client

In [2]:

# These are the repositories we'll be using
TIMEGATES = {
    "awa": "https://web.archive.org.au/awa/",
    "nzwa": "https://ndhadeliver.natlib.govt.nz/webarchive/",
    "ukwa": "https://www.webarchive.org.uk/wayback/archive/",
    "ia": "https://web.archive.org/web/",
    "ukgwa": "https://webarchive.nationalarchives.gov.uk/ukgwa/"
}

Timegates¶

Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the Accept-Datetime value in the headers of your request.

For example, if you wanted to query the Australian Web Archive to find the version of http://nla.gov.au/ that was captured as close as possible to 1 January 2001, you'd set the Accept-Datetime header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:

https://web.archive.org.au/awa/http://nla.gov.au/

A get request will return the captured page, but if all you want is the url of the archived page you can use a head request and extract the information you need from the response headers. Try this:

In [3]:

response = requests.head(
    "https://web.archive.org.au/awa/http://nla.gov.au/",
    headers={"Accept-Datetime": "Fri, 01 Jan 2010 01:00:00 GMT"},
)
response.headers

Out[3]:

{'Server': 'nginx', 'Date': 'Thu, 23 Mar 2023 15:03:12 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144751/http://www.nla.gov.au/', 'Link': '<http://www.nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://www.nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://www.nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144751mp_/http://www.nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:47:51 GMT"', 'Vary': 'accept-datetime'}

The request above returns the following headers:

{
    'Server': 'nginx', 
    'Date': 'Wed, 06 May 2020 04:34:50 GMT', 
    'Content-Length': '0', 'Connection': 'keep-alive', 
    'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 
    'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 
    'Vary': 'accept-datetime'
}

The Link parameter contains the Memento information. You can see that it's actually providing information on four types of link:

the original url (ie the url that was archived) – <http://nla.gov.au/>
the timegate for the harvested url (which us what we just used) – <https://web.archive.org.au/awa/http://nla.gov.au/>
the timemap for the harvested url (we'll look at this below) – <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>
the memento – <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>

The memento link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive. As we'll see below, some systems provide additional links such as first memento, last memento, prev memento, and next memento.

Here's some functions to query a timegate in one of the five systems we're exploring. We'll use them to compare the results we get from each.

In [4]:

def format_date_for_headers(iso_date, tz):
    """
    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.
    Convert the datetime to UTC and format as required by Accet-Datetime headers:
    eg Fri, 23 Mar 2007 01:00:00 GMT
    """
    local = arrow.get(f"{iso_date} 12:00:00 {tz}", "YYYY-MM-DD HH:mm:ss ZZZ")
    gmt = local.to("utc")
    return f'{gmt.format("ddd, DD MMM YYYY HH:mm:ss")} GMT'


def parse_links_from_headers(response):
    """
    Extract original, timegate, timemap, and memento links from 'Link' header.
    """
    links = response.links
    return {k: v["url"] for k, v in links.items()}


def format_timestamp(timestamp, date_format="YYYY-MM-DD HH:mm:ss"):
    return arrow.get(timestamp, "YYYYMMDDHHmmss").format(date_format)


def test_timegate(
    timegate,
    url,
    date=None,
    tz="Australia/Canberra",
    request_type="head",
    allow_redirects=True,
):
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers["Accept-Datetime"] = formatted_date
    # Note that you don't get a timegate response if you leave off the trailing slash
    tg_url = (
        f"{TIMEGATES[timegate]}{url}/"
        if not url.endswith("/")
        else f"{TIMEGATES[timegate]}{url}"
    )
    print(tg_url)
    if request_type == "head":
        response = requests.head(
            tg_url, headers=headers, allow_redirects=allow_redirects
        )
    else:
        response = requests.get(
            tg_url, headers=headers, allow_redirects=allow_redirects
        )
    response.raise_for_status()
    # print(response.headers)
    return parse_links_from_headers(response)

Australian Web Archive¶

A HEAD request that follows redirects returns no results

In [5]:

result = test_timegate("awa", "http://www.nla.gov.au")

# Test for expected result
assert result == {}

result

https://web.archive.org.au/awa/http://www.nla.gov.au/

Out[5]:

{}

A HEAD request that doesn't follow redirects returns results as expected

In [6]:

result = test_timegate("awa", "http://www.nla.gov.au", allow_redirects=False)

# Test for expected result
assert "memento" in result

result

https://web.archive.org.au/awa/http://www.nla.gov.au/

Out[6]:

{'original': 'https://www.nla.gov.au/',
 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/',
 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}

A query without an Accept-Datetime value returns a recent capture.

In [7]:

result = test_timegate("awa", "http://www.nla.gov.au", allow_redirects=False)

# Test for expected result
assert "memento" in result

result

https://web.archive.org.au/awa/http://www.nla.gov.au/

Out[7]:

{'original': 'https://www.nla.gov.au/',
 'timegate': 'https://web.archive.org.au/awa/https://www.nla.gov.au/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/https://www.nla.gov.au/',
 'memento': 'https://web.archive.org.au/awa/20230303002359mp_/https://www.nla.gov.au/'}

A query with an Accept-Datetime value of 1 January 2002 returns a capture from 20 January 2002.

In [8]:

result = test_timegate(
    "awa", "http://www.education.gov.au/", date="2002-01-01", allow_redirects=False
)

# Test for expected result
assert "memento" in result
assert "20020120" in result["memento"]

result

https://web.archive.org.au/awa/http://www.education.gov.au/

Out[8]:

{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

Using a GET rather than a HEAD request returns no Memento information when redirects are followed.

In [9]:

result = test_timegate(
    "awa", "http://www.education.gov.au/", date="2002-01-01", request_type="get"
)

# Test for expected result
assert result == {}

result

https://web.archive.org.au/awa/http://www.education.gov.au/

Out[9]:

{}

Using a GET rather than a HEAD request returns Memento information when redirects are not followed.

In [10]:

result = test_timegate(
    "awa",
    "http://www.education.gov.au/",
    date="2002-01-01",
    request_type="get",
    allow_redirects=False,
)

# Test for expected result
assert "memento" in result

result

https://web.archive.org.au/awa/http://www.education.gov.au/

Out[10]:

{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

New Zealand Web Archive¶

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an Accept-Datetime returns a recent capture.

In [ ]:

result = test_timegate("nzwa", "http://natlib.govt.nz")

# Test for expected result
assert "memento" in result

result

A query with an Accept-Datetime value of 1 January 2005 returns a memento from July 2004.

In [ ]:

result = test_timegate("nzwa", "http://natlib.govt.nz", date="2005-01-01")

# Test for expected result
assert "memento" in result
assert "20040711" in result["memento"]

result

A GET request returns the same results as a HEAD request.

In [ ]:

result_head = test_timegate("nzwa", "http://natlib.govt.nz", date="2005-01-01")
result_get = test_timegate(
    "nzwa", "http://natlib.govt.nz", date="2005-01-01", request_type="get"
)

# Test for expected result
assert result_head == result_get

result_get

Internet Archive¶

Using a HEAD request that follows redirects returns results as expected.

In [19]:

result = test_timegate("ia", "http://discontents.com.au")

# Test for expected result
assert "memento" in result
# IA responses have additional fields
assert "first memento" in result

result

https://web.archive.org/web/http://discontents.com.au/

Out[19]:

{'original': 'http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20230313181957/https://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/',
 'last memento': 'https://web.archive.org/web/20230318003745/http://discontents.com.au/'}

Using a HEAD request returns no Memento information if redirects are not followed.

In [16]:

result = test_timegate("ia", "http://discontents.com.au", allow_redirects=False)

# Test for expected result
assert result == {}

result

https://web.archive.org/web/http://discontents.com.au/

Out[16]:

{}

A query without an Accept-Datetime value returns a memento and also includes a first memento, last memento, prev memento, and last memento.

In [17]:

result = test_timegate("ia", "http://discontents.com.au")

# Test for expected result
assert "memento" in result
# IA responses have additional fields
assert "first memento" in result

result

https://web.archive.org/web/http://discontents.com.au/

Out[17]:

{'original': 'http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20220323201952/http://www.discontents.com.au/',
 'memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/',
 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}

A query with an Accept-Datetime value of 1 January 2010 returns a memento from 9 February 2010.

In [18]:

result = test_timegate("ia", "http://discontents.com.au", date="2010-01-01")

# Test for expected result
assert "memento" in result
assert "20100209" in result["memento"]

result

https://web.archive.org/web/http://discontents.com.au/

Out[18]:

{'original': 'http://discontents.com.au:80/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',
 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',
 'last memento': 'https://web.archive.org/web/20220331081122/http://discontents.com.au/'}

GET requests return different results if redirects are not followed.

In [19]:

result = test_timegate(
    "ia", "http://discontents.com.au", date="2010-01-01", request_type="get"
)
result_no_redirects = test_timegate(
    "ia",
    "http://discontents.com.au",
    date="2010-01-01",
    request_type="get",
    allow_redirects=False,
)

# Test for expected result
assert result != result_no_redirects

result_no_redirects

https://web.archive.org/web/http://discontents.com.au/
https://web.archive.org/web/http://discontents.com.au/

Out[19]:

{'original': 'http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/'}

UK Web Archive¶

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an Accept-Datetime value returns a recent capture.

In [14]:

result = test_timegate("ukwa", "http://bl.uk")

# Test for expected result
assert "memento" in result

result

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/

Out[14]:

{'original': 'https://www.bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/https://www.bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/https://www.bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20230319105859mp_/https://www.bl.uk/'}

A query with an Accept-Datetime value of 1 January 2006 returns a memento from 4 May 2004.

In [21]:

result = test_timegate("ukwa", "http://bl.uk", date="2006-01-01")

# Test for expected result
assert "memento" in result
assert "20040504" in result["memento"]

result

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/

Out[21]:

{'original': 'http://www.bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}

A GET request returns the same results as a HEAD request.

In [22]:

result_head = test_timegate("ukwa", "http://bl.uk", date="2006-01-01")
result_get = test_timegate(
    "ukwa", "http://bl.uk", date="2006-01-01", request_type="get"
)

# Test for expected result
assert result_head == result_get

result_get

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/
https://www.webarchive.org.uk/wayback/archive/http://bl.uk/

Out[22]:

{'original': 'http://www.bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://www.bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://www.bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20040504230000mp_/http://www.bl.uk/'}

UK Government Web Archive¶

Changing whether or not redirects are followed has no effect on any of these responses.

A query without an Accept-Datetime value returns a recent capture.

In [15]:

result = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/")

# Test for expected result
assert "memento" in result

result

https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/

Out[15]:

{'original': 'https://www.nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20230311073241mp_/https://www.nationalarchives.gov.uk/'}

A query with an Accept-Datetime value of 1 January 2006 returns a memento from 13 February 2006.

In [20]:

result = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01")

# Test for expected result
assert "memento" in result
assert "20060213" in result["memento"]

result

https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/

Out[20]:

{'original': 'http://www.nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}

A GET request returns the same results as a HEAD request.

In [21]:

result_head = test_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01")
result_get = test_timegate(
    "ukgwa", "https://www.nationalarchives.gov.uk/", date="2006-01-01", request_type="get")

# Test for expected result
assert result_head == result_get

result_get

https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/
https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk/

Out[21]:

{'original': 'http://www.nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://www.nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://www.nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20060213205514mp_/http://www.nationalarchives.gov.uk/'}

Summarising the differences¶

As you can see above, there are a couple of significant differences in the way that Timegates behave across the five repositories.

Wayback systems (IA) provide more information than the Pywb systems (first memento, last memento, prev memento, and last memento)
You can use either HEAD or GET with UKWA, NZWA, and UKGWA, but IA and AWA behave different depending on the type of request and whether redirects are followed. To get results from either a HEAD or GET request, AWA requests should not follow redirects. To get results from a HEAD requests, IA requests should follow redirects. GET requests to IA will return results whether or not redirects are allowed, however, those results differ.

Normalising Timegate responses and queries¶

Here's some code to smooth out the differences between systems, and return Memento data as a Python dictionary. Specifically it:

Follows redirects for requests to the IA.
If there is no memento value in the response (as sometimes happens with NLNZ), it looks for a first, last, prev or next value instead.

In [22]:

def query_timegate(timegate, url, date=None, tz="Australia/Canberra"):
    """
    Query the specified repository for a Memento.
    """
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers["Accept-Datetime"] = formatted_date
    
    # Note that you don't get a timegate response if you leave off the trailing slash, but extras don't hurt!
    tg_url = (
        f"{TIMEGATES[timegate]}{url}/"
        if not url.endswith("/")
        else f"{TIMEGATES[timegate]}{url}"
    )
    # print(tg_url)
    # IA only works if redirects are followed -- this defaults to False with HEAD requests...
    if timegate == "ia":
        allow_redirects = True
    else:
        allow_redirects = False
    response = requests.head(tg_url, headers=headers, allow_redirects=allow_redirects)
    response.raise_for_status()
    return parse_links_from_headers(response)


def get_memento(timegate, url, date=None, tz="Australia/Canberra"):
    """
    If there's no memento in the results, look for an alternative.
    """
    links = query_timegate(timegate, url, date, tz)
    # NLNZ doesn't always seem to return a Memento, so we'll build in some fuzziness
    if links:
        if "memento" in links:
            memento = links["memento"]
        elif "prev memento" in links:
            memento = links["prev memento"]
        elif "next memento" in links:
            memento = links["next memento"]
        elif "last memento" in links:
            memento = links["last memento"]
    else:
        memento = None
    return memento

Now we can request a Memento from any of the five repositories and get back the results as a Python dictionary. You can see this code in action in the Get full page screenshots from archived web pages notebook.

In [22]:

result = query_timegate("ukgwa", "https://www.nationalarchives.gov.uk/", date="2015-01-01")

# Test for expected result
assert "memento" in result

result

Out[22]:

{'original': 'http://nationalarchives.gov.uk/',
 'timegate': 'https://webarchive.nationalarchives.gov.uk/ukgwa/http://nationalarchives.gov.uk/',
 'timemap': 'https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/http://nationalarchives.gov.uk/',
 'memento': 'https://webarchive.nationalarchives.gov.uk/ukgwa/20141223091614mp_/http://nationalarchives.gov.uk/'}

Or if we just want to get the url for a Memento (and fallback to alternative values if memento is missing).

In [23]:

result = get_memento("nzwa", "http://natlib.govt.nz")

# Test for expected result
assert result.startswith("https://ndhadeliver.natlib.govt.nz/webarchive/")

result

Out[23]:

'https://ndhadeliver.natlib.govt.nz/webarchive/20220801082654mp_/http://natlib.govt.nz/'

Timemaps¶

Memento Timemaps provide machine-processable lists of web page captures from a particular archive. They are available from both OpenWayback and Pywb systems, though there are some differences. The Pywb documentation notes that the following formats are available:

link – returns an application/link-format as required by the Memento spec
cdxj – returns a timemap in the native CDXJ format
json – returns the timemap as newline-delimited JSON lines (NDJSON) format

Timemaps are requested using a url with the following format:

http://[address.of.archive]/[collection]/timemap/[format]/[web page url]

So if you wanted to query the Australian Web Archive to get a list of captures in JSON format from http://nla.gov.au/ you'd use this url:

https://web.archive.org.au/awa/timemap/json/http://nla.gov.au/

The examples below show how the format and behaviour of Timemaps vary slightly across the five respoitories we're interested in.

In [14]:

def get_timemap(timegate, url, format="json"):
    """
    Basic function to get a Timemap for the supplied url.
    """
    tg_url = f"{TIMEGATES[timegate]}timemap/{format}/{url}/"
    response = requests.get(tg_url)
    response.raise_for_status()
    # Show the content-type
    # print(response.headers['content-type'])
    return response.headers["content-type"], response.text

National Library of Australia¶

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [23]:

content_type, timemap = get_timemap("awa", "http://www.gov.au", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

# Show the first 5 lines
print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://web.archive.org.au/awa/timemap/link/http://www.gov.au/>; rel="self"; type="application/link-format"; from="Wed, 06 Dec 2000 21:15:00 GMT",
<https://web.archive.org.au/awa/http://www.gov.au/>; rel="timegate",
<http://www.gov.au/>; rel="original",
<https://web.archive.org.au/awa/20001206211500mp_/http://www.gov.au/>; rel="memento"; datetime="Wed, 06 Dec 2000 21:15:00 GMT"; collection="awa",
<https://web.archive.org.au/awa/20010118203600mp_/http://www.gov.au/>; rel="memento"; datetime="Thu, 18 Jan 2001 20:36:00 GMT"; collection="awa",

Request a Timemap in json format. This returns ndjson (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type of text/x-ndjson.

In [15]:

content_type, timemap = get_timemap(
    "awa",
    "http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm",
    "json",
)

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

# Show the first line
print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm", "timestamp": "20031122074837", "url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "length": "3446", "source": "awa", "source-coll": "awa"}

Request a Timemap in cdxj format. Note that response headers include content-type of text/x-cdxj.

In [26]:

content_type, timemap = get_timemap(
    "awa",
    "http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm",
    "cdxj",
)

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

# Show the first line
print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm 20031122074837 {"url": "http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm", "mime": "text/html", "status": "200", "digest": "3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE", "offset": "97170362", "filename": "NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz", "length": "3446", "source": "awa", "source-coll": "awa"}

UK Web Archive¶

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [27]:

content_type, timemap = get_timemap("ukwa", "http://bl.uk", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/>; rel="self"; type="application/link-format"; from="Tue, 30 Oct 2001 00:00:19 GMT",
<https://www.webarchive.org.uk/wayback/archive/http://bl.uk/>; rel="timegate",
<http://bl.uk/>; rel="original",
<https://www.webarchive.org.uk/wayback/archive/20011030000019mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 30 Oct 2001 00:00:19 GMT"; collection="archive",
<https://www.webarchive.org.uk/wayback/archive/20011113000000mp_/http://www.bl.uk/>; rel="memento"; datetime="Tue, 13 Nov 2001 00:00:00 GMT"; collection="archive",

In [28]:

content_type, timemap = get_timemap("ukwa", "http://bl.uk", "json")

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "uk,bl)/", "timestamp": "20011030000019", "url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive", "access": "allow"}

Request a Timemap in cdxj format. Note that response headers include content-type of text/x-cdxj.

In [29]:

content_type, timemap = get_timemap("ukwa", "http://bl.uk", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
uk,bl)/ 20011030000019 {"url": "http://www.bl.uk/", "mime": "text/html", "status": "200", "digest": "JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW", "redirect": "-", "robotflags": "-", "length": "0", "offset": "10813988", "filename": "/data/102148/31031347/WARCS/BL-31031347.warc.gz", "load_url": "", "source": "archive", "source-coll": "archive", "access": "allow"}

UK Government Web Archive¶

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [25]:

content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://webarchive.nationalarchives.gov.uk/ukgwa/timemap/link/https://www.nationalarchives.gov.uk//>; rel="self"; type="application/link-format"; from="Mon, 20 Oct 2003 01:04:12 GMT",
<https://webarchive.nationalarchives.gov.uk/ukgwa/https://www.nationalarchives.gov.uk//>; rel="timegate",
<https://www.nationalarchives.gov.uk//>; rel="original",
<https://webarchive.nationalarchives.gov.uk/ukgwa/20031020010412mp_/http://www.nationalarchives.gov.uk:80/>; rel="memento"; datetime="Mon, 20 Oct 2003 01:04:12 GMT"; collection="full_zipnum",
<https://webarchive.nationalarchives.gov.uk/ukgwa/20040104233258mp_/http://www.nationalarchives.gov.uk/>; rel="memento"; datetime="Sun, 04 Jan 2004 23:32:58 GMT"; collection="full_zipnum",

In [26]:

content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "json")

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "uk,gov,nationalarchives)/", "timestamp": "20031020010412", "url": "http://www.nationalarchives.gov.uk:80/", "mime": "text/html", "status": "200", "digest": "U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J", "redirect": "-", "robotflags": "-", "length": "951", "offset": "898", "filename": "UKGOV-WEEKLY-010-031019180412-000.warc.gz", "source": "full_zipnum", "source-coll": "full_zipnum", "access": "allow"}

Request a Timemap in cdxj format. Note that response headers include content-type of text/x-cdxj.

In [27]:

content_type, timemap = get_timemap("ukgwa", "https://www.nationalarchives.gov.uk/", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
uk,gov,nationalarchives)/ 20031020010412 {"url": "http://www.nationalarchives.gov.uk:80/", "mime": "text/html", "status": "200", "digest": "U2IC276V3AKMWIJGWWJXCVQ2KZ6AMU5J", "redirect": "-", "robotflags": "-", "length": "951", "offset": "898", "filename": "UKGOV-WEEKLY-010-031019180412-000.warc.gz", "source": "full_zipnum", "source-coll": "full_zipnum", "access": "allow"}

National Library of New Zealand¶

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [32]:

content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://natlib.govt.nz/>; rel="self"; type="application/link-format"; from="Sun, 11 Jul 2004 21:32:25 GMT",
<https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/>; rel="timegate",
<http://natlib.govt.nz/>; rel="original",
<https://ndhadeliver.natlib.govt.nz/webarchive/20040711213225mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Sun, 11 Jul 2004 21:32:25 GMT"; collection="webarchive",
<https://ndhadeliver.natlib.govt.nz/webarchive/20060704033135mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Tue, 04 Jul 2006 03:31:35 GMT"; collection="webarchive",

Request a Timemap in json format. This returns ndjson (Newline Delineated JSON) – each capture is a JSON object, separated by a line break. Note that the response headers include content-type of text/x-ndjson.

In [33]:

content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "json")

print(content_type)
# Test content type
assert content_type == "text/x-ndjson"

print("\n".join(timemap.splitlines()[:1]))

text/x-ndjson
{"urlkey": "nz,govt,natlib)/", "timestamp": "20040711213225", "url": "http://www.natlib.govt.nz/", "mime": "text/html", "status": "200", "digest": "JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK", "redirect": "-", "robotflags": "-", "length": "0", "offset": "976", "filename": "V1-FL1645590.arc", "load_url": "http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/", "source": "webarchive", "source-coll": "webarchive"}

Request a Timemap in cdxj format. Note that response headers include content-type of text/x-cdxj.

In [34]:

content_type, timemap = get_timemap("nzwa", "http://natlib.govt.nz", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/x-cdxj"

print("\n".join(timemap.splitlines()[:1]))

text/x-cdxj
nz,govt,natlib)/ 20040711213225 {"url": "http://www.natlib.govt.nz/", "mime": "text/html", "status": "200", "digest": "JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK", "redirect": "-", "robotflags": "-", "length": "0", "offset": "976", "filename": "V1-FL1645590.arc", "load_url": "http://10.4.1.66:80/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/", "source": "webarchive", "source-coll": "webarchive"}

Internet Archive¶

Request a Timemap in link format. Note that response headers include content-type of application/link-format.

In [35]:

content_type, timemap = get_timemap("ia", "http://discontents.com.au", "link")

print(content_type)
# Test content type
assert content_type == "application/link-format"

print("\n".join(timemap.splitlines()[:5]))

application/link-format
<http://www.discontents.com.au:80/>; rel="original",
<https://web.archive.org/web/timemap/link/http://discontents.com.au/>; rel="self"; type="application/link-format"; from="Sun, 06 Dec 1998 01:22:33 GMT",
<https://web.archive.org/web/http://discontents.com.au/>; rel="timegate",
<https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/>; rel="first memento"; datetime="Sun, 06 Dec 1998 01:22:33 GMT",
<https://web.archive.org/web/19981212024410/http://www.discontents.com.au:80/>; rel="memento"; datetime="Sat, 12 Dec 1998 02:44:10 GMT",

Request for timemap in json format returns results in JSON as an array of arrays, where the first row provides the column headings. Response headers include content-type of application/json.

In [36]:

content_type, timemap = get_timemap("ia", "http://discontents.com.au", "json")

print(content_type)
# Test content type
assert content_type == "application/json"

print("\n".join(timemap.splitlines()[:5]))

application/json
[["urlkey","timestamp","original","mimetype","statuscode","digest","redirect","robotflags","length","offset","filename"],
["au,com,discontents)/","19981206012233","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1610","43993900","green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz"],
["au,com,discontents)/","19981212024410","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","17792789","slash-913417727-c/slash-913430608.arc.gz"],
["au,com,discontents)/","19990125094813","http://www.discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1613","11419234","slash-913417727-c/slash_19990124232053-917257670.arc.gz"],
["au,com,discontents)/","19990208004052","http://discontents.com.au:80/","text/html","200","FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36","-","-","1612","13269748","slash-913417727-c/slash-918434425.arc.gz"],

Request for timemap in cdxj returns results in plain text, with fields separated by spaces, and captures separated by line breaks. Response headers include content-type of text/plain.

In [37]:

content_type, timemap = get_timemap("ia", "http://discontents.com.au", "cdxj")

print(content_type)
# Test content type
assert content_type == "text/plain"

print("\n".join(timemap.splitlines()[:1]))

text/plain
au,com,discontents)/ 19981206012233 http://www.discontents.com.au:80/ text/html 200 FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36 - - 1610 43993900 green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz

Differences in field labels¶

If we compare the Pywb JSON output with the IA Wayback output, we see there are also some differences in the field labels. In particular original in IA Wayback is just url in Pywb, while statuscode and mimetype are shortened to status and mime in Pywb.

In [28]:

_, timemap = get_timemap("ia", "http://bl.uk", "json")
data = json.loads(timemap)

# Test for `mimetype` label
assert "mimetype" in data[0]

data[0]

Out[28]:

['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'statuscode',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename']

In [16]:

_, timemap = get_timemap("ukwa", "http://bl.uk", "json")
data = [json.loads(line) for line in timemap.splitlines()]

# Test for `mime` label
assert "mime" in data[0]

list(data[0].keys())

Out[16]:

['urlkey',
 'timestamp',
 'url',
 'mime',
 'status',
 'digest',
 'redirect',
 'robotflags',
 'length',
 'offset',
 'filename',
 'load_url',
 'source',
 'source-coll',
 'access']

Summarising the differences¶

The good news is that all repositories provide Timemaps in the standard link format as required by the Memento specification. However, there's more varation when it comes to other formats.

IA's json format is different to the Pywb format from UKWA, UKGWA, NLNZ, and NLA.
IA uses different labels for some values.

Normalising Timemaps¶

With the information above we can construct some functions to return normalised Timemap results as JSON. To do this we need to:

Restructure the JSON output from IA to match the Pywb format
Change some of the column headings in the IA data to match the Pywb format

Because the link format provides less information than the json format, we could also try to enrich the NLNZ data by requesting more information about individual Mementos.

In [12]:

def convert_lists_to_dicts(results):
    """
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    """
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d["status"] = d.pop("statuscode")
        d["mime"] = d.pop("mimetype")
        d["url"] = d.pop("original")
    return results_as_dicts


def get_capture_data_from_memento(url, request_type="head"):
    """
    For OpenWayback systems this can get some extra capture info to insert into Timemaps.
    """
    if request_type == "head":
        response = requests.head(url)
    else:
        response = requests.get(url)
    headers = response.headers
    length = headers.get("x-archive-orig-content-length")
    status = headers.get("x-archive-orig-status")
    status = status.split(" ")[0] if status else None
    mime = headers.get("x-archive-orig-content-type")
    mime = mime.split(";")[0] if mime else None
    return {"length": length, "status": status, "mime": mime}


def convert_link_to_json(results, enrich_data=False):
    """
    Converts link formatted Timemap to JSON.

    This was originally needed for NLNZ, but now all five archives
    return JSON data.
    """
    data = []
    for line in results.splitlines():
        parts = line.split("; ")
        if len(parts) > 1:
            link_type = re.search(
                r'rel="(original|self|timegate|first memento|last memento|memento)"',
                parts[1],
            ).group(1)
            if link_type == "memento":
                link = parts[0].strip("<>")
                timestamp, original = re.search(r"/(\d{12}|\d{14})/(.*)$", link).groups()
                capture = {"timestamp": timestamp, "url": original}
                if enrich_data:
                    capture.update(get_capture_data_from_memento(link))
                    # print(capture)
                data.append(capture)
    return data


def get_timemap_as_json(timegate, url):
    """
    Get a Timemap then normalise results (if necessary) to return a list of dicts.
    """
    tg_url = f"{TIMEGATES[timegate]}timemap/json/{url}/"
    response = requests.get(tg_url)
    response.raise_for_status()
    response_type = response.headers["content-type"]
    # print(response_type)
    if response_type == "text/x-ndjson":
        data = [json.loads(line) for line in response.text.splitlines()]
    elif response_type == "application/json":
        data = convert_lists_to_dicts(response.json())
    elif response_type in ["application/link-format", "text/html;charset=utf-8"]:
        data = convert_link_to_json(response.text)
    return data

Now we can get information about captures in a standardised JSON format from all five repositories. You can see this in action in the Display changes in the text of an archived web page over time notebook

In [13]:

timemap = get_timemap_as_json("ukwa", "http://bl.uk")

# Test for `mime` label
assert "mime" in timemap[0]

timemap[0]

Out[13]:

{'urlkey': 'uk,bl)/',
 'timestamp': '20011030000019',
 'url': 'http://www.bl.uk/',
 'mime': 'text/html',
 'status': '200',
 'digest': 'JN4RHYLNGS7X64HADIX3XHDIMYDBLAAW',
 'redirect': '-',
 'robotflags': '-',
 'length': '0',
 'offset': '10813988',
 'filename': '/data/102148/31031347/WARCS/BL-31031347.warc.gz',
 'load_url': '',
 'source': 'archive',
 'source-coll': 'archive',
 'access': 'allow'}

In [14]:

timemap = get_timemap_as_json("ia", "http://bl.uk")

# Test for `mime` label
assert "mime" in timemap[0]

timemap[0]

Out[14]:

{'urlkey': 'uk,bl)/',
 'timestamp': '19970218190613',
 'digest': 'Z42UMUL76GODKO3EMNSLXDTCST66VDAX',
 'redirect': '-',
 'robotflags': '-',
 'length': '1208',
 'offset': '19524651',
 'filename': 'GR-001114-c/GR-002277.arc.gz',
 'status': '200',
 'mime': 'text/html',
 'url': 'http://www.bl.uk:80/'}

Mementos¶

You can also modify the url of a Memento to change the way it's presented. In particular, adding id_ after the timestamp will tell the server that you want the original harvested version of the webpage, without any rewriting of links, or web archive navigation features. For example:

https://web.archive.org.au/awa/20200302223537id_/http://discontents.com.au/

This works with all five repositories, however, note that for the Australian Web Archive you need to use the web.archive.org.au domain, not webarchive.nla.gov.au.

In addition, IA supports the if_ option, which provides a view of the archived page without web archive headers navigation inserted, but with links to CSS, JS, and images rewritten to point to archived versions. This is as close as you can get to looking at the original page, and I've used it in the Get full page screenshots from archived web pages notebook. Note that if you add if_ to requests from the UKWA, NLNZ, or the NLA you'll be redirected to the standard view with the original page framed by the web archive navigation.

Pywb's page on url rewriting has some useful information about this.

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020.

The Web Archives section of the GLAM Workbench is sponsored by the British Library.