New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
You can find all the archived versions of a web page by requesting a Timemap from a Memento-compliant repository. If the repository has a CDX API, you can get much the same data by doing an exact url search.
import json
import re
import requests
from surt import surt
Works with AWA, IA, NZWA & UKWA
Variations in the way Memento is implemented across repositories are documented in Getting data from web archives using Memento. The functions below smooth out these variations to provide a (mostly) consistent interface to the UK Web Archive, Australian Web Archive, New Zealand Web Archive, and the Internet Archive. They could be easily modified to work with other Memento-compliant repositories.
To get all captures of a url in JSON format:
get_timemap_as_json([timegate], [url], enrich_data=[True or False])
Parameters:
timegate
– one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)url
– the url you want to look for in the archiveenrich_data
– NZWA Timemaps include less information, if you set this to True
the script will query each memento in turn to try and find more capture information (such as mime
and status
). This will slow things down quite a bit, and isn't always successful, so leave it as False
unless you have a good reason.The data is returned in JSON format. The number of fields returned varies, but these will always be present:
urlkey
– SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)timestamp
– the date and time when the page was captured by the archive, in YYYYMMDDHHmmss
formaturl
– the url of the page that was capturedThe AWA, IA, and UKWA Timemaps also include:
status
– HTTP status code returned by the capture requestmime
– the mimetype of the captured resourcedigest
– algorithmically generated string that uniquely identifies the contents of the captured reourceFor more information on the contents of these fields, see Exploring the Internet Archive's CDX API.
# These are the repositories we'll be using
TIMEGATES = {
"awa": "https://web.archive.org.au/awa/",
"nzwa": "https://ndhadeliver.natlib.govt.nz/webarchive/",
"ukwa": "https://www.webarchive.org.uk/wayback/en/archive/",
"ia": "https://web.archive.org/web/",
}
def convert_lists_to_dicts(results):
"""
Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
Renames keys to standardise IA with other Timemaps.
"""
if results:
keys = results[0]
results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
else:
results_as_dicts = results
for d in results_as_dicts:
d["status"] = d.pop("statuscode")
d["mime"] = d.pop("mimetype")
d["url"] = d.pop("original")
return results_as_dicts
def get_capture_data_from_memento(url, request_type="head"):
"""
For OpenWayback systems this can get some extra capture info to insert into Timemaps.
"""
if request_type == "head":
response = requests.head(url)
else:
response = requests.get(url)
headers = response.headers
length = headers.get("x-archive-orig-content-length")
status = headers.get("x-archive-orig-status")
status = status.split(" ")[0] if status else None
mime = headers.get("x-archive-orig-content-type")
mime = mime.split(";")[0] if mime else None
return {"length": length, "status": status, "mime": mime}
def convert_link_to_json(results, enrich_data=False):
"""
Converts link formatted Timemap to JSON.
"""
data = []
for line in results.splitlines():
parts = line.split("; ")
if len(parts) > 1:
link_type = re.search(
r'rel="(original|self|timegate|first memento|last memento|memento)"',
parts[1],
).group(1)
if link_type == "memento":
link = parts[0].strip("<>")
timestamp, original = re.search(r"/(\d{14})/(.*)$", link).groups()
capture = {
"urlkey": surt(original),
"timestamp": timestamp,
"url": original,
}
if enrich_data:
capture.update(get_capture_data_from_memento(link))
print(capture)
data.append(capture)
return data
def get_timemap_as_json(timegate, url, enrich_data=False):
"""
Get a Timemap then normalise results (if necessary) to return a list of dicts.
"""
tg_url = f"{TIMEGATES[timegate]}timemap/json/{url}/"
response = requests.get(tg_url)
response_type = response.headers["content-type"]
# print(response_type)
if response_type == "text/x-ndjson":
data = [json.loads(line) for line in response.text.splitlines()]
elif response_type == "application/json":
data = convert_lists_to_dicts(response.json())
elif response_type in [
"application/link-format",
"application/link-format;charset=ISO-8859-1",
"text/html;charset=utf-8",
]:
data = convert_link_to_json(response.text, enrich_data=enrich_data)
return data
t1 = get_timemap_as_json("ia", "http://discontents.com.au")
len(t1)
351
# First -- results in date order
t1[0]
{'urlkey': 'au,com,discontents)/', 'timestamp': '19981206012233', 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36', 'redirect': '-', 'robotflags': '-', 'length': '1610', 'offset': '43993900', 'filename': 'green-0133-19990218235953-919455657-c/green-0141-912907270.arc.gz', 'status': '200', 'mime': 'text/html', 'url': 'http://www.discontents.com.au:80/'}
# Last -- the most recent
t1[-1]
{'urlkey': 'au,com,discontents)/', 'timestamp': '20220331081122', 'digest': 'LK7AWVZ7UN745CBGJNEVA3QJMKLJ4N4V', 'redirect': '-', 'robotflags': '-', 'length': '6582', 'offset': '684200982', 'filename': 'AHREFS-20220331070812-crawl800/AHREFS-20220331074718-00077.warc.gz', 'status': '200', 'mime': 'text/html', 'url': 'http://discontents.com.au/'}
t2 = get_timemap_as_json("ukwa", "http://bl.uk")
len(t2)
829
t3 = get_timemap_as_json("nzwa", "http://natlib.govt.nz")
len(t3)
1370
Works with AWA, IA, NZWA, & UKWA
The CDX APIs of the Internet Archive and PyWb-based systems such as the AWA, UKWA, and NZWA behave slightly differently. These differences are documented in Comparing CDX APIs. The functions below smooth out some of these bumps and should return consistently formatted results from the three repositories.
To get all the captures of a url in JSON format:
query_cdx([timegate], [url], [other optional parameters])
Required parameters:
timegate
– one of 'ukwa' (UK), 'awa' (Australia), 'nzwa' (New Zealand), or 'ia' (Internet Archive)url
– the url you want to look for in the archiveSupplying these parameters only is essentially the equivalent of asking for a Timemap (though when I compared results, I found the CDX API included more duplicates). One advantage of the CDX API is that you can filter results by supplying additional parameters. These optional parameters can be anything the CDX APIs support, such as from
, to
, and filter
. However, note that from
is a reserved keyword in Python, so use from_
instead. See below for some examples.
The data is returned in JSON format. The number of fields returned varies, but these will always be present:
urlkey
– SURT formatted url (in the case of NZWA this is generated by the script rather than the archive)timestamp
– the date and time when the page was captured by the archive, in YYYYMMDDHHmmss
formaturl
– the url of the page that was capturedstatus
– HTTP status code returned by the capture requestmime
– the mimetype of the captured resourcedigest
– algorithmically generated string that uniquely identifies the contents of the captured reourceAPIS = {
"ia": {"url": "http://web.archive.org/cdx/search/cdx", "type": "wb"},
"awa": {"url": "https://web.archive.org.au/awa/cdx", "type": "pywb"},
"nzwa": {
"url": "https://ndhadeliver.natlib.govt.nz/webarchive/cdx",
"type": "pywb",
},
"ukwa": {
"url": "https://www.webarchive.org.uk/wayback/archive/cdx",
"type": "pywb",
},
}
def normalise_filter(api, f):
"""
Normalise parameter names and regexp formatting across CDX systems.
"""
sys_type = APIS[api]["type"]
if sys_type == "pywb":
f = f.replace("mimetype:", "mime:")
f = f.replace("statuscode:", "status:")
f = f.replace("original:", "url:")
f = re.sub(r"^(!{0,1})(\w)", r"\1~\2", f)
elif sys_type == "wb":
f = f.replace("mime:", "mimetype:")
f = f.replace("status:", "statuscode:")
f = f.replace("url:", "original:")
return f
def normalise_filters(api, filters):
if isinstance(filters, list):
normalised = []
for f in filters:
normalised.append(normalise_filter(api, f))
else:
normalised = normalise_filter(api, filters)
return normalised
def query_cdx(api, url, **kwargs):
params = kwargs
if "filter" in params:
params["filter"] = normalise_filters(api, params["filter"])
# CDX accepts a 'from' parameter, but this is a reserved word in Python
# Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
if "from_" in params:
params["from"] = params["from_"]
del params["from_"]
params["url"] = url
params["output"] = "json"
response = requests.get(APIS[api]["url"], params=params)
response.raise_for_status()
response_type = response.headers["content-type"].split(";")[0]
if response_type == "text/x-ndjson":
data = [json.loads(line) for line in response.text.splitlines()]
elif response_type == "application/json":
data = convert_lists_to_dicts(response.json())
return data
# No filters -- give as all the captures!
d1 = query_cdx("ia", "http://discontents.com.au")
len(d1)
351
# First result
d1[0]
{'urlkey': 'au,com,discontents)/', 'timestamp': '19981206012233', 'digest': 'FQJ6JMPIZ7WEKYPQ4SGPVHF57GCV6B36', 'length': '1610', 'status': '200', 'mime': 'text/html', 'url': 'http://www.discontents.com.au:80/'}
# Last result -- note that the results are in date order, so this is the most recent
d1[-1]
{'urlkey': 'au,com,discontents)/', 'timestamp': '20220331081122', 'digest': 'LK7AWVZ7UN745CBGJNEVA3QJMKLJ4N4V', 'length': '6582', 'status': '200', 'mime': 'text/html', 'url': 'http://discontents.com.au/'}
# Filter by status code - note the number of results decreases
d2 = query_cdx("ia", "http://discontents.com.au", filter="status:200")
len(d2)
308
# Filter by date range using from_ and to
d3 = query_cdx("ia", "http://discontents.com.au", from_="2005", to="2006")
len(d3)
25
# First result should be from 2005
d3[0]
{'urlkey': 'au,com,discontents)/', 'timestamp': '20050209204432', 'digest': 'IWLJRLZLB7WBQNHYTVXJGD7TTARRGAXM', 'length': '1024', 'status': '200', 'mime': 'text/html', 'url': 'http://www.discontents.com.au:80/'}
# Last result should be from 2006
d3[-1]
{'urlkey': 'au,com,discontents)/', 'timestamp': '20061205043957', 'digest': 'QGCDU54UYAOMFBTZKGOV27NGYAFE27HZ', 'length': '1122', 'status': '200', 'mime': 'text/html', 'url': 'http://discontents.com.au:80/'}
# Same as d1, except from AWA
d4 = query_cdx("awa", "http://discontents.com.au")
len(d4)
152
Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!
Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020