New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
CDX APIs provide access to an index of resources captured by web archives. The results can be filtered in a number of ways, making them a convenient way of harvesting and exploring the holdings of web archives. This notebook focuses on the data you can obtain from the Internet Archives' CDX API. For more information on the differences between this and other CDX APIs see Comparing CDX APIs. To examine differences between CDX data and Timemaps see Timemaps vs CDX APIs.
Notebooks demonstrating ways of getting and using CDX data include:
import re
from base64 import b32encode
from hashlib import sha1
import altair as alt
import arrow
import pandas as pd
import requests
from tqdm.auto import tqdm
Let's have a look at the sort of data the CDX server gives us. At the very least, we have to provide a url
parameter to point to a particular page (or domain as we'll see below). To avoid flinging too much data about, we'll also add a limit
parameter that tells the CDX server how many rows of data to give us.
# 8 April 2020 -- without the 'User-Agent' header parameter I get a 445 error
# 27 April 2020 - now seems ok without changing User-Agent
# Feel free to change these values
params1 = {"url": "http://nla.gov.au", "limit": 10}
# Get the data and print the results
response = requests.get("https://web.archive.org/cdx/search/cdx", params=params1)
print(response.text)
By default, the results are returned in a simple text format – fields are separated by spaces, and each result is on a separate line. It's a bit hard to read in this format, so let's add the output
parameter to get the results in JSON format. We'll then use Pandas to display the results in a table.
params2 = {"url": "http://nla.gov.au", "limit": 10, "output": "json"}
# Get the data and print the results
response = requests.get("http://web.archive.org/cdx/search/cdx", params=params2)
results = response.json()
# Use Pandas to turn the results into a DataFrame then display
pd.DataFrame(results[1:], columns=results[0]).head(10)
The JSON results are, in Python terms, a list of lists, rather than a list of dictionaries. The first of these lists contains the field names. If you look at the line below, you'll see that we use the first list (results[0]
) to set the column names in the dataframe, while the rest of the data (results[1:]
) makes up the rows.
pd.DataFrame(results[1:], columns=results[0]).head(10)
Let's have a look at the fields.
urlkey
– the page url expressed as a SURT (Sort-friendly URI Reordering Transform)timestamp
– the date and time of the capture in a YYYYMMDDhhmmss
formatoriginal
– the url that was capturedmimetype
– the type of file captured, expressed in a standard formatstatuscode
– a standard code provided by the web server that reports on the result of the capture requestdigest
– also known as a 'checksum' or 'fingerprint', the digest provides an algorithmically generated string that uniquely identifies the content of the captured urllength
– the size of the captured content in bytes (compressed on disk)All makes perfect sense right? Hmmm, we'll dig a little deeper below, but first...
We can use the timestamp
value to retrieve the contents of a particular capture. A url like this will open the captured resource in the Wayback Machine:
https://web.archive.org/web/[timestamp]/[url]
For example: https://web.archive.org/web/20130201130329/http://www.nla.gov.au/
If you want the original contents, without the modifications and navigation added by the Wayback Machine, just add id_
after the timestamp
:
https://web.archive.org/web/[timestamp]id_/[url]
For example: https://web.archive.org/web/20130201130329id_/http://www.nla.gov.au/
You'll probably notice that the original version doesn't look very pretty because links to CSS or Javascript files are still pointing to their old, broken, addresses. If you want a version without the Wayback Machine Navigation, but with urls to any linked files rewritten to point to archived versions, then add if_
after the timestamp.
https://web.archive.org/web/[timestamp]if_/[url]
For example: https://web.archive.org/web/20130201130329if_/http://www.nla.gov.au/
If you want to get all the captures of a particular page, you can just leave out the limit
parameter. However, there is (supposedly) a limit on the number of results returned in a single request. The API documentation says the current limit is 150,000, but it seems much larger – if you ask for cnn.com
without using limit
you get more than 290,000 results! To try and make sure that you're getting everything, there's a couple of ways you can break up the results set into chunks. The first is to set the showResumeKey
parameter to true
. Then, if there are more results available than are returned in your initial request, a couple of extra rows of data will be added to your results. The last row will include a resumption key, while the second last row will be empty, for example:
[],
['com%2Ccnn%29%2F+20000621011732']
You then set the resumeKey
parameter to the value of the resumption key, and add it to your next requests. You can combine the use of the resumption key with the limit
paramater to break a large collection of captures into manageable chunks.
The other way is to add a page
parameter, starting at 0
then incrementing the page
value by one until you've worked through the complete set of results. But how do you know the total number of pages? If you add showNumPages=true
to your query, the server will return a single number representing the total pages. But the pages themselves come from a special index and can contain different numbers of results depending on your query, so there's no obvious way to calculate the number of captures from the number of pages. Also, the maximum size of a page seems quite large and this sometimes causes errors. You can control this by adding a pageSize
parameter. The meaning of this value seems a bit mysterious, but I've found that a pageSize
of 5
seems to be a reasonable balance between the amount of data returned by each requests, and the number of requests you have to make.
Let's put all this together in a few functions that will help us construct CDX queries of any size or complexity.
def check_for_resumption_key(results):
"""
Checks to see if the second-last row is an empty list,
if it is, return the last value as the resumption key.
"""
if not results[-2]:
return results[-1][0]
def get_total_pages(params):
"""
Gets the total number of pages in a set of results.
"""
these_params = params.copy()
these_params["showNumPages"] = "true"
response = requests.get(
"http://web.archive.org/cdx/search/cdx",
params=these_params,
headers={"User-Agent": ""},
)
return int(response.text)
def prepare_params(url, use_resume_key=False, **kwargs):
"""
Prepare the parameters for a CDX API requests.
Adds all supplied keyword arguments as parameters (changing from_ to from).
Adds in a few necessary parameters and showResumeKey if requested.
"""
params = kwargs
params["url"] = url
params["output"] = "json"
if use_resume_key:
params["showResumeKey"] = "true"
# CDX accepts a 'from' parameter, but this is a reserved word in Python
# Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
if "from_" in params:
params["from"] = params["from_"]
del params["from_"]
return params
def get_cdx_data(params):
"""
Make a request to the CDX API using the supplied parameters.
Check the results for a resumption key, and return the key (if any) and the results.
"""
response = requests.get(
"http://web.archive.org/cdx/search/cdx",
params=params,
headers={"User-Agent": ""},
)
response.raise_for_status()
results = response.json()
resumption_key = check_for_resumption_key(results)
# Remove the resumption key from the results
if resumption_key:
results = results[:-2]
return resumption_key, results
def query_cdx_by_page(url, **kwargs):
all_results = []
page = 0
params = prepare_params(url, **kwargs)
total_pages = get_total_pages(params)
with tqdm(total=total_pages - page) as pbar1:
with tqdm() as pbar2:
while page < total_pages:
params["page"] = page
_, results = get_cdx_data(params)
if page == 0:
all_results += results
else:
all_results += results[1:]
page += 1
pbar1.update(1)
pbar2.update(len(results) - 1)
return all_results
def query_cdx_with_key(url, **kwargs):
"""
Harvest results from the CDX API using the supplied parameters.
Uses showResumeKey to check if there are more than one page of results,
and if so loops through pages until all results are downloaded.
"""
params = prepare_params(url, use_resume_key=True, **kwargs)
with tqdm() as pbar:
# This will include the header row
resumption_key, all_results = get_cdx_data(params)
pbar.update(len(all_results) - 1)
while resumption_key is not None:
params["resumeKey"] = resumption_key
resumption_key, results = get_cdx_data(params)
# Remove the header row and add
all_results += results[1:]
pbar.update(len(results) - 1)
return all_results
To harvest all of the captures of 'http://www.nla.gov.au', you can just call:
results = query_cdx_with_key('http://www.nla.gov.au')
To break the harvest down into chunks of 1,000 results at a time, you'd call:
results = query_cdx_with_key('http://www.nla.gov.au', limit=1000)
There are a number of other parameters you can use to filter results from the CDX API, you can supply any of these as well. We'll see some examples below.
So let's get all the captures of 'http://www.nla.gov.au'.
results = query_cdx_with_key("http://www.nla.gov.au")
And convert them into a dataframe.
df = pd.DataFrame(results[1:], columns=results[0])
How many captures are there?
df.shape
Ok, now we've got a dataset, let's look at the structure of the data in a little more detail.
As noted above, the urlkey
field contains things that are technically known as SURTs (Sort-friendly URI Reordering Transform). Basically, the order of components in the url's domain are reversed to make captures easier to sort and group. So instead of nla.gov.au
we have au,gov,nla
. The path component of the url, the bit that points to a specific file within the domain, is tacked on the end of the urlkey
after a closing bracket. Here are some examples:
http://www.nla.gov.au
becomes au,gov,nla
plus the path /
, so the urlkey is:
au,gov,nla)/
http://www.dsto.defence.gov.au/attachments/9%20LEWG%20Oct%2008%20DEU.ppt
becomes au,gov,defence,dsto
plus the path /attachments/9%20lewg%20oct%2008%20deu.ppt
, so the urlkey is:
au,gov,defence,dsto)/attachments/9%20lewg%20oct%2008%20deu.ppt
From the examples above, you'll notice there's a bit of extra normalisation going on. For example, the url components are all converted to lowercase. You might also be wondering what happened to the www
subdomain. By convention these are aliases that just point to the underlying domain – www.nla.gov.au
ends up at the same place as nla.gov.au
– so they're removed from the SURT. We can explore this a bit further by comping the original
urls in our dataset to the urlkeys
.
How many unique urlkey
s are there? Hopefully just one, as we're gathering captures from a single url!
df["urlkey"].unique().shape[0]
But how many different original
urls were captured?
df["original"].unique().shape[0]
Let's have a look at them.
df["original"].value_counts()
So we can see that as well as removing www
, the normalisation process removes www2
and port numbers, and groups together the http
and https
protocols. There's also some odd things that look like email addresses and were probably harvested by mistake from mailto
links.
But wait a minute, our original query was just for the url http://nla.gov.au
, why did we get all these other urls? When we request a particular url
from the CDX API, it matches results based on the url's SURT, not on the original url. This ensures that we get all the variations in the way the url might be expressed. If we want to limit results to a specific form of the url, we can do that by filtering on the original
field, as we'll see below.
Because the urlkey
is essentially a normalised identifier for an individual url, you can use it to group together all the captures of individual pages across a whole domain. For example, if wanted to know how many urls have been captured from the nla.gov.au
domain, we can call our query function like this:
results = query_cdx('nla.gov.au/*', collapse='urlkey', limit=1000)
Note that the url
parameter includes a *
to indicate that we want everything under the nla.gov.au
domain. The collapse='urlkey'
parameter says that we only want unique urlkey
values – so we'll get just one capture for each individual url within the nla.gov.au
domain. This can be a useful way of gathering a domain-level summary.
The timestamp
field is pretty straightforward, it contains the date and time of the capture expressed in the format YYYYMMDDhhmmss
. Once we have the harvested results in a dataframe, we can easily convert the timestamps into a datetime object.
df["date"] = pd.to_datetime(df["timestamp"])
This makes it possible to plot the number of captures over time. Here we group the captures by year.
alt.Chart(df).mark_bar().encode(x="year(date):T", y="count()").properties(
width=700, height=200
)