Exploring the Internet Archive's CDX API

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

CDX APIs provide access to an index of resources captured by web archives. The results can be filtered in a number of ways, making them a convenient way of harvesting and exploring the holdings of web archives. This notebook focuses on the data you can obtain from the Internet Archives' CDX API. For more information on the differences between this and other CDX APIs see Comparing CDX APIs. To examine differences between CDX data and Timemaps see Timemaps vs CDX APIs.

Notebooks demonstrating ways of getting and using CDX data include:

In [4]:
import re
from base64 import b32encode
from hashlib import sha1

import altair as alt
import arrow
import pandas as pd
import requests
from tqdm.auto import tqdm

Useful resources

Your first CDX request

Let's have a look at the sort of data the CDX server gives us. At the very least, we have to provide a url parameter to point to a particular page (or domain as we'll see below). To avoid flinging too much data about, we'll also add a limit parameter that tells the CDX server how many rows of data to give us.

In [5]:
# 8 April 2020 -- without the 'User-Agent' header parameter I get a 445 error
# 27 April 2020 - now seems ok without changing User-Agent

# Feel free to change these values
params1 = {"url": "http://nla.gov.au", "limit": 10}

# Get the data and print the results
response = requests.get("https://web.archive.org/cdx/search/cdx", params=params1)
print(response.text)
au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135
au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138
au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457
au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141
au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126
au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140
au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123

By default, the results are returned in a simple text format – fields are separated by spaces, and each result is on a separate line. It's a bit hard to read in this format, so let's add the output parameter to get the results in JSON format. We'll then use Pandas to display the results in a table.

In [6]:
params2 = {"url": "http://nla.gov.au", "limit": 10, "output": "json"}

# Get the data and print the results
response = requests.get("http://web.archive.org/cdx/search/cdx", params=params2)
results = response.json()

# Use Pandas to turn the results into a DataFrame then display
pd.DataFrame(results[1:], columns=results[0]).head(10)
Out[6]:
urlkey timestamp original mimetype statuscode digest length
0 au,gov,nla)/ 19961019064223 http://www.nla.gov.au:80/ text/html 200 M5ORM4XQ5QCEZEDRNZRGSWXPCOGUVASI 1135
1 au,gov,nla)/ 19961221102755 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1138
2 au,gov,nla)/ 19961221132358 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
3 au,gov,nla)/ 19961223031839 http://www2.nla.gov.au:80/ text/html 200 6XHDP66AXEPMVKVROHHDN6CPZYHZICEX 457
4 au,gov,nla)/ 19970212053405 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1141
5 au,gov,nla)/ 19970215222554 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
6 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1126
7 au,gov,nla)/ 19970315230640 http://www.nla.gov.au:80/ text/html 200 TM4WSQIGWXAXMB36G4GVOY7MVPTO6CSE 1140
8 au,gov,nla)/ 19970413005246 http://nla.gov.au:80/ text/html 200 65SH4ZQ7ZYTTPYSVFQUSKZXJPZKSI6XA 603
9 au,gov,nla)/ 19970418074154 http://www.nla.gov.au:80/ text/html 200 NOUNS3AYAIAOO4LRFD23MQWW3QIGDMFB 1123

The JSON results are, in Python terms, a list of lists, rather than a list of dictionaries. The first of these lists contains the field names. If you look at the line below, you'll see that we use the first list (results[0]) to set the column names in the dataframe, while the rest of the data (results[1:]) makes up the rows.

pd.DataFrame(results[1:], columns=results[0]).head(10)

Let's have a look at the fields.

  • urlkey – the page url expressed as a SURT (Sort-friendly URI Reordering Transform)
  • timestamp – the date and time of the capture in a YYYYMMDDhhmmss format
  • original – the url that was captured
  • mimetype – the type of file captured, expressed in a standard format
  • statuscode – a standard code provided by the web server that reports on the result of the capture request
  • digest – also known as a 'checksum' or 'fingerprint', the digest provides an algorithmically generated string that uniquely identifies the content of the captured url
  • length – the size of the captured content in bytes (compressed on disk)

All makes perfect sense right? Hmmm, we'll dig a little deeper below, but first...

Requesting a particular capture

We can use the timestamp value to retrieve the contents of a particular capture. A url like this will open the captured resource in the Wayback Machine:

https://web.archive.org/web/[timestamp]/[url]

For example: https://web.archive.org/web/20130201130329/http://www.nla.gov.au/

If you want the original contents, without the modifications and navigation added by the Wayback Machine, just add id_ after the timestamp:

https://web.archive.org/web/[timestamp]id_/[url]

For example: https://web.archive.org/web/20130201130329id_/http://www.nla.gov.au/

You'll probably notice that the original version doesn't look very pretty because links to CSS or Javascript files are still pointing to their old, broken, addresses. If you want a version without the Wayback Machine Navigation, but with urls to any linked files rewritten to point to archived versions, then add if_ after the timestamp.

https://web.archive.org/web/[timestamp]if_/[url]

For example: https://web.archive.org/web/20130201130329if_/http://www.nla.gov.au/

Getting all the captures of a particular page

If you want to get all the captures of a particular page, you can just leave out the limit parameter. However, there is (supposedly) a limit on the number of results returned in a single request. The API documentation says the current limit is 150,000, but it seems much larger – if you ask for cnn.com without using limit you get more than 290,000 results! To try and make sure that you're getting everything, there's a couple of ways you can break up the results set into chunks. The first is to set the showResumeKey parameter to true. Then, if there are more results available than are returned in your initial request, a couple of extra rows of data will be added to your results. The last row will include a resumption key, while the second last row will be empty, for example:

[], 
    ['com%2Ccnn%29%2F+20000621011732']

You then set the resumeKey parameter to the value of the resumption key, and add it to your next requests. You can combine the use of the resumption key with the limit paramater to break a large collection of captures into manageable chunks.

The other way is to add a page parameter, starting at 0 then incrementing the page value by one until you've worked through the complete set of results. But how do you know the total number of pages? If you add showNumPages=true to your query, the server will return a single number representing the total pages. But the pages themselves come from a special index and can contain different numbers of results depending on your query, so there's no obvious way to calculate the number of captures from the number of pages. Also, the maximum size of a page seems quite large and this sometimes causes errors. You can control this by adding a pageSize parameter. The meaning of this value seems a bit mysterious, but I've found that a pageSize of 5 seems to be a reasonable balance between the amount of data returned by each requests, and the number of requests you have to make.

Let's put all this together in a few functions that will help us construct CDX queries of any size or complexity.

In [7]:
def check_for_resumption_key(results):
    """
    Checks to see if the second-last row is an empty list,
    if it is, return the last value as the resumption key.
    """
    if not results[-2]:
        return results[-1][0]


def get_total_pages(params):
    """
    Gets the total number of pages in a set of results.
    """
    these_params = params.copy()
    these_params["showNumPages"] = "true"
    response = requests.get(
        "http://web.archive.org/cdx/search/cdx",
        params=these_params,
        headers={"User-Agent": ""},
    )
    return int(response.text)


def prepare_params(url, use_resume_key=False, **kwargs):
    """
    Prepare the parameters for a CDX API requests.
    Adds all supplied keyword arguments as parameters (changing from_ to from).
    Adds in a few necessary parameters and showResumeKey if requested.
    """
    params = kwargs
    params["url"] = url
    params["output"] = "json"
    if use_resume_key:
        params["showResumeKey"] = "true"
    # CDX accepts a 'from' parameter, but this is a reserved word in Python
    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
    if "from_" in params:
        params["from"] = params["from_"]
        del params["from_"]
    return params


def get_cdx_data(params):
    """
    Make a request to the CDX API using the supplied parameters.
    Check the results for a resumption key, and return the key (if any) and the results.
    """
    response = requests.get(
        "http://web.archive.org/cdx/search/cdx",
        params=params,
        headers={"User-Agent": ""},
    )
    response.raise_for_status()
    results = response.json()
    resumption_key = check_for_resumption_key(results)
    # Remove the resumption key from the results
    if resumption_key:
        results = results[:-2]
    return resumption_key, results


def query_cdx_by_page(url, **kwargs):
    all_results = []
    page = 0
    params = prepare_params(url, **kwargs)
    total_pages = get_total_pages(params)
    with tqdm(total=total_pages - page) as pbar1:
        with tqdm() as pbar2:
            while page < total_pages:
                params["page"] = page
                _, results = get_cdx_data(params)
                if page == 0:
                    all_results += results
                else:
                    all_results += results[1:]
                page += 1
                pbar1.update(1)
                pbar2.update(len(results) - 1)
    return all_results


def query_cdx_with_key(url, **kwargs):
    """
    Harvest results from the CDX API using the supplied parameters.
    Uses showResumeKey to check if there are more than one page of results,
    and if so loops through pages until all results are downloaded.
    """
    params = prepare_params(url, use_resume_key=True, **kwargs)
    with tqdm() as pbar:
        # This will include the header row
        resumption_key, all_results = get_cdx_data(params)
        pbar.update(len(all_results) - 1)
        while resumption_key is not None:
            params["resumeKey"] = resumption_key
            resumption_key, results = get_cdx_data(params)
            # Remove the header row and add
            all_results += results[1:]
            pbar.update(len(results) - 1)
    return all_results

To harvest all of the captures of 'http://www.nla.gov.au', you can just call:

results = query_cdx_with_key('http://www.nla.gov.au')

To break the harvest down into chunks of 1,000 results at a time, you'd call:

results = query_cdx_with_key('http://www.nla.gov.au', limit=1000)

There are a number of other parameters you can use to filter results from the CDX API, you can supply any of these as well. We'll see some examples below.

So let's get all the captures of 'http://www.nla.gov.au'.

In [8]:
results = query_cdx_with_key("http://www.nla.gov.au")

And convert them into a dataframe.

In [9]:
df = pd.DataFrame(results[1:], columns=results[0])

How many captures are there?

In [10]:
df.shape
Out[10]:
(4405, 7)

Ok, now we've got a dataset, let's look at the structure of the data in a little more detail.

CDX data in depth

SURTs, urlkeys, & urls

As noted above, the urlkey field contains things that are technically known as SURTs (Sort-friendly URI Reordering Transform). Basically, the order of components in the url's domain are reversed to make captures easier to sort and group. So instead of nla.gov.au we have au,gov,nla. The path component of the url, the bit that points to a specific file within the domain, is tacked on the end of the urlkey after a closing bracket. Here are some examples:

http://www.nla.gov.au becomes au,gov,nla plus the path /, so the urlkey is:

au,gov,nla)/

http://www.dsto.defence.gov.au/attachments/9%20LEWG%20Oct%2008%20DEU.ppt becomes au,gov,defence,dsto plus the path /attachments/9%20lewg%20oct%2008%20deu.ppt, so the urlkey is:

au,gov,defence,dsto)/attachments/9%20lewg%20oct%2008%20deu.ppt

From the examples above, you'll notice there's a bit of extra normalisation going on. For example, the url components are all converted to lowercase. You might also be wondering what happened to the www subdomain. By convention these are aliases that just point to the underlying domain – www.nla.gov.au ends up at the same place as nla.gov.au – so they're removed from the SURT. We can explore this a bit further by comping the original urls in our dataset to the urlkeys.

How many unique urlkeys are there? Hopefully just one, as we're gathering captures from a single url!

In [11]:
df["urlkey"].unique().shape[0]
Out[11]:
1

But how many different original urls were captured?

In [12]:
df["original"].unique().shape[0]
Out[12]:
20

Let's have a look at them.

In [13]:
df["original"].value_counts()
Out[13]:
https://www.nla.gov.au/                  1526
http://www.nla.gov.au/                   1199
http://www.nla.gov.au:80/                 868
http://nla.gov.au/                        604
http://nla.gov.au:80/                      77
https://nla.gov.au/                        62
http://www.nla.gov.au//                    21
http://www.nla.gov.au                      11
http://www2.nla.gov.au:80/                 10
https://www.nla.gov.au                     10
http://[email protected]/                    6
http://www.nla.gov.au:80/?                  2
http://www.nla.gov.au./                     2
http://nla.gov.au                           1
http://mailto:[email protected]/             1
http://[email protected]/                1
http://mailto:[email protected]/       1
http://mailto:[email protected]/               1
http://www.nla.gov.au:80//                  1
http://www.nla.gov.au/?                     1
Name: original, dtype: int64

So we can see that as well as removing www, the normalisation process removes www2 and port numbers, and groups together the http and https protocols. There's also some odd things that look like email addresses and were probably harvested by mistake from mailto links.

But wait a minute, our original query was just for the url http://nla.gov.au, why did we get all these other urls? When we request a particular url from the CDX API, it matches results based on the url's SURT, not on the original url. This ensures that we get all the variations in the way the url might be expressed. If we want to limit results to a specific form of the url, we can do that by filtering on the original field, as we'll see below.

Because the urlkey is essentially a normalised identifier for an individual url, you can use it to group together all the captures of individual pages across a whole domain. For example, if wanted to know how many urls have been captured from the nla.gov.au domain, we can call our query function like this:

results = query_cdx('nla.gov.au/*', collapse='urlkey', limit=1000)

Note that the url parameter includes a * to indicate that we want everything under the nla.gov.au domain. The collapse='urlkey' parameter says that we only want unique urlkey values – so we'll get just one capture for each individual url within the nla.gov.au domain. This can be a useful way of gathering a domain-level summary.

Timestamps

The timestamp field is pretty straightforward, it contains the date and time of the capture expressed in the format YYYYMMDDhhmmss. Once we have the harvested results in a dataframe, we can easily convert the timestamps into a datetime object.

In [14]:
df["date"] = pd.to_datetime(df["timestamp"])

This makes it possible to plot the number of captures over time. Here we group the captures by year.

In [15]:
alt.Chart(df).mark_bar().encode(x="year(date):T", y="count()").properties(
    width=700, height=200
)
Out[15]: