Notebook

Harvesting data about a domain using the IA CDX API¶

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

In this notebook we'll look at how we can get domain level data from a CDX API. There are two types of search you can use:

a 'prefix' query – searching for nla.gov.au/* returns captures from the nla.gov.au domain
a 'domain' query – searching for *.nla.gov.au returns captures from the nla.gov.au domain and any subdomains

These searches can be combined with any of the other filters supported by the CDX API, such as mimetype and statuscode.

As noted in Comparing CDX APIs, support for domain level searching varies across systems. The AWA allows prefix queries, but not domain queries. The UKWA provides both in theory, but timeouts are common for large domains. Neither the AWA or UKWA supports pagination, so harvesting data from large domains can cause difficulties. For these reasons it seems sensible to focus on the IA CDX API, unless you're after data from a single, modestly-sized domain.

Related notebooks:

Exploring the Internet Archive's CDX API
Comparing CDX APIs
Find all the archived versions of a web page – shows how to use an 'exact' search with the CDX API
Find and explore Powerpoint presentations from a specific domain – example of finding particular types of files within a domain

In most other notebooks using the CDX API we've harvested data into memory and then saved to disk later on. Because we're potentially harvesting much larger quantities of data, it's probably a good idea to reverse this and save harvested data to disk as we download it. We can also use requests-cache to save responses from the API and make it easy to restart a failed harvest. This is the same strategy used in the Exploring subdomains in the gov.au domain notebook where I harvest data about 189 million captures.

Usage¶

Prefix query¶

Either using a url wildcard:

harvest_cdx_query_to_file('[domain]/*', [optional parameters])

or the matchType parameter:

harvest_cdx_query_to_file('[domain]', matchType='prefix', [optional parameters])

Domain query¶

Either using a url wildcard:

harvest_cdx_query_to_file('*.[domain]', [optional parameters])

or the matchType parameter:

harvest_cdx_query_to_file('[domain]', matchType='domain', [optional parameters])

Output¶

The results of each harvest are stored in a timestamped .ndjson file in a subdirectory of the domains directory. For example, a harvest from nla.gov.au is stored in domains/nla-gov-au. The file names combine the domain, the type of query (either 'prefix' or 'domain') and a timestamp. For example, a prefix query in nla.gov.au might generate a file named:

nla-gov-au-prefix-20200526113338.ndjson

Each harvest also creates a metadata file that has a similar name, but is in JSON format, for example:

nla-gov-au-prefix-20200526113338-metadata.json

The metadata file captures information about your harvest including:

params – the parameters used in your query (including any filters)
timestamp – date and time the harvest was started
file – path to the ndjson data file.

Import what we need¶

In [47]:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
import pandas as pd
import time
from requests_cache import CachedSession
import ndjson
from pathlib import Path
from slugify import slugify
import arrow
import json

# By using a cached session, all responses will be saved in a local cache
s = CachedSession()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))

Define some functions¶

In [48]:

def get_total_pages(params):
    '''
    Gets the total number of pages in a set of results.
    '''
    these_params = params.copy()
    these_params['showNumPages'] = 'true'
    response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})
    return int(response.text)

def prepare_params(url, **kwargs):
    '''
    Prepare the parameters for a CDX API requests.
    Adds all supplied keyword arguments as parameters (changing from_ to from).
    Adds in a few necessary parameters.
    '''
    params = kwargs
    params['url'] = url
    params['output'] = 'json'
    params['pageSize'] = 5
    # CDX accepts a 'from' parameter, but this is a reserved word in Python
    # Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
    if 'from_' in params:
        params['from'] = params['from_']
        del(params['from_'])
    return params

def convert_lists_to_dicts(results):
    '''
    Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
    Renames keys to standardise IA with other Timemaps.
    '''
    if results:
        keys = results[0]
        results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
    else:
        results_as_dicts = results
    for d in results_as_dicts:
        d['status'] = d.pop('statuscode')
        d['mime'] = d.pop('mimetype')
        d['url'] = d.pop('original')
    return results_as_dicts

def check_query_type(url):
    if url.startswith('*'):
        query_type = 'domain'
    elif url.endswith('*'):
        query_type = 'prefix'
    else:
        query_type = ''
    return query_type

def get_cdx_data(params):
    '''
    Make a request to the CDX API using the supplied parameters.
    Return results converted to a list of dicts.
    '''
    response = s.get('http://web.archive.org/cdx/search/cdx', params=params)
    response.raise_for_status()
    results = response.json()
    try:
        if not response.from_cache:
            time.sleep(0.2)
    except AttributeError:
        # Not using cache
        time.sleep(0.2)
    return convert_lists_to_dicts(results)

def save_metadata(output_dir, params, query_type, timestamp, file_path):
    md_path = Path(output_dir, f'{slugify(params["url"])}-{query_type}-{timestamp}-metadata.json')
    md = {
        'params': params,
        'timestamp': timestamp,
        'file': str(file_path)
    }
    with md_path.open('wt') as md_json:
        json.dump(md, md_json)

def harvest_cdx_query_to_file(url, **kwargs):
    '''
    Harvest capture data from a CDX query.
    Save results to a NDJSON formatted file.
    '''
    params = prepare_params(url, **kwargs)
    total_pages = get_total_pages(params)
    output_dir = Path('domains', slugify(url))
    output_dir.mkdir(parents=True, exist_ok=True)
    # We'll use a timestamp to distinguish between versions
    timestamp = arrow.now().format('YYYYMMDDHHmmss')
    query_type = params['matchType'] if 'matchType' in params else check_query_type(url)
    file_path = Path(output_dir, f'{slugify(url)}-{query_type}-{timestamp}.ndjson')
    save_metadata(output_dir, params, query_type, timestamp, file_path)
    page = 0
    with tqdm(total=total_pages-page) as pbar1:
        with tqdm() as pbar2:
            while page < total_pages:
                params['page'] = page
                results = get_cdx_data(params)
                with file_path.open('a') as f:
                    writer = ndjson.writer(f, ensure_ascii=False)
                    for result in results:
                        writer.writerow(result)
                page += 1
                pbar1.update(1)
                pbar2.update(len(results) - 1)

Prefix query¶

For a 'prefix' query either set the matchType parameter to prefix or use a url wildcard like nla.gov.au/*.

Get all successful web page captures from the nla.gov.au domain.

In [ ]:

harvest_cdx_query_to_file('nla.gov.au/*', filter=['statuscode:200', 'mimetype:text/html'])

Use collapse to limit the harvest to remove (most) records with duplicate values for urlkey. This should give us a list of unique urls from the nla.gov.au domain.

In [ ]:

harvest_cdx_query_to_file('nla.gov.au/*', filter=['statuscode:200', 'mimetype:text/html'], collapse='urlkey')

Domain query¶

For a 'domain' query either set the matchType parameter to domain or use a url wildcard like *.nla.gov.au.

In [ ]:

harvest_cdx_query_to_file('*.nla.gov.au', filter=['statuscode:200', 'mimetype:text/html'], collapse='urlkey')

Exploring results¶

You should be able to load smaller files using the ndjson module. If you're working with large data files (millions of captures) you might not want to load them all into memory. Have a look at Exploring subdomains in the gov.au domain for some ways of processing the data.

In [50]:

# Edit to point to your data_file, eg: 'domains/nla-gov-au/nla-gov-au-prefix-20200526123711.ndjson'
data_file = '[Path to data file]'
data_file = 'domains/nla-gov-au/nla-gov-au-prefix-20200526123711.ndjson'
with open(data_file) as f:
    capture_data = ndjson.load(f)

You could then convert the capture data to a Pandas dataframe for analysis.

In [ ]:

df = pd.DataFrame(capture_data)

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020