New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
In this notebook we'll look at how we can get domain level data from a CDX API. There are two types of search you can use:
nla.gov.au/*
returns captures from the nla.gov.au
domain*.nla.gov.au
returns captures from the nla.gov.au
domain and any subdomainsThese searches can be combined with any of the other filters supported by the CDX API, such as mimetype
and statuscode
.
As noted in Comparing CDX APIs, support for domain level searching varies across systems. The AWA allows prefix queries, but not domain queries. The UKWA provides both in theory, but timeouts are common for large domains. Neither the AWA or UKWA supports pagination, so harvesting data from large domains can cause difficulties. For these reasons it seems sensible to focus on the IA CDX API, unless you're after data from a single, modestly-sized domain.
Related notebooks:
In most other notebooks using the CDX API we've harvested data into memory and then saved to disk later on. Because we're potentially harvesting much larger quantities of data, it's probably a good idea to reverse this and save harvested data to disk as we download it. We can also use requests-cache
to save responses from the API and make it easy to restart a failed harvest. This is the same strategy used in the Exploring subdomains in the gov.au domain notebook where I harvest data about 189 million captures.
Either using a url wildcard:
harvest_cdx_query_to_file('[domain]/*', [optional parameters])
or the matchType
parameter:
harvest_cdx_query_to_file('[domain]', matchType='prefix', [optional parameters])
Either using a url wildcard:
harvest_cdx_query_to_file('*.[domain]', [optional parameters])
or the matchType
parameter:
harvest_cdx_query_to_file('[domain]', matchType='domain', [optional parameters])
The results of each harvest are stored in a timestamped .ndjson
file in a subdirectory of the domains
directory. For example, a harvest from nla.gov.au
is stored in domains/nla-gov-au
. The file names combine the domain, the type of query (either 'prefix' or 'domain') and a timestamp. For example, a prefix query in nla.gov.au
might generate a file named:
nla-gov-au-prefix-20200526113338.ndjson
Each harvest also creates a metadata file that has a similar name, but is in JSON format, for example:
nla-gov-au-prefix-20200526113338-metadata.json
The metadata file captures information about your harvest including:
params
– the parameters used in your query (including any filters)timestamp
– date and time the harvest was startedfile
– path to the ndjson
data file.import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
import pandas as pd
import time
from requests_cache import CachedSession
import ndjson
from pathlib import Path
from slugify import slugify
import arrow
import json
# By using a cached session, all responses will be saved in a local cache
s = CachedSession()
retries = Retry(total=10, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
def get_total_pages(params):
'''
Gets the total number of pages in a set of results.
'''
these_params = params.copy()
these_params['showNumPages'] = 'true'
response = s.get('http://web.archive.org/cdx/search/cdx', params=these_params, headers={'User-Agent': ''})
return int(response.text)
def prepare_params(url, **kwargs):
'''
Prepare the parameters for a CDX API requests.
Adds all supplied keyword arguments as parameters (changing from_ to from).
Adds in a few necessary parameters.
'''
params = kwargs
params['url'] = url
params['output'] = 'json'
params['pageSize'] = 5
# CDX accepts a 'from' parameter, but this is a reserved word in Python
# Use 'from_' to pass the value to the function & here we'll change it back to 'from'.
if 'from_' in params:
params['from'] = params['from_']
del(params['from_'])
return params
def convert_lists_to_dicts(results):
'''
Converts IA style timemap (a JSON array of arrays) to a list of dictionaries.
Renames keys to standardise IA with other Timemaps.
'''
if results:
keys = results[0]
results_as_dicts = [dict(zip(keys, v)) for v in results[1:]]
else:
results_as_dicts = results
for d in results_as_dicts:
d['status'] = d.pop('statuscode')
d['mime'] = d.pop('mimetype')
d['url'] = d.pop('original')
return results_as_dicts
def check_query_type(url):
if url.startswith('*'):
query_type = 'domain'
elif url.endswith('*'):
query_type = 'prefix'
else:
query_type = ''
return query_type
def get_cdx_data(params):
'''
Make a request to the CDX API using the supplied parameters.
Return results converted to a list of dicts.
'''
response = s.get('http://web.archive.org/cdx/search/cdx', params=params)
response.raise_for_status()
results = response.json()
try:
if not response.from_cache:
time.sleep(0.2)
except AttributeError:
# Not using cache
time.sleep(0.2)
return convert_lists_to_dicts(results)
def save_metadata(output_dir, params, query_type, timestamp, file_path):
md_path = Path(output_dir, f'{slugify(params["url"])}-{query_type}-{timestamp}-metadata.json')
md = {
'params': params,
'timestamp': timestamp,
'file': str(file_path)
}
with md_path.open('wt') as md_json:
json.dump(md, md_json)
def harvest_cdx_query_to_file(url, **kwargs):
'''
Harvest capture data from a CDX query.
Save results to a NDJSON formatted file.
'''
params = prepare_params(url, **kwargs)
total_pages = get_total_pages(params)
output_dir = Path('domains', slugify(url))
output_dir.mkdir(parents=True, exist_ok=True)
# We'll use a timestamp to distinguish between versions
timestamp = arrow.now().format('YYYYMMDDHHmmss')
query_type = params['matchType'] if 'matchType' in params else check_query_type(url)
file_path = Path(output_dir, f'{slugify(url)}-{query_type}-{timestamp}.ndjson')
save_metadata(output_dir, params, query_type, timestamp, file_path)
page = 0
with tqdm(total=total_pages-page) as pbar1:
with tqdm() as pbar2:
while page < total_pages:
params['page'] = page
results = get_cdx_data(params)
with file_path.open('a') as f:
writer = ndjson.writer(f, ensure_ascii=False)
for result in results:
writer.writerow(result)
page += 1
pbar1.update(1)
pbar2.update(len(results) - 1)
For a 'prefix' query either set the matchType
parameter to prefix
or use a url wildcard like nla.gov.au/*
.
Get all successful web page captures from the nla.gov.au
domain.
harvest_cdx_query_to_file('nla.gov.au/*', filter=['statuscode:200', 'mimetype:text/html'])
Use collapse
to limit the harvest to remove (most) records with duplicate values for urlkey
. This should give us a list of unique urls from the nla.gov.au
domain.
harvest_cdx_query_to_file('nla.gov.au/*', filter=['statuscode:200', 'mimetype:text/html'], collapse='urlkey')
For a 'domain' query either set the matchType
parameter to domain
or use a url wildcard like *.nla.gov.au
.
harvest_cdx_query_to_file('*.nla.gov.au', filter=['statuscode:200', 'mimetype:text/html'], collapse='urlkey')
You should be able to load smaller files using the ndjson
module. If you're working with large data files (millions of captures) you might not want to load them all into memory. Have a look at Exploring subdomains in the gov.au domain for some ways of processing the data.
# Edit to point to your data_file, eg: 'domains/nla-gov-au/nla-gov-au-prefix-20200526123711.ndjson'
data_file = '[Path to data file]'
data_file = 'domains/nla-gov-au/nla-gov-au-prefix-20200526123711.ndjson'
with open(data_file) as f:
capture_data = ndjson.load(f)
You could then convert the capture data to a Pandas dataframe for analysis.
df = pd.DataFrame(capture_data)
Created by Tim Sherratt for the GLAM Workbench.
Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020