Working with web archives

Current version: v1.1.1

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don't just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. This collection of notebooks is intended to help historians, and other researchers, frame those questions by revealing what sort of data is available, how to get it, and what you can do with it.

Web Archives share systems and standards, making it much easier for researchers wanting to get their hands on useful data. These notebooks focus on four particular web archives: the UK Web Archive, the Australian Web Archive (National Library of Australia ), the New Zealand Web Archive (National Library of New Zealand), and the Internet Archive. However, the tools and approaches here could be easily extended to other web archives.

Web archives are huge, and access is often limited for legal reasons. These notebooks focus on data that is readily accessible and able to be used without the need for special equipment. They use existing APIs to get data in manageable chunks. But many of the examples demonstrated can also be scaled up to build substantial datasets for analysis – you just have to be patient!

These notebooks are a starting point that I hope will encourage researchers to investigate the possibilities of web archives in more detail. They're intended to compliment the fabulous work being by projects such as Archives Unleashed to open web archives to new research uses.

The development of these notebooks was supported by the International Internet Preservation Consortium's Discretionary Funding Programme 2019-2020, with the participation of the British Library, the National Library of Australia, and the National Library of New Zealand. Thanks all!

See the web archives section of the GLAM Workbench for more information.

Notebook topics

Types of data

  • Timegates, Timemaps, and Mementos – explore how the Memento protocol helps you get machine-readable data about web archive captures
  • Exploring the Internet Archive's CDX API – some web archives provide indexes of the web pages they've archived through an API, this notebook looks in detail at the data provided by the Internet Archive's CDX API
  • Comparing CDX APIs – this notebook documents differences between the Internet Archive's Wayback CDX API and the PyWb CDX API (used by AWA and UKWA)
  • Timemaps vs CDX APIs – both Timemaps and CDX APIs can give us a list of captures from a particular web page, this notebook compares the results

Harvesting data and creating datasets

Exploring change over time

Cite as

See the GLAM Workbench or Zenodo for up-to-date citation details.

This repository is part of the GLAM Workbench.
If you think this project is worthwhile, you might like to sponsor me on GitHub.