Observing change in a web page over time

New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.

This notebook explores what we can find when you look at all captures of a single page over time.

Work in progress – this notebook isn't finished yet. Check back later for more...

In [1]:
import re

import altair as alt
import pandas as pd
import requests
In [2]:
def query_cdx(url, **kwargs):
    params = kwargs
    params["url"] = url
    params["output"] = "json"
    response = requests.get(
        "http://web.archive.org/cdx/search/cdx",
        params=params,
        headers={"User-Agent": ""},
    )
    response.raise_for_status()
    return response.json()
In [3]:
url = "http://nla.gov.au"

Getting the data

In this example we're using the IA CDX API, but this could easily be adapted to use Timemaps from a range of repositories.

In [4]:
data = query_cdx(url)

# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])

# Convert the timestamp string into a datetime object
df["date"] = pd.to_datetime(df["timestamp"])
df.sort_values(by="date", inplace=True, ignore_index=True)

# Convert the length from a string into an integer
df["length"] = df["length"].astype("int")

As noted in the notebook comparing the CDX API with Timemaps, there are a number of duplicate snapshots in the CDX results, so let's remove them.

In [5]:
print(f"Before: {df.shape[0]}")
df.drop_duplicates(
    subset=["timestamp", "original", "digest", "statuscode", "mimetype"],
    keep="first",
    inplace=True,
)
print(f"After: {df.shape[0]}")
Before: 4460
After: 4359

The basic shape

In [6]:
df["date"].min()
Out[6]:
Timestamp('1996-10-19 06:42:23')
In [7]:
df["date"].max()
Out[7]:
Timestamp('2022-04-12 18:37:39')
In [8]:
df["length"].describe()
Out[8]:
count     4359.000000
mean      8329.515715
std       7868.167415
min        235.000000
25%        533.000000
50%       5702.000000
75%      14854.000000
max      30062.000000
Name: length, dtype: float64
In [9]:
df["statuscode"].value_counts()
Out[9]:
200    2953
301     778
-       316
302     309
503       3
Name: statuscode, dtype: int64
In [10]:
df["mimetype"].value_counts()
Out[10]:
text/html       4041
warc/revisit     316
unk                2
Name: mimetype, dtype: int64

Plotting snapshots over time

In [11]:
# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ["-", "200", "301", "302", "404", "503"]
# green for ok, blue for redirects, red for errors
range_ = ["#888888", "#39a035", "#5ba3cf", "#125ca4", "#e13128", "#b21218"]

alt.Chart(df).mark_point().encode(
    x="date:T",
    y="length:Q",
    color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
    tooltip=["date", "length", "statuscode"],
).properties(width=700, height=300)
Out[11]: