New to Jupyter notebooks? Try Using Jupyter notebooks for a quick introduction.
This notebook explores what we can find when you look at all captures of a single page over time.
Work in progress – this notebook isn't finished yet. Check back later for more...
import re
import altair as alt
import pandas as pd
import requests
def query_cdx(url, **kwargs):
params = kwargs
params["url"] = url
params["output"] = "json"
response = requests.get(
"http://web.archive.org/cdx/search/cdx",
params=params,
headers={"User-Agent": ""},
)
response.raise_for_status()
return response.json()
url = "http://nla.gov.au"
data = query_cdx(url)
# Convert to a dataframe
# The column names are in the first row
df = pd.DataFrame(data[1:], columns=data[0])
# Convert the timestamp string into a datetime object
df["date"] = pd.to_datetime(df["timestamp"])
df.sort_values(by="date", inplace=True, ignore_index=True)
# Convert the length from a string into an integer
df["length"] = df["length"].astype("int")
As noted in the notebook comparing the CDX API with Timemaps, there are a number of duplicate snapshots in the CDX results, so let's remove them.
print(f"Before: {df.shape[0]}")
df.drop_duplicates(
subset=["timestamp", "original", "digest", "statuscode", "mimetype"],
keep="first",
inplace=True,
)
print(f"After: {df.shape[0]}")
Before: 4451 After: 4350
df["date"].min()
Timestamp('1996-10-19 06:42:23')
df["date"].max()
Timestamp('2022-04-10 18:38:06')
df["length"].describe()
count 4350.000000 mean 8318.689655 std 7854.281544 min 235.000000 25% 533.000000 50% 5699.000000 75% 14852.750000 max 30062.000000 Name: length, dtype: float64
df["statuscode"].value_counts()
200 2948 301 775 - 315 302 309 503 3 Name: statuscode, dtype: int64
df["mimetype"].value_counts()
text/html 4033 warc/revisit 315 unk 2 Name: mimetype, dtype: int64
# This is just a bit of fancy customisation to group the types of errors by color
# See https://altair-viz.github.io/user_guide/customization.html#customizing-colors
domain = ["-", "200", "301", "302", "404", "503"]
# green for ok, blue for redirects, red for errors
range_ = ["#888888", "#39a035", "#5ba3cf", "#125ca4", "#e13128", "#b21218"]
alt.Chart(df).mark_point().encode(
x="date:T",
y="length:Q",
color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
tooltip=["date", "length", "statuscode"],
).properties(width=700, height=300)
Looking at the chart above, it's hard to understand why a request for a page is sometimes redirected, and sometimes not. To understand this we have to look a bit closer at what pages are actually being archived. Let's look at the breakdown of values in the original
column. These are the urls being requested by the archiving bot.
df["original"].value_counts()
https://www.nla.gov.au/ 1508 http://www.nla.gov.au/ 1178 http://www.nla.gov.au:80/ 868 http://nla.gov.au/ 588 http://nla.gov.au:80/ 77 https://nla.gov.au/ 62 http://www.nla.gov.au// 21 http://www.nla.gov.au 11 http://www2.nla.gov.au:80/ 10 https://www.nla.gov.au 10 http://Trove@nla.gov.au/ 6 http://www.nla.gov.au:80/? 2 http://www.nla.gov.au./ 2 http://nla.gov.au 1 http://mailto:media@nla.gov.au/ 1 http://cmccarthy@nla.gov.au/ 1 http://mailto:development@nla.gov.au/ 1 http://mailto:www@nla.gov.au/ 1 http://www.nla.gov.au:80// 1 http://www.nla.gov.au/? 1 Name: original, dtype: int64
Ah ok, so there's actually a mix of things in here – some include the 'www' prefix and some don't, some use the 'https' protocol and some just plain old 'http'. There's also a bit of junk in there from badly parsed mailto
links. To look at the differences in more detail, let's create new columns for subdomain
and protocol
.
base_domain = re.search(r"https*:\/\/(\w*)\.", url).group(1)
df["subdomain"] = df["original"].str.extract(
r"^https*:\/\/(\w*)\.{}\.".format(base_domain), flags=re.IGNORECASE
)
df["subdomain"].fillna("", inplace=True)
df["subdomain"].value_counts()
www 3602 738 www2 10 Name: subdomain, dtype: int64
df["protocol"] = df["original"].str.extract(r"^(https*):")
df["protocol"].value_counts()
http 2770 https 1580 Name: protocol, dtype: int64
Let's look to see how the proportion of requests using each of the protocols changes over time. Here we're grouping the rows by year.
alt.Chart(df).mark_bar().encode(
x="year(date):T",
y=alt.Y("count()", stack="normalize"),
color="protocol:N",
# tooltip=['date', 'length', 'subdomain:N']
).properties(width=700, height=200)
No real surprise there given the increased use of https generally.
Let's now compare the proportion of status codes between the bare nla.gov.au
domain and the www
subdomain.
alt.Chart(
df.loc[(df["statuscode"] != "-") & (df["subdomain"] != "www2")]
).mark_bar().encode(
x="year(date):T",
y=alt.Y("count()", stack="normalize"),
color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
row="subdomain",
tooltip=["year(date):T", "statuscode"],
).properties(
width=700, height=100
)
I think we can start to see what's going on. Around about 2004, requests to nla.gov.au
started to be redirected to www.nla.gov.au
giving a 302 response, indicating that the page had been moved temporarily. But why the growth in 301 (moved permanently) responses from both domains after 2018? If we look at the chart above showing the increased use of the https
protocol, I think we could guess that http
requests in both domains are being redirected to https
.
Let's test that hypothesis by looking at the distribution of status codes by protocol.
alt.Chart(
df.loc[(df["statuscode"] != "-") & (df["subdomain"] != "www2")]
).mark_bar().encode(
x="year(date):T",
y=alt.Y("count()", stack="normalize"),
color=alt.Color("statuscode", scale=alt.Scale(domain=domain, range=range_)),
row="protocol",
tooltip=["year(date):T", "protocol", "statuscode"],
).properties(
width=700, height=100
)
We can see that by 2019, all requests using http
are being redirected to https
.
Now that we understand what's going on with the different domains and status codes, I think we can focus on just the 'www' domain and the '200' responses.
df_200 = df.copy().loc[
(df["statuscode"] == "200") & (df["subdomain"] == "www") & (df["length"] > 1000)
]
alt.Chart(df_200).mark_point().encode(
x="date:T", y="length:Q", tooltip=["date", "length"]
).properties(width=700, height=300)
Pandas makes it easy to calculate the difference between two adjacent values, so lets find the absolute difference in length between each capture.
df_200["change_in_length"] = abs(df_200["length"].diff())
Now we can look at the captures that varied most in length from their predecessor.
top_ten_changes = df_200.sort_values(by="change_in_length", ascending=False)[:10]
top_ten_changes
urlkey | timestamp | original | mimetype | statuscode | digest | length | date | subdomain | protocol | change_in_length | |
---|---|---|---|---|---|---|---|---|---|---|---|
3656 | au,gov,nla)/ | 20210701042826 | https://www.nla.gov.au/ | text/html | 200 | 6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 | 29215 | 2021-07-01 04:28:26 | www | https | 13933.0 |
4134 | au,gov,nla)/ | 20220202054835 | https://www.nla.gov.au/ | text/html | 200 | O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ | 27495 | 2022-02-02 05:48:35 | www | https | 4648.0 |
3954 | au,gov,nla)/ | 20220105025646 | https://www.nla.gov.au/ | text/html | 200 | MRIUSTANGOWT3CT5QSSRJ7NJPEN2RSEN | 27273 | 2022-01-05 02:56:46 | www | https | 4463.0 |
4058 | au,gov,nla)/ | 20220121065839 | https://www.nla.gov.au/ | text/html | 200 | UAGVH7YN6ZPJYQUIZTP32G4GF3JJ2N7J | 22948 | 2022-01-21 06:58:39 | www | https | 4394.0 |
4417 | au,gov,nla)/ | 20220405063728 | https://www.nla.gov.au/ | text/html | 200 | HXSLRIVPKI3ECPC5V6NNEJOIHTDKZ5JJ | 22682 | 2022-04-05 06:37:28 | www | https | 4375.0 |
3921 | au,gov,nla)/ | 20211228211936 | https://www.nla.gov.au/ | text/html | 200 | FQZBN2C7DPFPC26F6HX3KDNGWCWVOWWX | 22831 | 2021-12-28 21:19:36 | www | https | 4374.0 |
3946 | au,gov,nla)/ | 20220103064507 | https://www.nla.gov.au/ | text/html | 200 | PUJLOJI7OUJ4XFKUDI47HOQLFOMEQLKJ | 22917 | 2022-01-03 06:45:07 | www | https | 4367.0 |
4322 | au,gov,nla)/ | 20220323022916 | https://www.nla.gov.au/ | text/html | 200 | WEP3PTHC7CAEF22S3NDIMZHFDAQIK65J | 26919 | 2022-03-23 02:29:16 | www | https | 4359.0 |
3939 | au,gov,nla)/ | 20220102132710 | https://www.nla.gov.au/ | text/html | 200 | SBM5ATRRZWVT7HYA6J3BMMOXG4HTDZFD | 27251 | 2022-01-02 13:27:10 | www | https | 4352.0 |
4215 | au,gov,nla)/ | 20220224211329 | https://www.nla.gov.au/ | text/html | 200 | WCIDXIQ22M35PWXUGK7GCXG2LEJUAJ5J | 23208 | 2022-02-24 21:13:29 | www | https | 4351.0 |
Let's try visualising this by highlighting the major changes in length.
points = (
alt.Chart(df_200)
.mark_point()
.encode(x="date:T", y="length:Q", tooltip=["date", "length"])
.properties(width=700, height=300)
)
lines = (
alt.Chart(top_ten_changes)
.mark_rule(color="red")
.encode(x="date:T", tooltip=["date"])
.properties(width=700, height=300)
)
points + lines
Rather than just a raw number, perhaps the percentage change in length would be more useful. Once again, Pandas makes this easy to calculate. This calculates the percentage change from the previous value – so length2 - length1 / length1.
df_200["pct_change_in_length"] = abs(df_200["length"].pct_change())
top_ten_changes_pct = df_200.sort_values(by="pct_change_in_length", ascending=False)[
:10
]
top_ten_changes_pct
urlkey | timestamp | original | mimetype | statuscode | digest | length | date | subdomain | protocol | change_in_length | pct_change_in_length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3656 | au,gov,nla)/ | 20210701042826 | https://www.nla.gov.au/ | text/html | 200 | 6PC4KROOPHEBXIC6A3GRFFHBMO2EJ4F7 | 29215 | 2021-07-01 04:28:26 | www | https | 13933.0 | 0.911726 |
13 | au,gov,nla)/ | 19980205162107 | http://www.nla.gov.au:80/ | text/html | 200 | LIXK3YXSUFO5KPOO22XIMQGPWXKNHV6X | 1920 | 1998-02-05 16:21:07 | www | http | 757.0 | 0.650903 |
79 | au,gov,nla)/ | 20011003175018 | http://www.nla.gov.au:80/ | text/html | 200 | BWGDP6NTGVOI2TBA62P7IZ2PPWRLOODN | 3367 | 2001-10-03 17:50:18 | www | http | 1004.0 | 0.424884 |
1519 | au,gov,nla)/ | 20160901112433 | http://www.nla.gov.au/ | text/html | 200 | MZD7NTLMH5HBXSFQIHTGTC6IELQDYBN2 | 11541 | 2016-09-01 11:24:33 | www | http | 2738.0 | 0.311030 |
1184 | au,gov,nla)/ | 20130211044309 | http://www.nla.gov.au/ | text/html | 200 | QWCVHAK2Y6WXLDNIMTJLZT5RY6YCJ7UN | 8521 | 2013-02-11 04:43:09 | www | http | 1698.0 | 0.248864 |
1067 | au,gov,nla)/ | 20110611064218 | http://www.nla.gov.au/ | text/html | 200 | Z7PQG2MVOOQUZ62ASNRAWDFLSYFHATQT | 5601 | 2011-06-11 06:42:18 | www | http | 1739.0 | 0.236921 |
2049 | au,gov,nla)/ | 20181212014241 | https://www.nla.gov.au/ | text/html | 200 | C3YSGGHG52WIG6U6X7C3XOL5LOHVIINM | 14813 | 2018-12-12 01:42:41 | www | https | 2831.0 | 0.236271 |
786 | au,gov,nla)/ | 20061107083938 | http://www.nla.gov.au:80/ | text/html | 200 | HOW52ARISTA4HTCLFPYNCQWR6NK2N2NF | 5662 | 2006-11-07 08:39:38 | www | http | 1561.0 | 0.216115 |
4134 | au,gov,nla)/ | 20220202054835 | https://www.nla.gov.au/ | text/html | 200 | O6VCJWSKLU3EAFDSH4IGOIW264Y6IIXZ | 27495 | 2022-02-02 05:48:35 | www | https | 4648.0 | 0.203440 |
3864 | au,gov,nla)/ | 20211121150551 | https://www.nla.gov.au/ | text/html | 200 | GEPXW6EKI7GBG22SMUFIIXQW3KIQXNIY | 25912 | 2021-11-21 15:05:51 | www | https | 4316.0 | 0.199852 |
lines = (
alt.Chart(top_ten_changes_pct)
.mark_rule(color="red")
.encode(x="date:T", tooltip=["date"])
.properties(width=700, height=300)
)
points + lines
By focusing on percentage difference we can see that more prominence is given to the change in 2001. But rather than just the top 10, should we look at changes greater than 10% or some other threshold?
lines = (
alt.Chart(df_200.loc[df_200["pct_change_in_length"] > 0.1])
.mark_rule(color="red")
.encode(x="date:T", tooltip=["date"])
.properties(width=700, height=300)
)
points + lines
Once major changes, such as those above, have been identified, we can use some of the other notebooks in this repository to compare individual captures. For example:
Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!
Work on this notebook was supported by the IIPC Discretionary Funding Programme 2019-2020