This notebook examines data from a complete harvest of series publicly available through RecordSearch in May 2021. It also compares the results to an earlier harvest in May 2021. See this notebook for the harvesting method.
import pandas as pd
df = pd.read_csv(
"series_totals_April_2022.csv",
dtype={
"described_total": "Int64",
"digitised_total": "Int64",
"access_open_total": "Int64",
"access_owe_total": "Int64",
"access_closed_total": "Int64",
"access_nye_total": "Int64",
},
)
df_prev = pd.read_csv(
"series_totals_May_2021.csv",
dtype={
"described_total": "Int64",
"digitised_total": "Int64",
"access_open_total": "Int64",
"access_owe_total": "Int64",
"access_closed_total": "Int64",
"access_nye_total": "Int64",
},
)
df.head()
Note that these numbers might not be exact. To work around the 20,000 search result limit, some totals have been calculated by aggregating a series of searches. In most cases this will be accurate, but some items have multiple control symbols and may be duplicated in the results. I think any errors will be small.
The numbers in brackets indicate the change since the last harvest in May 2021.
print(f"{df.shape[0]:,} series ({df.shape[0] - df_prev.shape[0]:+,})")
print(
f'{round(df["quantity_total"].sum(), 2):,} metres of records ({round((df["quantity_total"].sum() - df_prev["quantity_total"].sum()), 2):+,})'
)
print(
f'{df["described_total"].sum():,} items described ({df["described_total"].sum() - df_prev["described_total"].sum():+,})'
)
print(
f'{df["digitised_total"].sum():,} items digitised ({df["digitised_total"].sum()- df_prev["digitised_total"].sum():+,})'
)
prev_percent = df_prev["digitised_total"].sum() / df_prev["described_total"].sum()
current_percent = df["digitised_total"].sum() / df["described_total"].sum()
print(
f"{current_percent:0.2%} of described items are digitised ({current_percent - prev_percent:+0.2%})"
)
access_totals = [
{
"access status": "Open",
"total": df["access_open_total"].sum(),
"change": df["access_open_total"].sum() - df_prev["access_open_total"].sum(),
},
{
"access status": "Open with exceptions",
"total": df["access_owe_total"].sum(),
"change": df["access_owe_total"].sum() - df_prev["access_owe_total"].sum(),
},
{
"access status": "Closed",
"total": df["access_closed_total"].sum(),
"change": df["access_closed_total"].sum()
- df_prev["access_closed_total"].sum(),
},
{
"access status": "Not yet examined",
"total": df["access_nye_total"].sum(),
"change": df["access_nye_total"].sum() - df_prev["access_nye_total"].sum(),
},
]
df_access = pd.DataFrame(access_totals)
df_access["percent"] = df_access["total"] / df_access["total"].sum()
df_access.style.format(
{"total": "{:,.0f}", "change": "{:+,}", "percent": "{:0.2%}"}
).hide()
There's no way of knowing this from the harvested data. However, the recently-released Tune Review says that 37% of the NAA's holdings are described. So as we know the number described, we should be able to calculate an approximate number of total items.
print(f'Approximately {int(df["described_total"].sum() / 0.37):,} items in total')
To put that another way, this is the approximate number of items not listed on RecordSearch:
print(
f'Approximately {int(df["described_total"].sum() / 0.37) - df["described_total"].sum():,} items **are not** listed on RecordSearch'
)
That's something to keep in mind if you're just relying on item keyword searches to find relevant content.
The note that accompanies the number of items listed in RecordSearch indicates how much of the series has been described at item level. By looking at the frequency of each of the values for this note, we can get a sese of the level of description across the collection.
df_described = df["described_note"].value_counts().to_frame()
df_described.columns = ["total"]
df_described["percent"] = df_described["total"] / df_described["total"].sum()
df_described.style.format({"total": "{:,.0f}", "percent": "{:0.2%}"})
The numbers above might be a bit misleading because sometimes series are registered on RecordSearch before any items are actually transferred to the NAA. So the reason there are no items listed might be that there are no items currently in Archives custody. To try an get a more accurate picture, we can filter out series where the quantity held by the NAA is equal to zero metres.
df_described_held = (
df.loc[df["quantity_total"] != 0]["described_note"].value_counts().to_frame()
)
df_described_held.columns = ["total"]
df_described_held["percent"] = (
df_described_held["total"] / df_described_held["total"].sum()
)
df_described_held.style.format({"total": "{:,.0f}", "percent": "{:0.2%}"})
This brings down the 'undescribed' proportion, though strangely this seems to indicate that there are zero shelf metres of some series which are fully described.
df.loc[
(df["described_note"].str.startswith("All")) & (df["quantity_total"] == 0)
].shape[0]
For example:
df.loc[(df["described_note"].str.startswith("All")) & (df["quantity_total"] == 0)].head(
2
)
So perhaps in some cases locations and quantities are not reliably recorded on RecordSearch.
From the items described note it seems that 19,227 series held by the NAA or AWM have no item level descriptions. We can check that by simply looking for series where the described_total
value is zero.
print(
f'{df.loc[(df["quantity_total"] > 0) & (df["described_total"] == 0)].shape[0]:,} series held by NAA have no item descriptions'
)
Yay! That (almost) matches.
Boo! That's a pretty significant black hole. Let's look at the quantity of records that represents.
print(
f'{df.loc[(df["quantity_total"] > 0) & (df["described_total"] == 0)]["quantity_total"].sum():,} linear metres in series held by NAA with no item descriptions'
)
Of course, this doesn't include the quantities of series that are partially described.
Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!