Files digitised in the last week

Each Sunday I'm automatically harvesting details of files digitised by the NAA in the previous week. You can view the results in this repository. This notebook analyses the most recent harvest to provide a summary of the results.

In [62]:
import datetime

import arrow
import pandas as pd
from IPython.display import display
from recordsearch_data_scraper.scrapers import RSSeries
from tqdm.auto import tqdm
In [63]:
# Find the date of the most recent Sunday
today = arrow.now().to("Australia/Sydney")
# Todat is Sunday and it's past 2pm so the harvest should have run
if today.weekday() == 6 and today.time() >= datetime.time(14, 0, 0, 0):
    harvest_day = today
# Otherwise get last Sunday
else:
    harvest_day = arrow.now().to("Australia/Sydney").shift(weekday=6).shift(weeks=-1)

print(f'Harvested on {harvest_day.format("dddd, d MMMM YYYY")}.')
Harvested on Sunday, 7 June 2022.
In [64]:
df = pd.read_csv(
    f'https://raw.githubusercontent.com/wragge/naa-recently-digitised/master/data/digitised-week-ending-{harvest_day.format("YYYYMMDD")}.csv'
)
In [65]:
df.shape
Out[65]:
(55633, 6)
In [66]:
df["series"].value_counts()[:10]
Out[66]:
A2571     33686
B884      10150
A2572      8748
C610        961
A9301       735
D874        624
B883        163
J853        161
A14487      102
A2478        21
Name: series, dtype: int64
In [67]:
series_list = list(df["series"].unique())
In [68]:
cited_series = []
for series in tqdm(series_list):
    data = RSSeries(
        series, include_number_digitised=False, include_access_status=False
    ).data
    cited_series.append({"series": series, "series_title": data["title"]})
In [69]:
df_titles = pd.merge(df, pd.DataFrame(cited_series), how="left", on="series")
In [71]:
with pd.option_context("display.max_colwidth", 100):
    df_titles = (
        df_titles.value_counts(["series", "series_title"]).to_frame().reset_index()
    )
    df_titles.columns = ["series", "series_title", "total"]
    display(df_titles[:20])
    totals = ""
    for title in df_titles[:20].itertuples():
        totals += (
            f"{title.series}, {title.series_title}, {title.total} files digitised; "
        )
    # print(totals)
series series_title total
0 A1 Correspondence files, annual single number series [Main correspondence files series of the agency] 1
1 D1915 Investigation case files, single number series with 'SA' (South Australia) prefix 1
2 CP211/2 Correspondence files and other related papers 1
3 CP697/39 Record of grants of land held on Norfolk Island 1
4 D1051 Original drawings, plans and prints of National Estate properties 1
5 D13 Original Agreements and Accounts of Crew (Form M & S 3)1, with Ships Official Log Books (Form M ... 1
6 D1357 Army Pay Files [WWII CMF) single number series with 'S' prefix 1
7 D1358 Army pay files (2AIF), single number series with 'SX' prefix 1
8 D1901 Loveday Internment Camp internees files, single number series with variable alpha prefix 1
9 D2416 Arrival and departure registers, Finsbury Hostel. 1
10 F941 Correspondence files, annual single number series - [portion transferred to NT Archives Service] 1
11 D2419 Arrival and departure registers, Glenelg North Migrant Hostel 1
12 D26 Lighthouse log books, Cape Willoughby (Sturt Light), chronological series 1
13 D400 Correspondence files, annual single number series with 'SA' and 'S' prefix 1
14 D4881 Alien registration cards, alphabetical series 1
15 D874 Still photograph outdoor and studio negatives, annual single number series with N prefix (and pr... 1
16 E1129 Aliens registration cards, lexicographical series 1
17 E40 Aliens Registration Files 1
18 C610 Australian Women's Land Army - personnel cards, alphabetical series 1
19 C123 World War II security investigation dossiers, single number series 1
In [ ]: