Ever searched for items in RecordSearch and wanted to save the results as a CSV file, or in some other machine-readable format? This notebook makes it easy to save the results of an item search as a downloadable dataset. You can even download all the images from items that have been digitised, or save the complete files as PDFs!
RecordSearch doesn't currently have an option for downloading machine-readable data. So to get collection metadata in a structured form, we have to resort of screen-scraping. This notebook uses the RecordSearch Data Scraper to do most of the work.
Notes:
cache_db.sqlite
file to start from scratch.control_range
list below. This list supplies a range of prefixes which are supplied (with a trailing '*' for wildcard matches) as the control
value.The available search parameters are the same as those in RecordSearch's Advanced Search form. There's lots of them, but you'll probably only end up using a few like kw
and series
. Note that you can use * for wildcard searches as you can in the web interface. So setting kw
to 'wragge*' will find both 'wragge' and 'wragges'.
See the RecordSearch Data Scraper documentation for more information on search parameters.
kw
– string containing keywords to search forkw_options
– how to interpret kw
, possible values are:kw
as a phrase rather than a list of wordskw_exclude
– string containing keywords to exclude from searchkw_exclude_options
– how to interpret kw_exclude
, possible values are:kw_exact
as a phrase rather than a list of wordssearch_notes
– set to 'on' to search item notes as well as metadataseries
– search for items in this seriesseries_exclude
– exclude items from this seriescontrol
– search for items matching this control symbolcontrol_exclude
– exclude items matching this control symbolitem_id
– search for items with this item ID number (formerly called barcode
)date_from
– search for items with a date (year) greater than or equal to this, eg. '1935'date_to
– search for items with a date (year) less than or equal to thisformats
– limit search to items in a particular format, see possible values belowformats_exclude
– exclude items in a particular format, see possible values belowlocations
– limit search to items held in a particular location, see possible values belowlocations_exclude
– exclude items held in a particular location, see possible values belowaccess
– limit to items with a particular access status, see possible values belowaccess_exclude
– exclude items with a particular access status, see possible values belowdigital
– set to True
to limit to items that are digitisedPossible values for formats
and formats_exclude
:
Possible values for locations
and locations_exclude
:
Possible values for access
and access_exclude
:
There are some additional parameters that affect the way the search results are delivered.
record_detail
– controls the amount of information included in each item record, possible values:Note that if you want to harvest all the digitised page images from a search, you need to set record_detail
to either 'digitised' or 'full'.
Once it's downloaded all the results, the harvesting function creates a directory for the harvest and saves three files inside:
metadata.json
– this is a summary of your harvest, including the parameters you used and the date it was runresults.ndjson
– this is the harvested data with each record saved as a JSON object on a new lineresults.csv
– the harvested data with any duplicates removed saved as a CSV file (if you've saved 'full' records, the list of access_decision_reasons
will be saved as a pipe-separated string)The metadata.json
file looks something like this:
{
"date_harvested": "2021-05-22T22:05:10.705184",
"search_params": {"results_per_page": 20, "sort": 9, "record_detail": "digitised"},
"search_kwargs": {"kw": "wragge"},
"total_results": 208,
"total_harvested": 208,
"total_deduplicated": 208
}
The 'total' values represent slightly different things:
total_results
: the number of matching results RecordSearch thinks there aretotal_harvested
: the number of results actually harvestedtotal_deduplicated
: the number of records left after duplicates are removed from the harvested resultsDuplicate records sometimes occur when items have an alternative control symbol. The CSV creation process removes any duplicates.
The fields in the results files are:
title
identifier
series
control_symbol
digitised_status
digitised_pages
– if record_detail
is set to 'digitised' or 'full'access_status
access_decision_reasons
– if record_detail
is set to 'full'location
retrieved
– date/time when this record was retrieved from RecordSearchcontents_date_str
contents_start_date
contents_end_date
access_decision_date_str
– if record_detail
is set to 'full'access_decision_date
– if record_detail
is set to 'full'See below for information on saving digitised images and PDFs.
import json
import string
import time
from datetime import datetime
from pathlib import Path
import pandas as pd
import requests
from IPython.display import HTML, FileLink, display
from recordsearch_data_scraper.scrapers import RSItemSearch
from slugify import slugify
from tqdm.auto import tqdm
# This is a workaround for a problem with tqdm adding space to cells
HTML(
"""
<style>
.p-Widget.jp-OutputPrompt.jp-OutputArea-prompt:empty {
padding: 0;
border: 0;
}
</style>
"""
)
# This is basically a list of letters and numbers that we can use to build up control symbol values.
control_range = (
[str(number) for number in range(0, 10)]
+ [letter for letter in string.ascii_uppercase]
+ ["/"]
)
def get_results(data_dir, **kwargs):
"""
Save all the results from a search using the given parameters.
If there are more than 20,000 results, return False.
Otherwise, return the harvested items.
"""
s = RSItemSearch(**kwargs)
if s.total_results == "20,000+":
return False
else:
with tqdm(total=s.total_results, leave=False) as pbar:
more = True
while more:
data = s.get_results()
if data["results"]:
save_to_ndjson(data_dir, data["results"])
pbar.update(len(data["results"]))
time.sleep(0.5)
else:
more = False
return True
def refine_controls(current_control, data_dir, **kwargs):
"""
Add additional letters/numbers to the control symbol wildcard search
until the number of results is less than 20,000.
Then harvest the results.
Returns:
* the RSItemSearch object (containing the search params, total results etc)
* a list containing the harvested items
"""
for control in control_range:
new_control = current_control.strip("*") + control + "*"
# print(new_control)
kwargs["control"] = new_control
results = get_results(data_dir, **kwargs)
# print(total)
if results is False:
refine_controls(new_control, data_dir, **kwargs)
def create_data_dir(search, today):
"""
Create a directory for the harvested data -- using the date and search parameters.
"""
params = search.params.copy()
params.update(search.kwargs)
search_param_str = slugify(
"_".join(
sorted(
[
f"{k}_{v}"
for k, v in params.items()
if v is not None and k not in ["results_per_page", "sort"]
]
)
)
)
data_dir = Path("harvests", f'{today.strftime("%Y%m%d_%H%M%S")}_{search_param_str}')
data_dir.mkdir(exist_ok=True, parents=True)
return data_dir
def save_to_ndjson(data_dir, results):
"""
Save results into a single, newline delimited JSON file.
"""
output_file = Path(data_dir, "results.ndjson")
with output_file.open("a") as ndjson_file:
for result in results:
ndjson_file.write(json.dumps(result) + "\n")
def save_metadata(search, data_dir, today, totals):
"""
Save information about the harvest to a JSON file.
"""
metadata = {
"date_harvested": today.isoformat(),
"search_params": search.params,
"search_kwargs": search.kwargs,
"total_results": search.total_results,
"total_harvested": totals["harvested"],
"total_after_deduplication": totals["deduped"],
}
with Path(data_dir, "metadata.json").open("w") as md_file:
json.dump(metadata, md_file)
def save_csv(data_dir):
"""
Save the harvested results as a CSV file, removing any duplicates.
"""
output_file = Path(data_dir, "results.csv")
input_file = Path(data_dir, "results.ndjson")
df = pd.read_json(input_file, lines=True)
harvested = df.shape[0]
# Flatten list
try:
df["access_decision_reasons"] = (
df["access_decision_reasons"].dropna().apply(lambda l: " | ".join(l))
)
except KeyError:
pass
# Remove any duplicates
df.drop_duplicates(inplace=True)
df.to_csv(output_file, index=False)
deduped = df.shape[0]
return {"harvested": harvested, "deduped": deduped}
def harvest_search(**kwargs):
"""
Harvest all the items from a search using the supplied parameters.
If there are more than 20,000 results, it will use control symbol
wildcard values to try and split the results into harvestable chunks.
"""
# Initialise the search
search = RSItemSearch(**kwargs)
today = datetime.now()
data_dir = create_data_dir(search, today)
# If there are more than 20,000 results, try chunking using control symbols
if search.total_results == "20,000+":
# Loop through the letters and numbers
for control in control_range:
# print(control)
# Add letter/number as a wildcard value
kwargs["control"] = f"{control}*"
# Try getting the results
results = get_results(data_dir, **kwargs)
# print(results)
if results is False:
# If there's still more than 20,000, add more letters/numbers to the control symbol!
refine_controls(control, data_dir, **kwargs)
# If there's less than 20,000 results, save them all
else:
get_results(data_dir, **kwargs)
totals = save_csv(data_dir)
save_metadata(search, data_dir, today, totals)
print(f"Harvest directory: {data_dir}")
display(FileLink(Path(data_dir, "metadata.json")))
display(FileLink(Path(data_dir, "results.ndjson")))
display(FileLink(Path(data_dir, "results.csv")))
return data_dir
def save_images(harvest_dir):
df = pd.read_csv(Path(harvest_dir, "results.csv"))
with tqdm(
total=df.loc[df["digitised_status"] == True].shape[0], desc="Files"
) as pbar:
for item in df.loc[df["digitised_status"] == True].itertuples():
image_dir = Path(
f"{harvest_dir}/images/{slugify(item.series)}-{slugify(str(item.control_symbol))}-{item.identifier}"
)
# Create the folder (and parent if necessary)
image_dir.mkdir(exist_ok=True, parents=True)
# Loop through the page numbers
for page in tqdm(
range(1, int(item.digitised_pages) + 1), desc="Images", leave=False
):
# Define the image filename using the barcode and page number
filename = Path(f"{image_dir}/{item.identifier}-{page}.jpg")
# Check to see if the image already exists (useful if rerunning a failed harvest)
if not filename.exists():
# If it doens't already exist then download it
img_url = f"https://recordsearch.naa.gov.au/NaaMedia/ShowImage.asp?B={item.identifier}&S={page}&T=P"
response = requests.get(img_url)
try:
response.raise_for_status()
except requests.exceptions.HTTPError:
pass
else:
filename.write_bytes(response.content)
time.sleep(0.5)
pbar.update(1)
def save_pdfs(harvest_dir):
df = pd.read_csv(Path(harvest_dir, "results.csv"))
pdf_dir = Path(harvest_dir, "pdfs")
pdf_dir.mkdir(exist_ok=True, parents=True)
with tqdm(
total=df.loc[df["digitised_status"] == True].shape[0], desc="Files"
) as pbar:
for item in df.loc[df["digitised_status"] == True].itertuples():
pdf_file = Path(
pdf_dir,
f"{slugify(item.series)}-{slugify(str(item.control_symbol))}-{item.identifier}.pdf",
)
if not pdf_file.exists():
pdf_url = f"https://recordsearch.naa.gov.au/SearchNRetrieve/NAAMedia/ViewPDF.aspx?B={item.identifier}&D=D"
response = requests.get(pdf_url)
try:
response.raise_for_status()
except requests.exceptions.HTTPError:
pass
else:
pdf_file.write_bytes(response.content)
time.sleep(0.5)
pbar.update(1)
Insert your search parameters in the brackets below.
Examples:
search, items = harvest_search(kw='rabbit')
search, items = harvest_search(kw='rabbit', digital=True)
search, items = harvest_search(record_detail='full', kw='rabbit', series='A1)
search, items = harvest_search(series='B13')
If you're running a long harvest, there's a good chance it will get interrupted at some point. Don't worry, just run the cell above again. The scraper caches your results, so it won't need to start from scratch.
data_dir = harvest_search(kw="wragge exhibit", record_detail="digitised")
Once you've saved all the metadata from your search, you can use it to download images from all the items that have been digitised.
Note that you can only save the images if you set the record_detail
parameter to 'digitised' or 'full' in the original harvest.
The function below will look for all items that have a digitised_pages
value in the harvest results, and then download an image for each page. The images will be saved in an images
subdirectory, inside the original harvest directory.
# Supply the path to the directory containing the harvested data
# This is the value returned by the `harvest_search()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_images(data_dir)
You can also save digitised files as PDFs. The function below will save any digisted files in the results to a pdfs
subdirectory within the harvest directory.
# Supply the path to the directory containing the harvested data
# This is the value returned by the `harvest_search()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_pdfs(data_dir)
Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!