The Portable Antiquities Scheme (PAS) is a programme run by the British Museum and the National Museum of Wales. Small artefacts are often found in the course of gardening, metal detecting and other activities; the Scheme allows those finds to be recorded and those objects to become known. The Scheme has a database at finds.org.uk/database containing well over 1 million objects. The database exposes its records in a variety of ways to encourage scholarly re-use. Daniel Pett, who designed and built the database and webapi, wrote the following R code as a demonstration for how to query its API and retrieve photographic records.
library(jsonlite)
library(RCurl)
Loading required package: bitops
Now we set the base URL for PAS because we'll need it later. We're going to make a search of the database, which will return results to us in json format. Some of the key:value pairs will be things like 'filename' and 'imagedir'; to get the data we want, we'll grab that information and string it together with the base url to create a download path.
# The base URL for PAS
base <- 'https://finds.org.uk/'
Now we set up our query. Open a new browser tab and go to the PAS website and do some simple searches to see the kind of information available. When you get the search results, scroll down to see the options for how you can get the data returned to you. Click on 'json', and note the URL in the search bar. That's what we're about to set up in the next cell:
## Set your query up
# The important parameters for you to include in a search are:
# q/{queryString} - which has your free text or parameterised search e.g. q/gold/broadperiod/BRONZE+AGE
# /thumbnail/1 - ask for records with images
# /format/json - ask for json response
##
url <- "https://finds.org.uk/database/search/results/q/gold/broadperiod/BRONZE+AGE/thumbnail/1/format/json"
The next line, which is using a function from the 'jsonlite' package, goes to the URL we set up above and gets the data.
# Get your JSON and parse
json <- fromJSON(url)
If you look at the results of your search in the other browser window, where you clicked on 'json' at the bottom of the page, you'll see a long list of key:value pairs. Below, we're going to grab some of the values that describe the metadata for our search.
# The total results available
total <- json$meta$totalResults
# Number of pages
# Results
results <- json$meta$resultsPerPage
pagination <- ceiling(total/results)
We're now going to set up some variables that will specify which values we wish to keep, and pass to a csv file at the end of this process.
# Set which fields to keep
keeps <- c(
"id", "objecttype", "old_findID",
"broadperiod", "institution", "imagedir",
"filename"
)
data <- json$results
# Keep the columns you want
data <- data[,(names(data) %in% keeps)]
Let's take a look at what we've got. We could just call data
and see everything. We'll use head(data)
instead to see just the first few lines. Any guesses as to what you'd type to see the last few lines?
head(data)
id | old_findID | objecttype | broadperiod | institution | filename | imagedir |
---|---|---|---|---|---|---|
904260 | PAS-011055 | PENANNULAR RING | BRONZE AGE | PAS | 2016T920b.jpg | images/ianr/ |
899611 | CORN-3237B8 | FLAT AXEHEAD | BRONZE AGE | CORN | DSCN0203.JPG | images/atyacke/ |
899113 | YORYM-057F37 | AXEHEAD | BRONZE AGE | YORYM | SWW0001.jpg | images/bmorris/ |
878840 | HESH-83416C | FLAT AXEHEAD | BRONZE AGE | HESH | HESH83416C.jpg | images/preavill/ |
878015 | HAMP-1248F2 | PENANNULAR RING | BRONZE AGE | HAMP | HAMP1248F2.jpg | images/khindshamp/ |
871932 | SUSS-1CBAB0 | PENANNULAR RING | BRONZE AGE | SUSS | RingSUSS1CBAB0.jpg | images/EdwinWood/ |
The next bit is tricky. We're going to loop through all of the pages of results, and bind it all together into a single table. This might take a bit of time, depending on how you framed your query. Be careful: there is a lot of data available. If you're running this notebook through Binder, know that it might time out on you if you're grabbing a vast amount.
# Loop through and bind results
for (i in seq(from=2, to=pagination, by=1)){
urlDownload <- paste(url, '/page/', i, sep='')
pagedJson <- fromJSON(urlDownload)
records <- pagedJson$results
records <- records[,(names(records) %in% keeps)]
data <-rbind(data,records)
}
We'll write that to csv:
# Write a csv file of the data you want
write.csv(data, file='data.csv',row.names=FALSE, na="")
This is a good place to stop and consider what you've done. You've queried an archaeological database, selected the subfields you want, and written it to a csv! Make sure to save a copy of your data locally (click on the jupyter icon top left and from the file manager download the csv to your computer).
The last part of this notebook iterates through your data, creating the download path for the images and saves those images to file in a tidy notebook where anything categorized as a 'BROOCH' is in a brooch folder, anything categorized as a 'TORQUE' is in a torque folder, and so on.
# Throw in a log file, just in case of troubles or missing files.
failures <- "failures.log"
log_con <- file(failures)
We're going to make a 'function', or a small mini-program that our code can use over and over again, to download the materials.
# Download function with test for URL
download <- function(data){
# This should be the object type taken from column 3
object = data[3]
# This should be the record old find ID taken from column 2
record = data[2]
# Check and create a folder for that object type if does not exist
if (!file.exists(object)){
dir.create(object)
}
# Create image url - image path is in column 7 and filename is column 6
URL = paste0(base,data[7],data[6])
# Test the file exists
exist <- url.exists(URL)
# If it does, download. If not say 404
if(exist == TRUE){
download.file(URLencode(URL), destfile = paste(object,basename(URL), sep = '/'))
} else {
print("That file is a 404")
# Log the errors for sending back to PAS to fix - probably better than csv as you
# can tail -f and watch the errors come in
message <- paste0(record,"|",URL,"|","404 \n")
# Write to error file
cat(message, file = failures, append = TRUE)
}
}
We can now call that function to download our data. When you run the code block below, it might seem as if nothing is happening - but keep an eye on the filemanager! Right-click on the Jupyter logo above and open the link in a new tab. You'll see your new folders appearing in the list. Below, you'll also get a list of errors. These are being written to a text file called 'failures.log' which will list the items that couldn't be downloaded. You can try to trace these down manually on the PAS website, or contact them with the information.
To stop the download, just make sure the code block below is highlighted, and then hit the stop button (the big square beside the Run button).
# Apply the function
apply(data, 1, download)
[1] "That file is a 404" [1] "That file is a 404" [1] "That file is a 404" [1] "That file is a 404"