Finding unpublished works that might be entering the public domain on 1 January 2019

Changes to Australian copyright legislation mean that many unpublished resources will be entering the public domain on 1 January 2019. This notebook attempts to harvest the details of some of these resources from Trove.

As with most things involving copyright, there's no real way to be certain what will be entering the public domain. The main problem is that if there's no known author then the copyright period depends on if and when the work was 'made public'. Add to that general issues around the accuracy and completeness of the metadata and all I can really do is create a list of some of the things which could potentially be entering the public domain based on the available metadata.

The basic methodology is:

  • Search in Trove's 'Diaries, letters, archives' zone for 'Unpublished' Australian materials
  • For each record check to see if there are any listed creators.
  • If there are creators, look to see if they have a death date and if that date is before 1949
  • If all creators died before 1949 then save the item metadata
  • If there are no creators, look to see if the creation date of the item is before 1949, if so save the metadata

If you just want the data, here's a CSV file you can download. Look below for a preview.

If you want to play with the data a bit, here's another notebook with a few ideas.

For more information on the changes see the NSLA guide to Preparing for copyright term changes in 2019.

In [9]:
import requests
import re
from tqdm import tqdm
import time
import pandas as pd
import datetime
from IPython.display import display, HTML

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

sess = requests.Session()
retries = Retry(
    total=10,
    backoff_factor=0.2,
    status_forcelist=[500, 502, 503, 504]
    )
sess.mount('http://', HTTPAdapter(max_retries=retries))
sess.mount('https://', HTTPAdapter(max_retries=retries))

Harvest the data

In [50]:
api_url = 'http://api.trove.nla.gov.au/v2/result'
params = {
    'q': ' ',
    'zone': 'collection',
    'encoding': 'json',
    'l-format': 'Unpublished',
    'l-australian': 'y',
    'include': 'holdings',
    'key': 'INSERT YOUR API KEY HERE',
    'n': '100'
}
In [51]:
# How many things are we processing?
response = sess.get(api_url, params = params)
data = response.json()
total = int(data['response']['zone'][0]['records']['total'])
print(total)
246136
In [52]:
def check_creators(creators):
    '''
    Make sure all creators have a death date before 1949.
    '''
    opening = False
    count = 0
    for creator in creators:
        year = get_latest_year(creator)
        if year and int(year) < 1949:
            count += 1
    if len(creators) == count:
        opening = True
    return opening


def check_date(issued):
    '''
    Check if the latest issued date is before 1949.
    '''
    opening = False
    year = get_latest_year(issued)
    if year and int(year) < 1949:
        opening = True
    return opening


def get_latest_year(value):
    '''
    Get a year from the end of a string.
    '''
    try:
        year = re.search(r'\b(\d{4})$', value).group(1)
    except (AttributeError, TypeError):
        year = None
    return year
        
In [54]:
items = []
s = '*'

with tqdm(total=total) as pbar:
    while s:
        params['s'] = s
        response = sess.get(api_url, params=params)
        # print(response.url)
        data = response.json()
        for record in data['response']['zone'][0]['records']['work']:
            opening = False
            creators = record.get('contributor')
            issued = record.get('issued')
            if creators:
                opening = check_creators(creators)
            elif issued:
                opening = check_date(str(issued))
            if opening:
                try:
                    creator = ' | '.join(creators)
                except TypeError:
                    creator = creators
                try:
                    nuc = record['holding'][0]['nuc']
                except KeyError:
                    nuc = None
                item = {
                    'id': record['id'],
                    'title': record['title'],
                    'creator': creator,
                    'date': issued,
                    'trove_url': record['troveUrl'],
                    'nuc': nuc
                }
                items.append(item)
        try:
            s = data['response']['zone'][0]['records']['nextStart']
        except KeyError:
            s = None
        pbar.update(100)
        time.sleep(0.5)
        
    
246200it [1:10:47, 55.30it/s]                            

Convert the results to a dataframe and have a look inside

In [55]:
df = pd.DataFrame(items)
df.head()
Out[55]:
creator date id nuc title trove_url
0 Kelly, F. S. (Frederick Septimus), 1881-1916 1893-1926 10201266 ANL Music manuscripts https://trove.nla.gov.au/work/10201266
1 None 1863-1925 10544890 ANL Collection of promissory notes from remote are... https://trove.nla.gov.au/work/10544890
2 Gugeri, Peter Anthony, 1845-1930 1863-1910 14022030 WLB Gugeri family papers https://trove.nla.gov.au/work/14022030
3 Kruse, Johann Secundus, 1859-1927 1870-1927 14952244 ANL Papers of Johann Kruse https://trove.nla.gov.au/work/14952244
4 Freycinet, Rose Marie de, d. 1832 1802-1927 152218670 ANL Documents relating to Louis and Rose de Freycinet https://trove.nla.gov.au/work/152218670
In [56]:
# How many items are there?
df.shape[0]
Out[56]:
14743

Save the results as a CSV file

In [57]:
date_str = datetime.datetime.now().strftime('%Y%m%d')
csv_file = 'unpublished_works_entering_pd_{}.csv'.format(date_str)
df.to_csv(csv_file, index=False)
# Make a download link
display(HTML('<a target="_blank" href="{}">Download CSV file</a>'.format(csv_file)))
In [ ]: