There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.
My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using format:Periodical/Newspaper
in the books and libraries category (or the article
API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the sort of results you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.
My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.
params = {
'zone': 'article',
'encoding': 'json',
'l-format': 'Periodical/Newspaper',
'reclevel': 'full',
'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
print(f'\n{newspaper["title"]}')
issn = newspaper.get('issn')
params['q'] = f'issn:{issn}'
response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
data = response.json()
try:
works = data['response']['zone'][0]['records']['work']
except KeyError:
print('Not found')
else:
for work in works:
print(work.get('language'))
if not response.from_cache:
time.sleep(0.2)
The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...
If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found pycld3 which installed with pip
, and just worked.
My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:
params = {
'zone': 'newspaper',
'encoding': 'json',
'l-word': '100 - 1000 Words',
'include': 'articletext',
'key': TROVE_API_KEY,
'q': ' ',
'n': 100,
}
Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles.
In general this worked pretty well, and the result was a list of 48 newspapers (also as a Gist) that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.
import requests
import time
import requests_cache
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from collections import Counter
import re
from langdetect import detect
from tqdm.auto import tqdm
import pandas as pd
import cld3
import pycountry
from language_tags import tags
import altair as alt
from pathlib import Path
s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
TROVE_API_KEY = '[YOUR API KEY]'
def get_newspapers():
'''
Get a list of newspapers in Trove.
'''
response = s.get('https://api.trove.nla.gov.au/v2/newspaper/titles', params={'encoding': 'json', 'key': TROVE_API_KEY})
data = response.json()
return data['response']['records']['newspaper']
params = {
'zone': 'newspaper',
'encoding': 'json',
#'l-category': 'Article',
'l-word': '100 - 1000 Words',
'include': 'articletext',
'key': TROVE_API_KEY,
'q': ' ',
'n': 100,
}
newspaper_langs = []
newspapers = get_newspapers()
for newspaper in tqdm(newspapers):
langs = []
# print(f'\n{newspaper["title"]}')
params['l-title'] = newspaper['id']
response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
data = response.json()
n = data['response']['zone'][0]['records']['n']
try:
articles = data['response']['zone'][0]['records']['article']
except KeyError:
# print('Not found')
pass
else:
# Detect language for each article in results
for article in articles:
if 'articleText' in article:
# Clean up OCRd text by removing takings and extra whitespace
text = article['articleText']
text = re.sub('<[^<]+?>', '', text)
text = re.sub("\s\s+", " ", text)
# Get the language
ld = cld3.get_language(text)
# If the language prediction is reliable, save it
if ld.is_reliable:
langs.append(ld.language)
# Find the count of each language detected in the sample of articles
for lang, count in dict(Counter(langs)).items():
# Calculate the language count as a proportion of the total number of results
prop = int(count) / len(langs)
newspaper_langs.append({'id': newspaper['id'], 'title': newspaper['title'], 'language': lang, 'proportion': prop, 'number': n})
if not response.from_cache:
time.sleep(0.2)
HBox(children=(FloatProgress(value=0.0, max=1622.0), HTML(value='')))
Convert the results into a dataframe.
df = pd.DataFrame(newspaper_langs)
df.head()
id | title | language | proportion | number | |
---|---|---|---|---|---|
0 | 166 | Canberra Community News (ACT : 1925 - 1927) | en | 1.0 | 100 |
1 | 165 | Canberra Illustrated: A Quarterly Magazine (AC... | en | 1.0 | 29 |
2 | 69 | Federal Capital Pioneer (Canberra, ACT : 1924 ... | en | 1.0 | 100 |
3 | 871 | Good Neighbour (ACT : 1950 - 1969) | en | 1.0 | 100 |
4 | 665 | Student Notes/Canberra University College Stud... | en | 1.0 | 100 |
The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the language-tags package.
def get_full_language(lc):
'''
Get full language names from codes
'''
lang = tags.description(lc)
if lang:
return lang[0]
else:
print(lc)
return lc
df['language_full'] = df['language'].apply(get_full_language)
If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.
df['language_full'].value_counts()
English 1565 Maltese 279 Catalan 53 Welsh 35 Japanese 31 Italian 31 Somali 24 Norwegian 23 Danish 17 German 16 Samoan 10 Igbo 10 Portuguese 9 French 9 Chinese 8 Estonian 8 Scottish Gaelic 8 Luxembourgish 8 Vietnamese 7 Western Frisian 7 Hawaiian 7 Russian 6 Modern Greek (1453-) 5 Swedish 5 Filipino 5 Afrikaans 4 Javanese 4 Indonesian 4 Polish 4 Hindi 4 Bulgarian 4 Corsican 4 Dutch 3 Malagasy 3 Haitian 3 Latin 3 Malay (macrolanguage) 3 Albanian 2 Spanish 2 Shona 2 Kurdish 2 Cebuano 2 Irish 2 Ukrainian 2 Bosnian 2 Macedonian 1 Slovak 1 Galician 1 Turkish 1 Czech 1 Lithuanian 1 Croatian 1 Slovenian 1 Zulu 1 Maori 1 Marathi 1 Name: language_full, dtype: int64
Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.
df.loc[df['proportion'] == 1]['language_full'].value_counts()
English 1112 Italian 3 German 3 Modern Greek (1453-) 1 Portuguese 1 Estonian 1 Name: language_full, dtype: int64
If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.
alt.Chart(df).mark_bar().encode(
x=alt.X('proportion:Q', bin=True),
y='count():Q'
)
If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives.
alt.Chart(df.loc[df['proportion'] < 0.1]).mark_bar().encode(
x=alt.X('proportion:Q', bin=True),
y='count():Q'
)