There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.
My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using format:Periodical/Newspaper
in the books and libraries category (or the article
API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the sort of results you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.
My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.
params = {
'zone': 'article',
'encoding': 'json',
'l-format': 'Periodical/Newspaper',
'reclevel': 'full',
'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
print(f'\n{newspaper["title"]}')
issn = newspaper.get('issn')
params['q'] = f'issn:{issn}'
response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
data = response.json()
try:
works = data['response']['zone'][0]['records']['work']
except KeyError:
print('Not found')
else:
for work in works:
print(work.get('language'))
if not response.from_cache:
time.sleep(0.2)
The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...
If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found pycld3 which installed with pip
, and just worked.
My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:
params = {
'zone': 'newspaper',
'encoding': 'json',
'l-word': '100 - 1000 Words',
'include': 'articletext',
'key': TROVE_API_KEY,
'q': ' ',
'n': 100,
}
Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles.
In general this worked pretty well, and the result was a list of 52 newspapers (also as a Gist) that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.
import os
import re
import time
from collections import Counter
from pathlib import Path
import altair as alt
import cld3
import pandas as pd
import requests_cache
from IPython.display import display
from language_tags import tags
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv
# Insert your Trove API key
API_KEY = "YOUR API KEY"
# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
API_KEY = os.getenv("TROVE_API_KEY")
def get_newspapers():
"""
Get a list of newspapers in Trove.
"""
response = s.get(
"https://api.trove.nla.gov.au/v2/newspaper/titles",
params={"encoding": "json", "key": API_KEY},
)
data = response.json()
return data["response"]["records"]["newspaper"]
params = {
"zone": "newspaper",
"encoding": "json",
# 'l-category': 'Article',
"l-word": "100 - 1000 Words",
"include": "articletext",
"key": API_KEY,
"q": " ",
"n": 100,
}
newspaper_langs = []
newspapers = get_newspapers()
for newspaper in tqdm(newspapers):
langs = []
# print(f'\n{newspaper["title"]}')
params["l-title"] = newspaper["id"]
response = s.get("https://api.trove.nla.gov.au/v2/result", params=params)
data = response.json()
n = data["response"]["zone"][0]["records"]["n"]
try:
articles = data["response"]["zone"][0]["records"]["article"]
except KeyError:
# print('Not found')
pass
else:
# Detect language for each article in results
for article in articles:
if "articleText" in article:
# Clean up OCRd text by removing tags and extra whitespace
text = article["articleText"]
text = re.sub(r"<[^<]+?>", "", text)
text = re.sub(r"\s\s+", " ", text)
# Get the language
ld = cld3.get_language(text)
# If the language prediction is reliable, save it
if ld.is_reliable:
langs.append(ld.language)
# Find the count of each language detected in the sample of articles
for lang, count in dict(Counter(langs)).items():
# Calculate the language count as a proportion of the total number of results
prop = int(count) / len(langs)
newspaper_langs.append(
{
"id": newspaper["id"],
"title": newspaper["title"],
"language": lang,
"proportion": prop,
"number": n,
}
)
if not response.from_cache:
time.sleep(0.2)
0%| | 0/1741 [00:00<?, ?it/s]
Convert the results into a dataframe.
df = pd.DataFrame(newspaper_langs)
df.head()
id | title | language | proportion | number | |
---|---|---|---|---|---|
0 | 166 | Canberra Community News (ACT : 1925 - 1927) | en | 1.0 | 100 |
1 | 165 | Canberra Illustrated: A Quarterly Magazine (AC... | en | 1.0 | 29 |
2 | 69 | Federal Capital Pioneer (Canberra, ACT : 1924 ... | en | 1.0 | 100 |
3 | 871 | Good Neighbour (ACT : 1950 - 1969) | en | 1.0 | 100 |
4 | 665 | Student Notes/Canberra University College Stud... | en | 1.0 | 100 |
The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the language-tags package.
def get_full_language(lc):
"""
Get full language names from codes
"""
lang = tags.description(lc)
if lang:
return lang[0]
else:
print(lc)
return lc
df["language_full"] = df["language"].apply(get_full_language)
If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.
df["language_full"].value_counts()
English 1680 Maltese 177 Japanese 28 Italian 22 Somali 18 German 16 Welsh 15 Catalan 12 Portuguese 9 Norwegian 9 Chinese 8 Estonian 7 Danish 7 Hindi 6 French 6 Western Frisian 6 Corsican 6 Hawaiian 4 Bulgarian 4 Vietnamese 4 Polish 4 Igbo 4 Indonesian 4 Modern Greek (1453-) 4 Luxembourgish 3 Javanese 3 Yiddish 3 Dutch 3 Scottish Gaelic 3 Swedish 3 Czech 2 Samoan 2 Latin 2 Kurdish 2 Malagasy 2 Filipino 2 Russian 2 Malay (macrolanguage) 2 Bosnian 2 Spanish 2 Cebuano 2 Uzbek 1 Slovenian 1 Irish 1 Croatian 1 Haitian 1 Turkish 1 Hebrew 1 Maori 1 Zulu 1 Galician 1 Latvian 1 Shona 1 Ukrainian 1 Lithuanian 1 Afrikaans 1 Hausa 1 Macedonian 1 Name: language_full, dtype: int64
Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.
df.loc[df["proportion"] == 1]["language_full"].value_counts()
English 1422 German 3 Italian 3 Modern Greek (1453-) 2 Estonian 1 Yiddish 1 Name: language_full, dtype: int64
If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.
alt.Chart(df).mark_bar().encode(x=alt.X("proportion:Q", bin=True), y="count():Q")
If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives.
alt.Chart(df.loc[df["proportion"] < 0.1]).mark_bar().encode(
x=alt.X("proportion:Q", bin=True), y="count():Q"
)
Let's be fairly conservative and filter out languages that have a proportion (per newspaper) less than 0.5. This list seems a bit more in line with what we would expect, but there are still some surprises – 34 newspapers published articles in Maltese?
df.loc[df["proportion"] >= 0.05]["language_full"].value_counts()
English 1670 Maltese 33 Italian 15 German 9 Chinese 8 Somali 5 Modern Greek (1453-) 4 Japanese 3 Portuguese 3 Yiddish 3 French 3 Polish 3 Western Frisian 2 Dutch 2 Malay (macrolanguage) 1 Lithuanian 1 Ukrainian 1 Estonian 1 Indonesian 1 Vietnamese 1 Danish 1 Swedish 1 Bosnian 1 Russian 1 Scottish Gaelic 1 Welsh 1 Spanish 1 Corsican 1 Macedonian 1 Bulgarian 1 Name: language_full, dtype: int64
If we focus in on the newspapers that supposedly have a significant proportion of articles in Maltese, we see some very strange results. I seriously doubt that 80% of the Mildura Irrigationist from 1892-3 is in Maltese. So what's going on?
df.loc[(df["proportion"] > 0.1) & (df["language_full"] == "Maltese")]
id | title | language | proportion | number | language_full | |
---|---|---|---|---|---|---|
203 | 1596 | L'Italo-Australiano = The Italo-Australian (Su... | mt | 0.206349 | 100 | Maltese |
270 | 389 | Reporter and Illawarra Journal (Kiama, NSW : 1... | mt | 0.105882 | 100 | Maltese |
286 | 418 | Southern Morning Herald (Goulburn, NSW : 1920 ... | mt | 0.146667 | 100 | Maltese |
289 | 623 | Sunday News (Sydney, NSW : 1919) | mt | 0.181818 | 100 | Maltese |
530 | 500 | The Richmond River Express and Casino Kyogle A... | mt | 0.126437 | 100 | Maltese |
654 | 810 | Upper Hunter Courier (Murrurundi, NSW : 1871) | mt | 0.142857 | 14 | Maltese |
812 | 892 | Warwick Daily News (Qld. : 1919 -1954) | mt | 0.111111 | 100 | Maltese |
928 | 34 | The Advertiser (Adelaide, SA : 1889 - 1931) | mt | 0.486111 | 100 | Maltese |
1205 | 543 | Cobden Times (Vic. : 1918) | mt | 0.109890 | 100 | Maltese |
1375 | 384 | North Melbourne Gazette (Vic. : 1894 - 1901) | mt | 0.189873 | 100 | Maltese |
1431 | 318 | Sandringham Southern Cross (Vic. : 1914 - 1918) | mt | 0.243902 | 100 | Maltese |
1565 | 1583 | The Mildura Irrigationist (Vic. : 1892 - 1893) | mt | 0.762500 | 100 | Maltese |
1568 | 1581 | The Mildura Irrigationist and Murray River Agr... | mt | 0.626667 | 100 | Maltese |
1577 | 1733 | The Morwell Advocate and Boolara and Mirboo Ch... | mt | 0.625000 | 21 | Maltese |
1580 | 1734 | The Morwell Advocate and Narracan, Boolara and... | mt | 0.170732 | 100 | Maltese |
1927 | 1617 | The Derby News (WA : 1887) | mt | 0.750000 | 5 | Maltese |
If you look at results for the Mildura Irrigationist in Trove you'll see that many of the page images are blurry, and as a result the OCR is very, very bad. Here's a sample:
ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lHa ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiwa afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f*' ""j •fria—lhati tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aMtoclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af
t«l. i pwwiaf Mtan (tot jw. twy MwUI « a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa
What happens when we feed this fragment of bad OCR to the language detector? Remarkably, the language detector is 96% sure that it's Maltese! To find out why this is the case, we'd probably have to dig into the way the language detection model was trained. But for our purposes it's enough to know that some of the languages detected seem to be the result of bad OCR.
ocr = """ill Tatr W lyltwililUmt aat aa«v aa MwOkaWtOPMlkMrf faiflftMMRltitlWBfMNM fmiMW^M^K IMIOHIpM^fQBMMI ft tWMmrwl tWWiltjfNMStW ffw aailwt«M wtMitiar«lH*a ifcmH af tlw ial«««l ion «M««f ffantoif wwtMaaM. tto tf h «frwringmhw torf M hr toaiy. Im*4. ar, fc> mmirf awlUW wefllaM aA. aaytMaa. l «Wa A tfc» tow waliw Macks b aaM, b wil fVfbH Ja ^IMntaam* Mm' ls tolliac. rt Tto aad nf ttoar UhKMimiw*a afM» ftjrwl ans W l OtfWOar jpaaofTwSi aJwwr la'aahS^*— attor aakwt mm rvfimMiMh* ttoai. day - Why. aa IH thrf t«fl almd yaa."iw. aal wwifciha m OiO all tto laM amnavaA, fawawNl I r aa4 f wa* tm enr a Mtcfc tto watrr tto wiaaal m a* a* day pfaMat. aa4 (h* ilj amintir* ilm tTtsjtvL.f**' ""j •fria—lhati* tow ««4M k." tlml t | r 4m» wtn .aa rUa* I h ha«« t ctoantaf InMM* aM*toclt ttopnaMaf II It la Mat rtgM, t jmi awl a 1 : af but d awtliqg a Mr. Jafc Matwa-(MMa M t «wl y gha yaar «toa anl yaar (ma as «fpai ta af <M>t«l. i pwwiaf Mtan (tot jw. twy MwUI «*a1 a«ry ftajr «ndl tar tlw aad annaH* a*«r aarf a««r aaria. tiaa"""
cld3.get_language(ocr)
LanguagePrediction(language='mt', probability=0.960280179977417, is_reliable=True, proportion=1.0)
Of course there might actually be newspapers with articles in Maltese, so we don't want to filter them all out. So let's do some manual inspection of the newspapers that seem to have non-English content. First we'll filter our results to include only languages with proportions of more than 0.05, and then drop out newspapers that seem to be only in English. We end up with 89 different titles.
# The filter on the groupby drops out newspapers that only have articles in English.
filtered = (
df.loc[df["proportion"] >= 0.05]
.groupby(by=["title", "id"])
.filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)
89
Let's list those 89 newspapers. From the list below, I think it's pretty easy to pick out the results that are likely to be the product of bad OCR.
for n, l in papers:
if not l.loc[(~df["language"].isin(["en"])) & (df["proportion"] >= 0.05)].empty:
print(f"\n{n[0]} ({n[1]})")
display(
l[["language_full", "language", "proportion"]]
.loc[(l["proportion"] > 0.05)]
.sort_values(by="proportion", ascending=False)
)
A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)
language_full | language | proportion | |
---|---|---|---|
8 | Portuguese | pt | 0.988889 |
Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)
language_full | language | proportion | |
---|---|---|---|
828 | German | de | 1.0 |
Auburn and District News (NSW : 1929) (1320)
language_full | language | proportion | |
---|---|---|---|
43 | English | en | 0.947368 |
44 | Vietnamese | vi | 0.052632 |
Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) (1686)
language_full | language | proportion | |
---|---|---|---|
1158 | Yiddish | yi | 1.0 |
Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)
language_full | language | proportion | |
---|---|---|---|
832 | German | de | 1.0 |
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)
language_full | language | proportion | |
---|---|---|---|
14 | Malay (macrolanguage) | ms | 0.891304 |
15 | Indonesian | id | 0.108696 |
Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)
language_full | language | proportion | |
---|---|---|---|
83 | Chinese | zh | 0.928571 |
Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)
language_full | language | proportion | |
---|---|---|---|
1194 | Chinese | zh | 0.918367 |
Chronicle and North Coast Advertiser (Qld. : 1903 - 1922) (286)
language_full | language | proportion | |
---|---|---|---|
695 | English | en | 0.94898 |
696 | Maltese | mt | 0.05102 |
Chung Wah News (Perth, WA : 1981 - 1987) (1383)
language_full | language | proportion | |
---|---|---|---|
1694 | English | en | 0.566667 |
1693 | Chinese | zh | 0.388889 |
Cobden Times (Vic. : 1918) (543)
language_full | language | proportion | |
---|---|---|---|
1204 | English | en | 0.857143 |
1205 | Maltese | mt | 0.109890 |
Colac Reformer (Vic. : 1914 - 1918) (763)
language_full | language | proportion | |
---|---|---|---|
1214 | English | en | 0.947368 |
1215 | Maltese | mt | 0.052632 |
Daily Post (Hobart, Tas. : 1908 - 1918) (860)
language_full | language | proportion | |
---|---|---|---|
1011 | English | en | 0.719101 |
1012 | Japanese | ja | 0.112360 |
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)
language_full | language | proportion | |
---|---|---|---|
1716 | German | de | 0.82 |
1717 | English | en | 0.18 |
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)
language_full | language | proportion | |
---|---|---|---|
125 | German | de | 1.0 |
Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)
language_full | language | proportion | |
---|---|---|---|
844 | German | de | 0.9 |
843 | English | en | 0.1 |
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)
language_full | language | proportion | |
---|---|---|---|
126 | German | de | 0.704082 |
127 | English | en | 0.295918 |
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)
language_full | language | proportion | |
---|---|---|---|
845 | German | de | 0.989583 |
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)
language_full | language | proportion | |
---|---|---|---|
131 | Dutch | nl | 0.969697 |
Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)
language_full | language | proportion | |
---|---|---|---|
134 | Dutch | nl | 0.919192 |
135 | English | en | 0.060606 |
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)
language_full | language | proportion | |
---|---|---|---|
1721 | Polish | pl | 0.91 |
1722 | English | en | 0.09 |
Eco Italiano (Perth, WA : 1958 - 1959) (1387)
language_full | language | proportion | |
---|---|---|---|
1723 | Italian | it | 1.0 |
Emu Bay Times and North West and West Coast Advocate (Tas. : 1897 - 1899) (116)
language_full | language | proportion | |
---|---|---|---|
1027 | English | en | 0.933333 |
1028 | Maltese | mt | 0.066667 |
Evelyn Observer, and South and East Bourke Record (Vic. : 1882 - 1902) (145)
language_full | language | proportion | |
---|---|---|---|
1241 | English | en | 0.913978 |
1240 | Maltese | mt | 0.075269 |
Geraldton Advocate and Johnstone River Guardian (Qld. : 1895 - 1896) (1103)
language_full | language | proportion | |
---|---|---|---|
704 | English | en | 0.947917 |
705 | Maltese | mt | 0.052083 |
Geraldton Express and Murchison Goldfields News (WA : 1894 - 1896) (1623)
language_full | language | proportion | |
---|---|---|---|
1734 | English | en | 0.643836 |
1735 | Maltese | mt | 0.095890 |
1739 | Japanese | ja | 0.068493 |
Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)
language_full | language | proportion | |
---|---|---|---|
162 | Chinese | zh | 0.854167 |
165 | Western Frisian | fy | 0.062500 |
Hamilton Spectator and Grange District Advertiser (Vic. : 1860 - 1870) (927)
language_full | language | proportion | |
---|---|---|---|
1282 | English | en | 0.915789 |
1283 | Maltese | mt | 0.073684 |
Hellenic Echo (Perth, WA : 1967 - 1968) (1389)
language_full | language | proportion | |
---|---|---|---|
1771 | Modern Greek (1453-) | el | 1.0 |
Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)
language_full | language | proportion | |
---|---|---|---|
1773 | Italian | it | 0.97 |
Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)
language_full | language | proportion | |
---|---|---|---|
175 | Italian | it | 0.91 |
176 | English | en | 0.09 |
Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)
language_full | language | proportion | |
---|---|---|---|
177 | Italian | it | 0.75 |
178 | English | en | 0.25 |
Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)
language_full | language | proportion | |
---|---|---|---|
188 | English | en | 0.833333 |
189 | Italian | it | 0.166667 |
Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)
language_full | language | proportion | |
---|---|---|---|
190 | English | en | 0.893617 |
191 | Italian | it | 0.106383 |
Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)
language_full | language | proportion | |
---|---|---|---|
192 | Italian | it | 0.97 |
Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)
language_full | language | proportion | |
---|---|---|---|
1777 | Japanese | ja | 0.9375 |
Kyabram Union (Vic. : 1886 - 1894) (196)
language_full | language | proportion | |
---|---|---|---|
1326 | English | en | 0.931818 |
1327 | Maltese | mt | 0.068182 |
L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)
language_full | language | proportion | |
---|---|---|---|
202 | Italian | it | 0.698413 |
203 | Maltese | mt | 0.206349 |
L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)
language_full | language | proportion | |
---|---|---|---|
208 | Italian | it | 0.97 |
La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) (1388)
language_full | language | proportion | |
---|---|---|---|
1796 | Italian | it | 0.98 |
Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)
language_full | language | proportion | |
---|---|---|---|
212 | French | fr | 0.76 |
213 | English | en | 0.24 |
Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)
language_full | language | proportion | |
---|---|---|---|
1815 | Modern Greek (1453-) | el | 0.357143 |
1814 | English | en | 0.224490 |
1816 | Portuguese | pt | 0.153061 |
1809 | French | fr | 0.081633 |
1808 | Spanish | es | 0.061224 |
Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956) (280)
language_full | language | proportion | |
---|---|---|---|
221 | Estonian | et | 1.0 |
Murchison Times and Cue-Big Bell-Reedy Advocate (WA : 1937 - 1942) (1543)
language_full | language | proportion | |
---|---|---|---|
1838 | English | en | 0.892857 |
1839 | Maltese | mt | 0.071429 |
Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) (1594)
language_full | language | proportion | |
---|---|---|---|
233 | Lithuanian | lt | 0.95 |
Nasza droga (Adelaide, SA : 1952 - 1954) (1323)
language_full | language | proportion | |
---|---|---|---|
869 | Polish | pl | 0.89 |
870 | English | en | 0.11 |
Norden (Melbourne, Vic. : 1914 - 1918) (797)
language_full | language | proportion | |
---|---|---|---|
1366 | Danish | da | 0.752809 |
1369 | Swedish | sv | 0.112360 |
1367 | English | en | 0.067416 |
North Melbourne Gazette (Vic. : 1894 - 1901) (384)
language_full | language | proportion | |
---|---|---|---|
1374 | English | en | 0.784810 |
1375 | Maltese | mt | 0.189873 |
Oceania (Sydney, NSW : 1913 - 1915) (1598)
language_full | language | proportion | |
---|---|---|---|
254 | Italian | it | 0.54 |
255 | English | en | 0.46 |
Reporter and Illawarra Journal (Kiama, NSW : 1887 - 1894) (389)
language_full | language | proportion | |
---|---|---|---|
269 | English | en | 0.894118 |
270 | Maltese | mt | 0.105882 |
Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) (1604)
language_full | language | proportion | |
---|---|---|---|
271 | French | fr | 0.98 |
Ringwood and Croydon Chronicle (Vic. : 1914 - 1918) (329)
language_full | language | proportion | |
---|---|---|---|
1422 | English | en | 0.938144 |
1423 | Maltese | mt | 0.061856 |
Sandringham Southern Cross (Vic. : 1914 - 1918) (318)
language_full | language | proportion | |
---|---|---|---|
1430 | English | en | 0.731707 |
1431 | Maltese | mt | 0.243902 |
Seamen's Strike Bulletin (Melbourne, Vic. : 1919) (1043)
language_full | language | proportion | |
---|---|---|---|
1436 | Polish | pl | 0.4 |
1435 | Bosnian | bs | 0.2 |
1437 | Russian | ru-Latn | 0.2 |
1438 | Western Frisian | fy | 0.2 |
Southern Morning Herald (Goulburn, NSW : 1920 - 1923) (418)
language_full | language | proportion | |
---|---|---|---|
285 | English | en | 0.800000 |
286 | Maltese | mt | 0.146667 |
287 | Somali | so | 0.053333 |
Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)
language_full | language | proportion | |
---|---|---|---|
1881 | Italian | it | 0.97 |
Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)
language_full | language | proportion | |
---|---|---|---|
924 | German | de | 0.888889 |
925 | English | en | 0.111111 |
Sunday News (Sydney, NSW : 1919) (623)
language_full | language | proportion | |
---|---|---|---|
290 | English | en | 0.779221 |
289 | Maltese | mt | 0.181818 |
Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)
language_full | language | proportion | |
---|---|---|---|
1888 | Italian | it | 1.0 |
Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)
language_full | language | proportion | |
---|---|---|---|
922 | German | de | 0.989691 |
The Advertiser (Adelaide, SA : 1889 - 1931) (34)
language_full | language | proportion | |
---|---|---|---|
927 | English | en | 0.513889 |
928 | Maltese | mt | 0.486111 |
The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)
language_full | language | proportion | |
---|---|---|---|
1473 | English | en | 0.810526 |
1475 | Yiddish | yi | 0.157895 |
The Castlereagh (Gilgandra, NSW : 1905 - 1907) (224)
language_full | language | proportion | |
---|---|---|---|
384 | English | en | 0.609195 |
385 | Somali | so | 0.310345 |
386 | Maltese | mt | 0.080460 |
The Chinese Advertiser (Ballarat, Vic. : 1856) (706)
language_full | language | proportion | |
---|---|---|---|
1504 | Chinese | zh | 0.500000 |
1506 | English | en | 0.333333 |
1505 | Scottish Gaelic | gd | 0.166667 |
The Derby News (WA : 1887) (1617)
language_full | language | proportion | |
---|---|---|---|
1927 | Maltese | mt | 0.75 |
1928 | Corsican | co | 0.25 |
The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)
language_full | language | proportion | |
---|---|---|---|
1522 | English | en | 0.894737 |
1523 | Chinese | zh | 0.052632 |
1524 | Maltese | mt | 0.052632 |
The Hay Standard and Advertiser for Balranald, Wentworth, Maude...(Hay, NSW : 1871 - 1873; 1880 - 1881; 1890 - 1900) (725)
language_full | language | proportion | |
---|---|---|---|
441 | English | en | 0.947368 |
442 | Maltese | mt | 0.052632 |
The Herald of Tasmania (Hobart, Tas. : 1845) (1741)
language_full | language | proportion | |
---|---|---|---|
1083 | English | en | 0.857143 |
1085 | Italian | it | 0.095238 |
The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)
language_full | language | proportion | |
---|---|---|---|
1535 | English | en | 0.81 |
1536 | Yiddish | yi | 0.19 |
The Melbourne Advertiser (Vic. : 1838) (935)
language_full | language | proportion | |
---|---|---|---|
1550 | English | en | 0.666667 |
1551 | Welsh | cy | 0.333333 |
The Mildura Irrigationist (Vic. : 1892 - 1893) (1583)
language_full | language | proportion | |
---|---|---|---|
1565 | Maltese | mt | 0.7625 |
1564 | English | en | 0.1250 |
1566 | Somali | so | 0.1125 |
The Mildura Irrigationist and Murray River Agricultural Times (Vic. : 1888) (1581)
language_full | language | proportion | |
---|---|---|---|
1568 | Maltese | mt | 0.626667 |
1569 | English | en | 0.240000 |
1567 | Somali | so | 0.133333 |
The Mildura Irrigationist and Murray River Cultural Advocate (Vic. : 1891 - 1892) (1582)
language_full | language | proportion | |
---|---|---|---|
1570 | English | en | 0.746667 |
1571 | Somali | so | 0.146667 |
1572 | Maltese | mt | 0.093333 |
The Miner's Right (Boulder, WA : 1897) (1638)
language_full | language | proportion | |
---|---|---|---|
1984 | English | en | 0.908163 |
1986 | Maltese | mt | 0.061224 |
The Morwell Advocate and Boolara and Mirboo Chronicle (Vic. : 1886) (1733)
language_full | language | proportion | |
---|---|---|---|
1577 | Maltese | mt | 0.625 |
1578 | English | en | 0.375 |
The Morwell Advocate and Narracan, Boolara and Mirboo Chronicle (Vic. : 1886) (1734)
language_full | language | proportion | |
---|---|---|---|
1579 | English | en | 0.829268 |
1580 | Maltese | mt | 0.170732 |
The Reporter (Box Hill, Vic. : 1889 - 1925) (244)
language_full | language | proportion | |
---|---|---|---|
1594 | English | en | 0.904255 |
1593 | Maltese | mt | 0.085106 |
The Richmond River Express and Casino Kyogle Advertiser (NSW : 1904 - 1929) (500)
language_full | language | proportion | |
---|---|---|---|
532 | English | en | 0.827586 |
530 | Maltese | mt | 0.126437 |
The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)
language_full | language | proportion | |
---|---|---|---|
2064 | Modern Greek (1453-) | el | 0.98 |
To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)
language_full | language | proportion | |
---|---|---|---|
626 | Modern Greek (1453-) | el | 1.0 |
Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)
language_full | language | proportion | |
---|---|---|---|
632 | Chinese | zh | 0.94 |
Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)
language_full | language | proportion | |
---|---|---|---|
638 | Chinese | zh | 0.926316 |
Twofold Bay and Maneroo Observer (NSW : 1860) (394)
language_full | language | proportion | |
---|---|---|---|
645 | English | en | 0.886364 |
647 | Maltese | mt | 0.090909 |
Uniamoci (Sydney, NSW : 1903 - 1904) (1599)
language_full | language | proportion | |
---|---|---|---|
652 | Italian | it | 1.0 |
Upper Hunter Courier (Murrurundi, NSW : 1871) (810)
language_full | language | proportion | |
---|---|---|---|
653 | English | en | 0.857143 |
654 | Maltese | mt | 0.142857 |
Vesnik (Perth, WA : 1975 - 1994) (1382)
language_full | language | proportion | |
---|---|---|---|
2093 | Macedonian | mk | 0.408163 |
2092 | English | en | 0.357143 |
2094 | Bulgarian | bg-Latn | 0.224490 |
Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)
language_full | language | proportion | |
---|---|---|---|
655 | Ukrainian | uk | 0.82 |
656 | English | en | 0.18 |
Warwick Daily News (Qld. : 1919 -1954) (892)
language_full | language | proportion | |
---|---|---|---|
811 | English | en | 0.864198 |
812 | Maltese | mt | 0.111111 |
Williamstown Trade Circular (Vic. : 1855 - 1856) (213)
language_full | language | proportion | |
---|---|---|---|
1658 | English | en | 0.882353 |
1659 | Portuguese | pt | 0.117647 |
I went through the titles above and compiled a list of title identifiers that seem to be producing dodgy results. We can use this to filter these newspapers out of our results.
# Titles where dodgy OCR causes false positives in language detection
# This was manually created after scanning results
dodgy = [
"1036",
"1043",
"1103",
"116",
"1207",
"1265",
"13",
"1320",
"1336",
"140",
"1400",
"145",
"1488",
"1543",
"1546",
"1581",
"1582",
"1583",
"1617",
"1623",
"1626",
"1638",
"1675",
"1678",
"171",
"1733",
"1734",
"1741",
"196",
"213",
"224",
"244",
"286",
"292",
"318",
"329",
"34",
"384",
"389",
"394",
"418",
"430",
"431",
"452",
"479",
"499",
"500",
"543",
"570",
"623",
"725",
"763",
"810",
"860",
"886",
"892",
"906",
"92",
"926",
"927",
"935",
"937",
"94",
"946",
"970",
"986",
]
Here we'll add the dodgy title ids into our filter. It seems that we have 52 newspapers with significant amounts of non-English content.
# The filter removes titles that only have one language, which is English
filtered = (
df.loc[(~df["id"].isin(dodgy)) & (df["proportion"] >= 0.05)]
.groupby(by=["title", "id"])
.filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)
52
Let's list them.
for n, l in papers:
print(n[0])
A Voz de Timor (Dili, East Timor : 1970 - 1975) Adelaider Deutsche Zeitung (SA : 1851 - 1862) Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) Australische Zeitung (Adelaide, SA : 1875 - 1916) Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) Chinese Republic News (Sydney, NSW : 1914 - 1937) Chinese Times (Melbourne, Vic. : 1902 - 1922) Chung Wah News (Perth, WA : 1981 - 1987) Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) Deutsche Zeitung für Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) Dutch Weekly (Sydney, NSW : 1993 - 2004) Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) Eco Italiano (Perth, WA : 1958 - 1959) Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) Hellenic Echo (Perth, WA : 1967 - 1968) Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) Il Giornale Italiano (Sydney, NSW : 1932 - 1940) Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) Italo-Australian (Sydney, NSW : 1927 - 1940) Japanese Perth Times (Subiaco, WA : 1989 - 1996) L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) Le Courrier Australien (Sydney, NSW : 1892 - 2011) Mediterranean Voice (Perth, WA : 1971 - 1972) Meie Kodu = Our Home (Sydney, NSW : 1949 - 1956) Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) Nasza droga (Adelaide, SA : 1952 - 1954) Norden (Melbourne, Vic. : 1914 - 1918) Oceania (Sydney, NSW : 1913 - 1915) Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) The Chinese Advertiser (Ballarat, Vic. : 1856) The English and Chinese Advertiser (Vic. : 1856 - 1858) The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) Tung Wah News (Sydney, NSW : 1898 - 1902) Tung Wah Times (Sydney, NSW : 1901 - 1936) Uniamoci (Sydney, NSW : 1903 - 1904) Vesnik (Perth, WA : 1975 - 1994) Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954)
That's looking pretty good. Let's save the results as a Markdown file to make it easy to explore. We'll include links into Trove. Here's the list of all 52 newspapers (also as a Gist).
with open(Path("non-english-newspapers.md"), "w") as md_file:
i = 1
for n, l in papers:
md_file.write(
f"\n### {i}. [{n[0]}](http://nla.gov.au/nla.news-title{n[1]})\n\n"
)
md_file.write("| Language | Language code | Proportion of sample |\n")
md_file.write("|---|---|---|\n")
for row in (
l[["language_full", "language", "proportion"]]
.loc[(l["proportion"] > 0.05)]
.sort_values(by="proportion", ascending=False)
.itertuples()
):
md_file.write(
f"| {row.language_full} | {row.language} | {row.proportion} |\n"
)
i += 1
If you look at the Markdown files you'll see that there are still some dodgy results – for example, 16% of the Chinese Advertiser is detected as 'Scottish Gaelic'. But the point of this exercise was to find non-English newspapers, rather than accurately detect the proportion of non-English content, so I think we can live with it for now.
Created by Tim Sherratt for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.