The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.
There are two ways of getting data about OCR corrections using the Trove API. To get aggregate data you can include has:corrections
in your query to limit the results to articles that have at least one OCR correction.
To get information about the number of corrections made to the articles in your results, you can add the reclevel=full
parameter to include the number of corrections and details of the most recent correction to the article record. For example, note the correctionCount
and lastCorrection
values in the record below:
{
"article": {
"id": "41697877",
"url": "/newspaper/41697877",
"heading": "WRAGGE AND WEATHER CYCLES.",
"category": "Article",
"title": {
"id": "101",
"value": "Western Mail (Perth, WA : 1885 - 1954)"
},
"date": "1922-11-23",
"page": 4,
"pageSequence": 4,
"troveUrl": "https://trove.nla.gov.au/ndp/del/article/41697877",
"illustrated": "N",
"wordCount": 1054,
"correctionCount": 1,
"listCount": 0,
"tagCount": 0,
"commentCount": 0,
"lastCorrection": {
"by": "*anon*",
"lastupdated": "2016-09-12T07:08:57Z"
},
"identifier": "https://nla.gov.au/nla.news-article41697877",
"trovePageUrl": "https://trove.nla.gov.au/ndp/del/page/3522839",
"pdf": "https://trove.nla.gov.au/ndp/imageservice/nla.news-page3522839/print"
}
}
import requests
import os
import ipywidgets as widgets
from operator import itemgetter # used for sorting
import pandas as pd # makes manipulating the data easier
import altair as alt
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
from IPython.display import display, HTML, FileLink, clear_output
import math
from collections import OrderedDict
import time
# Make sure data directory exists
os.makedirs('data', exist_ok=True)
# Create a session that will automatically retry on server errors
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))
s.mount('https://', HTTPAdapter(max_retries=retries))
# Are you using JupyterLab or the classic Jupyter notebook?
# If you're using the classic notebook, uncomment the following line to make sure Altair charts appear as expected!
alt.renderers.enable('notebook')
RendererRegistry.enable('notebook')
api_key = ''
print('Your API key is: {}'.format(api_key))
Your API key is:
# Basic parameters for Trove API
params = {
'facet': 'year', # Get the data aggregated by year.
'zone': 'newspaper',
'key': api_key,
'encoding': 'json',
'n': 0 # We don't need any records, just the facets!
}
def get_results(params):
'''
Get JSON response data from the Trove API.
Parameters:
params
Returns:
JSON formatted response data from Trove API
'''
response = s.get('https://api.trove.nla.gov.au/v2/result', params=params, timeout=30)
response.raise_for_status()
# print(response.url) # This shows us the url that's sent to the API
data = response.json()
return data
Let's find out what proportion of newspaper articles have at least one OCR correction.
First we'll get to the total number of newspaper articles in Trove.
# Set the q parameter to a single space to get everything
params['q'] = ' '
# Get the data from the API
data = get_results(params)
# Extract the total number of results
total = int(data['response']['zone'][0]['records']['total'])
print('{:,}'.format(total))
225,015,138
Now we'll set the q
parameter to has:corrections
to limit the results to newspaper articles that have at least one correction.
# Set the q parameter to 'has:corrections' to limit results to articles with corrections
params['q'] = 'has:corrections'
# Get the data from the API
data = get_results(params)
# Extract the total number of results
corrected = int(data['response']['zone'][0]['records']['total'])
print('{:,}'.format(corrected))
10,852,606
Calculate the proportion of articles with corrections.
print('{:.2%} of articles have at least one correction'.format(corrected/total))
4.82% of articles have at least one correction
You might be thinking that these figures don't seem to match the number of corrections by individuals displayed on the digitised newspapers home page. Remember that these figures show the number of articles that include corrections, while the individual scores show the number of lines corrected by each volunteer.
def get_facets(data):
'''
Loop through facets in Trove API response, saving terms and counts.
Parameters:
data - JSON formatted response data from Trove API
Returns:
A list of dictionaries containing: 'term', 'total_results'
'''
facets = []
try:
# The facets are buried a fair way down in the results
# Note that if you ask for more than one facet, you'll have use the facet['name'] param to find the one you want
# In this case there's only one facet, so we can just grab the list of terms (which are in fact the results by year)
for term in data['response']['zone'][0]['facets']['facet']['term']:
# Get the year and the number of results, and convert them to integers, before adding to our results
facets.append({'term': term['search'], 'total_results': int(term['count'])})
# Sort facets by year
facets.sort(key=itemgetter('term'))
except TypeError:
pass
return facets
def get_facet_data(params, start_decade=180, end_decade=201):
'''
Loop throught the decades from 'start_decade' to 'end_decade',
getting the number of search results for each year from the year facet.
Combine all the results into a single list.
Parameters:
params - parameters to send to the API
start_decade
end_decade
Returns:
A list of dictionaries containing 'year', 'total_results' for the complete
period between the start and end decades.
'''
# Create a list to hold the facets data
facet_data = []
# Loop through the decades
for decade in tqdm(range(start_decade, end_decade + 1)):
print(params)
# Avoid confusion by copying the params before we change anything.
search_params = params.copy()
# Add decade value to params
search_params['l-decade'] = decade
# Get the data from the API
data = get_results(search_params)
# Get the facets from the data and add to facets_data
facet_data += get_facets(data)
# Reomve the progress bar (you can also set leave=False in tqdm, but that still leaves white space in Jupyter Lab)
clear_output()
return facet_data
facet_data = get_facet_data(params)
# Convert our data to a dataframe called df
df = pd.DataFrame(facet_data)
df.head()
term | total_results | |
---|---|---|
0 | 1803 | 526 |
1 | 1804 | 619 |
2 | 1805 | 430 |
3 | 1806 | 367 |
4 | 1807 | 134 |
# Reset the 'q' parameter
# Use a an empty search (a single space) to get ALL THE ARTICLES
params['q'] = ' '
# Get facet data for all articles
all_facet_data = get_facet_data(params)
# Convert the results to a dataframe
df_total = pd.DataFrame(all_facet_data)
def merge_df_with_total(df, df_total, how='left'):
'''
Merge dataframes containing search results with the total number of articles by year.
This is a left join on the year column. The total number of articles will be added as a column to
the existing results.
Once merged, do some reorganisation and calculate the proportion of search results.
Parameters:
df - the search results in a dataframe
df_total - total number of articles per year in a dataframe
Returns:
A dataframe with the following columns - 'year', 'total_results', 'total_articles', 'proportion'
(plus any other columns that are in the search results dataframe).
'''
# Merge the two dataframes on year
# Note that we're joining the two dataframes on the year column
df_merged = pd.merge(df, df_total, how=how, on='term')
# Rename the columns for convenience
df_merged.rename({'total_results_y': 'total_articles'}, inplace=True, axis='columns')
df_merged.rename({'total_results_x': 'total_results'}, inplace=True, axis='columns')
# Set blank values to zero to avoid problems
df_merged['total_results'] = df_merged['total_results'].fillna(0).astype(int)
# Calculate proportion by dividing the search results by the total articles
df_merged['proportion'] = df_merged['total_results'] / df_merged['total_articles']
return df_merged
# Merge the search results with the total articles
df_merged = merge_df_with_total(df, df_total)
df_merged.head()
term | total_results | total_articles | proportion | |
---|---|---|---|---|
0 | 1803 | 526 | 526 | 1.0 |
1 | 1804 | 619 | 619 | 1.0 |
2 | 1805 | 430 | 430 | 1.0 |
3 | 1806 | 367 | 367 | 1.0 |
4 | 1807 | 134 | 134 | 1.0 |
chart1 = alt.Chart(df).mark_line(point=True).encode(
x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),
y=alt.Y('total_results:Q', axis=alt.Axis(format=',d', title='Number of articles')),
tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('total_results:Q', title='Articles', format=',')]
).properties(width=700, height=250)
chart2 = alt.Chart(df_merged).mark_line(point=True, color='red').encode(
x=alt.X('term:Q', axis=alt.Axis(format='c', title='Year')),
# This time we're showing the proportion (formatted as a percentage) on the Y axis
y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles')),
tooltip=[alt.Tooltip('term:Q', title='Year'), alt.Tooltip('proportion:Q', title='Proportion', format='%')],
# Make the charts different colors
color=alt.value('orange')
).properties(width=700, height=250)
# This is a shorthand way of stacking the charts on top of each other
chart1 & chart2
params['q'] = 'has:corrections'
params['facet'] = 'category'
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
# Get the state and the number of results, and convert it to integers, before adding to our results
facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_categories = pd.DataFrame(facets)
df_categories.head()
term | total_results | |
---|---|---|
0 | Article | 8216008 |
1 | Family Notices | 1227802 |
2 | Advertising | 1029204 |
3 | Detailed Lists, Results, Guides | 380902 |
4 | Literature | 5659 |
params['q'] = ' '
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
# Get the state and the number of results, and convert it to integers, before adding to our results
facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_total_categories = pd.DataFrame(facets)
df_categories_merged = merge_df_with_total(df_categories, df_total_categories)
df_categories_merged
term | total_results | total_articles | proportion | |
---|---|---|---|---|
0 | Article | 8216008 | 156930606 | 0.052354 |
1 | Family Notices | 1227802 | 1845548 | 0.665278 |
2 | Advertising | 1029204 | 40643498 | 0.025323 |
3 | Detailed Lists, Results, Guides | 380902 | 25589737 | 0.014885 |
4 | Literature | 5659 | 25559 | 0.221409 |
5 | Obituaries | 4507 | 4737 | 0.951446 |
6 | Humour | 3654 | 15508 | 0.235620 |
7 | Law, Courts, And Crime | 2983 | 4105 | 0.726675 |
8 | News | 2726 | 3541 | 0.769839 |
9 | Sport And Games | 2224 | 4038 | 0.550768 |
10 | Letters | 1187 | 6368 | 0.186401 |
11 | Arts And Culture | 929 | 1260 | 0.737302 |
12 | Puzzles | 862 | 7258 | 0.118766 |
13 | Official Appointments And Notices | 390 | 395 | 0.987342 |
14 | Classified Advertisements And Notices | 326 | 341 | 0.956012 |
15 | Editorial | 262 | 3032 | 0.086412 |
16 | Shipping Notices | 238 | 328 | 0.725610 |
17 | Reviews | 204 | 286 | 0.713287 |
18 | Display Advertisement | 202 | 211 | 0.957346 |
19 | Weather | 202 | 399 | 0.506266 |
20 | Commerce And Business | 176 | 184 | 0.956522 |
df_categories_filtered = df_categories_merged.loc[df_categories_merged['total_articles'] > 5000]
df_categories_filtered
term | total_results | total_articles | proportion | |
---|---|---|---|---|
0 | Article | 8216008 | 156930606 | 0.052354 |
1 | Family Notices | 1227802 | 1845548 | 0.665278 |
2 | Advertising | 1029204 | 40643498 | 0.025323 |
3 | Detailed Lists, Results, Guides | 380902 | 25589737 | 0.014885 |
4 | Literature | 5659 | 25559 | 0.221409 |
6 | Humour | 3654 | 15508 | 0.235620 |
10 | Letters | 1187 | 6368 | 0.186401 |
12 | Puzzles | 862 | 7258 | 0.118766 |
cat_chart1 = alt.Chart(df_categories_filtered).mark_bar().encode(
x=alt.X('term:N', title='Category'),
y=alt.Y('total_results:Q', title='Articles with corrections')
)
cat_chart2 = alt.Chart(df_categories_filtered).mark_bar().encode(
x=alt.X('term:N', title='Category'),
y=alt.Y('proportion:Q', axis=alt.Axis(format='%', title='Proportion of articles with corrections')),
)
cat_chart1 | cat_chart2
params['q'] = 'has:corrections'
params['facet'] = 'title'
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
# Get the state and the number of results, and convert it to integers, before adding to our results
facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_newspapers = pd.DataFrame(facets)
df_newspapers.head()
term | total_results | |
---|---|---|
0 | 35 | 741002 |
1 | 13 | 719478 |
2 | 16 | 307910 |
3 | 11 | 287229 |
4 | 30 | 268359 |
Get the total number of articles per newspaper.
params['q'] = ' '
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
# Get the state and the number of results, and convert it to integers, before adding to our results
facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_newspapers_total = pd.DataFrame(facets)
df_newspapers_merged = merge_df_with_total(df_newspapers, df_newspapers_total, how='right')
df_newspapers_merged.sort_values(by='proportion', ascending=False, inplace=True)
df_newspapers_merged.rename(columns={'term': 'id'}, inplace=True)
df_newspapers_merged.head()
id | total_results | total_articles | proportion | |
---|---|---|---|---|
1272 | 1047 | 38 | 38 | 1.0 |
1233 | 118 | 56 | 56 | 1.0 |
1340 | 201 | 17 | 17 | 1.0 |
875 | 1142 | 437 | 437 | 1.0 |
1317 | 1042 | 21 | 21 | 1.0 |
Let's get the names of the newspapers.
title_params = {
'key': api_key,
'encoding': 'json',
}
title_data = s.get('https://api.trove.nla.gov.au/v2/newspaper/titles', params=params).json()
titles = []
for newspaper in title_data['response']['records']['newspaper']:
titles.append({'title': newspaper['title'], 'id': int(newspaper['id'])})
df_titles = pd.DataFrame(titles)
df_titles.head()
id | title | |
---|---|---|
0 | 166 | Canberra Community News (ACT : 1925 - 1927) |
1 | 165 | Canberra Illustrated: A Quarterly Magazine (AC... |
2 | 69 | Federal Capital Pioneer (Canberra, ACT : 1924 ... |
3 | 871 | Good Neighbour (ACT : 1950 - 1969) |
4 | 665 | Student Notes/Canberra University College Stud... |
df_titles.shape
(1465, 2)
Exclude the government gazettes.
gazette_data = s.get('https://api.trove.nla.gov.au/v2/gazette/titles', params=params).json()
gazettes = []
for gaz in gazette_data['response']['records']['newspaper']:
gazettes.append({'title': gaz['title'], 'id': int(gaz['id'])})
df_gazettes = pd.DataFrame(gazettes)
df_gazettes.shape
(37, 2)
df_titles_not_gazettes = df_titles[~df_titles['id'].isin(df_gazettes['id'])]
df_newspapers_with_titles = pd.merge(df_titles_not_gazettes, df_newspapers_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)
# Convert the totals back to integers
df_newspapers_with_titles[['total_results', 'total_articles']] = df_newspapers_with_titles[['total_results', 'total_articles']].astype(int)
Most corrected.
df_newspapers_with_titles[:25]
id | title | total_results | total_articles | proportion | |
---|---|---|---|---|---|
1331 | 1142 | Swan River Guardian (WA : 1836 - 1838) | 437 | 437 | 1.000000 |
802 | 1046 | The Derwent Star and Van Diemen's Land Intelli... | 12 | 12 | 1.000000 |
497 | 118 | Moonta Herald and Northern Territory Gazette (... | 56 | 56 | 1.000000 |
197 | 1042 | Society (Sydney, NSW : 1887) | 21 | 21 | 1.000000 |
430 | 729 | The Temora Telegraph and Mining Advocate (NSW ... | 3 | 3 | 1.000000 |
177 | 1000 | Party (Sydney, NSW : 1942) | 6 | 6 | 1.000000 |
838 | 1047 | The Van Diemen's Land Gazette and General Adve... | 38 | 38 | 1.000000 |
913 | 201 | Elsternwick Leader and East Brighton, ... (Vic... | 17 | 17 | 1.000000 |
258 | 686 | The Branxton Advocate: Greta and Rothbury Reco... | 53 | 53 | 1.000000 |
17 | 51 | The Australian Abo Call (National : 1938) | 78 | 78 | 1.000000 |
387 | 1028 | The Satirist and Sporting Chronicle (Sydney, N... | 286 | 286 | 1.000000 |
691 | 314 | Suedaustralische Zeitung (Adelaide, SA : 1850 ... | 47 | 47 | 1.000000 |
808 | 4 | The Hobart Town Gazette and Southern Reporter ... | 1910 | 1923 | 0.993240 |
2 | 69 | Federal Capital Pioneer (Canberra, ACT : 1924 ... | 541 | 545 | 0.992661 |
1155 | 935 | The Melbourne Advertiser (Vic. : 1838) | 120 | 121 | 0.991736 |
677 | 40 | South Australian Gazette and Colonial Register... | 1048 | 1065 | 0.984038 |
767 | 5 | Hobart Town Gazette and Van Diemen's Land Adve... | 1530 | 1556 | 0.983290 |
133 | 624 | Intelligence (Bowral, NSW : 1884) | 117 | 119 | 0.983193 |
1425 | 1131 | York Advocate (WA : 1915) | 236 | 241 | 0.979253 |
136 | 885 | Justice (Narrabri, NSW : 1891) | 44 | 45 | 0.977778 |
359 | 513 | The Newcastle Argus and District Advertiser (N... | 29 | 30 | 0.966667 |
10 | 1283 | Berita Repoeblik (Djakarta, Indonesia : 1945 -... | 471 | 498 | 0.945783 |
381 | 917 | The Record of the Blue Mountains (Katoomba, NS... | 104 | 112 | 0.928571 |
780 | 273 | Tasmanian and Port Dalrymple Advertiser (Launc... | 178 | 193 | 0.922280 |
833 | 1247 | The Tasmanian Telegraph (Hobart Town, Tas. : 1... | 1529 | 1660 | 0.921084 |
Least corrected.
df_newspapers_with_titles[-25:]
id | title | total_results | total_articles | proportion | |
---|---|---|---|---|---|
235 | 1500 | The Bananacoast Opinion (Coffs Harbour, NSW : ... | 60 | 29322 | 0.002046 |
1396 | 1381 | The Voice of Freedom = Elefthera Phoni (Perth,... | 1 | 511 | 0.001957 |
1140 | 1502 | The Express, Melton (Vic. 1943 - 1954) | 68 | 35201 | 0.001932 |
1407 | 1433 | The Wyalkatchem Wheatsheaf and Bencubbin Banne... | 25 | 13265 | 0.001885 |
1303 | 1446 | Narembeen Observer (WA : 1928 - 1933) | 42 | 23107 | 0.001818 |
1216 | 1395 | Ballidu-Wongan Budget (WA : 1927 - 1941) | 114 | 63251 | 0.001802 |
1263 | 1417 | Gnowangerup Star (WA : 1941 - 1954) | 27 | 15774 | 0.001712 |
49 | 1492 | Border Morning Mail (Albury, NSW : 1938 - 1943) | 442 | 268927 | 0.001644 |
1218 | 1397 | Bonnie Rock-Lake Brown-Mukinbudin Leader (WA :... | 17 | 12885 | 0.001319 |
591 | 1481 | Toowoomba Chronicle and Darling Downs Gazette ... | 786 | 611636 | 0.001285 |
1328 | 1380 | Stampa Italiana = The Italian Press (Perth, WA... | 3 | 2493 | 0.001203 |
1389 | 1437 | The Southern Cross Herald (WA : 1894 - 1896) | 3 | 2653 | 0.001131 |
174 | 1501 | Nota (Hawks Nest and Tea Gardens, NSW : 1970 -... | 17 | 15533 | 0.001094 |
1284 | 1455 | Kookynie Advocate and Northern Goldfields News... | 2 | 2253 | 0.000888 |
1297 | 1449 | Morning Post (Geraldton, WA : 1895 - 1896) | 1 | 1779 | 0.000562 |
1223 | 1399 | Bullfinch Budget (WA : 1910 - 1911) | 0 | 1008 | 0.000000 |
261 | 1299 | The Burrangong Chronicle and Young and Lachlan... | 0 | 25 | 0.000000 |
1283 | 1456 | Kondinin Wheatlander and Kulin and Karlgarin G... | 0 | 1267 | 0.000000 |
1079 | 1043 | Seamen's Strike Bulletin (Melbourne, Vic. : 1919) | 0 | 14 | 0.000000 |
1245 | 1409 | Democrat (Perth, WA : 1904 - 1905) | 0 | 1901 | 0.000000 |
1241 | 1406 | Daily Advertiser (Geraldton, WA : 1890 - 1893) | 0 | 1675 | 0.000000 |
1233 | 1401 | Central Districts Advertiser and Agriculture a... | 0 | 2875 | 0.000000 |
1224 | 1460 | Bullfinch Miner and Yilgarn Advocate (WA : 1910) | 0 | 27 | 0.000000 |
769 | 1262 | Hobart Town Punch (Hobart Town, Tas. : 1878) | 0 | 32 | 0.000000 |
1427 | 1431 | York Gazette and Quairading and Dangin Herald ... | 0 | 2268 | 0.000000 |
df_newspapers_with_titles_csv = df_newspapers_with_titles.copy()
df_newspapers_with_titles_csv.rename({'total_results': 'articles_with_corrections'}, axis=1, inplace=True)
df_newspapers_with_titles_csv['percentage_with_corrections'] = df_newspapers_with_titles_csv['proportion'] * 100
df_newspapers_with_titles_csv.sort_values(by=['percentage_with_corrections'], inplace=True)
df_newspapers_with_titles_csv[['id', 'title', 'articles_with_corrections', 'total_articles', 'percentage_with_corrections']].to_csv('titles_corrected.csv', index=False)
#df_newspapers_with_titles.to_csv('titles_corrected.csv', index=False)
display(FileLink('titles_corrected.csv'))
Let's see if we can combine some guesses about OCR error rates with the correction data to find the newspapers most in need of help.
To make a guesstimate of error rates, we'll use the occurance of 'tbe' – ie a common OCR error for 'the'. I don't know how valid this is, but it's a place to start.
params['q'] = 'text:"tbe"~0'
params['facet'] = 'title'
data = get_results(params)
facets = []
for term in data['response']['zone'][0]['facets']['facet']['term']:
# Get the state and the number of results, and convert it to integers, before adding to our results
facets.append({'term': term['search'], 'total_results': int(term['count'])})
df_errors = pd.DataFrame(facets)
df_errors_merged = merge_df_with_total(df_errors, df_newspapers_total, how='right')
df_errors_merged.sort_values(by='proportion', ascending=False, inplace=True)
df_errors_merged.rename(columns={'term': 'id'}, inplace=True)
df_errors_merged.head()
id | total_results | total_articles | proportion | |
---|---|---|---|---|
544 | 1316 | 2013 | 2954 | 0.681449 |
343 | 758 | 5271 | 8078 | 0.652513 |
245 | 927 | 9557 | 17227 | 0.554769 |
295 | 382 | 6980 | 12744 | 0.547709 |
314 | 262 | 6313 | 11527 | 0.547671 |
df_errors_with_titles = pd.merge(df_titles_not_gazettes, df_errors_merged, how='left', on='id').fillna(0).sort_values(by='proportion', ascending=False)
df_errors_with_titles[:25]
id | title | total_results | total_articles | proportion | |
---|---|---|---|---|---|
449 | 1316 | The Weekly Advance (Granville, NSW : 1892 - 1893) | 2013 | 2954 | 0.681449 |
910 | 758 | Dunolly and Betbetshire Express and County of ... | 5271 | 8078 | 0.652513 |
953 | 927 | Hamilton Spectator and Grange District Adverti... | 9557 | 17227 | 0.554769 |
478 | 382 | Wagga Wagga Express and Murrumbidgee District ... | 6980 | 12744 | 0.547709 |
574 | 262 | The North Australian, Ipswich and General Adve... | 6313 | 11527 | 0.547671 |
573 | 264 | The North Australian (Brisbane, Qld. : 1863 - ... | 2894 | 5314 | 0.544599 |
788 | 1250 | Telegraph (Hobart Town, Tas. : 1867) | 75 | 140 | 0.535714 |
778 | 1242 | Morning Star and Commercial Advertiser (Hobart... | 891 | 1703 | 0.523194 |
315 | 725 | The Hay Standard and Advertiser for Balranald,... | 21737 | 42068 | 0.516711 |
191 | 530 | Robertson Advocate (NSW : 1894 - 1923) | 37158 | 72383 | 0.513353 |
217 | 728 | Temora Herald and Mining Journal (NSW : 1882 -... | 641 | 1253 | 0.511572 |
211 | 697 | Sydney Mail (NSW : 1860 - 1871) | 24805 | 48535 | 0.511074 |
782 | 865 | Tasmanian Morning Herald (Hobart, Tas. : 1865 ... | 4876 | 9559 | 0.510095 |
1050 | 1138 | Port Phillip Gazette and Settler's Journal (Vi... | 6158 | 12127 | 0.507793 |
154 | 424 | Molong Argus (NSW : 1896 - 1921) | 52258 | 104984 | 0.497771 |
1049 | 1139 | Port Phillip Gazette (Vic. : 1851) | 241 | 491 | 0.490835 |
283 | 724 | The Cumberland Free Press (Parramatta, NSW : 1... | 6466 | 13247 | 0.488111 |
604 | 986 | Adelaide Chronicle and South Australian Litera... | 938 | 1937 | 0.484254 |
524 | 850 | Logan Witness (Beenleigh, Qld. : 1878 - 1893) | 7050 | 14654 | 0.481097 |
841 | 869 | Trumpeter General (Hobart, Tas. : 1833 - 1834) | 711 | 1482 | 0.479757 |
566 | 257 | The Darling Downs Gazette and General Advertis... | 30339 | 65268 | 0.464837 |
799 | 170 | The Cornwall Chronicle (Launceston, Tas. : 183... | 73080 | 163791 | 0.446178 |
891 | 847 | Chronicle, South Yarra Gazette, Toorak Times a... | 1648 | 3720 | 0.443011 |
816 | 1251 | The Mount Lyell Standard and Strahan Gazette (... | 36532 | 83363 | 0.438228 |
1424 | 1009 | Yilgarn Merredin Times (Southern Cross, WA : 1... | 1534 | 3508 | 0.437286 |
df_errors_with_titles[-25:]
id | title | total_results | total_articles | proportion | |
---|---|---|---|---|---|
1224 | 1460 | Bullfinch Miner and Yilgarn Advocate (WA : 1910) | 0 | 27 | 0.0 |
691 | 314 | Suedaustralische Zeitung (Adelaide, SA : 1850 ... | 0 | 47 | 0.0 |
697 | 1305 | The Citizen (Port Adelaide, SA : 1938-1940) | 0 | 1284 | 0.0 |
197 | 1042 | Society (Sydney, NSW : 1887) | 0 | 21 | 0.0 |
930 | 233 | Frankston Standard (Frankston, Vic. : 1949) | 0 | 1997 | 0.0 |
1330 | 1379 | Sunday Times Edizione Italiana (Perth, WA : 19... | 0 | 455 | 0.0 |
63 | 996 | Citizen Soldier (Sydney, NSW : 1942) | 0 | 60 | 0.0 |
913 | 201 | Elsternwick Leader and East Brighton, ... (Vic... | 0 | 17 | 0.0 |
656 | 1478 | Port Augusta and Stirling Illustrated News (SA... | 0 | 125 | 0.0 |
1155 | 935 | The Melbourne Advertiser (Vic. : 1838) | 0 | 121 | 0.0 |
1383 | 1201 | The Possum (Fremantle, WA : 1890) | 0 | 105 | 0.0 |
802 | 1046 | The Derwent Star and Van Diemen's Land Intelli... | 0 | 12 | 0.0 |
381 | 917 | The Record of the Blue Mountains (Katoomba, NS... | 0 | 112 | 0.0 |
407 | 1033 | The Sun and New South Wales Independent Press ... | 0 | 405 | 0.0 |
209 | 696 | Sydney General Trade List, Mercantile Chronicl... | 0 | 22 | 0.0 |
44 | 802 | Blayney West Macquarie (NSW : 1949) | 0 | 110 | 0.0 |
1188 | 799 | Vigilante (Melbourne, Vic. : 1918) | 0 | 302 | 0.0 |
1 | 165 | Canberra Illustrated: A Quarterly Magazine (AC... | 0 | 57 | 0.0 |
1135 | 200 | The Elsternwick Leader and Caulfield and Balac... | 0 | 47 | 0.0 |
716 | 719 | The Port Adelaide Post Shipping Gazette, Farme... | 0 | 18 | 0.0 |
29 | 1320 | Auburn and District News (NSW : 1929) | 0 | 25 | 0.0 |
1286 | 1388 | La Rondine (Perth, WA : 1969 - 1994) | 0 | 1383 | 0.0 |
1396 | 1381 | The Voice of Freedom = Elefthera Phoni (Perth,... | 0 | 511 | 0.0 |
1234 | 1383 | Chung Wah News (Perth, WA : 1981 - 1987) | 0 | 860 | 0.0 |
64 | 104 | Clarence and Richmond Examiner (Grafton, NSW :... | 0 | 111 | 0.0 |
corrections_errors_merged_df = pd.merge(df_newspapers_with_titles, df_errors_with_titles, how='left', on='id')
corrections_errors_merged_df.head()
id | title_x | total_results_x | total_articles_x | proportion_x | title_y | total_results_y | total_articles_y | proportion_y | |
---|---|---|---|---|---|---|---|---|---|
0 | 1142 | Swan River Guardian (WA : 1836 - 1838) | 437 | 437 | 1.0 | Swan River Guardian (WA : 1836 - 1838) | 32 | 437 | 0.073227 |
1 | 1046 | The Derwent Star and Van Diemen's Land Intelli... | 12 | 12 | 1.0 | The Derwent Star and Van Diemen's Land Intelli... | 0 | 12 | 0.000000 |
2 | 118 | Moonta Herald and Northern Territory Gazette (... | 56 | 56 | 1.0 | Moonta Herald and Northern Territory Gazette (... | 0 | 56 | 0.000000 |
3 | 1042 | Society (Sydney, NSW : 1887) | 21 | 21 | 1.0 | Society (Sydney, NSW : 1887) | 0 | 21 | 0.000000 |
4 | 729 | The Temora Telegraph and Mining Advocate (NSW ... | 3 | 3 | 1.0 | The Temora Telegraph and Mining Advocate (NSW ... | 0 | 3 | 0.000000 |
corrections_errors_merged_df['proportion_uncorrected'] = corrections_errors_merged_df['proportion_x'].apply(lambda x: 1 - x)
corrections_errors_merged_df.rename(columns={'title_x': 'title', 'proportion_x': 'proportion_corrected', 'proportion_y': 'proportion_with_errors'}, inplace=True)
corrections_errors_merged_df.sort_values(by=['proportion_with_errors', 'proportion_uncorrected'], ascending=False, inplace=True)
corrections_errors_merged_df[['title', 'proportion_with_errors', 'proportion_uncorrected']][:25]
title | proportion_with_errors | proportion_uncorrected | |
---|---|---|---|
1294 | The Weekly Advance (Granville, NSW : 1892 - 1893) | 0.681449 | 0.989506 |
552 | Dunolly and Betbetshire Express and County of ... | 0.652513 | 0.942065 |
419 | Hamilton Spectator and Grange District Adverti... | 0.554769 | 0.921809 |
394 | Wagga Wagga Express and Murrumbidgee District ... | 0.547709 | 0.917687 |
149 | The North Australian, Ipswich and General Adve... | 0.547671 | 0.792053 |
214 | The North Australian (Brisbane, Qld. : 1863 - ... | 0.544599 | 0.855852 |
460 | Telegraph (Hobart Town, Tas. : 1867) | 0.535714 | 0.928571 |
150 | Morning Star and Commercial Advertiser (Hobart... | 0.523194 | 0.792132 |
942 | The Hay Standard and Advertiser for Balranald,... | 0.516711 | 0.970500 |
760 | Robertson Advocate (NSW : 1894 - 1923) | 0.513353 | 0.959977 |
561 | Temora Herald and Mining Journal (NSW : 1882 -... | 0.511572 | 0.942538 |
325 | Sydney Mail (NSW : 1860 - 1871) | 0.511074 | 0.900546 |
448 | Tasmanian Morning Herald (Hobart, Tas. : 1865 ... | 0.510095 | 0.926038 |
212 | Port Phillip Gazette and Settler's Journal (Vi... | 0.507793 | 0.855117 |
569 | Molong Argus (NSW : 1896 - 1921) | 0.497771 | 0.943344 |
213 | Port Phillip Gazette (Vic. : 1851) | 0.490835 | 0.855397 |
389 | The Cumberland Free Press (Parramatta, NSW : 1... | 0.488111 | 0.916434 |
117 | Adelaide Chronicle and South Australian Litera... | 0.484254 | 0.744450 |
283 | Logan Witness (Beenleigh, Qld. : 1878 - 1893) | 0.481097 | 0.890542 |
109 | Trumpeter General (Hobart, Tas. : 1833 - 1834) | 0.479757 | 0.730094 |
243 | The Darling Downs Gazette and General Advertis... | 0.464837 | 0.872234 |
177 | The Cornwall Chronicle (Launceston, Tas. : 183... | 0.446178 | 0.838575 |
998 | Chronicle, South Yarra Gazette, Toorak Times a... | 0.443011 | 0.973656 |
1307 | The Mount Lyell Standard and Strahan Gazette (... | 0.438228 | 0.990307 |
979 | Yilgarn Merredin Times (Southern Cross, WA : 1... | 0.437286 | 0.972349 |