This notebook analyses the results of running the column detection script across all of the Stock Exchange images on CloudStor.
The raw results are in CSV files, one for each year. See this notebook for more details.
See this notebook for some visualisations of this data.
import pandas as pd
import os
# We're going to combin all of the CSV files into one big dataframe
# Create an empty dataframe
combined_df = pd.DataFrame()
# Loop through the range of years
for year in range(1901, 1951):
# Open the CSV file for that year as a dataframe
year_df = pd.read_csv('{}.csv'.format(year))
# Add the single year df to the combined df
combined_df = combined_df.append(year_df)
# How many images do we have data for?
combined_df.shape
(72932, 11)
# Have a look inside
combined_df.head()
directory | name | path | referenceCode | startDate | endDate | year | width | height | columns | column_positions | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AU NBAC N193-001/ | N193-001_0001.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-001 | 1901-01-01 | 1901-03-01 | 1901 | 6237 | 5000 | 3 | 0,1811,3222 |
1 | AU NBAC N193-001/ | N193-001_0002.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-001 | 1901-01-01 | 1901-03-01 | 1901 | 6266 | 5000 | 3 | 205,1840,3259 |
2 | AU NBAC N193-001/ | N193-001_0003.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-001 | 1901-01-01 | 1901-03-01 | 1901 | 6237 | 5000 | 2 | 286,2068 |
3 | AU NBAC N193-001/ | N193-001_0004.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-001 | 1901-01-01 | 1901-03-01 | 1901 | 6236 | 5000 | 3 | 9,1821,3219 |
4 | AU NBAC N193-001/ | N193-001_0005.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-001 | 1901-01-01 | 1901-03-01 | 1901 | 6236 | 5000 | 3 | 288,1821,3220 |
combined_df['columns'].value_counts()
3 41076 4 26917 2 4825 1 19 0 6 Name: columns, dtype: int64
combined_df.loc[combined_df['width'] == 0]
directory | name | path | referenceCode | startDate | endDate | year | width | height | columns | column_positions | |
---|---|---|---|---|---|---|---|---|---|---|---|
677 | AU NBAC N193-055/ | N193-055_0037.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-055 | 1914-07-01 | 1914-09-01 | 1914 | 0 | 0 | 0 | NaN |
1051 | AU NBAC N193-064/ | N193-064_0078.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-064 | 1916-10-01 | 1916-12-01 | 1916 | 0 | 0 | 0 | NaN |
44 | AU NBAC N193-173/ | N193-173_0045.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
50 | AU NBAC N193-173/ | N193-173_0051.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
52 | AU NBAC N193-173/ | N193-173_0053.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
65 | AU NBAC N193-173/ | N193-173_0066.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
There are 25 pages with 0 or 1 columns detected. Let's see what's up with them...
# Get the problem pages
problems = combined_df.loc[(combined_df['columns'] == 0) | (combined_df['columns'] == 1)]
problems
directory | name | path | referenceCode | startDate | endDate | year | width | height | columns | column_positions | |
---|---|---|---|---|---|---|---|---|---|---|---|
677 | AU NBAC N193-055/ | N193-055_0037.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-055 | 1914-07-01 | 1914-09-01 | 1914 | 0 | 0 | 0 | NaN |
1051 | AU NBAC N193-064/ | N193-064_0078.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-064 | 1916-10-01 | 1916-12-01 | 1916 | 0 | 0 | 0 | NaN |
515 | AU NBAC N193-090/ | N193-090_0210.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-090 | 1923-04-01 | 1923-06-01 | 1923 | 4264 | 5000 | 1 | 355 |
330 | AU NBAC N193-109/ | N193-109_0331.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-109 | 1928-01-01 | 1928-03-01 | 1928 | 4032 | 5000 | 1 | 26 |
856 | AU NBAC N193-111/ | N193-111_0216.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-111 | 1928-07-01 | 1928-09-01 | 1928 | 5732 | 5000 | 1 | 12 |
1590 | AU NBAC N193-163/ | N193-163_0427.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-163 | 1941-07-01 | 1941-09-01 | 1941 | 3642 | 2464 | 1 | 0 |
44 | AU NBAC N193-173/ | N193-173_0045.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
50 | AU NBAC N193-173/ | N193-173_0051.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
52 | AU NBAC N193-173/ | N193-173_0053.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
65 | AU NBAC N193-173/ | N193-173_0066.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 0 | 0 | 0 | NaN |
414 | AU NBAC N193-173/ | N193-173_0415.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4618 | 1 | 0 |
415 | AU NBAC N193-173/ | N193-173_0416.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4618 | 1 | 0 |
416 | AU NBAC N193-173/ | N193-173_0417.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4689 | 1 | 9 |
417 | AU NBAC N193-173/ | N193-173_0418.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4618 | 1 | 0 |
418 | AU NBAC N193-173/ | N193-173_0419.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4618 | 1 | 0 |
419 | AU NBAC N193-173/ | N193-173_0420.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4618 | 1 | 0 |
420 | AU NBAC N193-173/ | N193-173_0421.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4570 | 1 | 0 |
421 | AU NBAC N193-173/ | N193-173_0422.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4570 | 1 | 0 |
422 | AU NBAC N193-173/ | N193-173_0423.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4570 | 1 | 0 |
423 | AU NBAC N193-173/ | N193-173_0424.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4558 | 1 | 0 |
424 | AU NBAC N193-173/ | N193-173_0425.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4558 | 1 | 0 |
426 | AU NBAC N193-173/ | N193-173_0427.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4558 | 1 | 0 |
427 | AU NBAC N193-173/ | N193-173_0428.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4558 | 1 | 0 |
431 | AU NBAC N193-173/ | N193-173_0432.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4558 | 1 | 0 |
432 | AU NBAC N193-173/ | N193-173_0433.tif | Shared/ANU-Library/Sydney Stock Exchange 1901-... | N193-173 | 1944-01-01 | 1944-03-01 | 1944 | 5879 | 4558 | 1 | 0 |
# If running locally need to set up Cloudstor client to download images
# DON'T RUN THIS ON SWAN (or you'll get an error because webdav is not installed)
import webdav.client as wc
from webdav.client import RemoteResourceNotFound
from credentials import * # Storing my CloudStor credentials in another file
# Set the connection options. CLOUDSTOR_USER and CLOUDSTOR_PW are stored in a separate credentials file.
options = {
'webdav_hostname': 'https://cloudstor.aarnet.edu.au',
'webdav_login': CLOUDSTOR_USER,
'webdav_password': CLOUDSTOR_PW,
'webdav_root': '/plus/remote.php/webdav/'
}
# Ok let's initiate the client.
client = wc.Client(options)
from PIL import Image
def download_image(image):
try:
client.download_sync(remote_path=image.path, local_path='problems/{}'.format(image.name))
except RemoteResourceNotFound:
print('Not found: {}'.format(image.name))
else:
filename, ext = os.path.splitext(image.name)
if os.path.getsize('problems/{}'.format(image.name)) > 3000000:
img = Image.open('problems/{}'.format(image.name))
img.thumbnail((1000,1000), resample=Image.LANCZOS)
img.save('problems/{}.jpg'.format(filename))
else:
print('Small: {}'.format(image.name))
for row in problems.itertuples():
if not os.path.exists('problems/{}'.format(row.name)):
download_image(row)
Not found: N193-055_0037.tif
Note that 6 of the pages have no width or height recorded. This means that the script couldn't open the images. I manually checked these:
I downloaded the 5 that seemed ok, and ran the column detection script on them and the results were as expected. So I think there must have been some temporary problem on CloudStor when the script tried to access them.
I downloaded the rest and all of them were either rotated, or not the usual page format. These rotated:
Others:
All the rest from N193-173
are hand-written register pages.
So, in summary, the column detector script seems to have worked as expected on all of these.